CN111432282B

CN111432282B - Video recommendation method and device

Info

Publication number: CN111432282B
Application number: CN202010251511.6A
Authority: CN
Inventors: 康战辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-04-01
Filing date: 2020-04-01
Publication date: 2022-01-04
Anticipated expiration: 2040-04-01
Also published as: CN111432282A

Abstract

The embodiment of the application discloses a video recommendation method and a video recommendation device, the method comprises the steps of receiving a recommendation request from a terminal, determining at least one target video in a video database according to a search keyword, determining a target video cover thumbnail corresponding to the search keyword as a dynamic video cover thumbnail of the target video, and generating a recommendation response according to the dynamic video cover thumbnail, recommendation information and a playing address of each target video; the method is based on a computer vision technology, different video cover thumbnails are provided for the same video according to different search keywords, the video cover thumbnails corresponding to the search keywords are dynamically used as the dynamic video cover thumbnails of the target video according to the search keywords of a user during later recommendation, and therefore the video cover thumbnails of each recommendation result contain elements corresponding to the search keywords, and the video cover thumbnail display accuracy rate of the video recommendation technology is improved.

Description

Video recommendation method and device

Technical Field

The application relates to the field of recommendation, in particular to a video recommendation method and device.

Background

With the development of internet applications, many applications can recommend content according to user requirements, such as video recommendation of short videos and the like.

When uploading video content such as short videos, the video recommendation server enables a publisher (or an author) to designate a certain video frame or a first video frame of a default video as a video cover thumbnail of the video, and a user can quickly determine whether to watch the corresponding video based on the video cover thumbnail of the video.

In practical applications, a short video often includes a plurality of different elements (corresponding to search keywords), such as different stars, different musical instruments, etc., and a video cover thumbnail designated by a publisher (or author) or a default video cover thumbnail of a server often cannot include these different elements at the same time, which may result in that when the recommendation server feeds back a short video to a user according to a search keyword input by the user, the video cover thumbnail corresponding to the short video may not include the element corresponding to the search keyword, as shown in fig. 8a, the video cover thumbnail of a recommendation result (video 00003) may not include the element a-01; although the content of the recommendation includes the singing content of the element a-01 (e.g., a singer), since the video cover thumbnail specified by the publisher (or author) does not include the element a-01, the user may not open the recommendation when viewing the recommendation presentation interface.

Namely, the current video recommendation mode has the technical problem of low display accuracy of the video cover thumbnail.

Content of application

The embodiment of the application provides a video recommendation method and device, which are used for improving the display accuracy of a video cover thumbnail of a video recommendation technology.

In order to solve the above technical problem, an embodiment of the present application provides the following technical solutions:

the embodiment of the application provides a video recommendation method, which comprises the following steps:

receiving a recommendation request from a terminal, wherein the recommendation request carries search keywords;

determining at least one target video in a video database according to the search keyword, wherein the matching degree of the video frame content of the target video and the search keyword meets the recommendation condition;

when the target video comprises video cover thumbnails corresponding to a plurality of different candidate keywords, determining the target video cover thumbnail corresponding to the search keyword as a dynamic video cover thumbnail of the target video;

generating a recommendation response according to the dynamic video cover thumbnail, the recommendation information and the playing address of each target video;

and sending the recommendation response to the terminal so that the terminal can display the dynamic video cover thumbnails and recommendation information of all the target videos, and playing the target videos based on the playing addresses after the target videos are selected.

The embodiment of the application provides a video recommendation device, it includes:

the system comprises a receiving module, a searching module and a recommending module, wherein the receiving module is used for receiving a recommending request from a terminal, and the recommending request carries a searching keyword;

the video searching module is used for determining at least one target video in a video database according to the searching keyword, and the matching degree of the video frame content of the target video and the searching keyword meets the recommendation condition;

the cover determining module is used for determining the target video cover thumbnail corresponding to the search keyword as a dynamic video cover thumbnail of the target video when the target video comprises video cover thumbnails corresponding to a plurality of different candidate keywords;

the response construction module is used for generating recommendation responses according to the dynamic video cover thumbnails, the recommendation information and the playing addresses of the target videos;

and the sending module is used for sending the recommendation response to the terminal so that the terminal can display the dynamic video cover thumbnails and the recommendation information of all the target videos, and playing the target videos based on the playing addresses after the target videos are selected.

The embodiment of the application provides a server, which comprises a processor and a memory, wherein the memory stores a plurality of instructions, and the instructions are suitable for the processor to load so as to execute the steps in the method.

The embodiment of the present application provides a computer-readable storage medium, which stores a plurality of instructions, where the instructions are suitable for a processor to load, so as to execute the steps in the above method.

The embodiment of the application provides a new video recommendation method and a device, the method comprises the steps of firstly receiving a recommendation request from a terminal, determining at least one target video in a video database according to search keywords, determining a target video cover thumbnail corresponding to the search keywords as a dynamic video cover thumbnail of the target video when the target video comprises video cover thumbnails corresponding to a plurality of different candidate keywords, generating a recommendation response according to the dynamic video cover thumbnails, recommendation information and a play address of each target video, sending the recommendation response to the terminal to enable the terminal to display the dynamic video cover thumbnails and the recommendation information of each target video, and playing the target video based on the play address after the target video is selected; the method is based on a computer vision technology, different video cover thumbnails are provided for the same video according to different search keywords, the video cover thumbnails corresponding to the search keywords are dynamically used as the dynamic video cover thumbnails of the target video during later recommendation according to the search keywords of a user, therefore, the video cover thumbnails of all recommendation results (short videos) can contain elements corresponding to the search keywords, the user can open the first recommendation result, the display accuracy of the video cover thumbnails of the video recommendation technology is improved, and the conversion rate of the video recommendation results and the viscosity of the user are further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic networking diagram of a recommendation system provided in an embodiment of the present application.

Fig. 2 is a schematic flowchart of a video recommendation method according to an embodiment of the present application.

Fig. 3 is a second flowchart of a video recommendation method according to an embodiment of the present application.

Fig. 4 is a third flowchart illustrating a video recommendation method according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of a video recommendation apparatus according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.

Fig. 7 is a schematic diagram of a model according to an embodiment of the present application.

Fig. 8a to 8d are schematic views of interfaces according to embodiments of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the embodiment of the present application, the video includes, but is not limited to, short video, long video, and the like, the short video may be a video with a length of less than 10 minutes, and the long video may be a video with a length of more than 10 minutes. When the videos correspond to a plurality of different elements, such as song burns of a plurality of stars, song burns of different musical instruments, song burns in different TV plays/movies and the like, the recommendation effect is particularly objective by using the recommendation method.

In the embodiment of the application, the terminal refers to a terminal device used by a user to be recommended, and the user to be recommended refers to a user using a video recommendation service.

In the embodiment of the present application, the elements in the video may include a plurality of categories, such as stars, instruments, sources, etc., each element category includes different elements, such as all stars (public figures, etc.) in the star library may be included under the category of the stars, all instruments in the instrument library may be included under the category of the instruments, such as pianos, lutes, flutes, suona, etc., and all tv shows, movies, etc. in the movie library may be included under the category of the sources.

In the embodiment of the application, the video cover thumbnail is a video frame or a combined picture of a plurality of video frames in a video and is used for representing the main video content of the video. Dynamic video cover thumbnails are a relative concept to static video cover thumbnails; when a video is short and has a single content, such as a climax part of a certain song video, a brief introduction video of a certain star, a single performance video of a certain musical instrument and the like, the video usually corresponds to only one element, and a user also needs to input a corresponding search keyword to retrieve the video, a plurality of video cover thumbnails do not need to be provided for the video, only one video cover thumbnail needs to be provided, and when the video is recommended to the user, the video cover thumbnail is fixed and is called a static video cover thumbnail; on the contrary, when one video corresponds to a plurality of elements, particularly a serial video, and a user needs different search keywords to retrieve the video, a plurality of video cover thumbnails are provided for the video, and when the video is recommended to the user, the video cover thumbnails dynamically change according to the search keywords input by the user and are called dynamic video cover thumbnails, and the dynamic video cover thumbnails can reflect the direct relation between the video content and the search keywords, so that the recommendation success rate can be greatly improved.

In the embodiment of the present application, the recommendation information is text information for simply describing a video, such as what song of a certain star, a certain musical instrument, and the like. Like the video cover thumbnail, the recommendation information can also be divided into dynamic recommendation information and static recommendation information; when a video is very short and has a single content, such as a climax part of a certain song video, a brief introduction video of a certain star, a single performance video of a certain musical instrument and the like, the video usually corresponds to only one element, and a user also needs to input a corresponding search keyword to retrieve the video, so that a plurality of pieces of recommendation information do not need to be provided for the video, only one piece of recommendation information needs to be provided, and when the video is recommended to the user, the recommendation information of the video is fixed and unchanged and is called as static recommendation information; on the contrary, when one video corresponds to a plurality of elements, particularly a string-burning video, and a user needs different search keywords to retrieve the video, a plurality of pieces of recommendation information are provided for the video, when the video is recommended to the user, the recommendation information is dynamically changed according to the search keywords input by the user and is called dynamic recommendation information, the dynamic recommendation information can embody the direct relation between the video content and the search keywords, and the recommendation success rate can be greatly improved.

In this embodiment of the application, the play address refers to a storage address of a video in a storage server, and the terminal may obtain corresponding video content from the storage server based on the address and play the video content.

In the embodiment of the application, the search keywords refer to keywords input by a user to be recommended through a terminal on interfaces such as a recommendation application client and the like, and the candidate keywords are keywords automatically generated by background personnel or a server in a marking database.

In the embodiment of the present application, the video frame is a basic unit constituting the content of the video image, and the video includes 60 video frames per second for the video of 60 Hz.

In the embodiment of the present application, the tag database includes candidate keywords and identification pictures corresponding to the candidate keywords, such as a star and a face picture of the star, a musical instrument and a picture of the musical instrument, and the like, where the star and the musical instrument are search keywords, and the face picture or the musical instrument picture is an identification picture. In light of the above analysis, the present application may set different tag databases for different element categories, such as a star tag library for the category of stars, an instrument tag library for the category of instruments, a source tag library for the category of sources, and so on.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. In the application, the artificial intelligence technology is mainly used for realizing the keyword marking of video frames in videos.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition. Specifically, for the star tag library, the CV mainly realizes Face registration (Face Alignment) under Face recognition (Face registration), identifies a video frame and a star Face picture in the star tag library to obtain the probability that a person in the video frame is a certain star, and completes Face tagging of the video frame based on the probability; aiming at a musical instrument marking library, a CV mainly realizes Image classification (Image classification) under Image Semantic Understanding (ISU), and identifies a video frame and musical instrument pictures in the musical instrument marking library to obtain the probability of including a certain type of musical instrument in the video frame so as to realize the classification marking of the video frame based on different musical instruments; aiming at a source marking library, a CV mainly realizes similar Image retrieval under Image Recognition (IR), carries out similar Image retrieval on video frames and video frames in a TV show/movie in the source marking library so as to determine the source of the video frames, and carries out source marking on the video frames based on the source of the video frames.

Referring to fig. 1, fig. 1 is a schematic view of a scenario of a recommendation system according to an embodiment of the present application, where the recommendation system may include a user-side device and a service-side device, and the user-side device and the service-side device are connected through an internet formed by various gateways and the like, which are not described again, where the user-side device includes a plurality of terminals 11, and the service-side device includes a plurality of servers 12; wherein:

the terminal 11 includes, but is not limited to, a portable terminal such as a mobile phone and a tablet equipped with various video playing applications, and a fixed terminal such as a computer, an inquiry machine and an advertisement machine, and is a service port that can be used and operated by a user, and in the present application, the terminal provides various functions such as search keyword input, video selection, recommendation result display and video playing for the user; for the convenience of the following description, the terminals 11 are defined as a publisher terminal 11a and a user terminal 11b, the publisher terminal 11a is used for uploading a video to the recommendation server, and the user terminal 11b is used for obtaining a recommended video;

the server 12 provides various business services for users, including a recommendation server 12a, a storage server 12b, and the like, wherein the storage server 12b is used for storing videos and providing services such as video downloading, the recommendation server is used for receiving recommendation requests from terminals, determining at least one target video in a video database according to the search keywords, determining a target video cover thumbnail corresponding to the search keywords as a dynamic video cover thumbnail of the target video when the target video includes video cover thumbnails corresponding to a plurality of different candidate keywords, generating a recommendation response according to the dynamic video cover thumbnail, recommendation information, and a play address of each target video, and sending the recommendation response to the terminals so that the terminals display the dynamic video cover thumbnails and recommendation information of each target video, and playing the target video based on the playing address after the target video is selected.

It should be noted that the system scenario diagram shown in fig. 1 is only an example, and the server and the scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows, with the evolution of the system and the occurrence of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

Fig. 2 is a first flowchart of a video recommendation method according to an embodiment of the present application, please refer to fig. 2, where the video recommendation method includes the following steps:

201: and processing the video in the video database to obtain at least one video cover thumbnail and at least one recommendation information of the video.

In one embodiment, after receiving the entertainment and leisure videos uploaded by the publisher terminal, the server informs the recommendation server of processing the videos, the process is performed offline by the recommendation server, different processing results are generated for different videos after the step is performed, some videos only comprise one video cover thumbnail and one recommendation message, and other videos comprise at least two video cover thumbnails and at least two recommendation messages. After this step was performed, the data shown in table 1 below were obtained.

TABLE 1

In Table 1, A-01 and A-02 in the star labels represent 2 different stars, and B-01 and B-02 in the instrument labels represent 2 different instruments; wherein:

for a video corresponding to the video number 00001, the corresponding mark only comprises the star A-01 (namely a search keyword), the corresponding video cover thumbnail is a display frame numbered as S008 in the video, and the corresponding recommendation information only comprises the text information T1 of the star A-01;

for a video corresponding to the video number 00002, the corresponding mark only comprises the star A-02 (namely the search keyword), the corresponding video cover thumbnail is a display frame numbered S009 in the video, and the corresponding recommendation information only comprises the text information T1 of the star A-02;

aiming at a video corresponding to a video number 00003, corresponding marks of the video simultaneously comprise a star A-01, a star A-02 and a musical instrument B-01 (namely corresponding to 3 search keywords), a video cover thumbnail corresponding to a search keyword star A-01 is a display frame numbered as S018 in the video, a video cover thumbnail corresponding to a search keyword star A-02 is a display frame numbered as S036 in the video, a video cover thumbnail corresponding to a search keyword musical instrument B-01 is a display frame numbered as S036 in the video, recommendation information corresponding to a search keyword star A-01 is text information T1 comprising the star A-01, the recommendation information corresponding to the search keyword star A-02 and the search keyword musical instrument B-01 is text information T2 comprising the star A-02 and the musical instrument B-01;

aiming at a video corresponding to a video number 0000n, corresponding marks of the video simultaneously comprise a star A-02 and a musical instrument B-02 (namely corresponding to 2 search keywords), a video cover thumbnail corresponding to the search keyword star A-02 is a display frame numbered as S046 in the video, a video cover thumbnail corresponding to the search keyword musical instrument B-02 is a display frame numbered as S076 in the video, recommendation information corresponding to the search keyword star A-02 is text information T1 comprising the star A-02, and recommendation information corresponding to the search keyword musical instrument B-02 is text information T2 comprising the musical instrument B-02.

In practical applications, table 1 may be extended according to the number of categories of elements.

In one embodiment, taking the star label as an example, the present step includes: constructing a mark database comprising candidate keywords and identification pictures corresponding to the candidate keywords; acquiring candidate video frames of each video in a video database; based on a mark database, carrying out candidate keyword marking on candidate video frames of each video; and determining the video cover thumbnail corresponding to the candidate keyword of each video according to the candidate keyword corresponding to the candidate video frame of each video. In this embodiment, the more complete the candidate keyword in the tag database is, the clearer the recognition picture is, and the higher the recognition accuracy is.

In one embodiment, the step of obtaining candidate video frames for each video in the video database comprises: analyzing the video to obtain all video frames of the video; and screening all video frames of the video based on a preset selection condition to obtain candidate video frames of the video. For example, all video frames of a certain video can be used as candidate video frames of the video, or screening can be performed based on some preset selection conditions to reduce the data amount.

In one embodiment, the step of screening all video frames of the video based on a preset selection condition to obtain candidate video frames of the video includes: performing first screening on all video frames of the video according to a preset quantity selection condition, and performing second screening on a first screening result according to a preset video frame content condition to obtain candidate video frames of the video; or screening all video frames of the video according to a preset quantity selection condition to obtain candidate video frames of the video; or screening all video frames of the video according to the content conditions of the preset video frames to obtain candidate video frames of the video. For video, the content of adjacent video frames is basically the same, so that the content condition of the preset video frame can be set, and only the content difference degree of the adjacent video frames is kept to be larger than the preset value (such as 10 percent); for another example, as described above, each second includes 60 video frames, the data is relatively large, and the conditions can be selected based on the preset number, and only 3 to 5 video frames are randomly or regularly reserved in each second; for another example, two conditions can be considered simultaneously to further screen the video frames, thereby reducing the video data amount to a greater extent.

In one embodiment, the step of labeling candidate keywords for candidate video frames of each video based on a label database includes: using the trained neural recognition model to perform image similarity recognition on the candidate video frame and the recognition picture to obtain an image similarity recognition result; and according to the image similarity identification result, marking the candidate keywords corresponding to the identification pictures with the image similarity meeting the marking conditions of the candidate video frames as the candidate keywords corresponding to the candidate video frames. In an embodiment, this embodiment may be completed based on a neural recognition model such as CNN, for example, the data in table 1, for videos corresponding to the video number 00001, the image similarities of the video frames numbered S002 to S096 and the recognition picture of star a-01 in the label library all satisfy the label condition (the similarity is greater than 90%), and the candidate keywords of these video frames are all labeled as star a-01.

In one embodiment, the step of performing image similarity recognition on the candidate video frame and the recognition picture by using the trained neural recognition model comprises: acquiring a trained face recognition model as a trained nerve recognition model; using the trained face recognition model to recognize the face similarity of the candidate video frames and the recognition pictures in the mark database one by one; and taking the face similarity recognition result as an image similarity recognition result. In an embodiment, this embodiment may be implemented by using any trained face recognition model, and the spheerface model proposed by 2018CVPR is taken as an example to describe this embodiment.

As shown in fig. 7, the spheerface model includes:

a data Input layer (Input) for inputting data, in this embodiment, for inputting candidate video frames and star Face pictures (Face Images) in the markup database;

a convolution layer (Convolutional Feature Learning) for performing convolution processing and Learning on an input picture;

a characteristic learning layer (deep learnt Features) for processing the data output by the convolution layer to obtain Separable Features (Separable Features) and Discriminative Features (Discriminative Features);

a Label Prediction layer (Label Prediction) for classifying (classic) according to an output result of the feature learning layer to perform Label Prediction and outputting a Prediction Label (Prediction Labels);

and the Loss output layer (Loss Function) is used for obtaining the star face picture corresponding to the video frame based on the Loss Function.

The open-set mode is adopted for training and testing the model shown in fig. 7, namely, the tested pictures do not appear in the training set, the prediction result of each tested picture is a feature vector, and if the human faces of the two pictures are to be compared and belong to the same person, the distance of the feature vector of the image needs to be tested; in order to obtain the result, the loss function adopts an angular Softmax (a-Softmax) loss, and measures the similarity between feature vectors by cosine similarity during training, so that the function does not have the problem that the feature vectors close to the interface are easily classified into wrong categories, which requires that the included angle between feature vectors of the same person is smaller than the included angle between feature vectors of different persons and is small enough (the included angle between feature vectors of different persons is at least m times of the same person, and m > is 1).

Based on the face recognition model shown in fig. 7, the recommendation server can recognize the star faces in all short videos in the entertainment category in an off-line manner, and mark corresponding follow-up keywords such as star names and the like on corresponding video frames.

In an embodiment, a plurality of video frames in a video are often labeled with the same label (i.e., a subsequent keyword), and then an optimal video frame needs to be selected as a video cover thumbnail corresponding to the candidate keyword, and at this time, the step of determining the video cover thumbnail corresponding to the candidate keyword of each video according to the candidate keyword corresponding to the candidate video frame of each video includes: sorting candidate keywords corresponding to candidate video frames of the video to obtain a plurality of candidate video frames corresponding to the candidate keywords; and selecting one of the candidate video frames corresponding to the candidate keywords as a video cover thumbnail corresponding to the candidate keywords according to the image similarity between the corresponding identification picture of the candidate keywords and the candidate video frames corresponding to the candidate keywords and a preset selection condition. In an embodiment, the preset selection condition may be that the video cover thumbnail with the largest face recognition accuracy score is used as the video cover thumbnail corresponding to the candidate keyword, for a star, video frames corresponding to front face pictures with different side face angles and different face recognition accuracy scores, the face recognition accuracy score is obviously different, and a most accurate one may be determined based on the face recognition accuracy score, and the face recognition accuracy score may be obtained by processing the similarity between the candidate video frame and the star face picture. For example, the data in table 1 indicates that, for a video corresponding to the video number 00001, the face recognition accuracy score of the video frame with the number S008 is the largest, and the video frame is taken as the video cover thumbnail corresponding to the candidate keyword star a-01.

In one embodiment, the step of determining the video cover thumbnail corresponding to the candidate keyword of each video according to the candidate keyword corresponding to the candidate video frame of each video includes: sorting candidate keywords corresponding to candidate video frames of the video to obtain a plurality of candidate video frames corresponding to the candidate keywords; acquiring frame definition of a plurality of candidate video frames corresponding to the candidate keywords; and selecting one of a plurality of candidate video frames corresponding to the candidate keywords as a video cover thumbnail corresponding to the candidate keywords according to the frame definition and a preset selection condition. In one embodiment, the clearer the video frame is, the better the user experience is, the preset selection condition may be set to select the video cover thumbnail with the highest frame definition as the video cover thumbnail corresponding to the candidate keyword. For example, the data in table 1, for the video corresponding to video number 00002, the video frame with number S009 with the largest frame definition is taken as the video cover thumbnail corresponding to candidate keyword star a-02.

In one embodiment, for instrument tagging of a video, the steps include: acquiring an audio file of a video, preprocessing the audio file, and converting the audio file from a time domain signal into frequency domain signals with a preset number of windows through short-time Fourier transform; converting the frequency domain signals with the preset window number from the frequency scale to a Mel scale to obtain a Mel spectrogram; inputting the Mel spectrogram into a pre-constructed musical instrument identification model to obtain the type of a musical instrument used by the audio file; searching a musical instrument picture corresponding to the musical instrument type in a musical instrument marking library, identifying whether each candidate video frame in the video comprises the musical instrument picture or not based on an image semantic understanding technology, and marking the candidate keywords of the video frame comprising the musical instrument picture as the musical instrument; and then, selecting a video frame with the maximum frame definition or including the complete musical instrument picture as a video cover thumbnail corresponding to the candidate keyword. For example, the data in table 1, for the video corresponding to the video number 00003, the candidate keywords with the numbers S030 to S090 video frames are labeled as musical instruments B-01, and the S036 video frame with the largest frame definition is taken as the thumbnail of the video cover corresponding to the candidate keyword musical instrument B-01.

In one embodiment, for source tagging of video, this step includes: acquiring text contents such as video conversations and the like, performing similar text matching on the text contents and the text contents corresponding to the television series/movies in the source marking library, and determining the television series/movies related to the video sources; identifying which video frame of the series/movie related to the source is most similar to each candidate video frame in the video based on an image identification technology, and taking the series/movie comprising the most similar video frame as a candidate keyword of the video frame; and then, selecting a video frame with the highest frame definition as a video cover thumbnail corresponding to the candidate keyword (a certain television play/movie). For example, for a video corresponding to the video number 00003, the candidate keywords with the numbers S030 to S090 video frames are labeled as movie C, and the S036 video frame with the largest frame definition is taken as the video cover thumbnail corresponding to the candidate keyword movie C.

202: and receiving a recommendation request from a terminal, wherein the recommendation request carries search keywords.

In one embodiment, as shown in fig. 8b, a user inputs a search keyword star a-01 on a search interface provided by the terminal 11b, and the terminal generates a recommendation request and sends the recommendation request to the recommendation server, where the recommendation request carries the search keyword.

After receiving the recommendation request, the recommendation server 12b performs parsing to obtain the search keyword star a-01.

203: and determining at least one target video in the video database according to the search keyword, wherein the matching degree of the video frame content of the target video and the search keyword meets the recommendation condition.

In one embodiment, the recommendation server 12b searches the video database provided by the storage server 12a according to the search keyword, and based on the data shown in table 1, it can be determined that the target video includes videos with video numbers 00001 and 00003, because there are videos in which the video frame is marked as star a-01, and the matching degree between the content of the video frame of the target video and the search keyword satisfies the recommendation condition.

204: and when the target video comprises video cover thumbnails corresponding to a plurality of different candidate keywords, determining the target video cover thumbnail corresponding to the search keyword as a dynamic video cover thumbnail of the target video.

In one embodiment, for a video with a video number of 00001, only video cover thumbnails corresponding to 1 candidate keyword are included, and at this time, the recommendation server directly determines a display frame with a number of S008 in the video as the corresponding video cover thumbnail.

In one embodiment, for a video corresponding to the video number 00003, the corresponding marks include a star a-01, a star a-02 and a musical instrument B-01 (i.e., correspond to 3 search keywords), a video cover thumbnail corresponding to the search keyword star a-01 is a display frame numbered S018 in the video, a video cover thumbnail corresponding to the search keyword star a-02 is a display frame numbered S036 in the video, and a video cover thumbnail corresponding to the search keyword musical instrument B-01 is a display frame numbered S036 in the video; at this time, the recommendation server determines the display frame of the target video cover thumbnail S018 corresponding to the search keyword star a-01 as the dynamic video cover thumbnail of the target video 00003.

205: and generating a recommendation response according to the dynamic video cover thumbnail, the recommendation information and the playing address of each target video.

In one embodiment, the recommendation server 12b constructs a recommendation response including a list of target videos and the dynamic video cover thumbnails, recommendation information, and play addresses of the target videos based on the dynamic video cover thumbnails, recommendation information, and play addresses of the target videos.

In one embodiment, before this step, the method further comprises: when the target video comprises recommendation information corresponding to a plurality of different candidate keywords, determining the target recommendation information corresponding to the search keywords as dynamic recommendation information of the target video; and generating a recommendation response according to the dynamic video cover thumbnail, the dynamic recommendation information and the playing address of each target video.

In one embodiment, for a video with a video number of 00001, only video cover thumbnails corresponding to 1 candidate keyword and one piece of recommendation information are included, at this time, the recommendation server directly determines a display frame with a number of S008 in the video as a corresponding video cover thumbnail and determines text information T1 including star a-01 as corresponding recommendation information.

In one embodiment, for a video corresponding to a video number of 00003, corresponding to 2 pieces of recommendation information, the recommendation information corresponding to the search keyword star A-01 is text information T1 comprising star A-01, and the recommendation information corresponding to the search keyword star A-02 and the search keyword musical instrument B-01 is text information T2 comprising star A-02 and musical instrument B-01; at this time, the recommendation server determines the recommendation information corresponding to the search keyword, star a-01, as the dynamic recommendation information of the target video 00003, which is the text information T1 including star a-01.

206: and sending a recommendation response to the terminal so that the terminal can display the dynamic video cover thumbnails and recommendation information of all the target videos, and playing the target videos based on the playing addresses after the target videos are selected.

In one embodiment, the recommendation server 12b sends a recommendation response to the terminal 11b, and the terminal 11b presents the interface shown in fig. 8b, where the dynamic video cover thumbnail corresponding to the target video 00003 is video frame S018, associated with star a-01, rather than presenting video frame S036. The user can click on a video cover thumbnail of a certain video to enter a video playing interface.

The embodiment provides a video recommendation method, which provides different video cover thumbnails for the same video based on different search keywords by a computer vision technology, and dynamically uses the video cover thumbnails corresponding to the search keywords as dynamic video cover thumbnails of a target video according to the search keywords of a user during later recommendation, so that the video cover thumbnails of each recommendation result (short video) contain elements corresponding to the search keywords, the user opens the first recommendation result, the display accuracy of the video cover thumbnails of the video recommendation technology is improved, and the conversion rate and the user viscosity of the video recommendation result are further improved.

Fig. 3 is a schematic flowchart of a second method for video recommendation according to an embodiment of the present application, please refer to fig. 3, in which the method for video recommendation includes the following steps:

301: the user uploads the video of the song burned in a string.

In one embodiment, a user such as an author or an uploader uploads a song burn video, i.e., video 00003, including video contents of 2 stars, i.e., star a-01 and star a-02, to the recommendation server 11b through the uploader terminal 11a, and the upload publisher designates a video frame including only star a-02 as a video cover thumbnail and sets recommendation information as a song burn of star a-01+ star a-02.

302: the recommendation server 12b processes the video to obtain at least one video cover thumbnail of the video.

In one embodiment, referring to step 201 above, the recommendation server 12b processes the video using the face recognition model, the corresponding labels of which include both stars a-01 and stars a-02 (i.e., corresponding to 2 search keywords), and determines that the video cover thumbnail corresponding to the search keyword star a-01 is the display frame numbered S018 in the video, and the video cover thumbnail corresponding to the search keyword star a-02 is the display frame numbered S036 in the video.

303: the recommendation server 12b stores the video in the storage server 12 a.

In one embodiment, the recommendation server 12b stores the video in the storage server 12a and obtains the storage address of the video.

304: the user terminal 11b transmits a search request.

In one embodiment, as shown in fig. 8c, a user inputs a search keyword star a-01 on a search interface provided by the terminal 11b, and the terminal generates a recommendation request and sends the recommendation request to the recommendation server, where the recommendation request carries the search keyword. After receiving the recommendation request, the recommendation server 12b performs parsing to obtain the search keyword star a-01.

305: the recommendation server determines the target video.

306: the recommendation server determines a thumbnail of each target video.

In one embodiment, for a video corresponding to a video number 00003, corresponding marks of the video simultaneously include a star a-01 and a star a-02), a video cover thumbnail corresponding to a search keyword star a-01 is a display frame numbered as S018 in the video, a video cover thumbnail corresponding to a search keyword star a-02 is a display frame numbered as S036 in the video, and at this time, the recommendation server determines a target video cover thumbnail S018 display frame corresponding to a search keyword star a-01 as a dynamic video cover thumbnail of the target video 00003.

307: the recommendation server constructs a recommendation response.

In one embodiment, for the video corresponding to video number 00003, the recommendation information is the song string burning of star a-01+ star a-02.

308: and the recommendation server sends a recommendation response to the user terminal.

In one embodiment, the recommendation server 12b sends a recommendation response to the terminal 11 b.

309: and the user terminal displays the recommendation response.

In one embodiment, terminal 11b presents the interface shown in FIG. 8c, where the thumbnail of the dynamic video cover corresponding to target video 00003 is video frame S018, associated with Star A-01, rather than presenting a thumbnail set by the user that does not include the relevant elements of Star A-01.

310: and the user terminal plays the video.

In one embodiment, the user clicks a thumbnail of a video cover of a certain video in the interface shown in fig. 8c, and the user terminal 11b acquires the corresponding video from the storage server 12a based on the corresponding playing address and plays the corresponding video.

In this embodiment, the recommendation server 12b provides different video cover thumbnails for the same video based on different search keywords by using a computer vision technology, and dynamically uses the video cover thumbnail corresponding to the search keyword as a dynamic video cover thumbnail of a target video according to the search keyword of a user during later recommendation, so that the display accuracy of the video cover thumbnail of the video recommendation technology is improved.

Fig. 4 is a schematic flowchart of a third method for video recommendation according to an embodiment of the present application, please refer to fig. 4, where the method for video recommendation includes the following steps:

401: the user uploads the video of the song burned in a string.

402: the recommendation server 12b processes the video to obtain at least one video cover thumbnail of the video and at least one recommendation information.

In one embodiment, referring to step 201 above, the recommendation server 12b processes the video using the face recognition model, the corresponding labels of which include both star a-01 and star a-02 (i.e. corresponding to 2 search keywords), and determines that the video cover thumbnail corresponding to the search keyword star a-01 is the display frame numbered S018 in the video, and the video cover thumbnail corresponding to the search keyword star a-02 is the display frame numbered S036 in the video; and determining that the recommendation information corresponding to the search keyword star A-01 comprises the following steps: the text information T1 of song 1, the recommendation information corresponding to the search keyword star A-02 is that the recommendation information comprises star A-02: textual information T2 for song 2.

403: the recommendation server 12b stores the video in the storage server 12 a.

404: the user terminal 11b transmits a search request.

In one embodiment, as shown in fig. 8d, the user inputs the search keyword star a-01 on the search interface provided by the terminal 11b, and the terminal generates and sends a recommendation request to the recommendation server, where the recommendation request carries the search keyword. After receiving the recommendation request, the recommendation server 12b performs parsing to obtain the search keyword star a-01.

405: the recommendation server determines the target video.

406: and the recommendation server determines the thumbnail of each target video and recommendation information.

In one embodiment, for a video with a video number of 00001, only video cover thumbnails corresponding to 1 candidate keyword are included, at this time, the recommendation server directly determines a display frame numbered S008 in the video as a corresponding video cover thumbnail, and the recommendation information is text information T1 including star a-01.

In one embodiment, for a video corresponding to a video number 00003, corresponding marks of the video simultaneously include a star a-01 and a star a-02), a video cover thumbnail corresponding to a search keyword star a-01 is a display frame numbered as S018 in the video, a video cover thumbnail corresponding to a search keyword star a-02 is a display frame numbered as S036 in the video, at this time, the recommendation server determines a target video cover thumbnail S018 display frame corresponding to a search keyword star a-01 as a dynamic video cover thumbnail of the target video 00003, and recommendation information corresponding to a search keyword star a-01 includes the star a-01: the text information T1 of song 1 serves as dynamic recommendation information of the target video 00003.

407: the recommendation server constructs a recommendation response.

408: and the recommendation server sends a recommendation response to the user terminal.

409: and the user terminal displays the recommendation response.

In one embodiment, terminal 11b presents the interface shown in fig. 8d, where the thumbnail of the dynamic video cover corresponding to target video 00003 is video frame S018, associated with star a-01, and the recommendation information is star a-01: song 1, more accurate.

410: and the user terminal plays the video.

In one embodiment, the user clicks a thumbnail of a video cover of a certain video in the interface shown in fig. 8d, and the user terminal 11b acquires the corresponding video from the storage server 12a based on the corresponding playing address and plays the corresponding video.

In this embodiment, the recommendation server 12b provides different video cover thumbnails and recommendation information for the same video based on different search keywords by using a computer vision technology, and dynamically uses the video cover thumbnail corresponding to the search keyword as a dynamic video cover thumbnail of a target video and the recommendation information corresponding to the search keyword as dynamic recommendation information of the target video according to the search keyword of a user during later recommendation, so that the accuracy of the video recommendation technology is improved.

Correspondingly, fig. 5 is a schematic structural diagram of a video recommendation device according to an embodiment of the present application, please refer to fig. 5, where the video recommendation device includes the following modules:

the identification module 501 is configured to process a video in a video database to obtain at least one video cover thumbnail and at least one recommendation information of the video;

a receiving module 502, configured to receive a recommendation request from a terminal, where the recommendation request carries a search keyword;

the video searching module 503 is configured to determine at least one target video in the video database according to the search keyword, where a matching degree between the video frame content of the target video and the search keyword meets a recommendation condition;

a cover determination module 504, configured to determine, when the target video includes video cover thumbnails corresponding to multiple different candidate keywords, a target video cover thumbnail corresponding to the search keyword as a dynamic video cover thumbnail of the target video;

a response construction module 505, configured to generate a recommendation response according to the dynamic video cover thumbnail, the recommendation information, and the play address of each target video;

and a sending module 506, configured to send a recommendation response to the terminal, so that the terminal displays the dynamic video cover thumbnails and recommendation information of the target videos, and plays the target videos based on the play addresses after the target videos are selected.

In one embodiment, the identification module 501 is configured to: constructing a mark database comprising candidate keywords and identification pictures corresponding to the candidate keywords; acquiring candidate video frames of each video in a video database; based on a mark database, carrying out candidate keyword marking on candidate video frames of each video; and determining the video cover thumbnail corresponding to the candidate keyword of each video according to the candidate keyword corresponding to the candidate video frame of each video.

In one embodiment, the identification module 501 is configured to: analyzing the video to obtain all video frames of the video; and screening all video frames of the video based on a preset selection condition to obtain candidate video frames of the video.

In one embodiment, the identification module 501 is configured to: performing first screening on all video frames of the video according to a preset quantity selection condition, and performing second screening on a first screening result according to a preset video frame content condition to obtain candidate video frames of the video; or screening all video frames of the video according to a preset quantity selection condition to obtain candidate video frames of the video; or screening all video frames of the video according to the content conditions of the preset video frames to obtain candidate video frames of the video.

In one embodiment, the identification module 501 is configured to: using the trained neural recognition model to perform image similarity recognition on the candidate video frame and the recognition picture to obtain an image similarity recognition result; and according to the image similarity identification result, marking the candidate keywords corresponding to the identification pictures with the image similarity meeting the marking conditions of the candidate video frames as the candidate keywords corresponding to the candidate video frames.

In one embodiment, the identification module 501 is configured to: acquiring a trained face recognition model as a trained nerve recognition model; using the trained face recognition model to recognize the face similarity of the candidate video frames and the recognition pictures in the mark database one by one; and taking the face similarity recognition result as an image similarity recognition result.

In one embodiment, the identification module 501 is configured to: sorting candidate keywords corresponding to candidate video frames of the video to obtain a plurality of candidate video frames corresponding to the candidate keywords; and selecting one of the candidate video frames corresponding to the candidate keywords as a video cover thumbnail corresponding to the candidate keywords according to the image similarity between the corresponding identification picture of the candidate keywords and the candidate video frames corresponding to the candidate keywords and a preset selection condition.

In one embodiment, the identification module 501 is configured to: sorting candidate keywords corresponding to candidate video frames of the video to obtain a plurality of candidate video frames corresponding to the candidate keywords; acquiring frame definition of a plurality of candidate video frames corresponding to the candidate keywords; and selecting one of a plurality of candidate video frames corresponding to the candidate keywords as a video cover thumbnail corresponding to the candidate keywords according to the frame definition and a preset selection condition.

In one embodiment, response building module 505 is to: when the target video comprises recommendation information corresponding to a plurality of different candidate keywords, determining the target recommendation information corresponding to the search keywords as dynamic recommendation information of the target video; and generating a recommendation response according to the dynamic video cover thumbnail, the dynamic recommendation information and the playing address of each target video.

Accordingly, embodiments of the present application also provide a server, as shown in fig. 6, the server may include Radio Frequency (RF) circuit 601, memory 602 including one or more computer-readable storage media, input unit 603, display unit 604, sensor 605, audio circuit 606, Wireless Fidelity (WiFi) module 607, processor 608 including one or more processing cores, and power supply 609. Those skilled in the art will appreciate that the server architecture shown in FIG. 6 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the RF circuit 601 may be used for receiving and transmitting signals during a message transmission or communication process, and in particular, for receiving downlink messages from a base station and then processing the received downlink messages by one or more processors 608; in addition, data relating to uplink is transmitted to the base station. The memory 602 may be used to store software programs and modules, and the processor 608 executes various functional applications and data processing by operating the software programs and modules stored in the memory 602. The input unit 603 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

The display unit 604 may be used to display information input by or provided to the user and various graphical user interfaces of the server, which may be made up of graphics, text, icons, video, and any combination thereof.

The server may also include at least one sensor 605, such as light sensors, motion sensors, and other sensors. Audio circuitry 606 includes speakers that may provide an audio interface between the user and the server.

WiFi belongs to short distance wireless transmission technology, and the server can help the user send and receive e-mail, browse web page and access streaming media etc. through WiFi module 607, it provides wireless broadband internet access for the user. Although fig. 6 shows the WiFi module 607, it is understood that it does not belong to the essential constitution of the server, and may be omitted entirely as needed within the scope of not changing the essence of the application.

The processor 608 is the control center of the server, connects various parts of the entire handset using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 602 and calling data stored in the memory 602, thereby performing overall monitoring of the handset.

The server also includes a power supply 609 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 608 via a power management system, such that the power management system may manage charging, discharging, and power consumption.

Although not shown, the server may further include a camera, a bluetooth module, etc., which will not be described herein. Specifically, in this embodiment, the processor 608 in the server loads the executable file corresponding to the process of one or more application programs into the memory 602 according to the following instructions, and the processor 608 runs the application program stored in the memory 602, so as to implement the following functions:

when the target video comprises video cover thumbnails corresponding to a plurality of different candidate keywords, determining the target video cover thumbnails corresponding to the search keywords as dynamic video cover thumbnails of the target video;

and sending a recommendation response to the terminal so that the terminal can display the dynamic video cover thumbnails and recommendation information of all the target videos, and playing the target videos based on the playing addresses after the target videos are selected.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed description, and are not described herein again.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application provides a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to implement the following functions:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any method provided in the embodiments of the present application, the beneficial effects that can be achieved by any method provided in the embodiments of the present application can be achieved, for details, see the foregoing embodiments, and are not described herein again.

The video recommendation method and apparatus, the server, and the computer-readable storage medium provided in the embodiments of the present application are introduced in detail above, and specific examples are applied herein to explain the principles and implementations of the present application, and the descriptions of the above embodiments are only used to help understand the method and the core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for video recommendation, comprising:

determining at least one target video in a video database according to the search keyword, wherein the matching degree of the video frame content of the target video and the search keyword meets recommendation conditions, and candidate keywords and video cover thumbnails of each video in the video database are obtained by performing offline processing by a recommendation server;

for each target video in the at least one target video, when the target video comprises video cover thumbnails corresponding to a plurality of different candidate keywords, determining the target video cover thumbnail corresponding to the search keyword as a dynamic video cover thumbnail of the target video;

2. The video recommendation method according to claim 1, further comprising, before the step of receiving a recommendation request from the terminal:

constructing a mark database comprising candidate keywords and identification pictures corresponding to the candidate keywords;

acquiring candidate video frames of each video in the video database;

based on the mark database, carrying out candidate keyword mark on candidate video frames of the videos;

and determining the video cover thumbnail corresponding to the candidate keyword of each video according to the candidate keyword corresponding to the candidate video frame of each video.

3. The video recommendation method according to claim 2, wherein the step of obtaining candidate video frames of each video in the video database comprises:

analyzing a video to obtain all video frames of the video;

and screening all video frames of the video based on a preset selection condition to obtain candidate video frames of the video.

4. The video recommendation method according to claim 3, wherein the step of filtering all video frames of the video based on a preset selection condition comprises:

performing first screening on all video frames of the video according to a preset quantity selection condition, and performing second screening on a first screening result according to a preset video frame content condition to obtain candidate video frames of the video;

or screening all video frames of the video according to a preset quantity selection condition to obtain candidate video frames of the video;

or screening all video frames of the video according to a preset video frame content condition to obtain candidate video frames of the video.

5. The video recommendation method according to claim 2, wherein the step of labeling candidate keywords for candidate video frames of each video based on the label database comprises:

using the trained neural recognition model to perform image similarity recognition on the candidate video frame and the recognition picture to obtain an image similarity recognition result;

and according to the image similarity identification result, marking the candidate keywords corresponding to the identification pictures of which the image similarity of the candidate video frames meets the marking condition as the candidate keywords corresponding to the candidate video frames.

6. The video recommendation method according to claim 5, wherein the step of performing image similarity recognition on the candidate video frame and the recognition picture by using the trained neural recognition model comprises:

acquiring a trained face recognition model as the trained neural recognition model;

using the trained face recognition model to recognize the face similarity of the candidate video frames and the recognition pictures in the mark database one by one;

and taking the face similarity recognition result as the image similarity recognition result.

7. The video recommendation method according to claim 5, wherein the step of determining the video cover thumbnail corresponding to the candidate keyword of each video according to the candidate keyword corresponding to the candidate video frame of each video comprises:

sorting candidate keywords corresponding to the candidate video frames of the video to obtain a plurality of candidate video frames corresponding to the candidate keywords;

and selecting one of the candidate video frames corresponding to the candidate keywords as a video cover thumbnail corresponding to the candidate keywords according to the image similarity between the identification picture corresponding to the candidate keywords and the candidate video frames corresponding to the candidate keywords and a preset selection condition.

8. The video recommendation method according to claim 2, wherein the step of determining the video cover thumbnail corresponding to the candidate keyword of each video according to the candidate keyword corresponding to the candidate video frame of each video comprises:

acquiring frame definition of a plurality of candidate video frames corresponding to the candidate keywords;

and selecting one of a plurality of candidate video frames corresponding to the candidate keywords as a video cover thumbnail corresponding to the candidate keywords according to the frame definition and a preset selection condition.

9. The video recommendation method according to any one of claims 1 to 8, further comprising, before the step of generating a recommendation response based on the dynamic video cover thumbnail, the recommendation information, and the play address of each target video:

when the target video comprises recommendation information corresponding to a plurality of different candidate keywords, determining the target recommendation information corresponding to the search keywords as dynamic recommendation information of the target video;

and generating a recommendation response according to the dynamic video cover thumbnail, the dynamic recommendation information and the playing address of each target video.

10. A video recommendation apparatus, comprising:

the video search module is used for determining at least one target video in a video database according to the search keyword, the matching degree of the video frame content of the target video and the search keyword meets recommendation conditions, and candidate keywords and video cover thumbnails of each video in the video database are obtained by performing offline processing through a recommendation server;

a cover determination module, configured to determine, for each target video in the at least one target video, a target video cover thumbnail corresponding to the search keyword as a dynamic video cover thumbnail of the target video when the target video includes video cover thumbnails corresponding to a plurality of different candidate keywords;