CN112989116B

CN112989116B - Video recommendation method, system and device

Info

Publication number: CN112989116B
Application number: CN202110503297.3A
Authority: CN
Inventors: 吴庆宁; 谢统玲; 殷焦元; 陈万锋
Original assignee: Guangzhou Kuaizi Information Technology Co ltd
Current assignee: Guangzhou Kuaizi Information Technology Co ltd
Priority date: 2021-05-10
Filing date: 2021-05-10
Publication date: 2021-10-26
Anticipated expiration: 2041-05-10
Also published as: CN112989116A

Abstract

The specification relates to a video recommendation method, a system and a device, wherein the method specifically comprises the following steps: acquiring a plurality of candidate videos; splitting video clips corresponding to a plurality of video shots based on the obtained candidate videos; generating a plurality of shot features corresponding to the plurality of video shots based on the trained shot feature extraction model and the video clips corresponding to each of the plurality of video shots; generating a video feature vector corresponding to each candidate video based on the plurality of shot features; and determining the similarity degree between any two candidate videos based on the trained discriminant model and the video feature vectors corresponding to the candidate videos, and determining a recommended video set based on the similarity degree.

Description

Video recommendation method, system and device

Technical Field

The present disclosure relates to the field of video analysis technologies, and in particular, to a method, a system, and an apparatus for video recommendation.

Background

With the development of the field of short videos and multimedia, more and more audiences watch videos through mobile equipment, public transport televisions, elevator televisions, outdoor advertising screens and other equipment. However, as the number of short videos increases, how the video platform actively recommends videos to determine the user's preference becomes a key for the next stage of development of the industry to recommend the favorite videos to the user accurately.

Therefore, how to recommend videos with low similarity to the user to quickly lock the user's preference becomes a problem to be solved.

Disclosure of Invention

One embodiment of the present specification provides a video recommendation method, including: acquiring a plurality of candidate videos; splitting video clips corresponding to a plurality of video shots based on the obtained candidate videos; generating a plurality of shot features corresponding to the plurality of video shots based on the trained shot feature extraction model and the video clips corresponding to each of the plurality of video shots; generating a video feature vector corresponding to each candidate video based on the plurality of shot features; and determining the similarity degree between any two candidate videos based on the trained discriminant model and the video feature vectors corresponding to the candidate videos, and determining a recommended video set based on the similarity degree.

One of embodiments of the present specification provides a video recommendation system, including: the candidate video acquisition module is used for acquiring a plurality of candidate videos; the video clip splitting module is used for splitting video clips corresponding to a plurality of video shots based on the obtained candidate videos; a shot feature extraction module, configured to generate a plurality of shot features corresponding to the plurality of video shots based on the trained shot feature extraction model and the video clip corresponding to each of the plurality of video shots; the video feature vector generation module is used for generating a video feature vector corresponding to each candidate video based on the plurality of shot features; and the recommended video set determining module is used for determining the similarity degree between any two candidate videos based on the trained discriminant model and the video feature vectors corresponding to the candidate videos, and determining the recommended video set based on the similarity degree.

One of the embodiments of the present specification provides a video recommendation apparatus, including a processor, configured to execute the video recommendation method described above.

Drawings

The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is a schematic diagram of a video recommendation application scenario, shown in accordance with some embodiments of the present description;

2A-2B are system block diagrams of a video recommendation system according to some embodiments herein;

FIG. 3 is an exemplary flow diagram of a video recommendation method according to some embodiments of the present description;

FIG. 4 is a schematic illustration of a model structure shown in accordance with some embodiments of the present description; and

FIG. 5 is an exemplary flow diagram of a model training process, shown in accordance with some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "apparatus", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

FIG. 1 is a schematic diagram of a video recommendation application scenario according to some embodiments of the present description.

As shown in fig. 1, an application scenario of the video recommendation system 100 referred to in this specification may include one or more terminals 110, a first computing system 120, a second computing system 150, and/or a third computing system 160.

The first computing system 120 may be used to obtain the candidate video 112; the candidate video 112 may be obtained by the terminal 110. The candidate video 112 may enter the first computing system 120 in a variety of common ways. Such as WIFI, bluetooth, microwave communication, etc. The received candidate video 112 may be converted to shot features 130 by a shot feature extraction model 122 in the first computing system 120. Specifically, the first computing system 120 may split video clips corresponding to a plurality of video shots based on the obtained plurality of candidate videos 112; and further, based on the trained shot feature extraction model 122 and the video clips corresponding to each of the plurality of video shots, a plurality of shot features 130 corresponding to the plurality of video shots are generated. In some embodiments, the first computing system 120 may also generate a video feature vector 140 corresponding to each candidate video based on the plurality of shot features 130 to provide to the third computing system 160 for determination.

The second computing system 150 may be used to train the initial model 152 into the discriminative model 162. In some embodiments, the initial model 152 includes an initialized first lens feature extraction model, an initialized second lens feature extraction model, and an initialized discriminative model. In an embodiment of the scenario, the second computing system 150 may obtain a first training set, where the first training set includes a plurality of video pairs, where each video pair includes an image feature corresponding to a corresponding first sample video, an image feature corresponding to a second sample video, and a label value reflecting a degree of similarity between the first sample video and the second sample video. Further, the second computing system 150 may train the parameters of the initial model 152 through multiple iterations based on the first training set to generate a trained discriminative model 162.

In one or more embodiments of the present description, a model (e.g., the shot feature extraction model 122, the initial model 152, or/and the discriminant model 162) may refer to a collection of several methods performed based on the processing device. These methods may include a number of parameters. When executing the model, the parameters used may be preset or may be dynamically adjusted. Some parameters may be obtained by a trained method, and some parameters may be obtained during execution. For a specific description of the model referred to in this specification, reference is made to the relevant part of the specification.

The third computing system 160 may be configured to determine a degree of similarity between any two candidate videos based on the trained discriminant model 162, and determine the recommended video set 170 based on the degree of similarity.

The first computing system 120, the second computing system 150, and the third computing system 160 referred to above may be the same or different. The first computing system 120, the second computing system 150, and the third computing system 160 refer to systems having computing capabilities, and may include various computers, such as servers, personal computers, or computing platforms formed by connecting a plurality of computers in various structures.

Processing devices (not shown) may be included in first computing system 120, second computing system 150, and third computing system 160. The processing device may execute program instructions. The Processing device may include various common general purpose Central Processing Units (CPUs), Graphics Processing Units (GPUs), microprocessors, Application-Specific Integrated circuits (ASICs), or other types of Integrated circuits.

Storage media (not shown) may be included in first computing system 120, second computing system 150, and third computing system 160. A storage medium may store instructions and may also store data. The storage medium may include mass storage, removable storage, volatile read-write memory, read-only memory (ROM), and the like, or any combination thereof.

First computing system 120, second computing system 150, and third computing system 160 may also include networks for internal connections and connections to the outside. Terminals for input or output may also be included. The network may be any one or more of a wired network or a wireless network.

In some embodiments, the candidate videos 112 acquired by the first computing system 120 may come from one or more terminals 110. In one or more embodiments of the present description, the terminal 110 may be a device with information acquisition, storage, and/or transmission functions, including but not limited to one or a combination of mobile device 110-1, tablet computer 110-2, desktop computer 110-3, camera 110-4, and the like. In some embodiments, the terminal 110 may include smart home devices, wearable devices, smart mobile devices, augmented reality devices, and the like, or combinations thereof.

In some embodiments, the terminal 110 may be used to obtain the candidate video 112. The candidate video 112 may be an advertisement video, an animation video, a movie video, a teaching video, etc. It is to be appreciated that the candidate video 112 may be comprised of a plurality of different shots. Further description of determining shot characteristics based on candidate videos 112 can be found in fig. 3, and will not be repeated here.

At present, a new internet mode is continuously rising, social network service communities are continuously developing and strong, and multimedia contents are rapidly developed. In the internet video, the development of pioneering media, but how to keep the user scale increasing and how to deeply plough the stock of clients becomes a key problem. Therefore, each large video platform is dedicated to actively recommend videos of interest to the user, so as to maximize the viewing time of the user.

Generally, when determining videos in which a user is interested, videos with low partial similarity need to be recommended to the user to quickly determine the user's preference. The existing video similarity analysis technology is usually realized based on video content, and mainly comprises the steps of extracting key frames in a video to be processed, extracting deep learning characteristics of the key frames, and sequentially comparing the similarity between frame images in two videos according to frame extraction time, so that the similarity between the two videos is predicted based on the similarity between the frame images; this method is acceptable in the case of a small number of videos. However, in the face of massive videos generated in the information explosion era, strict requirements are put forward on the processing capacity and the processing speed of a video recommendation system, and if the similarity between videos is calculated by using key frames, the time consumption is long, the speed is slow, and the video recommendation speed is seriously influenced.

In one or more embodiments related to this specification, video segments corresponding to a plurality of video shots may be split based on a plurality of acquired candidate videos; generating a plurality of shot features corresponding to the plurality of video shots based on the trained shot feature extraction model and the video clips corresponding to each of the plurality of video shots; generating a video feature vector corresponding to each candidate video based on the plurality of shot features; and determining the similarity degree between any two candidate videos according to the judgment result and the trained discriminant model, thereby determining a recommended video set with low similarity degree. The method for the vector representation is adopted to judge the similarity degree, the judging efficiency is higher, and the accuracy of the similarity degree judgment is improved along with the improvement of the training depth and the training sample amount.

Further, in one or more embodiments related to this specification, a plurality of video feature vectors may be clustered based on a clustering algorithm to obtain a plurality of video cluster clusters; a recommended video set is determined based on the plurality of video cluster clusters. In this way, the recommended video set with low similarity can be acquired more accurately and quickly.

Fig. 2A-2B are system block diagrams of a video recommendation system according to some embodiments of the present description.

As shown in fig. 2A, the system 200 arrangement may be disposed on a computing system (e.g., first computing system 120, second computing system 150, and/or third computing system 160 in fig. 1). The system 200 may include a candidate video acquisition module 210, a video clip splitting module 220, a shot feature extraction module 230, a video feature vector generation module 240, and a recommended video set determination module 250.

A candidate video obtaining module 210, configured to obtain multiple candidate videos.

The video clip splitting module 220 is configured to split video clips corresponding to a plurality of video shots based on the obtained plurality of candidate videos.

A shot feature extraction module 230, configured to generate a plurality of shot features corresponding to the plurality of video shots based on the trained shot feature extraction model and the video segment corresponding to each of the plurality of video shots.

A video feature vector generating module 240, configured to generate a video feature vector corresponding to each candidate video based on the plurality of shot features.

And a recommended video set determining module 250, configured to determine a similarity degree between any two candidate videos based on the trained discriminant model and the video feature vectors corresponding to the candidate videos, and determine a recommended video set based on the similarity degree.

In some embodiments, the recommended video set determination module 250 is further configured to: clustering the plurality of video feature vectors based on a clustering algorithm to obtain a plurality of video clustering clusters; determining a recommended video set based on the plurality of video cluster clusters. In some embodiments, the recommended video set determination module 250 is further configured to: determining a preset value of a recommended video set; processing the video feature vectors by adopting a clustering algorithm for multiple times until the number of the obtained video clustering clusters is greater than a preset value of the recommended video set; randomly selecting a plurality of video clustering clusters with the number equal to a preset value; and respectively acquiring a candidate video from each selected video cluster to determine the recommended video set.

In some embodiments, the video clip splitting module 220 is further configured to: acquiring a plurality of video frames of the candidate video and image characteristics of each video frame; according to the image characteristics of the video frames, respectively calculating the similarity between each video frame and a video frame preselected from a plurality of video frames to determine a shot boundary frame; and dividing the candidate video into a plurality of video segments according to the shot boundary frame.

In some embodiments, the shot feature extraction model is a sequence-based machine learning model. In this scenario embodiment, as shown in fig. 2B, the shot feature extraction module 230 further includes: a video frame acquiring unit 231, configured to acquire a plurality of video frames in a video clip corresponding to each video shot; an image feature determining unit 232, configured to determine one or more image features corresponding to each video frame; and a shot feature determining unit 233, configured to process, based on the trained shot feature extraction model, image features in the multiple video frames and a correlation between the image features in the multiple video frames, and determine shot features corresponding to the video shots.

In some embodiments, the lens feature extraction model and the discrimination model are sub-models of a first neural network model, the first neural network model comprises a first lens feature extraction model, a second lens feature extraction model and a discrimination model, and the first lens feature extraction model and the second lens feature extraction model have the same model structure; the lens feature extraction module further includes a first neural network model training unit 234, where the first neural network model training unit 234 is configured to: acquiring a first training set, wherein the first training set comprises a plurality of video pairs, each video pair comprises an image feature corresponding to a corresponding first sample video, an image feature corresponding to a second sample video and a label value, and the label value reflects the similarity degree between the first sample video and the second sample video; training a first initial model through multiple rounds of iteration based on the first training set to generate a trained first neural network model; the first initial model comprises an initialized first lens feature extraction model, an initialized second lens feature extraction model and an initialized discrimination model.

In some embodiments, the first neural network model training unit 234 trains the first initial model for a plurality of iterations to generate a trained first neural network model, wherein each iteration comprises: acquiring an updated first initial model generated in the previous iteration; for each video pair, processing image features corresponding to a first sample video in the video pair by using the updated first shot feature extraction model to obtain corresponding first shot features; processing image characteristics corresponding to a second sample video in the same video pair by using the updated second lens characteristic extraction model to obtain second lens characteristics; processing the first shot feature and the second shot feature by using the updated discrimination model to generate a discrimination result, wherein the discrimination result is used for reflecting the similarity degree of the first shot feature and the second shot feature; and judging whether to perform the next iteration or determine a trained first neural network model based on the judgment result and the label value. In some embodiments, the image features to which the video frames correspond include at least one of the following: shape information of an object in a video frame, positional relationship information between a plurality of objects in a video frame, color information of an object in a video frame, a degree of completeness of an object in a video frame, and/or a brightness in a video frame.

It should be appreciated that the system and its modules shown in fig. 2A-2B may be implemented in a variety of ways. For example, in some embodiments, an apparatus and its modules may be implemented by hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may then be stored in a memory for execution by a suitable instruction execution device, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and apparatus described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided for example on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware) or a data carrier such as an optical or electronic signal carrier. The apparatus and modules thereof in this specification may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of the above hardware circuits and software (e.g., firmware).

FIG. 3 is an exemplary flow diagram of a round of iterative updating in a model training method in accordance with some embodiments shown herein.

One or more steps in flow 300 may be performed by any of the computing devices in fig. 1, such as first computing system 120, second computing system 150, and/or third computing system 160. In some embodiments, flow 300 may be further performed by system 200.

In step 310, a plurality of candidate videos are obtained. In some embodiments, step 310 may be performed by candidate video acquisition module 210.

The candidate video may be any one of an advertisement video, an animation video, a movie video, a teaching video, and the like. In some embodiments, the candidate videos may be presented to the audience group via a television, an outdoor advertising screen, a webpage or a pop-up window of an electronic device (e.g., a mobile phone, a computer, a smart wearable device, etc.), or the like. Candidate videos (such as advertisements) are usually composed of multiple shots, which contain different scenes, people, and merchandise.

In step 320, splitting video segments corresponding to a plurality of video shots based on the obtained plurality of candidate videos. In some embodiments, step 320 may be performed by the video clip splitting module 220.

A video segment may be a sub-sequence of a sequence of images that constitutes a candidate video, and a video segment may be understood as a segment of continuous and un-stitched video. In some embodiments, a candidate video may include only one video clip, and the entire advertising video is treated as one video clip. In other embodiments, a candidate video may be formed by splicing a plurality of shots, and one or more consecutive video frames at the junction of two adjacent video segments may be referred to as a shot boundary frame. In some embodiments, the video clip splitting module 220 may segment a plurality of video clips in the candidate video in units of video clips. When segmenting the candidate video into a plurality of video segments, the segmentation may be performed at shot boundary frames.

Specifically, the method for segmenting the candidate video into the plurality of video segments may specifically include the following specific procedures: step 1, acquiring a plurality of video frames of a candidate video and image characteristics of each video frame; step 2, respectively calculating the similarity between each video frame and a video frame preselected from a plurality of video frames according to the image characteristics of the video frames to determine a shot boundary frame; step 3, the video segment splitting module 220 may split the candidate video into a plurality of video segments according to the shot boundary frame.

The image features in the video frame may include, but are not limited to, a combination of one or more of shape information of an object in the video frame, positional relationship information between a plurality of objects in the video frame, color information of an object in the video frame, a degree of completeness of an object in the video frame, and/or a brightness in the video frame.

Objects in a video frame can be understood as the main objects appearing in each shot. For example, the object may include a living being (human, animal, etc.), a commodity (automobile, daily necessities, ornaments, cosmetics, etc.), a background (mountain, road, bridge, house, etc.), and the like.

The shape information of the object in the video frame refers to contour curve information of the object in the video frame. Such as the contour, shape, etc. of the object. The positional relationship information between the plurality of objects refers to spatial distance information such as orientations and distances between the plurality of objects/human bodies, and may be information of coordinates of an object reference point, for example.

The color information of the object refers to information such as the color saturation of the corresponding object; the integrity of the object refers to the integrity of the target object (for example, only half of the face of the target person, 1/3 faces, etc. are captured in the picture); luminance in a video frame refers to luminance value information in a photograph.

In some embodiments, in step 1 above, the video segment splitting module 220 may further use a feature extraction model to obtain image features of the video frames. For example, the feature extraction model can identify objects in each video and obtain corresponding picture features. For more description of the feature extraction model, reference may be made to the corresponding description of step 330, which is not described herein again.

In step 2, the video segment splitting module 220 may use an inner product of image features of two video frames as a similarity between the two video frames. In some embodiments, calculating the similarity between each video frame and a preselected video frame from the plurality of video frames may be calculating the similarity between each video frame and its preceding and/or succeeding neighboring video frames. For example, in the process of determining the shot boundary frame, the similarity between each video frame and its preceding and/or succeeding neighboring video frames may be calculated, and if the similarity between two neighboring video frames is lower than the similarity threshold, the two neighboring video frames are determined to be the shot boundary frames. In some alternative embodiments, the video segment splitting module 220 may also use a trained discriminant model to determine the similarity between two video frames. For the trained discriminant model, refer to step 350 and the detailed description of fig. 5, which are not repeated herein.

In the step 2, the similarity between each video frame and the video frames preceding and/or following the video frame by the preset interval frame number (e.g. interval 3 frames, 5 frames, 7 frames, etc.) may also be calculated. It is assumed that the preset interval frame number may be set to 2 frames, 3 frames, 5 frames, or the like. And if the similarity between two video frames is calculated to be smaller than a preset threshold value, taking the video frame between the two video frames as a candidate segmentation area, and taking the two video frames as boundary frames of the candidate segmentation area. For example, if the preset number of frames is 2 frames, the similarity between the 10 th frame and the 14 th frame may be calculated, and if the similarity is smaller than the similarity threshold, the 12 th frame and the 13 th frame are taken as candidate segmentation regions, and the 10 th frame and the 14 th frame are taken as boundary frames of the candidate segmentation regions. Then, the candidate segmented regions may be further fused, i.e. the overlapped candidate segmented regions are merged together. If the 12 th frame and the 13 th frame are candidate divided regions and the 13 th frame and the 14 th frame are also candidate divided regions, the 12 th, the 13 th and the 14 th frames are combined into one candidate divided region.

Step 330, generating a plurality of shot features corresponding to the plurality of video shots based on the trained shot feature extraction model and the video clips corresponding to each of the plurality of video shots. In some embodiments, step 330 may be performed by lens feature extraction module 230.

In an embodiment of the present specification, the shot feature extraction module 230 may process the plurality of video segments based on a trained shot feature extraction model to generate corresponding shot features. In particular, it may comprise the following steps:

step S11: the shot feature extraction module 230 may extract a plurality of video frames in a video clip corresponding to each video shot. Video frames may be understood as corresponding frame images that are decomposed from successive images at time intervals. For example, the time interval between each frame of images may illustratively be set to 1/24s (which may also mean that 24 frames of images are acquired at 1 second intervals). In some embodiments, step S11 may be performed by the video frame acquisition unit 231.

Step S12: the shot feature extraction module 230 may obtain one or more image features corresponding to each video frame based on the characterization extraction process. In some embodiments, step S12 may be further performed by the image feature determination unit 232. The feature extraction processing may be processing of the original information and extracting feature data, and the feature extraction processing may improve the expression of the original information to facilitate subsequent tasks.

In some embodiments, the feature extraction process may employ statistical methods (e.g., principal component analysis methods), dimension reduction techniques (e.g., linear discriminant analysis methods), feature normalization, data binning, and the like. For example, taking the brightness in the video frame as an example, the shot feature extraction module 230 may scale the brightness value within 0-80 to [1, 0, 0], scale the brightness value 80-160 to [0, 1, 0], and scale the brightness value above 80 to [0, 0, 1 ].

However, since the obtained image features are various, some obtained image features are difficult to measure with a fixed function or an explicit rule. Therefore, the feature extraction process can also form a predictable model through automatic learning of the collected information by means of machine learning (such as adopting a feature extraction model), so as to obtain higher accuracy. The feature extraction model may be a generation model or a determination model, or may be a deep learning model in machine learning, for example, a deep learning model using a yolo series algorithm, a fasterncn algorithm, or an EfficientDet algorithm. The machine learning model may detect a set object of interest in a frame of each video frame. Objects that need attention may include living things (humans, animals, etc.), merchandise (automobiles, ornaments, cosmetics, etc.), backgrounds (mountains, roads, bridges, houses, etc.), and the like. Further, the object to be focused on may be set as a video, for example, a person or a product. A plurality of shots may be input into the machine learning model, which is capable of outputting image features such as position information, brightness, and the like of objects in the respective shots.

It should be noted that, the feature extraction model used can be arbitrarily changed by those skilled in the art, and the present specification does not limit this. For example, the feature extraction model may be a google lenet model, a VGG model, a ResNet model, or the like. By extracting features of the video frames in a machine learning model manner, image features can be determined more accurately.

Step S13: the shot feature extraction module 230 may process image features in the plurality of video frames and correlations between the image features in the plurality of video frames based on the trained shot feature extraction model, and determine shot features corresponding to the video shots. In some embodiments, step S13 may be performed by the lens characteristic determination unit 233. The trained shot feature extraction model may be a sequence-based machine learning model that may transform a variable-length input into a fixed-length vector representation for output. It can be understood that, because different shots have different durations and the number of corresponding video frames is different, after the shot feature extraction model is processed through training, the shot feature extraction model can be converted into a vector with a fixed length for representation, which is beneficial to subsequent processing.

Illustratively, the sequence-based machine learning model may be a deep learning model (DNN), a recurrent neural network (LSTM) or bidirectional Short-Term Memory (Bi-directional LSTM), a gate repeat unit GRU model, and the like, and combinations thereof, and the description is not limited herein. Specifically, the image features (such as features 1,2,3 …, n) corresponding to the video frames obtained in step 2 and the relationship (such as sequence and/or chronological relationship) thereof are input into the lens feature extraction model,the lens feature extraction model can output the sequence (such as h) of the coding hidden state at each moment₁~h_n) Wherein h is_nAll information of the shots in this period of time is contained. In this way, the shot feature extraction model can convert a plurality of image features within a period of time (such as a video clip corresponding to a shot) into a vector expression h with a fixed length_n(i.e., lens characteristics). The training process of the shot feature extraction model can be referred to the corresponding description of fig. 5, and is not described herein again.

It is understood that the above steps and methods may be respectively applied to a plurality of shots in the candidate video to obtain the shot features of different shots. Here, it is assumed that the shots of a certain candidate video (candidate video c) are 1,2,3 …, m, respectively, and the shot features obtained correspondingly are

This arrangement is used in the following description.

Step 340, generating a video feature vector corresponding to each candidate video based on the plurality of shot features. In some embodiments, step 340 may be performed by video feature vector generation module 240.

The video feature vector generation module 240 may generate a video feature vector corresponding to the candidate video based on the acquisition order of each shot in the candidate video and the shot features. For example, the video feature vector generation module 240 may obtain the video feature vectors corresponding to the candidate videos by means of vector stitching, vector concatenation, or the like. Taking the candidate video c as an example, the previous example is the video feature vector corresponding to the candidate video c

Where the superscript T denotes the matrix transpose.

And 350, determining the similarity degree between any two candidate videos based on the trained discrimination model and the video feature vectors corresponding to the candidate videos, and determining a recommended video set based on the similarity degree. In some embodiments, step 350 may be performed by the recommended video set determination module 250.

The recommended video set determining module 250 may perform similarity degree determination on the video feature vectors corresponding to any two candidate videos based on the trained discriminant model, and determine the recommended video set based on all the similarity degree determination results.

Suppose there are a, b, c … k candidate videos, each with corresponding video feature vectors

The recommended video set determination process may specifically be implemented by the following steps:

step S1: one of the k candidate videos is selected (for example, the candidate video c is selected), and the video feature vectors are respectively calculated

Obtaining the similarity degree comparison results with the similarity degrees of other (k-1) candidate video feature vectors, and placing a group of candidate videos with the minimum similarity degree in the recommended video set (exemplarily, the video feature vectors corresponding to the candidate videos a and c)

，

If the similarity is minimum, adding a and c into the recommended video set);

step S2: further obtaining the average value of the similarity degree of the rest candidate videos (such as k-2 videos) with the existing videos in the recommended video set, and adding the candidate video with the lowest average value of the similarity degree into the recommended video set (for example, if the candidate videos a and c already exist in the recommended video, further determining the video feature vectors corresponding to the rest videos and the video feature vectors corresponding to the rest videos

The average value of (1), adding the video with the minimum average value to the recommended video set);

step S3: and selecting candidate videos with the lowest similarity degree one by one in the step S2 until the number of the selected videos meets the preset requirement.

In some embodiments, in the above step S1, the recommended video set determining module 250 may use an inner product of video feature vectors of two candidate videos as a similarity degree between the two candidate videos. In some embodiments, the recommended video set determination module 250 may also use a vector similarity coefficient to determine a degree of similarity between two candidate videos. The similarity coefficient refers to the similarity between samples calculated by using an equation, and the smaller the value of the similarity coefficient is, the smaller the similarity between individuals is, the larger the difference is. When the similarity coefficient between the two candidate videos is large, it can be determined that the degree of similarity between the two candidate videos is high. In some embodiments, the similarity coefficients used include, but are not limited to, simple match similarity coefficients, Jaccard similarity coefficients, cosine similarity, adjusted cosine similarity, pearson correlation coefficients, and the like.

In some embodiments, in step 1, the recommended video set determining module 250 may also use a trained discriminant model to obtain the similarity degree between the candidate videos. The training process of the discriminant model can be referred to the corresponding description of fig. 5, and is not repeated herein.

In some embodiments, the recommended video set determining module 250 may further determine to cluster the plurality of video feature vectors by using a clustering algorithm to obtain a plurality of video cluster clusters, and then determine the recommended video set based on the plurality of video cluster clusters.

Specifically, it is assumed that the number of videos required by the recommended video set (i.e., the preset value of the recommended video set) to be obtained is P, and the number of actually obtained clusters is Q. If the number P of videos needed by the recommended video set is smaller than or equal to the number Q of actually obtained cluster clusters, selecting P cluster clusters, and selecting a candidate video from each cluster; if the number P of videos needed by the recommended video set is larger than the number Q of the actually obtained cluster clusters, selecting a plurality of candidate videos far away from the cluster center from each cluster, and randomly extracting P videos to form the recommended video set;

in some implementationsIn an example, the recommended video set determination module 250 may obtain a plurality of video cluster clusters based on a density clustering algorithm (e.g., DBSCAN density clustering algorithm). Specifically, the recommended video set determining module 250 determines a preset value of a required recommended video set, that is, determines the number (P) of videos required by the recommended video set; further, neighborhood parameters of the cluster are determined

Wherein, in the step (A),

corresponding to the radius of the cluster in the vector space, corresponding to the minimum value of the number of samples required for forming the cluster by MinPts, obtaining the cluster with the number of Q, and repeatedly adjusting the neighborhood parameters

And clustering the video feature vectors until the obtained cluster number Q is more than or equal to a preset value P of the recommended video set. At the moment, randomly selecting a plurality of video clustering clusters with the number equal to the preset value P; and respectively acquiring a candidate video from each selected video cluster to determine a recommended video set.

It should be understood that the above description of the process 300 is only exemplary and is not intended to limit the scope of the present disclosure. Many modifications and variations will be apparent to those skilled in the art in light of the description. However, such modifications and changes do not depart from the scope of the present specification.

One or more steps in flow 500 may be performed by any of the computing devices in fig. 1, such as first computing system 120, second computing system 150, and/or third computing system 160. In some embodiments, flow 300 may be further performed by system 200. In some embodiments, the flow 300 may be further performed by the lens feature extraction module 230. In some embodiments, the process 300 may be further configured to be performed by the first neural network model training unit 234.

FIG. 4 is a schematic illustration of a model structure according to some embodiments of the present description. In fig. 4, the first neural network model includes a first lens feature extraction model, a second lens feature extraction model and a discrimination model, and the first lens feature extraction model and the second lens feature extraction model have the same model structure; the first neural network model is further trained based on the method of flow 500. It should be noted that any one of the first shot feature extraction model and the second shot feature extraction model may implement the processing of the video clip corresponding to each of the plurality of video shots in step 330 to generate a plurality of shot features corresponding to the plurality of video shots.

Step 510, a first training set is obtained.

The first training set refers to a set of training samples used to train the first initial model. The first training set comprises a plurality of video pairs, wherein each video pair comprises an image feature corresponding to the corresponding first sample video, an image feature corresponding to the second sample video and a label value corresponding to the two sample videos. The corresponding image features obtained from the first sample video and the second sample video can be obtained by adopting a characteristic extraction processing mode. For more description of the image feature and the feature extraction process, reference may be made to the detailed description of step 330 in fig. 3, which is not described herein again.

The label value in the video pair reflects the degree of similarity between the first sample video and the second sample video. The label values in the sample set can be labeled manually, and the corresponding machine learning model can also be used for automatically labeling the video pairs. For example, the similarity degree of each video pair can be obtained by a trained classifier model.

In some embodiments, the manner of acquiring the first training set may be from an image collector such as a camera, a smartphone, or the terminal 110 in fig. 1. In some embodiments, the first training set may be obtained by reading directly from a storage system in which a large number of pictures are stored. In some embodiments, the first training set may also be obtained in any other manner, which is not limited in this embodiment.

The first initial model can be understood as an untrained neural network model or as an untrained complete neural network model. The method further comprises an initialized first lens feature extraction model, an initialized second lens feature extraction model and an initialized discrimination model. Each layer of the initial model may be provided with initial parameters, which may be continuously adjusted during the training process until the training is completed.

Step 520, training a first initial model through multiple iterations based on the first training set to generate a trained first neural network model. Wherein each iteration further comprises:

and step 521, processing the image feature corresponding to the first sample video in the video pair by using the updated first lens feature extraction model to obtain a corresponding first lens feature.

The first shot feature extraction model is a sequence-based machine learning model that can transform a variable-length input into a fixed-length vector representation for output. Specifically, in one or more embodiments of the present description, the first shot feature extraction model performs forward propagation based on the image feature corresponding to the first sample video in the video pair to obtain the corresponding first shot feature.

Step 522, the updated second lens feature extraction model is used for processing the image features corresponding to the second sample video in the same video pair, so as to obtain second lens features.

Similar to step 521, the second shot feature extraction model performs forward propagation based on the image features corresponding to the second sample video in the same video pair, and obtains corresponding second shot features.

Step 523, processing the first lens feature and the second lens feature by using the updated discrimination model to generate a discrimination result, where the discrimination result is used to reflect the similarity between the first lens feature and the second lens feature.

The discriminative model may be a classifier model that determines a degree of similarity between the first lens features and the second lens features. For example, the discriminative model may obtain a distance between the first shot feature and the second shot feature in vector space. The smaller the distance between the vector spaces, the more similar the two shot features. For another example, the discriminative model may also determine a similarity probability value between the first shot feature and the second shot feature.

Step 524, judging whether to perform the next iteration or determine the trained first neural network model based on the judgment result and the label value.

After the discrimination model is propagated forward to obtain the discrimination result, a loss function can be constructed based on the discrimination result and the sample label, and the model parameters are updated based on the reverse propagation of the loss function. In some embodiments, the training sample label data may be represented as

The result of the discrimination is expressed as

The calculated loss function value is expressed as

In some embodiments, different loss functions may be selected according to the type of the model, such as a mean square error loss function or a cross entropy loss function as the loss function, and the like, which is not limited in this specification. In an exemplary manner, the first and second electrodes are,

in some embodiments, a gradient backpass algorithm may be employed to update the model parameters. The back propagation algorithm compares the predicted results for a particular training sample with the label data to determine the update magnitude for each weight of the model. That is, the back propagation algorithm is used to determine the change in the loss function (which may also be referred to as the gradient or error derivative) with respect to each weight, noted as

Furthermore, the gradient back-propagation algorithm may pass the value of the loss function back through the output layer, to the hidden layer and the input layer, and determine the modified value (or gradient) of the model parameter of each layer in turn. The correction value (or gradient) of the model parameter of each layer includes a plurality of matrix elements (e.g., gradient elements) corresponding to the model parameters one-to-one, and each gradient element reflects a correction direction (increase or decrease) and a correction amount of the parameter. In one or more embodiments related to the present specification, after the decision model performs a backward propagation to complete the gradient, the decision model further performs a backward propagation to the first lens feature extraction model and the second lens feature extraction model performs a backward propagation to the model parameters layer by layer to complete a round of iterative updating. Compared with the mode of training each model independently, the mode of training the first shot feature extraction model, the second shot feature extraction model and the judgment model jointly adopts a uniform loss function to train, and the training efficiency is higher.

In some embodiments, it may be determined whether to perform a next iteration or determine a trained first neural network model based on the discrimination result and the label value. The criterion for judgment may be whether the iteration number has reached a preset iteration number, whether the updated model meets a preset performance index threshold, or whether an instruction to terminate training is received. If it is determined that the next iteration is needed, the next iteration may be performed based on the updated first portion of the model for the current iteration process. In other words, the updated model obtained in the next iteration of the current iteration is used as the updated initial model in the next iteration. If it is determined that the next iteration is not required, the updated model obtained in the current iteration process can be used as the finally trained model.

The beneficial effects that may be brought by the embodiments of the present description include, but are not limited to: 1) based on the trained shot feature extraction model, the shot video with variable length is converted into vector expression with fixed length for output, which is beneficial to processing by using a clustering algorithm; 2) clustering is carried out on a plurality of video feature vectors based on a clustering algorithm, so that a recommended video set with low similarity can be obtained more accurately and quickly; 3) the training efficiency can be improved by adopting a mode of joint training of the first lens characteristic extraction model, the second lens characteristic extraction model and the discrimination model. It is to be noted that different embodiments may produce different advantages, and in different embodiments, any one or combination of the above advantages may be produced, or any other advantages may be obtained.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method for video recommendation, comprising:

acquiring a plurality of candidate videos;

acquiring a plurality of video frames of the candidate video and image characteristics of each video frame;

according to the image characteristics of the video frames, respectively calculating the similarity between each video frame and a video frame preselected from a plurality of video frames based on a trained discrimination model so as to determine a shot boundary frame;

dividing the candidate video into a plurality of video segments according to shot boundary frames;

generating a plurality of shot features corresponding to the plurality of video shots based on the trained shot feature extraction model and the video clips corresponding to each of the plurality of video shots;

generating a video feature vector corresponding to each candidate video based on the plurality of shot features;

determining the similarity degree between any two candidate videos based on the trained discrimination model and the video feature vectors corresponding to the candidate videos, and determining a recommended video set based on the similarity degree;

the lens feature extraction model and the discrimination model are submodels of a first neural network model, the lens feature extraction model is a sequence-based machine learning model, the discrimination model is a trained classifier model, and the lens feature extraction model and the discrimination model are obtained through multi-round iterative joint training based on a plurality of training samples.

2. The method of claim 1, wherein the determining a similarity degree between any two candidate videos based on the trained discriminant model and the video feature vectors corresponding to the candidate videos and the determining a recommended video set based on the similarity degree further comprises:

clustering the plurality of video feature vectors based on a clustering algorithm to obtain a plurality of video clustering clusters;

determining a recommended video set based on the plurality of video cluster clusters.

3. The method of claim 2, the determining a recommended video set based on the plurality of video cluster clusters, comprising:

determining a preset value of a recommended video set;

processing the video feature vectors by adopting a clustering algorithm for multiple times until the number of the obtained video clustering clusters is greater than a preset value of the recommended video set;

randomly selecting a plurality of video clustering clusters with the number equal to a preset value;

and respectively acquiring a candidate video from each selected video cluster to determine the recommended video set.

4. The method of claim 1, wherein generating a plurality of shot features corresponding to the plurality of video shots based on the trained shot feature extraction model and the video segment corresponding to each of the plurality of video shots comprises:

acquiring a plurality of video frames in a video clip corresponding to each video shot;

determining one or more image features corresponding to each video frame;

and processing the image features in the video frames and the interrelation among the image features in the video frames based on the trained shot feature extraction model, and determining the shot features corresponding to the video shots.

5. The method of claim 4, wherein the first neural network model comprises a first lens feature extraction model, a second lens feature extraction model and a discrimination model, the first lens feature extraction model and the second lens feature extraction model have the same model structure, and any one of the first lens feature extraction model and the second lens feature extraction model can be used for generating a plurality of lens features corresponding to the plurality of video lenses; the first neural network model is trained based on the following steps:

acquiring a first training set, wherein the first training set comprises a plurality of video pairs, each video pair comprises an image feature corresponding to a corresponding first sample video, an image feature corresponding to a second sample video and a label value, and the label value reflects the similarity degree between the first sample video and the second sample video; and

training a first initial model through multiple rounds of iteration based on the first training set to generate a trained first neural network model;

the first initial model comprises an initialized first lens feature extraction model, an initialized second lens feature extraction model and an initialized discrimination model.

6. The method of claim 5, the plurality of iterations to train a first initial model to generate a trained first neural network model, wherein each iteration comprises:

acquiring an updated first initial model generated in the previous iteration;

for each of the video pairs in question,

processing image features corresponding to a first sample video in the video pair by using the updated first lens feature extraction model to obtain corresponding first lens features;

processing image characteristics corresponding to a second sample video in the same video pair by using the updated second lens characteristic extraction model to obtain second lens characteristics;

processing the first shot feature and the second shot feature by using the updated discrimination model to generate a discrimination result, wherein the discrimination result is used for reflecting the similarity degree of the first shot feature and the second shot feature;

and judging whether to perform the next iteration or determine a trained first neural network model based on the judgment result and the label value.

7. The method of claim 4, the corresponding image features of the video frame comprising at least one of: shape information of an object in a video frame, positional relationship information between a plurality of objects in a video frame, color information of an object in a video frame, a degree of completeness of an object in a video frame, and/or a brightness in a video frame.

8. A video recommendation system, comprising:

the candidate video acquisition module is used for acquiring a plurality of candidate videos;

the video clip splitting module is used for acquiring a plurality of video frames of the candidate video and the image characteristics of each video frame, respectively calculating the similarity between each video frame and a video frame preselected from the plurality of video frames according to the image characteristics of the video frames to determine a shot boundary frame, and splitting the candidate video into a plurality of video clips according to the shot boundary frame;

a shot feature extraction module, configured to generate a plurality of shot features corresponding to the plurality of video shots based on the trained shot feature extraction model and the video clip corresponding to each of the plurality of video shots;

the video feature vector generation module is used for generating a video feature vector corresponding to each candidate video based on the plurality of shot features;

the recommended video set determining module is used for determining the similarity degree between any two candidate videos based on the trained discrimination model and the video feature vectors corresponding to the candidate videos and determining a recommended video set based on the similarity degree;

9. The system of claim 8, the recommended video set determination module further to:

10. The system of claim 9, the recommended video set determination module further to:

determining a preset value of a recommended video set;

11. The system of claim 8, the shot feature extraction module further comprising:

the video frame acquisition unit is used for acquiring a plurality of video frames in a video clip corresponding to each video shot;

the image characteristic determining unit is used for determining one or more image characteristics corresponding to each video frame;

and the shot feature determining unit is used for processing the image features in the video frames and the interrelations among the image features in the video frames based on the trained shot feature extraction model and determining the shot features corresponding to the video shots.

12. The system of claim 11, wherein the lens feature extraction model and the discrimination model are sub-models of a first neural network model, the first neural network model comprises a first lens feature extraction model, a second lens feature extraction model and a discrimination model, the first lens feature extraction model and the second lens feature extraction model have the same model structure, and any one of the first lens feature extraction model and the second lens feature extraction model can be used for generating a plurality of lens features corresponding to the plurality of video lenses; the lens feature extraction module further comprises a first neural network model training unit, and the first neural network model training unit is used for:

13. The system of claim 12, the first neural network model training unit to train a first initial model for a plurality of iterations to generate a trained first neural network model, wherein each iteration comprises:

acquiring an updated first initial model generated in the previous iteration;

for each of the video pairs in question,

14. The system of claim 11, the corresponding image features of the video frame comprising at least one of: shape information of an object in a video frame, positional relationship information between a plurality of objects in a video frame, color information of an object in a video frame, a degree of completeness of an object in a video frame, and/or a brightness in a video frame.

15. A video recommendation apparatus comprising a processor for performing the video recommendation method of claims 1-8.