WO2020024958A1

WO2020024958A1 - Method and system for generating video abstract

Info

Publication number: WO2020024958A1
Application number: PCT/CN2019/098495
Authority: WO
Inventors: 曾建平; 吴立薪; 吕晶晶; 包勇军
Original assignee: 北京京东尚科信息技术有限公司; 北京京东世纪贸易有限公司
Priority date: 2018-08-03
Filing date: 2019-07-31
Publication date: 2020-02-06
Also published as: CN110798752A; CN110798752B

Abstract

The present invention relates to the technical field of videos, and provides a method and system for generating a video abstract. The method may comprise: receiving a video and splitting the video into multiple shots according to the video scene change of the video, wherein each of the multiple shots is a video scene having continuous content; calculating the importance score of each shot; selecting a group of shots from the multiple shots, thereby making the total importance score of the selected group of shots be maximum in the case that the constraint condition of the total time length of a video abstract is satisfied; and splicing the selected group of shots into the video abstract and outputting the video abstract. The present invention may make the video abstract comprise some important shots or segments.

Description

Method and system for generating video summary

Cross-reference to related applications

This application is based on an application with a CN application number of 201810874321.2 and an application date of August 3, 2018, and claims its priority. The disclosure of this CN application is incorporated herein as a whole.

Technical field

The present disclosure relates to the field of video technology, and in particular, to a method and system for generating a video abstract.

Background technique

The video summary is to select key frames or key segments from a longer video and stitch them into a shorter video, so that the viewer can understand the content of the original video or enjoy the wonderful clips in the original video in a short time. Video summary has a wide range of application scenarios, including personal video editing, TV movie plot introduction, video assisted criminal investigation and Internet short video. In the related art method for generating a video summary, due to the strong subjectivity of the video evaluation, the generated video summary may lose some important segments or exciting content.

For example, the video abstraction method of related technologies generally selects key frames and key fragments based on some general criteria, and there are few methods for generating video abstractions for specific scenes and applications. This leads to the poor performance of this method in some specific application scenarios, especially in the field of video advertising. The summary-processed advertising video may lose key segments used to introduce product brands and product characteristics.

Summary of the invention

A technical problem solved by the embodiments of the present disclosure is to provide a method for generating a video summary, so that the video summary can include some relatively important shots or fragments.

According to an aspect of the embodiments of the present disclosure, there is provided a method for generating a video summary executed by a computer, including: receiving a video, and dividing the video into multiple shots according to a change in a video scene of the video Each shot of the plurality of shots is a continuous video scene; calculating the importance score of each shot; selecting a set of shots from the plurality of shots so as to meet the constraint of the total length of the video summary The total importance score of the selected group of shots is the largest under the condition; and the selected group of shots is stitched into a video summary and the video summary is output.

In some embodiments, the step of calculating the importance score of each shot includes: extracting a feature vector from each shot using a three-dimensional convolution network to obtain a feature vector sequence of a shot set composed of the plurality of shots; and The feature vector sequence is input to a pre-trained shot importance score calculation network to calculate the importance score of each shot.

In some embodiments, before segmenting the video into multiple shots, the method further includes: training a shot importance score calculation network using a reinforcement learning method, wherein the method of reinforcement learning includes Key elements include: action and value reward functions, which include: diversity indicators and representative indicators.

In some embodiments, before selecting the set of shots from the plurality of shots, the method further includes identifying at least one shot among the plurality of shots that exhibits a key feature.

In some embodiments, the key feature includes at least one of a product brand trademark and a product brand text.

In some embodiments, the step of identifying at least one shot that exhibits a key feature among the plurality of shots includes: using a deep learning-based object detection method to detect a trademark region in each frame of the video of the video; and The image is input to a pre-trained depth model to extract the embedded feature vector, and the embedded feature vector is compared with the feature vector of the trademark image in the database to obtain the brand type of the trademark, so as to identify at least one displaying the brand brand of the product Footage; or, using deep learning-based optical character recognition methods to identify the text in each frame of the video; and word segmentation of the text, and matching the processed text with the brand text in the database, keeping it relevant to the product brand Text to identify at least one shot showing the brand text of the product.

In some embodiments, the step of selecting a group of shots from the plurality of shots includes: selecting at least one main shot from the at least one shot showing key features, and dividing the shot from the plurality of shots. At least one auxiliary lens is selected from the remaining lenses other than the at least one main lens, and the at least one main lens and the at least one auxiliary lens are taken as a selected group of lenses.

In some embodiments, the step of selecting at least one main shot from the at least one shot showing the key feature includes: if the shot selected from the at least one shot showing the key feature is the first Ng shots of the video or The rearmost Ng lens, it is determined that the frontmost Ng lens or the rearmost Ng lens is the at least one main lens and Ng is a positive integer; at least one auxiliary lens is selected from the remaining lenses in the plurality of lenses The step of using the at least one main lens and the at least one auxiliary lens as the selected group of lenses includes: selecting at least one from among the remaining lenses except the at least one main lens among the plurality of lenses. An auxiliary lens, using the at least one main lens and the at least one auxiliary lens as a selected group of lenses, so that the selected group of lenses has a total importance score under the condition that the constraint condition of the total length of the video summary is satisfied Maximum; the step of stitching the selected set of shots into a video summary includes: combining the at least one main shot and the at least one auxiliary shot Lens spliced into video summary in chronological order.

In some embodiments, before at least one shot showing key features is identified in the plurality of shots, the method further includes: calculating the similarity between each shot and the advertised product picture, and using the similarity correction The importance score of the corresponding shot.

In some embodiments, the step of calculating the similarity between each shot and the advertised product picture and using the similarity to modify the importance score of the corresponding shot includes: calculating a feature vector of the advertised product picture; for each Multi-frame images of the lens are sampled to obtain a sample frame, and a feature vector of the sample frame of each lens is calculated; each lens and the product are calculated according to the feature vector of the product picture and the feature vector of the sample frame of each lens Similarity of pictures; and correcting importance scores of corresponding shots according to the similarity and a preset similarity threshold.

According to another aspect of the embodiments of the present disclosure, there is provided a system for generating a video digest, including: a video segmentation unit configured to receive a video and cut the video according to a change in a video scene of the video Divided into a plurality of shots, wherein each shot of the plurality of shots is a continuous video scene; the calculation unit is configured to calculate the importance score of each shot; the selection unit is configured to extract A group of shots is selected from a plurality of shots, so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied; and a stitching unit is configured to set the selected A set of shots is stitched into a video summary and the video summary is output.

In some embodiments, the calculation unit is configured to extract a feature vector from each shot using a three-dimensional convolution network to obtain a feature vector sequence of a shot set composed of the plurality of shots, and to convert the feature vector sequence Input to a pre-trained shot importance score calculation network to calculate the importance score of each shot.

In some embodiments, the system further includes: a training unit configured to train a lens importance score calculation network by using a reinforcement learning method, wherein the key elements included in the method of reinforcement learning include: action And value reward function, the value reward function includes: a diversity indicator and a representative indicator.

In some embodiments, the system further includes a recognition unit configured to identify at least one shot among the plurality of shots that exhibits a key feature.

In some embodiments, the recognition unit is configured to: use a deep learning-based object detection method to detect a trademark region in each frame of video of the video; and input the image of the trademark region into a pre-trained depth model to extract embedded features Vector, and comparing the embedded feature vector with the feature vector of the trademark image in the database to obtain the brand type of the trademark, thereby identifying at least one shot showing the brand brand of the product; or using optical character recognition based on deep learning The method recognizes the text in each frame of the video; performs word segmentation on the text, matches the processed text with the brand text in the database, and retains the text related to the product brand, thereby identifying at least the product brand text A shot.

In some embodiments, the selection unit is configured to select at least one main shot from the at least one shot that exhibits key features, and to remove from the rest of the plurality of shots except the at least one main shot At least one auxiliary lens is selected from the lenses, and the at least one main lens and the at least one auxiliary lens are taken as a selected group of lenses.

In some embodiments, the selecting unit is configured to determine the foremost Ng if the lens selected from the at least one lens showing the key feature is the foremost Ng lens or the last Ng lens of the video. The three lenses or the rearmost Ng lenses are the at least one main lens and Ng is a positive integer; and at least one auxiliary lens is selected from the remaining lenses of the plurality of lenses except the at least one main lens, Using the at least one main shot and the at least one auxiliary shot as a selected group of shots, so that the selected group of shots has the largest total importance score when the constraint condition of the total length of the video summary is satisfied; The stitching unit is configured to stitch the at least one main shot and the at least one auxiliary shot into a video summary in chronological order.

In some embodiments, the system further includes a correction unit configured to calculate the similarity between each shot and the advertised product picture, and use the similarity to correct the importance score of the corresponding shot.

In some embodiments, the correction unit is configured to: calculate a feature vector of the advertised product picture; sample a plurality of frames of each shot to obtain a sampling frame, and calculate a feature vector of the sampling frame of each shot; Calculating the similarity between each shot and the product picture according to the feature vector of the product picture and the feature vector of the sampling frame of each shot; and the importance of the corresponding shot to the corresponding shot according to the similarity and a preset similarity threshold Scores are corrected.

According to another aspect of the embodiments of the present disclosure, there is provided a system for generating a video digest, including: a memory; and a processor coupled to the memory, the processor being configured to be based on the memory stored in the memory. The instructions execute the method described previously.

According to another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor, implement the steps of the method described above.

In the above method, after receiving the video, after dividing the video into multiple shots, the importance score of each shot is calculated, and in the process of selecting a set of shots, a constraint is selected to satisfy the total length of the video summary In the case of a condition, a group of shots with the highest total importance score is stitched into a video summary and the video summary is output. Therefore, this method can make the video summary contain some more important shots or fragments.

Other features and advantages of the present disclosure will become apparent from the following detailed description of exemplary embodiments of the present disclosure with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which form a part of the specification, describe embodiments of the present disclosure and, together with the description, serve to explain principles of the present disclosure.

The disclosure can be understood more clearly with reference to the accompanying drawings, based on the following detailed description, in which:

1 is a flowchart illustrating a method for generating a video digest according to some embodiments of the present disclosure;

2 is a flowchart illustrating a method of calculating an importance score of each shot according to some embodiments of the present disclosure;

3 is a flowchart illustrating a method of calculating an importance score of each shot according to other embodiments of the present disclosure;

4 is a flowchart illustrating a method for generating a video digest according to some embodiments of the present disclosure;

5 is a flowchart illustrating a method of correcting an importance score of a lens according to some embodiments of the present disclosure;

6 is a structural diagram illustrating a system for generating a video digest according to some embodiments of the present disclosure;

7 is a structural diagram illustrating a system for generating a video digest according to some embodiments of the present disclosure;

8 is a structural diagram illustrating a system for generating a video digest according to other embodiments of the present disclosure;

FIG. 9 is a structural diagram illustrating a system for generating a video digest according to other embodiments of the present disclosure.

detailed description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the drawings. It should be noted that, unless specifically stated otherwise, the relative arrangement of components and steps, numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure.

At the same time, it should be understood that, for the convenience of description, the dimensions of the various parts shown in the drawings are not drawn according to the actual proportional relationship.

The following description of at least one exemplary embodiment is actually merely illustrative and in no way serves as any limitation on the present disclosure and its application or use.

Techniques, methods, and equipment known to those of ordinary skill in the relevant field may not be discussed in detail, but where appropriate, the techniques, methods, and equipment should be considered as part of the description.

In all examples shown and discussed herein, any specific value should be construed as exemplary only and not as a limitation. Therefore, other examples of the exemplary embodiments may have different values.

It should be noted that similar reference numerals and letters indicate similar items in the following drawings, so once an item is defined in one drawing, it need not be discussed further in subsequent drawings.

FIG. 1 is a flowchart illustrating a method for generating a video digest according to some embodiments of the present disclosure. FIG. 2 is a flowchart illustrating a method of calculating an importance score of each shot according to some embodiments of the present disclosure. FIG. 3 is a flowchart illustrating a method of calculating an importance score of each shot according to other embodiments of the present disclosure. The method for generating a video summary executed by a computer according to some embodiments of the present disclosure is described in detail below with reference to FIGS. 1 to 3. As shown in FIG. 1, the method may include steps S102 to S108.

As shown in FIG. 1, in step S102, a video is received, and the video is divided into multiple shots according to changes in the video scene of the video, where each shot of the multiple shots is a segment of continuous video scene.

For example, for a video sequence V = {I _i | i = 1, ..., N}, where I _i is a frame of video image. According to the change of the video scene, it is divided into multiple shots S _{t of} different lengths, which constitute a lens set S = {S _t | t = 1, ..., T}, T> 1 and T is a positive integer . Each shot is a continuous video scene. Assuming that the length of each shot (that is, the number of video frames contained in each shot) is sl _t , the set of all shot lengths is expressed as SL = {sl _t | t = 1, ..., T}.

In some embodiments, a KTS (Kernel Temporal Segmentation, Kernel Time Domain Segmentation) method may be used to segment a video into multiple shots. This method has good segmentation effect and fast speed. However, the present disclosure is not limited to using the KTS method, and other lens segmentation methods may also be adopted.

In step S104, the importance score of each shot is calculated.

In some embodiments, step S104 may include: extracting a feature vector from each shot using a three-dimensional convolution network to obtain a feature vector sequence of a shot set composed of the plurality of shots; and inputting the feature vector sequence to A pre-trained shot importance score calculation network to calculate the importance score of each shot.

For example, a block diagram of a model that implements the calculation of importance scores (the process of calculating importance scores can also be referred to as importance scores) is shown in Figure 2. Use a three-dimensional convolutional network (C3D Net) to extract feature vectors from the video shots, and obtain a feature vector sequence X = {X _t | t = 1, for the shot set S (S = {S _t | t = 1, ..., T}) …, T}, where X _t ∈ R ^d1 , R is the set of real numbers, and d1 represents the dimension. Then input the feature vector sequence X to the lens importance score calculation network that has been trained to calculate the importance score (or importance probability value) of each lens sv _t ∈ [0,1] to obtain the lens importance The sexual sequence SV = {sv _t | t = 1, ..., T}. The two sub-networks for calculating the importance score are described below.

(1) Video lens feature extraction network

Video footage is a sequence of images that can be represented by a three-dimensional matrix. A three-dimensional convolutional network (C3D Net) can be used to process the shots and extract one-dimensional feature vectors. That is, a three-dimensional convolutional network is used as a feature extraction network for video shots. For example, an inflated 3D convolutional network (I3D for short) can be used to process the lens.

For example, Kinetics-600 is a video classification dataset, which contains the activities of 600 categories of people, a total of more than 500,000 10-second video clips. First use the Kinetics-600 dataset to pre-train the I3D network, then use the I3D network to process the video shot S _t , and use the output of the last pooling layer of the network as the feature vector X _t , so that the shot set S = {S _t | t = 1, ..., T} is transformed into a feature vector sequence X = {X _t | t = 1, ..., T}. Because the pre-trained I3D network has strong video classification capabilities, the output of its last pooling layer is a type of feature embedding, which characterizes the essential characteristics of video content.

In the embodiment of the present disclosure, it is not limited to adopting an I3D network, and other types of three-dimensional convolutional networks may also be used for feature extraction of a video lens.

(2) Lens importance score calculation network

The lens importance score calculation network may be a time-series network, for example, it may be a recurrent neural network (Recurrent Neural Network, RNN for short). The lens importance score calculation network can be input with a chronological sequence of feature vectors X = {X _t | t = 1, ..., T}, and output a lens importance score sequence SV = {sv _t | t = 1, …, T}. For example, a bidirectional LSTM (Long Short-Term Memory) can be used to implement this network, as shown in Figure 3.

In some embodiments, before segmenting the video into multiple shots, the method may further include: training a shot importance score calculation network using a reinforcement learning method. The key elements of this reinforcement learning method include: action and value reward functions. The value reward function includes: a diversity indicator and a representative indicator. Reinforcement learning is used to train the above model without labeling the video. The reinforcement learning method is an unsupervised learning method.

The basic idea of reinforcement learning is to take multiple actions at random in a certain state of the system, calculate the value generated by each action, and optimize the system by rewarding the action of higher value and punishing the action of lower value, so that it tends to For choosing higher value actions. Therefore, reinforcement learning has two key elements: actions and reward functions.

For example, define actions related to camera selection:

It means that the lens whose time serial number is y _i is selected, so Y can represent the lens selection action, the time serial number set of the selected lens, and | Y | the number of elements in the set. The network outputs the importance probability value p _t = sv _t for each video shot, and samples whether the shot is selected based on the Bernoulli distribution, that is, a _t ～ Bernoulli (p _t ), using π _θ (a _t | p _t ), Where θ is a parameter of the above-mentioned two-way LSTM model, so the probability of occurrence of the lens selection action Y is

The value reward function R (S) has two indicators: diversity R _div and representative R _rep , which are defined as follows:

among them,

R (S) = R _div + R _rep . (4)

Here, the length || || X _t X _t ₂ represents the feature vector, obtained by squaring the square of each element of the feature vector X _t, and opening; || X _{t '||} ₂ indicates the feature vector X _t' length , Obtained by square-summing the squares of the elements of the feature vector X _{t '} ;

Represents the transpose of the feature vector X _t .

The diversity index measures the diversity of content between different shots, and the representative index measures how much the selected video shots represent the original video.

The goal of reinforcement learning is to maximize the expectation of the reward function R (S) for all possible actions. The mathematical description is as follows:

Among them, a _{1: T} represents the action taken, that is, which lenses are selected and which lenses are not selected, and pθ (a _{1: T} ) represents the probability that the action a _{1: T} occurs.

Because the probability of the camera selecting action Y is

The gradient of all objective functions can be expressed as:

By sampling the lens selection action, the above gradient expectations can be approximated, that is:

During the actual calculation of expectations, some actions can be sampled to approximate the expectations. Here, N is the number of actions sampled.

Based on the above reinforcement learning method, for example, the aforementioned two-way LSTM network is trained using a large number of advertisement videos on Jingdong Mall, and a trained lens importance score calculation network is obtained as a video lens importance score network model.

Returning to FIG. 1, in step S106, a group of shots is selected from a plurality of shots, so that the total importance score of the selected group of shots is the largest when the constraint condition of the total length of the video summary is satisfied.

For example, the constraint condition that the total length of the video summary needs to be satisfied may not exceed the required total length of the video summary. A set of shots is selected from the plurality of shots, and the set of shots has the largest total importance score when the constraint condition of the total length of the video summary is satisfied.

In step S108, the selected group of shots is stitched into a video summary and the video summary is output. For example, you can stitch the set of shots into a video summary in chronological order. The video summary may be output to a display to display the video summary. Of course, the video summary can also be output to other devices.

So far, a method for generating a video summary of some embodiments is provided. In this method, after receiving a video, after dividing the video into multiple shots, the importance score of each shot is calculated, and a shot with a larger importance score is a more important shot. In the process of selecting a group of shots, a group of shots with the largest total importance score is selected under the condition that the total length of the video summary is satisfied, and the set of shots is stitched into a video summary and the video summary is output. Therefore, this method can be used to make the video summary contain some more important shots or fragments.

In some embodiments, before step S106, the method may further include identifying at least one shot that exhibits a key feature among the plurality of shots. For example, the key feature may include at least one of a product brand trademark and a product brand text.

In some embodiments, the above step S106 may include: selecting at least one main shot from at least one shot showing key features, and selecting at least one from the remaining shots of the plurality of shots except for at least one main shot The auxiliary lens, using the at least one main lens and the at least one auxiliary lens as a selected group of lenses.

In the method of the above embodiment, at least one shot showing key features is identified, and at least one main shot is selected from at least one shot showing key features, and at least one auxiliary shot is selected from other remaining shots. The at least one main shot and the at least one auxiliary shot are used as a selected group of shots, so that the total importance score of the set of shots is the largest when the constraint condition of the total length of the video summary is satisfied. Stitch the set of shots into a video summary. In this way, the obtained video summary contains key shots, such as the key shots used to introduce product brands or product names in advertising videos, so as to highlight the important information of the video as much as possible.

In some embodiments, before at least one shot showing key features is identified in the plurality of shots, the method may further include: calculating the similarity between each shot and the advertised product picture, and using the similarity correction The importance score of the corresponding shot (ie, the shot corresponding to the similarity). With this amendment, the importance of those lenses that focus on displaying products can be enhanced, thereby enhancing the ability of video summaries to display products.

FIG. 4 is a flowchart illustrating a method for generating a video digest according to other embodiments of the present disclosure. As shown in FIG. 4, the method may include steps S402 to S410.

In step S402, a video is received, and the video is divided into multiple shots according to changes in the video scene of the video, where each shot of the multiple shots is a segment of a continuous video scene. This step S402 is the same as or similar to step S102, and details are not described herein again.

In step S404, the importance score of each shot is calculated. This step S404 is the same as or similar to step S104, and details are not described herein again.

In step S406, the similarity between each lens and the advertised product picture is calculated, and the importance score of the corresponding lens is corrected by using the similarity. The process of step S406 will be described in detail later with reference to FIG. 5.

In step S408, at least one shot showing key features is identified among the plurality of shots. For example, the key feature may include at least one of a product brand trademark and a product brand text.

For example, in advertising videos, there are generally shots showing product brands at the beginning or end of the video. This is to deepen the impression of the product brand on the advertising audience and to promote the brand. Therefore, you can identify and extract the shots of the advertising brand and summarize them in the summary After the ad video. The two sources of information used in the embodiments of the present disclosure to identify advertising brand lenses include: product brand trademarks and product brand text. For example, Jingdong mascots and Jingdong characters.

In some embodiments, advertising brand lens recognition may include two steps of brand trademark or text recognition and brand lens determination. It is as follows: (1) using the object detection technology to identify the brand trademark, or using OCR (Optical Character Recognition, optical character recognition) technology to identify the brand text; (2) brand lens determination: for the lens S _t , its length (ie the number of video frames) For sl _t , if the brand trademark or text is in the center area of the image and appears in consecutive N _c frame images, then this shot is determined as an advertising brand shot. For example, N _c ≥sl _t / 2.

In some embodiments, step S408 may include: using a deep learning-based object detection method to detect a trademark area in each frame of the video. For example, the object detection method can use Faster-RCNN (Faster, Region, CNN, Detector, Faster Region CNN Detector), SSD (Single Shot Detector, Single Frame Detector), and YOLO (Detector by "You only look at the scene" "Can only look at the detector once" etc., but is not limited to these methods. The step S408 may further include: inputting the image of the trademark region into a pre-trained depth model to extract the embedded feature vector, and comparing the embedded feature vector with the feature vector of the trademark image in the database to obtain the brand type of the trademark ( (Such as JD.com, Apple, Haier, etc.), so as to identify at least one shot showing the brand logo of the product. For example, the at least one lens may include a plurality of lenses. For example, if the database stores feature vectors of N trademark images, the extracted embedded feature vectors are compared with the feature vectors of these N trademark images to obtain the brand type of the trademark.

In other embodiments, step S408 may include: identifying the text in each frame of the video using the OCR method based on deep learning; and performing word segmentation processing on the text, and processing the processed text with the brand text in the database. Match and retain the text related to the product brand to identify at least one shot showing the product brand text.

In step S410, a set of shots is selected from a plurality of shots, so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied.

In the embodiment of the present disclosure, in the process of generating a video summary, a group of shots needs to be selected and stitched together to obtain a final summary video file. Which shots are selected can be represented by the set SU = {su _t | t = 1, ..., T}, where su _t ∈ {0,1} indicates whether the shot is selected. For example, su _t is 1, which means that the lens is selected; su _t is 0, which means that the lens is not selected.

For the lens set S = {S _t | t = 1, ..., T}, selecting a group of lenses under the condition of satisfying the total duration constraint to maximize the total lens importance score can be reduced to an optimization problem, as follows:

limited by

Among them, sv _t is the importance score of the lens, sl _t is the length of the lens, su _t indicates whether the lens is selected, and ST is the maximum duration of the summary video. This optimization problem can be solved using dynamic programming methods.

In some embodiments, step S410 may include: selecting at least one main shot from at least one shot showing key features, and selecting from the remaining shots of the plurality of shots except the at least one main shot At least one auxiliary lens, using the at least one main lens and the at least one auxiliary lens as a selected group of lenses.

In some embodiments, the step of selecting at least one main shot from at least one shot showing key features may include: if the shot selected from the at least one shot showing key features is the first N _g shots or the last face of the video N _g lenses, it is determined that the frontmost N _g lenses or the rearmost N _g lenses are the at least one main lens, and N _g is a positive integer, for example, the value of N _g is 1 to 2.

For example, if the lens S _t is identified as the lens used to display the brand of the advertised product, and it is the first N _g or the last N _g of the lens set S, that is t ≦ N _g or t> KN _g , K Is the total number of shots, then this shot S _t is the selected advertisement brand shot. For example, the value of N _g is 1 to 2. Because a basic purpose of advertising is to let the advertising audience know the brand of the product, the brand of the product can be displayed and emphasized in the summary video.

In some embodiments, the step of selecting at least one auxiliary lens from the remaining lenses in the plurality of lenses, and using the at least one main lens and the at least one auxiliary lens as the selected group of lenses may include: At least one auxiliary lens is selected from the remaining lenses except the at least one main lens, and the at least one main lens and the at least one auxiliary lens are used as a selected group of lenses, so that the selected group The lens has the largest total importance score when it meets the constraint of the total length of the video summary.

For example, S ^pre is above the selected brand advertising lens set in the lens set S \ S ^pre (meant to exclude residual lens after S ^pre set) using a dynamic programming method to solve the optimization problem, choice of lens, and satisfies the remaining length constraint .

In step S412, the selected group of shots is stitched into a video summary and the video summary is output.

In some embodiments, the step of stitching the selected group of shots into a video summary may include: stitching at least one main shot and at least one auxiliary shot into a video summary in chronological order. For example, at least one main shot and at least one auxiliary shot may be sorted in time, and finally stitched into an advertisement video summary.

In other embodiments, the lens may exhibit not critical features of the top N video shots or the rearmost _g N _g shots, but some lenses video intermediate portion. In such a case, one or some shots can be selected as the main shot from these shots showing key features. Then select the secondary lens from the remaining shots. In the process of stitching the main shot and the auxiliary shot into a video summary, the main shot is placed at the front or the rear of the video summary, and the auxiliary shots are arranged in chronological order, so that the main shot and the auxiliary shot are stitched into a video Summary.

So far, a method for generating a video summary according to other embodiments of the present disclosure is provided. In this method, after receiving a video, after dividing the video into multiple shots, the importance score of each shot is calculated. Among them, a shot with a larger importance score is a more important shot. At least one shot showing key features is identified, and at least one main shot is selected from at least one shot showing key features, and at least one auxiliary shot is selected from other remaining shots. The at least one main shot and the at least one auxiliary shot are used as a selected group of shots, so that the total importance score of the set of shots is the largest when the constraint condition of the total length of the video summary is satisfied. The set of shots is stitched into a video summary and the video summary is output. In this way, the obtained video summary contains key shots, such as the key shots used to introduce product brands or product names in advertising videos, so as to highlight the important information of the video as much as possible.

The method of some embodiments of the present disclosure focuses on retaining key segments that introduce product brands and product characteristics in short video advertisements, and guarantees a certain degree of continuity and excitement of the video content after the summary.

One purpose of advertising is to show the appearance of the product to the advertising audience, to build an impression of the product in their minds, so the shots that highlight the product can be identified in the advertising video and output to the video summary. The main image of the product generally contains the overall appearance of the product. The similarity between the video lens and the main image of the product can identify the lens that displays the main content of the product. If the main picture of the product promoted by the advertising video can be obtained, the lens importance score can be corrected.

FIG. 5 is a flowchart illustrating a method of correcting an importance score of a lens according to some embodiments of the present disclosure. The process shown in FIG. 5 is a specific implementation manner of step S406 in FIG. 4. The specific process of step S406 in FIG. 4 is described in detail below with reference to FIG. 5. As shown in FIG. 5, the process of correcting the importance score of the lens may include steps S502 to S508.

In step S502, a feature vector of the advertised product picture is calculated.

For example, a classification model based on deep learning (for example, Very Deep Convolutional Network (VGG), Google Inception Convolutional Network (Inception), ResNet (Residual Convolutional Network, residual) convolutional network)) calculations for the image (or referred to the goods master FIG.) I _M damascene feature vectors X _{_M} ∈R ^d2, X _M is the dimension of the feature vector d2.

In step S504, the multi-frame image of each lens is sampled to obtain a sample frame, and the feature vector of the sample frame of each lens is calculated.

For example, one frame image is selected for every several frames (for example, every 5 frames) of the video image in each shot S _t , and the classification feature model in step S502 is used to calculate the embedded feature vectors of these images to obtain a feature vector set {X _ti | i = 1, ..., N _t }. Here N _t represents the number of images sampled by the lens S _t .

In step S506, the similarity between each shot and the product picture is calculated according to the feature vector of the product picture and the feature vector of the sampling frame of each shot.

For example, for each shot S _t , calculate the cosine similarity between the feature vector set {X _ti | i = 1, ..., N _t } and the feature vector X _{M of the} product picture to obtain the similarity set {sm _ti | i = 1, ..., N _t }, and take the median value of the similarity set sm _t = median {sm _ti | i = 1, ..., N _t } as the similarity between the lens and the product picture.

In step S508, the importance score of the corresponding lens is modified according to the similarity and a preset similarity threshold. For example, the importance scores of some important shots can be modified, or the importance scores of each shot can be modified.

For example, the following formula may be used to modify the lens importance score sv _t , where tsm is a similarity threshold, for example, the similarity threshold may take a value of 0.5 to 0.6. The formula for correcting the lens importance score sv _t is:

So far, a method of correcting the importance score of a lens according to some embodiments is provided. By calculating the similarity between the lens and the product picture, and correcting the importance score of the lens according to the similarity, the importance of the lens that focuses on the product can be enhanced, thereby enhancing the ability of the video summary to display the product.

FIG. 6 is a structural diagram illustrating a system for generating a video digest according to some embodiments of the present disclosure. As shown in FIG. 6, the system may include a video segmentation unit 602, a calculation unit 604, a selection unit 606, and a stitching unit 608.

The video segmentation unit 602 is configured to receive a video, and segment the video into multiple shots according to a change in the video scene of the video. Each shot of the plurality of shots is a continuous video scene.

The calculation unit 604 is configured to calculate an importance score of each shot.

The selecting unit 606 is configured to select a group of shots from the plurality of shots, so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied.

The stitching unit 608 is configured to stitch the selected group of shots into a video summary and output the video summary.

In the system of this embodiment, the video segmentation unit receives the video and divides the video into multiple shots according to changes in the video scene; the calculation unit calculates the importance score of each shot; and the selection unit selects from among the multiple shots Selecting a group of shots so that the total importance score of the selected group of shots is the largest under the constraint condition of the total length of the video summary; and the stitching unit stitches the selected group of shots into a video summary and outputs the Video summary. This system can make the video summary contain some more important shots or clips.

In some embodiments, the calculation unit 604 may be configured to extract a feature vector from each shot using a three-dimensional convolution network, obtain a feature vector sequence of a shot set composed of the plurality of shots, and input the feature vector sequence. To a pre-trained shot importance score calculation network to calculate the importance score of each shot.

FIG. 7 is a structural diagram illustrating a system for generating a video digest according to other embodiments of the present disclosure. As shown in FIG. 7, the system may include a video segmentation unit 602, a calculation unit 604, a selection unit 606, and a stitching unit 608.

In some embodiments, as shown in FIG. 7, the system may further include a training unit 714. The training unit 714 is configured to train a lens importance score calculation network using a reinforcement learning method. The key elements of this reinforcement learning method include: action and value reward functions. The value reward function includes: a diversity indicator and a representative indicator.

In some embodiments, as shown in FIG. 7, the system may further include an identification unit 710. The identification unit 710 is configured to identify at least one shot that exhibits a key feature among the plurality of shots. For example, the key feature may include at least one of a product brand trademark and a product brand text.

In some embodiments, the recognition unit 710 may be configured to: use a deep learning-based object detection method to detect a trademark region in each frame of video of the video; and input the image of the trademark region into a pre-trained depth model to extract an embedding Feature vector, and comparing the embedded feature vector with the feature vector of the trademark image in the database to obtain the brand type of the trademark, thereby identifying at least one shot showing the brand brand of the product.

In other embodiments, the recognition unit 710 may be configured to: use a deep learning-based optical character recognition method to recognize text in each frame of the video; and perform word segmentation processing on the text, and compare the processed text with a database The brand text in the match is matched, and the text related to the product brand is retained, thereby identifying at least one shot showing the product brand text.

In some embodiments, the selecting unit 606 may be configured to select at least one main shot from at least one shot showing key features, and to select remaining shots from the plurality of shots other than the at least one main shot. Select at least one auxiliary lens, and use the at least one main lens and the at least one auxiliary lens as the selected group of lenses.

In some embodiments, the selection unit 606 may be configured to determine the forefront if the shot selected from the at least one shot showing the key feature is the first N _g shots or the last N _g shots of the video. N _g shots or the rearmost N _g shots are the at least one main shot, and N _g is a positive integer; and at least one selected from the remaining shots of the plurality of shots except the at least one main shot An auxiliary lens, using the at least one main lens and the at least one auxiliary lens as a selected group of lenses, so that the selected group of lenses has the largest total importance score when the constraint condition of the total length of the video summary is satisfied .

In some embodiments, the stitching unit 608 may be configured to stitch the at least one main shot and the at least one auxiliary shot into a video summary in chronological order.

In some embodiments, as shown in FIG. 7, the system may further include a correction unit 712. The correction unit 712 is configured to calculate the similarity between each shot and the advertised product picture, and use the similarity to correct the importance score of the corresponding shot.

In some embodiments, the correction unit 712 may be configured to: calculate a feature vector of the advertised product picture; sample multiple frames of each shot to obtain a sampling frame, and calculate a feature vector of the sampling frame of each shot ; Calculate the similarity between each shot and the product picture according to the feature vector of the product picture and the feature vector of the sampling frame of each shot; and the importance score of the corresponding shot according to the similarity and a preset similarity threshold Make corrections.

FIG. 8 is a structural diagram illustrating a system for generating a video digest according to other embodiments of the present disclosure. The system includes a memory 810 and a processor 820. among them:

The memory 810 may be a magnetic disk, a flash memory, or any other non-volatile storage medium. The memory is configured to store instructions in the embodiment corresponding to at least one of FIG. 1 to FIG. 5.

The processor 820 is coupled to the memory 810 and may be implemented as one or more integrated circuits, such as a microprocessor or a microcontroller. The processor 820 is configured to execute instructions stored in the memory, so that the video summary includes some more important shots or clips, or contains some key shots or clips.

In some embodiments, as shown in FIG. 9, the system 900 includes a memory 910 and a processor 920. The processor 920 is coupled to the memory 910 through a BUS bus 930. The system 900 may also be connected to an external storage device 950 through a storage interface 940 to call external data, and may also be connected to a network or another computer system (not shown) through a network interface 960, which is not described in detail here.

In this embodiment, a data instruction is stored in a memory, and the above instruction is processed by a processor, so that the video summary includes some more important shots or fragments, or some key shots or fragments.

In other embodiments, the present disclosure also provides a computer-readable storage medium having computer program instructions stored thereon, which are executed by a processor to implement at least one of the embodiments corresponding to FIG. 1 to FIG. 5 Steps of the method. Those skilled in the art should understand that the embodiments of the present disclosure may be provided as a method, an apparatus, or a computer program product. Therefore, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, the present disclosure may take the form of a computer program product implemented on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code therein. .

The present disclosure is described with reference to flowcharts and / or block diagrams of methods, devices (systems) and computer program products according to embodiments of the present disclosure. It should be understood that each process and / or block in the flowcharts and / or block diagrams and combinations of processes and / or blocks in the flowcharts and / or block diagrams can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing device to produce a machine, so that the instructions generated by the processor of the computer or other programmable data processing device are used to generate instructions Means for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a particular manner such that the instructions stored in the computer-readable memory produce a manufactured article including an instruction device, the instructions The device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.

So far, the present disclosure has been described in detail. To avoid obscuring the concept of the present disclosure, some details known in the art are not described. Those skilled in the art can fully understand how to implement the technical solutions disclosed herein based on the above description.

The methods and systems of the present disclosure may be implemented in many ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware or any combination of software, hardware, firmware. The above order of the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless otherwise specifically stated. Further, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, which programs include machine-readable instructions for implementing the method according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing a method according to the present disclosure.

Although some specific embodiments of the present disclosure have been described in detail through examples, those skilled in the art should understand that the above examples are only for the purpose of illustration and are not intended to limit the scope of the present disclosure. Those skilled in the art should understand that the above embodiments can be modified without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

A computer-implemented method for generating a video summary includes:

Receiving a video, and cutting the video into multiple shots according to changes in the video scene of the video, where each shot of the multiple shots is a continuous video scene;

Calculate the importance score of each shot;

Selecting a set of shots from the plurality of shots so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied; and

The selected group of shots is stitched into a video summary and the video summary is output.
The method of claim 1, wherein the step of calculating the importance score of each shot includes:

Extracting a feature vector from each shot using a three-dimensional convolutional network to obtain a feature vector sequence of a shot set composed of the plurality of shots; and

The feature vector sequence is input to a pre-trained shot importance score calculation network to calculate the importance score of each shot.
The method according to claim 2, wherein before the video is divided into a plurality of shots, the method further comprises:

Reinforcement learning is used to train the lens importance score calculation network. The key elements included in the reinforcement learning method include: action and value reward functions. The value reward function includes: diversity indicators and representativeness. index.
The method according to claim 1, wherein before selecting the group of shots from the plurality of shots, the method further comprises:

At least one shot is identified among the plurality of shots that exhibits a key feature.
The method of claim 4, wherein the key feature comprises at least one of a product brand trademark and a product brand text.
The method according to claim 5, wherein the step of identifying, among the plurality of shots, at least one shot exhibiting key features comprises:

Use a deep learning-based object detection method to detect the trademark region in each frame of the video; and input the image of the trademark region into a pre-trained depth model to extract the embedded feature vector, and compare the embedded feature vector with the trademark in the database The feature vectors of the images are compared to obtain the brand type of the trademark, so as to identify at least one shot showing the brand trademark of the product; or,

Use deep learning-based optical character recognition to identify the text in each frame of the video; and perform word segmentation on the text, and match the processed text with the brand text in the database, and retain the text related to the product brand, thereby Identify at least one shot showing the brand text of the product.
The method according to claim 4, wherein the step of selecting a group of lenses from the plurality of lenses comprises:

Selecting at least one main lens from the at least one lens exhibiting key features, and selecting at least one auxiliary lens from the remaining lenses of the plurality of lenses except the at least one main lens, and dividing the at least one The main lens and the at least one auxiliary lens serve as a selected group of lenses.
The method according to claim 7, wherein:

The step of selecting at least one main shot from the at least one shot showing the key feature includes: if the shot selected from the at least one shot showing the key feature is the first Ng shot or the last Ng shot of the video, Determining that the foremost Ng lens or the last Ng lens is the at least one main lens, and Ng is a positive integer;

The step of selecting at least one auxiliary lens from the remaining lenses in the plurality of lenses, and using the at least one main lens and the at least one auxiliary lens as the selected group of lenses includes: At least one auxiliary lens is selected from the remaining lenses except the at least one main lens, and the at least one main lens and the at least one auxiliary lens are selected as a selected group of lenses, so that the selected group of lenses satisfies The total importance score is the largest under the constraint of the total length of the video summary;

The step of splicing the selected set of shots into a video summary includes: splicing the at least one main shot and the at least one auxiliary shot into a video summary in chronological order.
The method according to claim 5, wherein before identifying at least one shot showing key features among the plurality of shots, the method further comprises:

Calculate the similarity between each shot and the advertised product picture, and use the similarity to modify the importance score of the corresponding shot.
The method according to claim 9, wherein the step of calculating the similarity between each shot and the advertised product picture and using the similarity to correct the importance score of the corresponding shot comprises:

Calculate the feature vector of the advertised product picture;

Sampling a multi-frame image of each lens to obtain a sampling frame, and calculating a feature vector of the sampling frame of each lens;

Calculating the similarity between each shot and the product picture according to the feature vector of the product picture and the feature vector of the sampling frame of each shot; and

Correct the importance score of the corresponding lens according to the similarity and a preset similarity threshold.
A system for generating a video summary includes:

The video segmentation unit is configured to receive a video and divide the video into multiple shots according to a change in the video scene of the video, where each shot of the multiple shots is a segment of a continuous video scene ;

A calculation unit configured to calculate an importance score of each lens;

A selection unit configured to select a group of shots from the plurality of shots so that the total importance score of the selected set of shots is the largest when the constraint condition of the total length of the video summary is satisfied; and

The stitching unit is configured to stitch the selected group of shots into a video summary and output the video summary.
The system according to claim 11, wherein:

The calculation unit is configured to extract a feature vector from each shot using a three-dimensional convolutional network to obtain a feature vector sequence of a shot set composed of the plurality of shots, and input the feature vector sequence to a pre-trained shot The importance score calculation network calculates the importance score of each shot.
The system of claim 12, further comprising:

A training unit configured to train a lens importance score calculation network by using a reinforcement learning method, wherein the key elements included in the method of reinforcement learning include: an action and a value reward function, the value reward function includes: Diversity and representativeness indicators.
The system of claim 11, further comprising:

The identification unit is configured to identify at least one shot that exhibits a key feature among the plurality of shots.
The system of claim 14, wherein the key feature includes at least one of a product brand trademark and a product brand text.
The system of claim 15, wherein:

The recognition unit is configured to: use a deep learning-based object detection method to detect a trademark region in each frame of the video; and input the image of the trademark region into a pre-trained depth model to extract an embedded feature vector, and Compare the embedded feature vector with the feature vector of the trademark image in the database to obtain the brand type of the trademark, so as to identify at least one shot showing the brand trademark of the product; or use deep learning-based optical character recognition to identify each frame of the video Text in the image; word segmentation of the text, and matching the processed text with the brand text in the database, retaining the text related to the product brand, thereby identifying at least one shot showing the brand text of the product.
The system according to claim 14, wherein:

The selecting unit is configured to select at least one main shot from the at least one shot showing key features, and select at least one auxiliary shot from the remaining shots of the plurality of shots except the at least one main shot. A lens, using the at least one main lens and the at least one auxiliary lens as a selected group of lenses.
The system of claim 17, wherein:

The selecting unit is configured to determine the foremost Ng lens or the last Ng lens if the lens selected from the at least one lens showing the key feature is the foremost Ng lens or the last Ng lens of the video. Ng lenses are the at least one main lens and Ng is a positive integer; and at least one auxiliary lens is selected from the remaining lenses of the plurality of lenses except the at least one main lens, and the at least one main lens is The shot and the at least one auxiliary shot are used as a selected set of shots, so that the selected set of shots has the largest total importance score when the constraint condition of the total length of the video summary is satisfied;

The stitching unit is configured to stitch the at least one main shot and the at least one auxiliary shot into a video summary in chronological order.
The system of claim 15, further comprising:

The correction unit is configured to calculate a similarity between each lens and the advertised product picture, and correct the importance score of the corresponding lens by using the similarity.
The system of claim 19, wherein:

The correction unit is configured to: calculate a feature vector of the advertised product picture; sample multiple frames of each lens to obtain a sample frame, and calculate a feature vector of the sample frame of each lens; The feature vector and the feature vector of the sampling frame of each shot calculate the similarity between each shot and the product picture; and correct the importance score of the corresponding shot according to the similarity and a preset similarity threshold.
A system for generating a video summary includes:

Memory; and

A processor coupled to the memory, the processor being configured to perform the method of any one of claims 1 to 10 based on instructions stored in the memory.
A computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor, implement the steps of the method according to any one of claims 1 to 10.