CN110996183B

CN110996183B - Video abstract generation method, device, terminal and storage medium

Info

Publication number: CN110996183B
Application number: CN201911103191.3A
Authority: CN
Inventors: 李马丁; 郑云飞; 刘建辉; 宁小东; 章佳杰; 于冰
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-07-12
Filing date: 2019-11-12
Publication date: 2022-01-21
Anticipated expiration: 2039-11-12
Also published as: CN110996183A

Abstract

The disclosure provides a video abstract generation method, a video abstract generation device, a video abstract generation terminal and a storage medium, and relates to the technical field of computers. The method comprises the steps of extracting a plurality of first video frames from a target video, carrying out duplication removal on the plurality of first video frames to obtain a plurality of second video frames, and generating a video abstract of the target video according to the plurality of second video frames. Through carrying out the duplicate removal on a plurality of extracted first video frames, the second video frames obtained after the duplicate removal are adopted to generate the video abstract, the problem that the first video frames are not subjected to the duplicate removal is avoided, the repeatability of the video abstract is caused by directly adopting the extracted first video frames to generate the video abstract, the diversity and the richness of each video frame displayed by the video abstract are ensured, the video content can be summarized more completely, the quality of the video abstract is improved, and the user experience is optimized.

Description

Video abstract generation method, device, terminal and storage medium

The present disclosure claims priority of a chinese patent application filed by the intellectual property office of the people's republic of china on 12/07/2019 under the title "method, apparatus, terminal and storage medium for generating a video summary" with application number 201910631335.6, the entire contents of which are incorporated by reference in the present disclosure.

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a terminal, and a storage medium for generating a video summary.

Background

With the continuous development of computer technology and the popularization of intelligent hardware equipment, more and more professionals or amateurs can invest in the field of video editing. The user edits a new video based on the existing picture or video material by means of various professional, non-professional, complex or simple video editing software. In general, some users may produce a video summary of original video material or edited new video, so as to concentrate the essence of the video and show the content reflected by the video to the audience in a general way.

In the related art, in order to generate a video summary quickly, a plurality of video frames are extracted from a video at random or at equal inter-frame intervals, and the extracted plurality of video frames are simply combined to generate the video summary. However, the video summary generated in this way cannot completely summarize the video content, and the quality is difficult to guarantee.

Disclosure of Invention

The embodiment of the application provides a method, a device, a terminal and a storage medium for generating a video abstract, and aims to improve the generation quality of the video abstract and optimize user experience.

According to a first aspect of the embodiments of the present disclosure, there is provided a method for generating a video summary, the method including:

extracting a plurality of first video frames from a target video;

carrying out duplication removal on the plurality of first video frames to obtain a plurality of second video frames;

and generating a video abstract of the target video according to the plurality of second video frames.

Optionally, before the step of performing deduplication on the plurality of first video frames to obtain a plurality of second video frames, the method further includes:

determining a quality score for each of the first video frames;

and removing the first video frames with the quality scores smaller than a first preset threshold value from the plurality of first video frames.

Optionally, the step of performing deduplication on the plurality of first video frames to obtain a plurality of second video frames includes:

respectively calculating the similarity of two first video frames in each video frame combination; the video frame combination comprises two adjacent first video frames in the plurality of first video frames;

and when the similarity is larger than a second preset threshold value, removing the first video frames with low quality scores in the video frame combination according to a preset demand to obtain a plurality of second video frames.

Optionally, when the similarity is greater than a second preset threshold, the step of removing a first video frame with a low quality score in the video frame combination according to a preset amount of demand to obtain a plurality of second video frames includes:

when the similarity is larger than a second preset threshold value, sequentially removing first video frames with low quality scores in the video frame combination according to the time sequence;

when the number of the remaining first video frames is equal to the preset demand, stopping the de-duplication of the plurality of first video frames to obtain a plurality of second video frames;

when all the first video frames are subjected to de-duplication and the number of the remaining first video frames is still larger than the preset demand, sequencing the remaining first video frames from top to bottom according to the quality scores;

determining N first video frames which are sequenced at the top as a plurality of second video frames; and N is the preset demand.

Optionally, the step of generating a video summary of the target video according to the plurality of second video frames includes:

and adding a motion special effect to each second video frame and/or performing transition processing between two adjacent second video frames to generate a video abstract of the target video.

Optionally, when the second video frames include a face image, before the step of adding a motion special effect to each of the second video frames and/or performing transition processing between two adjacent second video frames to generate a video summary of the target video, the method further includes:

determining a central point of the face image aiming at a second video frame comprising the face image;

and determining the central point of the face image as the starting point or the end point of the motion special effect of the second video frame comprising the face image.

Optionally, after the step of generating the video summary of the target video according to the plurality of second video frames, the method further includes:

adding video introduction information to a first video frame in the video abstract;

and/or adding a color gradient special effect to the last video frame in the video abstract;

and/or adding a preset icon to the last video frame in the video abstract.

According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for generating a video summary, the apparatus including:

the first video frame extraction module is used for extracting a plurality of first video frames from the target video;

the first video frame duplicate removal module is used for carrying out duplicate removal on the plurality of first video frames to obtain a plurality of second video frames;

and the video abstract generating module is used for generating the video abstract of the target video according to the plurality of second video frames.

Optionally, the apparatus further comprises:

a quality score determination module for determining a quality score for each of the first video frames;

the first video frame removing module is used for removing the first video frames with the quality scores smaller than a first preset threshold value in the plurality of first video frames.

Optionally, the first video frame deduplication module includes:

the similarity calculation operator module is used for calculating the similarity of two first video frames in each video frame combination respectively; the video frame combination comprises two adjacent first video frames in the plurality of first video frames;

and the first video frame duplicate removal submodule is used for removing the first video frame with low quality score in the video frame combination according to the preset demand when the similarity is greater than a second preset threshold so as to obtain a plurality of second video frames.

Optionally, the first video frame deduplication sub-module includes:

the first video frame duplicate removal unit is used for sequentially removing first video frames with low quality scores in the video frame combination according to the time sequence when the similarity is greater than a second preset threshold value;

a duplication stopping unit, configured to stop performing duplication removal on the plurality of first video frames to obtain the plurality of second video frames when the number of remaining first video frames is equal to the preset required amount;

the sorting unit is used for sorting the rest first video frames from top to bottom according to the quality scores when all the first video frames are subjected to de-duplication and the number of the rest first video frames is still larger than the preset demand;

a second video frame determination unit, configured to determine N first video frames that are ranked at the top as the plurality of second video frames; and N is the preset demand.

Optionally, the video summary generation module includes:

and the video abstract generating submodule is used for adding a motion special effect to each second video frame and/or performing transition processing between two adjacent second video frames so as to generate the video abstract of the target video.

Optionally, when the second video frame includes a face image, the apparatus further includes:

the central point determining module is used for determining the central point of the face image aiming at a second video frame comprising the face image;

and the motion special effect starting and end point determining module is used for determining the central point of the face image as the starting point or the end point of the motion special effect of the second video frame comprising the face image.

Optionally, the apparatus further comprises:

the video introduction information adding module is used for adding video introduction information to a first video frame in the video abstract;

and/or, a color gradient special effect adding module, configured to add a color gradient special effect to a last video frame in the video summary;

and/or the preset icon adding module is used for adding a preset icon to the last video frame in the video abstract.

According to a third aspect of the embodiments of the present disclosure, there is provided a terminal, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute to implement the operations performed by the video summary generation method provided by the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having instructions therein, which, when executed by a processor of a terminal, enable the terminal to perform operations performed to implement the video summary generation method as provided by the present disclosure.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the method comprises the steps of extracting a plurality of first video frames from a target video, carrying out duplication removal on the plurality of first video frames to obtain a plurality of second video frames, and generating a video abstract of the target video according to the plurality of second video frames. Through carrying out the duplicate removal on a plurality of extracted first video frames, the second video frames obtained after the duplicate removal are adopted to generate the video abstract, the problem that the first video frames are not subjected to the duplicate removal is avoided, the repeatability of the video abstract is caused by directly adopting the extracted first video frames to generate the video abstract, the diversity and the richness of each video frame displayed by the video abstract are ensured, the video content can be summarized more completely, the quality of the video abstract is improved, and the user experience is optimized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flowchart of a method for generating a video summary according to an embodiment of the present application;

fig. 2 is a flowchart of another method for generating a video summary according to an embodiment of the present application;

fig. 3 is a schematic diagram of a video summary generation apparatus according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In the related art, in order to generate a video summary for a user more quickly, video editing software typically extracts a plurality of video frames from a video at random or at equal inter-frame intervals, and simply combines the extracted plurality of video frames to generate the video summary. However, the video summary generated in this way cannot completely summarize the video content, and the quality is difficult to guarantee.

In view of this, the present application provides a method, an apparatus, a terminal and a storage medium for generating a video summary through the following embodiments, which aim to improve the quality of generating the video summary and optimize the user experience.

Fig. 1 is a flowchart of a method for generating a video summary according to an embodiment of the present application, and as shown in fig. 1, the method includes the following steps:

in step S11, a plurality of first video frames are extracted from the target video.

In the embodiment of the disclosure, a target video is obtained, and a plurality of first video frames are extracted from the target video according to a preset extraction mode.

It should be noted that, there are various extraction methods for extracting a plurality of first video frames from a target video, and the present application does not limit this. The first video frame can be extracted from the target video at equal inter-frame intervals; user settings may also be received to increase the extraction density during certain periods of the target video, e.g., to extract the first video frame at a shorter inter-frame spacing during periods of high target video highlights.

In step S12, the plurality of first video frames are deduplicated to obtain a plurality of second video frames.

In the embodiment of the present disclosure, a plurality of video frames with high repeatability generally exist in a plurality of first video frames extracted from a target video, and therefore, the plurality of first video frames need to be deduplicated, any one or a specific one of the plurality of video frames with high repeatability is retained, and other video frames in the plurality of video frames with high repeatability are removed, so as to obtain a plurality of second video frames.

For example, the extracted plurality of video frames include video frame 1, video frame 2, video frame 3, and up to video frame 10, where the repeatability of video frame 2 and video frame 3 is high, video frame 2 is retained and video frame 3 is removed, and the obtained plurality of second video frames sequentially include video frame 1, video frame 2, video frame 4, and up to video frame 10.

In step S13, a video summary of the target video is generated according to the plurality of second video frames.

In the embodiment of the present disclosure, after the multiple first video frames are deduplicated to obtain multiple second video frames, the multiple second video frames are processed to generate the video summary of the target video.

Through carrying out the deduplication to a plurality of first video frames, adopt the second video frame that obtains after the deduplication to generate the video abstract, avoid not carrying out the deduplication to first video frame, directly adopt the first video frame that draws to generate the repeatability that the video abstract leads to the video abstract, because the video frame that the video abstract can demonstrate is limited, after removing the first video frame of repeatability, make the content of the video frame that the video abstract included abundanter and various, can summarize the video content more completely.

The video abstract is mainly used for showing the wonderful video frames of the target video to users, and can also be used as a dynamic cover picture of the target video.

In the embodiment of the disclosure, the extracted multiple first video frames are deduplicated, and the second video frames obtained after deduplication is adopted to generate the video abstract, so that the situation that the first video frames are not deduplicated and the repeatability of the video abstract is caused by directly adopting the extracted first video frames to generate the video abstract is avoided, and various video frames displayed by the video abstract are ensured to be more diverse and rich, and the video content can be summarized more completely, so that the quality of the video abstract is improved, and the user experience is optimized.

Fig. 2 is a flowchart of another method for generating a video summary according to an embodiment of the present application, and as shown in fig. 2, the method includes the following steps:

in step S21, a plurality of first video frames are extracted from the target video.

This step is similar in principle to the step S11 described above and will not be described herein again.

In step S22, a quality score for each of the first video frames is determined.

In the disclosed embodiments, each first video frame is analyzed to determine a quality score for each first video frame, which is used to assess the quality of the first video frame.

In particular, the quality score for each first video frame may be determined by one or more of the dimensional data. When the first video frame does not comprise a face image, the dimension data is only target dimension data, and the target dimension data comprises one or more of image definition dimension data, color richness dimension data and value degree dimension data; when the first video frame includes a face image, the dimension data includes target dimension data and face dimension data, and the face dimension data includes one or more of face sharpness dimension data, eye opening degree dimension data, mouth opening degree dimension data, composition dimension data, and face direction dimension data. And when the quality score of each first video frame is determined through a plurality of dimension data, carrying out weighted summation on the dimension data to obtain the quality score of the first video frame.

For image sharpness dimension data: performing edge detection on the first video frame to obtain an edge detection result of each pixel point in the first video frame; calculating the variance of the edge detection result of each pixel point in the first video frame to obtain a first definition score of the first video frame; performing fuzzy processing on the first video frame to obtain a fuzzy image; calculating a YUV difference value between each pixel point in the first video frame and a corresponding pixel point in the blurred image; determining a second definition score of the first video frame according to the YUV difference value of each pixel point; and determining image definition dimension data of the first video frame according to the first definition score and/or the second definition score. The larger the first definition degree value is, the more details of the first video frame are represented, and the clearer the first video frame is; the larger the second definition degree value is, the larger the difference between the first video frame and the blurred image is, and the clearer the first video frame is.

For the color richness dimension data: calculating respective variances and means of at least two of the YUV color space components of the first video frame; color-richness dimension data of the first video frame is determined based on the respective variances and means of the at least two components. Specifically, the square root of the sum of the variances of the at least two components is calculated to obtain first color data, then the square root of the sum of the squares of the means of the at least two components is calculated to obtain second color data, and the first color data and the second color data are subjected to weighted summation to obtain color richness dimension data of the first video frame.

For value degree dimension data: calculating the variance and mean of the intra-frame distortion metric values of the first video frame, and generating a feature vector according to the variance of the intra-frame distortion metric values, the mean of the intra-frame distortion metric values and color richness dimension data; and inputting the characteristic vector into a preset value degree prediction model to obtain value degree dimension data of the first video frame. Specifically, a first video frame is divided into a plurality of region blocks, firstly, a YUV value of each pixel point in each region block is predicted through YUV values of pixel points around each region block, then, a difference value between the predicted YUV value and an actual YUV value of each pixel point in each region block is determined as an intra-frame distortion value of each pixel point, then, an average value of the intra-frame distortion values of all the pixel points in each region block is calculated, an intra-frame distortion metric value of each region block is determined, and finally, a variance and an average value corresponding to the intra-frame distortion metric values of all the region blocks are calculated and serve as a variance and an average value of the intra-frame distortion metric values of the first video frame. The value degree prediction model is obtained by training according to the sample characteristic vector of the sample image and the actual value result of the user calibration sample image.

For face definition dimension data: performing edge detection on a face region where a face image in a first video frame is located to obtain an edge detection result of each pixel point in the face region; calculating the variance of the edge detection result of each pixel point in the face region to obtain a third definition score of the face region; carrying out fuzzy processing on the face area to obtain a fuzzy area; calculating a YUV difference value between each pixel point in the face region and a corresponding pixel point in the fuzzy region; determining a fourth definition score of the face region according to the YUV difference value of each pixel point in the face region; and determining the face definition dimensional data of the face region according to the third definition score and/or the fourth definition score. The face region where the face image is located may be a frame-shaped region for face detection, a face contour region framed according to the feature points of the face, or a face internal region formed by two eyes and a chin (or a mouth).

For the eye-open degree dimension data: calculating the ratio of a first distance between the upper eyelid and the lower eyelid of the left eye and a second distance between the inner canthus and the outer canthus of the left eye in the face area where the face image is located to obtain the eye opening value of the left eye; calculating the ratio of the third distance between the upper eyelid and the lower eyelid of the right eye in the face area where the face image is located to the fourth distance between the inner eye angle and the outer eye angle of the right eye to obtain the opening value of the right eye; when the left eye opening score and the right eye opening score are both smaller than or equal to a first threshold value, determining the opening degree dimension data of the human face area as a score smaller than zero; when at least one of the left-eye opening score and the right-eye opening score is larger than a first threshold value and the absolute value of the difference between the left-eye opening score and the right-eye opening score is smaller than or equal to a second threshold value, taking the sum of the left-eye opening score and the right-eye opening score as eye opening degree dimension data of the human face area; when at least one of the left-eye opening score and the right-eye opening score is greater than the first threshold value and an absolute value of a difference between the left-eye opening score and the right-eye opening score is greater than a second threshold value, twice the maximum opening score of the left-eye opening score and the right-eye opening score is used as the opening degree dimension data of the face region.

For the lip opening degree dimension data: determining a first included angle and a second included angle of a triangle formed by the left mouth corner, the right mouth corner and the middle point of the lower lip, wherein the first included angle is the included angle of the area where the left mouth corner is located, and the second included angle is the included angle of the area where the right mouth corner is located; and determining the mouth opening degree dimension data of the face region according to the first included angle and the second included angle. And when the angle value of the first included angle and the angle value of the second included angle are larger, the corresponding mouth opening degree dimension data are larger.

For composition dimension data: determining the number of human faces included in a human face region, and when the human face region includes one human face, determining composition dimension data of the human face region according to the distance between the center point of the human face and the composition gravity center; and when the face area comprises a plurality of faces, determining composition dimension data of the face area according to the distance between the gravity center of a polygon formed by the center points of the plurality of faces and the composition gravity center. When the first video frame is a portrait image, the composition gravity center can be a position above the center of the portrait image; when the first video frame is a landscape image, the composition gravity center may be a position on the upper left side or a position on the upper right side of the landscape image. The composition dimension data is larger as the distance is closer.

Aiming at the direction dimension data of the human face: determining the face direction in the face region; and determining the face direction dimension data of the face region according to the deviation angle of the face direction from the reference direction. Specifically, a face direction lookup table may be preset, a plurality of included angle ranges are set in the face direction lookup table, each included angle range corresponds to one face direction dimension data, and after a deviation angle of the face direction deviating from a reference direction is obtained, corresponding face direction dimension data is queried from the face direction lookup table according to an angle range where the deviation angle is located. The human face direction can be head up, head down, head left turn, head right turn, head left tilt, head right tilt, etc.

In step S23, the first video frames with the quality scores smaller than the first preset threshold are removed from the plurality of first video frames.

In the embodiment of the disclosure, after the quality score of each first video frame is determined, the quality score of each first video frame is compared with a first preset threshold, the first video frames with the quality scores smaller than the first preset threshold are removed from the plurality of first video frames, and the first video frames with the quality scores larger than or equal to the first preset threshold are reserved.

For example, if the first predetermined threshold is 0.6 and the respective quality score of each first video frame is a normalized value between 0 and 1, the first video frames with quality scores less than 0.6 are excluded from the plurality of first video frames.

The quality of the video summary generation can be controlled by adjusting the value of the first preset threshold, for example, by adjusting the first preset threshold, so as to further improve the quality of the video summary generation.

The quality score of each first video frame extracted from the target video is determined, and the first video frames with the quality scores smaller than a first preset threshold value in the plurality of first video frames are removed, so that the quality of the subsequent video frames for generating the video abstract is ensured, and the first video frames with lower quality are prevented from being selected for making the video abstract.

In step S24, calculating the similarity of the two first video frames in each video frame combination respectively; the video frame combination comprises two adjacent first video frames in the plurality of first video frames.

In the embodiment of the present disclosure, the plurality of first video frames are divided into a plurality of video frame combinations, each video frame combination includes two adjacent first video frames in the plurality of first video frames, and the similarity of the two first video frames in each video frame combination is calculated respectively.

When calculating the similarity of the two first video frames in each video frame combination, there are various options for the specific calculation method, which is not limited in this application. For example, the SAD value of the sampled pixel values of two first video frames in a video frame combination may be determined first; then, determining the similarity of the histograms of two first video frames in the video frame combination; and calculating the similarity of the two first video frames in the video frame combination according to the SAD values and/or the similarity of the histograms of the two first video frames in the video frame combination.

Wherein, sad (sum of absolute differences) is an image matching algorithm, which specifically refers to: the sum of the absolute values of the differences between the corresponding pixels of the two first video frames. In general, the smaller the SAD value, the smaller the difference between the two first video frames, and the larger the SAD value, the greater the similarity between the two first video frames.

And if the similarity of the two first video frames in the video frame combination is jointly calculated according to the SAD value and the similarity of the histogram of the two first video frames in the video frame combination, carrying out weighted summation on the SAD value and the similarity of the histogram to obtain the similarity of the two first video frames in the video frame combination.

In step S25, when the similarity is greater than a second preset threshold, removing the first video frames with low quality scores in the video frame combination according to a preset amount of demand, so as to obtain a plurality of second video frames.

In the embodiment of the disclosure, when the similarity of two first video frames in a video frame combination is greater than a second preset threshold, it is determined that the two first video frames in the video frame combination may show the same scene, and the two first video frames have repeatability, so that the first video frame with low quality score in the video frame combination is removed according to a preset required amount, so that the first video frame with repeatability and low quality score is removed, and a plurality of second video frames are obtained.

The preset demand can be determined according to the target duration of the video summary and the target display duration of each second video frame, and specifically, the preset demand is a ratio of the target duration of the video summary to the target display duration of each second video frame.

For example, if the target duration of the video summary is 30 seconds and the target presentation duration of each second video frame is 2 seconds, the preset demand amount may be determined to be 15. It should be understood that, in the present application, the target display duration of each second video frame may also be different, and the determination manner of the preset demand is not limited to the above example.

Specifically, step S25 includes steps a1 to a 4:

in step a1, when the similarity is greater than a second preset threshold, sequentially removing, in chronological order, first video frames with low quality scores in the video frame combination;

in step a2, when the number of remaining first video frames is equal to the preset demand, stopping the de-duplication of the plurality of first video frames to obtain the plurality of second video frames;

in step a3, when all the de-duplication of the plurality of first video frames is completed and the number of the remaining first video frames is still greater than the preset demand, sorting the remaining first video frames from top to bottom according to the quality score;

in step a4, determining the top N first video frames as the plurality of second video frames; and N is the preset demand.

In the embodiment of the disclosure, when the similarity of two first video frames in a video frame combination is greater than a second preset threshold, sequentially removing the first video frames with low quality scores in the video frame combination according to the time sequence of the first video frames, removing one first video frame each time, and updating the number of the remaining first video frames once; when the number of the remaining first video frames is equal to the preset demand, the multiple first video frames are stopped to be deduplicated, namely the first video frames with low quality scores are not required to be removed for the subsequent video frame combination, so that multiple second video frames are obtained; when all the multiple first video frames are subjected to de-duplication and the number of the remaining first video frames is still larger than the preset demand, sorting the remaining first video frames from top to bottom according to the quality scores, acquiring the first video frames with the preset demand which are sorted at the top, and determining the first video frames with the preset demand which are sorted at the top as multiple second video frames.

For example, the number of the first video frames obtained after the step S23 is performed is 20, the 20 first video frames are divided into 10 video frame combinations, each video frame combination includes two adjacent first video frames, the 1 st video frame combination includes the 1 st first video frame and the 2 nd first video frame, the 2 nd video frame combination includes the 3 rd first video frame and the 4 th first video frame, and so on, the 10 th video frame combination includes the 19 th first video frame and the 20 th first video frame; respectively calculating the similarity of two first video frames in each video frame combination, when the similarity of the 1 st first video frame and the 2 nd first video frame in the 1 st video frame combination is greater than a second preset threshold, removing the 2 nd first video frame because the quality score of the 2 nd first video frame is lower than the quality score of the 1 st first video frame, updating the number of the remaining first video frames to be 19, judging that the number of the remaining first video frames is greater than a preset demand, performing duplication elimination judgment on the 2 nd video frame combination, when the similarity of the 3 rd first video frame and the 4 th first video frame in the 2 nd video frame combination is less than or equal to the second preset threshold, reserving the 3 rd first video frame and the 4 th first video frame, and sequentially performing duplication elimination on each video frame combination according to the method. Before the 6 th video frame is subjected to combined deduplication, if 2 first video frames are already removed, and 1 first video frame is removed when the 6 th video frame is subjected to combined deduplication, the number of the remaining first video frames is 17.

Assuming that the preset demand is 16, after the 6 th video frame is combined and deduplicated, the number of the remaining first video frames is 17, which is greater than the preset demand, the deduplication comparison is continuously performed on the 7 th video frame combination, and if 1 first video frame in the 7 th video frame combination is removed, the number of the remaining first video frames is determined to be 16, which is equal to the preset demand, at this time, the deduplication comparison on the remaining 8 th to 10 th video frame combinations can be stopped, and 16 second video frames are obtained.

Assuming that the preset demand is 12, if the number of the remaining first video frames after all the deduplication of the 10 video frame combinations is 15 and is greater than the preset demand, sorting the remaining 15 first video frames from top to bottom according to the quality scores, and acquiring 12 first video frames sorted in front as second video frames.

It should be understood that, in the above example, the way of combining two by two the 20 first video frames is not limited to combining into 10 video frame combinations, and the above combination way should not be understood as a limitation to the present application.

In step S26, a motion effect is added to each of the second video frames and/or transition processing is performed between two adjacent second video frames to generate a video summary of the target video.

In the embodiment of the disclosure, after a plurality of second video frames are obtained, a motion special effect is added to each second video frame and/or transition processing is performed between two adjacent second video frames. The motion special effect specifically includes motion special effects such as zooming out, zooming in, and translating, for example, the zooming in motion special effect means that the second video frame is gradually zoomed in from a starting point, and the zooming out motion special effect means that the second video frame is gradually zoomed out from the zoomed in display to an end point; the transition processing specifically refers to adding transition effects of fade-in fade-out and fly-in fly-out to two adjacent second video frames, for example, the fade-in transition effect refers to that the second video frame is changed from dark to bright during display and is finally displayed clearly, and the fade-out transition effect refers to that the second video frame is changed from bright to dark during display and is finally not displayed at all.

In an optional embodiment of the present disclosure, when the second video frame includes a face image, before step S26, the method further includes: determining a central point of the face image aiming at a second video frame comprising the face image; and determining the central point of the face image as the starting point or the end point of the motion special effect of the second video frame comprising the face image.

If the second video frame comprises a plurality of faces, the center point of each face is determined respectively, the center points of the faces are connected to form a polygon, the center of gravity of the polygon is determined as the center point of the face image, and the plurality of faces can be considered together, so that the plurality of faces are jointly used as a display core; then, determining the central point of the face image as the starting point or the end point of the motion special effect of the second video frame; and finally, adding the motion special effect to the second video frames comprising the face images based on the starting point or the end point of the motion special effect, and/or performing transition processing between two adjacent second video frames to generate a video abstract of the target video.

For example, the center point of a human face or the center of gravity of a polygon can be used as a starting point of a special motion effect, and the picture can be displayed by gradually enlarging the starting point; or, the center point of the face or the center of gravity of the polygon can be used as the end point of the special motion effect, and the picture is displayed by gradually reducing the end point.

In an optional embodiment of the present disclosure, after step S26, the method further includes:

and/or adding a preset icon to the last video frame in the video abstract.

After the video abstract is generated, video introduction information can be added to the first video frame in the video abstract, wherein the video introduction information comprises a title or a word introduction; a color gradient special effect, such as a gradient full black special effect, can be added to the last video frame in the video abstract, so that the picture gradually becomes a full black state after the last video frame is displayed; a preset icon may also be added to the last video frame in the video summary, where the preset icon may be a logo (identification) of the producer of the target video.

The first video frame in the video abstract refers to a first video frame picture seen by a user when the user plays the video abstract; the last video frame in the video summary refers to the last video frame seen by the user when playing the video summary.

In the embodiment of the disclosure, the first video frames with quality scores smaller than the first preset threshold value in the plurality of first video frames are removed, the removed plurality of first video frames are deduplicated, and the second video frames obtained after deduplication are adopted to generate the video abstract, so that the problem that the quality of the video abstract is reduced due to the fact that the first video frames with lower quality are selected to be used for making the video abstract is avoided, meanwhile, the problem that the first video frames are not deduplicated, and the video abstract is generated by directly adopting the extracted first video frames to cause the repeatability of the video abstract is avoided, and each video frame displayed by the video abstract is ensured to have more diversity and richness, the video content can be summarized more completely, so that the quality of the video abstract is improved, and the user experience is optimized.

Based on the same inventive concept, the embodiment of the application provides a video abstract generating device.

Fig. 3 is a schematic diagram of an apparatus for generating a video summary according to an embodiment of the present application, and as shown in fig. 3, the apparatus 30 includes:

a first video frame extraction module 31, configured to extract a plurality of first video frames from a target video;

a first video frame deduplication module 32, configured to perform deduplication on the multiple first video frames to obtain multiple second video frames;

a video summary generating module 33, configured to generate a video summary of the target video according to the plurality of second video frames.

Optionally, the apparatus further comprises:

Optionally, the first video frame deduplication module includes:

Optionally, the first video frame deduplication sub-module includes:

Optionally, the video summary generation module includes:

Optionally, the apparatus further comprises:

Based on the same inventive concept, another embodiment of the present application provides a terminal, including: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to execute operations performed to implement the video summary generation method according to any of the above embodiments of the present application.

Based on the same inventive concept, another embodiment of the present application provides a non-transitory computer-readable storage medium, wherein instructions, when executed by a processor of a terminal, enable the terminal to perform operations performed to implement the video summary generation method according to any of the above embodiments of the present application.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method, the device, the terminal and the storage medium for generating the video summary provided by the present application are introduced in detail, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understanding the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for generating a video summary, the method comprising:

extracting a plurality of first video frames from a target video;

performing de-duplication on the plurality of first video frames to obtain a plurality of second video frames, including: according to the time sequence of the plurality of first video frames, sequentially removing the first video frames with low quality scores in two adjacent first video frames with repeatability, removing one first video frame each time, and updating the number of the remaining first video frames once; when the number of the remaining first video frames is equal to the preset demand, the multiple first video frames are stopped to be deduplicated to obtain multiple second video frames, wherein the repeatability is as follows: two adjacent first video frames have the possibility of showing the same scene;

2. The method of claim 1, further comprising, before the step of de-duplicating the first plurality of video frames to obtain a second plurality of video frames:

determining a quality score for each of the first video frames;

3. The method of claim 2, wherein the step of de-duplicating the first video frames to obtain second video frames comprises:

4. The method according to claim 3, wherein the step of removing the first video frame with the low quality score from the video frame combination according to a preset amount of demand to obtain a plurality of second video frames when the similarity is greater than a second preset threshold comprises:

when all the first video frames are subjected to de-duplication and the number of the remaining first video frames is still larger than the preset demand, sequencing the remaining first video frames from high to low according to the quality scores;

5. The method according to any one of claims 1 to 4, wherein the step of generating a video summary of the target video from the plurality of second video frames comprises:

6. The method according to claim 5, wherein when the second video frames include face images, before the step of adding a motion effect to each of the second video frames and/or performing transition processing between two adjacent second video frames to generate the video summary of the target video, the method further comprises:

7. The method according to any one of claims 1 to 4, further comprising, after the step of generating the video summary of the target video from the plurality of second video frames:

and/or adding a preset icon to the last video frame in the video abstract.

8. An apparatus for generating a video summary, the apparatus comprising:

the first video frame deduplication module is configured to deduplicate the plurality of first video frames to obtain a plurality of second video frames, and includes: according to the time sequence of the plurality of first video frames, sequentially removing the first video frames with low quality scores in two adjacent first video frames with repeatability, removing one first video frame each time, and updating the number of the remaining first video frames once; when the number of the remaining first video frames is equal to the preset demand, the multiple first video frames are stopped to be deduplicated to obtain multiple second video frames, wherein the repeatability is as follows: two adjacent first video frames have the possibility of showing the same scene;

9. The apparatus of claim 8, further comprising:

10. The apparatus of claim 9, wherein the first video frame deduplication module comprises:

11. The apparatus of claim 10, wherein the first video frame deduplication sub-module comprises:

the sorting unit is used for sorting the rest first video frames from high to low according to the quality scores when all the first video frames are subjected to de-duplication and the number of the rest first video frames is still larger than the preset demand;

12. The apparatus according to any one of claims 8 to 11, wherein the video summary generation module comprises:

13. The apparatus of claim 12, wherein when the second video frame comprises a face image, the apparatus further comprises:

14. The apparatus of any one of claims 8 to 11, further comprising:

15. A terminal, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute operations performed to implement the method of generating a video summary according to any one of claims 1 to 7.

16. A non-transitory computer-readable storage medium, wherein instructions, when executed by a processor of a terminal, enable the terminal to perform operations performed to implement the method of generating a video summary according to any one of claims 1 to 7.