CN114201644A

CN114201644A - Method, device and equipment for generating abstract video and storage medium

Info

Publication number: CN114201644A
Application number: CN202111535826.4A
Authority: CN
Inventors: 刘钊
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-03-18

Abstract

The method separates videos to be processed based on a video playing time axis to generate at least two frames of images, and generates a related image set based on the similarity between every two frames of images; acquiring a first frame image and a last frame image in each associated image set as candidate frame images; determining a key frame image in each candidate frame image based on each file information corresponding to each candidate frame image; and generating the abstract video of the video to be processed according to the main time node in the preset abstract form and the key frame image. And deleting redundant images except the first frame image and the last frame image in the associated images based on the similarity between the frame images. Therefore, the generation efficiency of the abstract video is improved, meanwhile, the redundant image information of the abstract video is reduced, and the user experience is improved.

Description

Method, device and equipment for generating abstract video and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for generating an abstract video.

Background

With the popularization of short videos and the continuous development of multimedia processing technologies, product explanation and popularization through short videos gradually become a mainstream means. Among them, short video based on slide explanation is a way of more users at present. Short video based on slide explanation does not need the mirror of the interpreter, and only needs to be explained by matching with characters or voice in the process of playing the slide in the background. However, short videos based on slide interpretation have a problem that contents of slides to be interpreted are redundant and emphasis is not sufficiently given. Therefore, for the above problems, two short video summary generation methods are mainly used by manual screening and automatic screening based on computer technology. The generated short video abstract content is accurate, but the abstract generation cost is high and the generation efficiency is low through a generation mode of manually screening the video segment by segment; the automatic screening is mainly based on semantic understanding of texts or voices in the slides, the generating mode is low in cost and high in generating efficiency, but the short video summaries generated by the mode are more in content, and more redundant information is generated. Therefore, how to reduce redundant information of the digest video while improving the generation efficiency of the digest video becomes a technical problem to be solved urgently at present.

Disclosure of Invention

The invention mainly aims to provide a method, a device and equipment for generating a summary video and a computer readable storage medium, aiming at improving the generation efficiency of the summary video and reducing the redundant information of the summary video.

In order to achieve the above object, the present invention provides a method for generating a summarized video, where the method for generating a summarized video includes: separating a video to be processed based on a video playing time axis to generate at least two frames of images, and generating a related image set based on the similarity between every two frames of images, wherein the similarity between the frames of images in the related image set exceeds a preset similarity threshold; acquiring a first frame image and a last frame image in each associated image set as candidate frame images; determining a key frame image in each candidate frame image based on each file information corresponding to each candidate frame image; and generating the abstract video of the video to be processed according to the main time node in the preset abstract form and the key frame image.

In addition, to achieve the above object, the present invention further provides a digest video generation apparatus, including: the device comprises a correlation image screening module, a video processing module and a processing module, wherein the correlation image screening module is used for separating a video to be processed based on a video playing time axis to generate at least two frames of images and generating a correlation image set based on the similarity between every two frames of images, and the similarity between the frames of images in the correlation image set exceeds a preset similarity threshold; the candidate image screening module is used for acquiring a first frame image and a last frame image in each associated image set as candidate frame images; the key image screening module is used for determining key frame images in the candidate frame images based on the file information corresponding to the candidate frame images; and the abstract video generation module is used for generating the abstract video of the video to be processed according to the main time node in the preset abstract form and the key frame image.

In addition, to achieve the above object, the present invention further provides a digest video generation apparatus, which includes a processor, a memory, and a digest video generation program stored on the memory and executable by the processor, wherein when the digest video generation program is executed by the processor, the steps of the digest video generation method described above are implemented.

In addition, to achieve the above object, the present invention further provides a computer readable storage medium, on which a summary video generation program is stored, wherein the summary video generation program, when executed by a processor, implements the steps of the summary video generation method as described above.

The invention provides a method for generating an abstract video, which separates a video to be processed based on a video playing time axis to generate at least two frames of images, and generates a related image set based on the similarity between every two frames of images, wherein the similarity between the frames of images in the related image set exceeds a preset similarity threshold; acquiring a first frame image and a last frame image in each associated image set as candidate frame images; determining a key frame image in each candidate frame image based on each file information corresponding to each candidate frame image; and generating the abstract video of the video to be processed according to the main time node in the preset abstract form and the key frame image. Through the mode, redundant images except the first frame image and the last frame image in the associated images are deleted based on the similarity among the frame images, then the key frame images are further screened out based on candidate frame images, namely the first frame image and the last frame image in each associated image set and corresponding file information, and the abstract video can be formed according to the main time nodes in the abstract form and the key frame images. Therefore, manual video abstraction extraction is avoided, the generation efficiency of the abstract video is improved, meanwhile, redundant image information of the abstract video is reduced, and user experience is improved.

Drawings

Fig. 1 is a schematic hardware configuration diagram of a digest video generation apparatus according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for generating a summarized video according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a second embodiment of a method for generating a summarized video according to the present invention;

FIG. 4 is a flowchart illustrating a method for generating a summarized video according to a third embodiment of the present invention;

fig. 5 is a functional block diagram of an apparatus for generating a summary video according to a first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The method for generating the abstract video is mainly applied to generating equipment of the abstract video, and the generating equipment of the abstract video can be equipment with display and processing functions, such as a PC (personal computer), a portable computer, a mobile terminal and the like.

Referring to fig. 1, fig. 1 is a schematic diagram of a hardware structure of a device for generating a summarized video according to an embodiment of the present invention. In this embodiment of the present invention, the apparatus for generating a summarized video may include a processor 1001 (e.g., a CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used for realizing connection communication among the components; the user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard); the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface); the memory 1005 may be a high-speed RAM memory, or may be a non-volatile memory (e.g., a magnetic disk memory), and optionally, the memory 1005 may be a storage device independent of the processor 1001.

Those skilled in the art will appreciate that the hardware configuration shown in fig. 1 does not constitute a limitation of the generation apparatus of the digest video, and may include more or less components than those shown, or combine some components, or arrange different components.

With continued reference to fig. 1, the memory 1005 of fig. 1, which is a type of computer-readable storage medium, may include an operating system, a network communication module, and a digest video generation program.

In fig. 1, the network communication module is mainly used for connecting to a server and performing data communication with the server; and the processor 1001 may call the digest video generation program stored in the memory 1005 and execute the digest video generation method provided by the embodiment of the present invention.

The embodiment of the invention provides a method for generating an abstract video.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for generating a summarized video according to a first embodiment of the present invention.

In this embodiment, the method for generating the abstract video includes the following steps:

step S10, separating the video to be processed based on a video playing time axis to generate at least two frames of images, and generating an associated image set based on the similarity between every two frames of images, wherein the similarity between the frames of images in the associated image set exceeds a preset similarity threshold;

in this embodiment, the video to be processed may be downloaded based on the video link of the video to be processed, or may be directly processed on line. The method comprises the steps of firstly separating a video to be processed into frame images according to a time axis, then removing redundant frame images in the frame images, carrying out optical character recognition on residual frame images from which the redundant frame images are removed, extracting pattern information of the frame images in the residual frame images, and determining key frame images in the residual frame images according to the pattern information of the frame images. And inserting the key frame images into each main time node in a preset abstract form, thereby generating the abstract video of the video to be processed.

Specifically, the video to be processed is separated into each frame of image according to a video playing time axis, and the similar images are used as a related image set.

Wherein the step S10 includes:

based on a video frame extraction algorithm, separating the video to be processed according to the video playing time axis, and generating at least two frames of images;

and calculating the similarity between every two images of each frame, taking the images of each frame with the similarity exceeding a preset similarity threshold as a related image set, and generating the related image sets.

Specifically, the image level features of each frame of image are extracted through a pre-trained deep learning model, such as a deep convolutional neural network VGGNet, a deep residual error network ResNet, a depth-level separable convolutional network MobileNet, and the like), and then the similarity between the image level features of every two images in each frame of image is calculated respectively by setting a distance function formula through weighting and combining with an equal distance function of a euclidean distance or a chebyshev distance and combining with weighting weight set by an actual scene.

When the similarity between the two images is not greater than a preset similarity threshold, adding the two images into an associated image set, for example, if the similarity between the first image and the second image is not greater than the preset similarity threshold, adding the first image and the second image into the first associated image set; and if the similarity between the first frame image and the third frame image is not greater than a preset similarity threshold, adding the third frame image to the first associated image set, and so on until all the frame images with the similarity not greater than the preset similarity threshold are added to the same associated image set. Thereby, each related image set corresponding to each frame image is generated.

Further, before the step S10, the method further includes:

acquiring the video to be processed, and judging whether the video to be processed belongs to an effective video or not based on the duration of the video to be processed;

when the video to be processed does not belong to the effective video, acquiring a next video as the video to be processed;

when the video to be processed belongs to the effective video, executing: the method comprises the steps of separating videos to be processed based on a video playing time axis to generate at least two frames of images, and generating at least one associated image set based on the similarity between every two frames of images.

Specifically, when the video to be processed is acquired, whether the video to be processed belongs to an effective video is judged, that is, the duration of the video to be processed is compared with the duration of a preset effective video. And the video with the duration being less than the duration of the effective video (not less than the duration of the abstract video) is the ineffective video. And when the video to be processed belongs to the effective video, separating the video to be processed into each frame of image according to a video playing time axis. Therefore, videos with insufficient time length for generating the abstract videos are directly filtered, and the generating efficiency of the abstract videos is improved.

Step S20, acquiring a first frame image and a last frame image in each associated image set as candidate frame images;

in this embodiment, the related image sets store multiple frames of similar images, the first frame image and the last frame image in each related image set can be retained, and the redundant frame images in each related image set, that is, the other frame images except the first frame image and the last frame image, are deleted. And then acquiring a first frame image and a last frame image in each associated image set as candidate frame images. According to the playing sequence of each frame image of a related image set in a video playing time axis, a first played image is used as a first frame image, and a last played image is used as a last frame image.

Step S30, determining a key frame image in each of the candidate frame images based on each of the pattern information corresponding to each of the candidate frame images;

in this embodiment, character recognition is performed on each candidate frame image according to an OCR technology, for example, character recognition extraction is performed by pressing a picture in a chat communication software, or character extraction in the picture is performed by an onenote plug-in Office software Office, a picture character recognition applet can also be directly searched in a wechat applet, character recognition and extraction are performed through the recognition applet, and character recognition extraction can also be performed through an OCR character recognition software. After extracting the document information of each frame of image, such as the number of characters in the document and the size of the characters. And deleting the candidate frames without the file information and with the file character number lower than the preset number threshold, and taking the frame images with the other character numbers higher than the preset number threshold as the related frame images. And taking the relevant frame image with the character size exceeding a preset size threshold as a key frame image.

Step S40, generating a digest video of the to-be-processed video according to the key frame image and the main time node in the preset digest form.

In this embodiment, the key frame images are recombined according to actual requirements and a preset summary form, so as to generate the summary video corresponding to the to-be-processed image.

The embodiment provides a method for generating an abstract video, which separates a video to be processed based on a video playing time axis to generate at least two frames of images, and generates a related image set based on the similarity between every two frames of images, wherein the similarity between the frames of images in the related image set exceeds a preset similarity threshold; acquiring a first frame image and a last frame image in each associated image set as candidate frame images; determining a key frame image in each candidate frame image based on each file information corresponding to each candidate frame image; and generating the abstract video of the video to be processed according to the main time node in the preset abstract form and the key frame image. Through the above manner, in the embodiment, based on the similarity between the frame images, the redundant images except the first frame image and the last frame image in the associated images are deleted, then based on the candidate frame images, that is, the first frame image and the last frame image in each associated image set, and the corresponding pattern information, the key frame images are further screened out, and the digest video can be formed according to the main time node in the digest form and the key frame images. Therefore, manual video abstraction extraction is avoided, the generation efficiency of the abstract video is improved, meanwhile, redundant image information of the abstract video is reduced, and user experience is improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of a method for generating a summarized video according to the present invention.

Based on the foregoing embodiment shown in fig. 2, in this embodiment, the step S30 specifically includes:

step S31, performing character recognition on the candidate frame images, and acquiring each pattern information corresponding to each candidate frame image, wherein the pattern information comprises the number of characters and the size of the characters;

step S32, determining relevant frame images with characters exceeding a preset number threshold value in each candidate frame image;

step S33, in each of the relevant frame images, determining a key frame image whose text size exceeds a preset size threshold.

In this embodiment, according to the optical character recognition technology, character recognition is performed on each candidate frame image, and the document information of each frame image, such as the number of characters in the document and the size of characters, is extracted. Determining a key frame image corresponding to the font pattern exceeding a preset size threshold in each key frame image, for example, determining an image with the largest number of characters and the largest font in each key frame image as a key frame image. In an embodiment, the introduction start frame, the title frame (i.e. the main content headline frame) and the summary end frame can be determined as the key frame images according to the pattern keywords of the key frame images.

Referring to fig. 4, fig. 4 is a flowchart illustrating a method for generating a summarized video according to a third embodiment of the present invention.

Based on the foregoing embodiment shown in fig. 3, in this embodiment, the step S30 further includes:

step S34, performing character recognition on the candidate frame images, and acquiring keywords, character positions and character repetition rates corresponding to the candidate frame images as the case information;

step S35 is to determine, as the key frame images, a leading frame image, and a summarizing frame image in each of the key frame images, based on the keywords, the character positions, and the character repetition rates of the candidate frame images.

In this embodiment, after determining the introduction start frame, the title frame (i.e., the main content headline frame), and the summary end frame in each key frame image, the remaining key frames are matched with the introduction start frame, the title frame (i.e., the main content headline frame), and the summary end frame according to the pattern keywords in the images, and the remaining key frames, i.e., one of the introduction start frame image, the title frame image, and the summary end frame image, are sequentially determined in combination with the playing sequence of the remaining key frames in the video playing time axis. If the characters are positioned at the top or middle of the image, and the file comprises a key frame image of the welcome keywords and the like, which is used as an introduction starting frame image; the image which comprises characters with high repetition rate and key words such as a first chapter, a second chapter and the like is used as a title frame image; the characters are positioned at the top or middle of the image and the file comprises a key frame image of the key words such as thank you and the like as a summary ending frame.

Further, in each of the header frame images, according to the font size, each level of header frame image, for example, an image corresponding to the largest font is determined as a main content large header frame image, an image corresponding to the second largest font is determined as a next level segmented small header frame image, and the like.

Further, the step S40 specifically includes:

and inserting each key frame image into the main time node, and inserting a preset transition frame image between the key frame images of adjacent nodes to generate the abstract video.

The inserting each key frame image into the main time node, and inserting a preset transition frame image between the key frame images of adjacent nodes, and generating the abstract video specifically includes:

inserting the introduction start frame image at an introduction node among the main time nodes, inserting a header frame image at a header node among the main time nodes, and inserting the summary end frame image at a summary node among the main time nodes;

the transition frame image is inserted between adjacent introduction start frame images and header frame images and/or between adjacent header frame images and summary end frame images.

In this embodiment, a summary form, including but not limited to a PPT explanation video form, an H5 animation video form, etc., is first determined, and at each main time node corresponding to the determined summary form, including but not limited to a general introduction node of the summary, an introduction node of the summary main content (including each section node of the main content), a start and end node of each section, and a summary and end node of the summary are determined. At each main time node, corresponding key frame images are respectively inserted (namely, the general introduction node of the abstract inserts introduction start frame images, the introduction node and each section node of the abstract main content insert header frame images and each decomposition header frame image, the summary and end node of the abstract inserts summary end frame images, etc.).

In a specific embodiment, transition frame images, including but not limited to overlong animation images or other non-key frame images, may also be inserted between adjacent key frames (e.g., between introduction start frame images, between header frame images, between summary end frame images, between introduction start frame images and header frame images, between introduction start frame images and summary end frame images, and/or between header frame images and summary end frame images), thereby generating the digest video.

In addition, the embodiment of the invention also provides a device for generating the abstract video.

Referring to fig. 5, fig. 5 is a functional block diagram of a device for generating a summary video according to a first embodiment of the present invention.

In this embodiment, the apparatus for generating a summary video includes:

the related image screening module 10 is configured to separate videos to be processed based on a video playing time axis to generate at least two frames of images, and generate a related image set based on a similarity between every two frames of images, where the similarity between the frames of images in the related image set exceeds a preset similarity threshold;

a candidate image screening module 20, configured to obtain a first frame image and a last frame image in each associated image set as candidate frame images;

a key image screening module 30, configured to determine a key frame image in each candidate frame image based on each pattern information corresponding to each candidate frame image;

and the abstract video generation module 40 is configured to generate an abstract video of the to-be-processed video according to the key frame image and the main time node in a preset abstract form.

Further, the associated image screening module 10 specifically includes:

the video separation and extraction unit is used for separating the video to be processed according to the video playing time axis based on a video frame extraction algorithm and generating at least two frames of images;

the first image screening unit is used for calculating the similarity between every two images of each frame, taking each image with the similarity exceeding a preset similarity threshold as a related image set, and generating each related image set.

Further, the key image filtering module 30 specifically includes:

the image character recognition unit is used for carrying out character recognition on the candidate frame images and acquiring each pattern information corresponding to each candidate frame image, wherein the pattern information comprises the number of characters and the size of the characters;

the related image screening unit is used for determining related frame images with the number of characters exceeding a preset number threshold in each candidate frame image;

and the key image screening unit is used for determining the key frame images with the character sizes exceeding a preset size threshold in the relevant frame images.

Further, the key image filtering module 30 specifically includes:

a pattern information extraction unit, configured to perform character recognition on the candidate frame images, and obtain keywords, character positions, and character repetition rates corresponding to the candidate frame images as the pattern information;

and the second image screening unit is used for determining an introduction start frame image, a title frame image and a summary end frame image in each key frame image according to the keywords, the character positions and the character repetition rates of the candidate frame images as the key frame images.

Further, the summary video generation module 40 specifically includes:

and the abstract video generation unit is used for inserting each key frame image into the main time node, and inserting a preset transition frame image between the key frame images of adjacent nodes to generate the abstract video.

Further, the summary video generation unit specifically includes:

an image inserting subunit configured to insert the introduction start frame image at an introduction node of the main time nodes, insert the header frame image at a header node of the main time nodes, and insert the summary end frame image at a summary node of the main time nodes;

and the video generation subunit is used for inserting the transition frame image between the adjacent introduction starting frame image and the header frame image and/or between the adjacent header frame image and the summary ending frame image.

Further, the apparatus for generating the summary video further includes a video screening module, and the video screening module specifically includes:

the effective video judging unit is used for acquiring the video to be processed and judging whether the video to be processed belongs to an effective video or not based on the duration of the video to be processed;

the invalid video filtering unit is used for acquiring a next video as the video to be processed when the video to be processed does not belong to the valid video;

an effective video processing unit, configured to, when the to-be-processed video belongs to the effective video, turn to the associated criminal screening module, and be configured to execute: the method comprises the steps of separating videos to be processed based on a video playing time axis to generate at least two frames of images, and generating at least one associated image set based on the similarity between every two frames of images.

Each module in the apparatus for generating a summary video corresponds to each step in the embodiment of the method for generating a summary video, and the functions and implementation processes thereof are not described in detail herein.

In addition, the embodiment of the invention also provides a computer readable storage medium.

The computer readable storage medium of the present invention stores a summary video generation program, wherein the summary video generation program, when executed by a processor, implements the steps of the summary video generation method as described above.

The method for implementing the summary video generation program when executed may refer to various embodiments of the summary video generation method of the present invention, and details thereof are not repeated herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for generating a summarized video is characterized by comprising the following steps:

separating a video to be processed based on a video playing time axis to generate at least two frames of images, and generating a related image set based on the similarity between every two frames of images, wherein the similarity between the frames of images in the related image set exceeds a preset similarity threshold;

acquiring a first frame image and a last frame image in each associated image set as candidate frame images;

determining a key frame image in each candidate frame image based on each file information corresponding to each candidate frame image;

and generating the abstract video of the video to be processed according to the main time node in the preset abstract form and the key frame image.

2. The method for generating a digest video according to claim 1, wherein the separating a video to be processed based on a video playing time axis to generate at least two images, and the generating at least one associated image set based on the similarity between each two images of each frame comprises:

3. The method for generating a digest video according to claim 1, wherein the determining, based on the respective pattern information corresponding to the respective candidate frame images, a key frame image in the respective candidate frame images includes:

performing character recognition on the candidate frame images, and acquiring each pattern information corresponding to each candidate frame image, wherein the pattern information comprises the number of characters and the size of the characters;

determining relevant frame images with the number of characters exceeding a preset number threshold in each candidate frame image;

and determining key frame images with the character size exceeding a preset size threshold in each related frame image.

4. The method for generating a digest video according to claim 1, wherein the determining, based on the respective pattern information corresponding to the respective candidate frame images, a key frame image in the respective candidate frame images includes:

performing character recognition on the candidate frame images, and acquiring keywords, character positions and character repetition rates corresponding to the candidate frame images as the case information;

and determining an introduction start frame image, a title frame image and a summary end frame image in each key frame image as the key frame images according to the keywords, the character positions and the character repetition rates of the candidate frame images.

5. The method for generating a summarized video as claimed in claim 4, wherein the generating the summarized video of the video to be processed according to the key frame images and the main time nodes in the preset summarized form comprises:

6. The method for generating a summarized video as claimed in claim 5, characterized in that said inserting each key frame image into the main time node and inserting preset transition frame images between the key frame images of adjacent nodes comprises:

7. The method for generating a summary video according to any one of claims 1 to 6, wherein before the separating the video to be processed based on the video playing time axis to generate at least two frames of images and generating at least one associated image set based on the similarity between each two frames of images, the method further comprises:

8. An apparatus for generating a digest video, the apparatus comprising:

the device comprises a correlation image screening module, a video processing module and a processing module, wherein the correlation image screening module is used for separating a video to be processed based on a video playing time axis to generate at least two frames of images and generating a correlation image set based on the similarity between every two frames of images, and the similarity between the frames of images in the correlation image set exceeds a preset similarity threshold;

the candidate image screening module is used for acquiring a first frame image and a last frame image in each associated image set as candidate frame images;

the key image screening module is used for determining key frame images in the candidate frame images based on the file information corresponding to the candidate frame images;

and the abstract video generation module is used for generating the abstract video of the video to be processed according to the main time node in the preset abstract form and the key frame image.

9. A digest video generation apparatus comprising a processor, a memory, and a digest video generation program stored on the memory and executable by the processor, wherein the digest video generation program, when executed by the processor, implements the steps of the digest video generation method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a generation program of digest video, wherein the generation program of digest video, when executed by a processor, implements the steps of the generation method of digest video according to any one of claims 1 to 7.