CN111625683B

CN111625683B - Automatic video abstract generation method and system based on graph structure difference analysis

Info

Publication number: CN111625683B
Application number: CN202010376813.6A
Authority: CN
Inventors: 吕晨; 柴春蕾; 马彩霞; 马艳玲; 吕蕾; 刘弘
Original assignee: Shandong Normal University
Current assignee: Beijing Senbo Mingde Marketing Technology Co.,Ltd.
Priority date: 2020-05-07
Filing date: 2020-05-07
Publication date: 2023-05-23
Anticipated expiration: 2040-05-07
Also published as: CN111625683A

Abstract

The invention discloses a video abstract automatic generation method and a system based on graph structure difference analysis, comprising the following steps: preprocessing a given video stream, and dividing each frame of image in the preprocessed video stream into a plurality of image blocks with equal size; extracting the characteristics of each image block to obtain the characteristic vector of each image block of each frame of image; establishing an undirected weighted graph of each frame of image according to the characteristic vector of each image block of each frame of image; detecting video shot boundaries based on hypothesis testing of graph structure difference analysis; the key frames in each video shot are extracted based on the median map of each video shot. The present disclosure addresses the problem that the original features may not be able to fully capture detailed structural information in the frame, making our method more robust to detect various types of shot transitions.

Description

Automatic video abstract generation method and system based on graph structure difference analysis

Technical Field

The disclosure relates to the technical field of automatic generation of a static video abstract (video key frame extraction), in particular to an automatic generation method and an automatic generation system of a video abstract based on graph structure difference analysis.

Background

The statements in this section merely mention background art related to the present disclosure and do not necessarily constitute prior art.

In recent years, a large number of new photographing apparatuses and video software have emerged, so that various videos on the internet have been greatly increased. There is an increasing need to quickly view and review large amounts of video data in a limited amount of time to facilitate video browsing and video retrieval. Furthermore, this is a general concern in the field where large amounts of video data must be stored, archived, analyzed, or visualized. Automatic video summarization techniques solve these problems by generating a reduced version of a video stream that retains only its most informative and representative content.

Automatic video summarization techniques can be divided into two categories, still video summarization (static key frames) and dynamic video summarization (dynamic video browsing). Currently, in the field of video key frame extraction, key frame extraction by shot boundary detection has been widely used.

In general, shot boundary detection may be achieved by analyzing differences between successive frames, where significant differences indicate that there may be a boundary at the currently detected location. There are a number of different similarity measures based on different video features, such as pixel differences, color histogram differences, compressed domain techniques, motion vectors, object tracking, event analysis, etc. These methods have proven to be specialized methods of detecting abrupt lens changes (e.g., hard cuts).

However, the inventors have found that a major limitation affecting the performance of existing methods is the lack of ability to detect subtle changes. For the transition of dissolution, erasure, fade-in/fade-out, etc., the inter-frame change of the progressive lens is relatively fine, which is difficult to detect by only relying on the low-level features adopted by the traditional method. The reason is that low-level features such as pixels, pixel blocks, histograms, etc. do not express the underlying detail structure information of each frame, which plays a crucial role in distinguishing nuances between successive frames in the progressive lens.

Disclosure of Invention

In order to solve the defects of the prior art, the present disclosure provides a method and a system for automatically generating a video abstract based on graph structure difference analysis; taking the video structure information into consideration, modeling the video frames by using an undirected weighted graph, and detecting shot boundaries through structure difference analysis among the graphs. And calculating a median map in the shot, and extracting corresponding key frames. The present disclosure addresses the problem that the original features may not be able to fully capture detailed structural information in the frame, making our method more robust to detect various types of shot transitions.

In a first aspect, the present disclosure provides a method for automatically generating a video summary based on graph structure difference analysis;

the automatic video abstract generation method based on graph structure difference analysis comprises the following steps:

preprocessing a given video stream, and dividing each frame of image in the preprocessed video stream into a plurality of image blocks with equal size;

extracting the characteristics of each image block to obtain the characteristic vector of each image block of each frame of image;

establishing an undirected weighted graph of each frame of image according to the characteristic vector of each image block of each frame of image; detecting video shot boundaries based on hypothesis testing of graph structure difference analysis;

the key frames in each video shot are extracted based on the median map of each video shot.

In a second aspect, the present disclosure further provides a video summary automatic generation system based on graph structure difference analysis;

an automatic video abstract generating system based on graph structure difference analysis comprises:

a preprocessing module configured to: preprocessing a given video stream, and dividing each frame of image in the preprocessed video stream into a plurality of image blocks with equal size;

a feature extraction module configured to: extracting the characteristics of each image block to obtain the characteristic vector of each image block of each frame of image;

a shot boundary detection module configured to: establishing an undirected weighted graph of each frame of image according to the characteristic vector of each image block of each frame of image; detecting video shot boundaries based on hypothesis testing of graph structure difference analysis;

a key frame extraction module configured to: the key frames in each video shot are extracted based on the median map of each video shot.

In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of the first aspect.

In a fourth aspect, the present disclosure also provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.

In a fifth aspect, the present disclosure also provides a computer program (product) comprising a computer program for implementing the method of any one of the preceding aspects when run on one or more processors.

Compared with the prior art, the beneficial effects of the present disclosure are:

(1) The present disclosure proposes a new method for representing video based on graph modeling, so that strong connectivity between graphs becomes a key factor for determining structural features of video frames, so as to make up for the gap between actual semantics of video frames and original features.

(2) The present disclosure proposes a graph-based dissimilarity measure method to measure inter-frame differences to reflect potential differences between successive frames, enhancing the robustness and accuracy of detecting various shot transitions.

(3) The method and the device have the advantages that the corresponding frames are extracted by the aid of the median map and serve as key frames of each shot, and overall trends of videos can be reflected more comprehensively and accurately.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application.

FIG. 1 is an overall flow overview of an algorithm according to a first embodiment of the disclosure;

FIG. 2 is a schematic diagram of video representation based on graph modeling in accordance with an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating detecting video shot boundaries based on graph structure difference analysis according to a first embodiment of the disclosure;

fig. 4 is a diagram illustrating a median view calculated in a shot according to an embodiment of the present disclosure.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Example 1

The embodiment provides a video abstract automatic generation method based on graph structure difference analysis;

as shown in fig. 1, the automatic video abstract generating method based on graph structure difference analysis includes:

s101: preprocessing a given video stream, and dividing each frame of image in the preprocessed video stream into a plurality of image blocks with equal size;

s102: extracting the characteristics of each image block to obtain the characteristic vector of each image block of each frame of image;

s103: establishing an undirected weighted graph of each frame of image according to the characteristic vector of each image block of each frame of image; detecting video shot boundaries based on hypothesis testing of graph structure difference analysis;

s104: the key frames in each video shot are extracted based on the median map of each video shot.

As one or more embodiments, in S101, a given video stream is preprocessed, and each frame of image in the preprocessed video stream is divided into a plurality of image blocks with equal size; the method comprises the following specific steps:

sampling a given video stream to obtain a video frame set; the video frame set comprises video frames sampled from a given video stream;

and carrying out consistent size adjustment on each frame of image in the video frame set, and dividing each frame of image after adjustment into a plurality of image blocks with equal size.

The image blocks herein are also referred to as patches.

Exemplary, in S101, a given video stream is preprocessed, and each frame of image in the preprocessed video stream is divided into a plurality of image blocks with equal size; the method comprises the following specific steps:

first, for a given video stream, a set of video frames f= { F containing n frames is extracted at a predefined sampling rate ₁ ,f ₂ ,f ₃ ,...,f _n }. Frame f _i An i-th frame representing a set of video frames.

A predefined sampling rate r:

the specified constant C is usually 1 to 3, here 3;

secondly, considering the influence of noise such as local illumination of video frames, each frame in the video frame set F is controlled to be 256 multiplied by 192;

each frame is then equally divided into k patches. Thus, each frame is denoted as f _i ＝{f _i ¹ ,f _i ² ,f _i ³ ,...,f _i ^k And (f), where f _i ^k Representing frame f _i Is the kth patch of (c). As shown in fig. 2, k=4 is set.

As one or more embodiments, in S102, feature extraction is performed on each image block to obtain a feature vector of each image block of each frame of image; the method comprises the following specific steps:

extracting HSV color histograms from each image block of each frame of preprocessed image;

extracting an HOG direction gradient histogram from each image block of each preprocessed frame image;

and connecting the HSV color histogram and the HOG direction gradient histogram of each image block of each frame of image to obtain the characteristic vector of each image block of each frame of image.

It will be appreciated that feature extraction plays a critical role in video representation as the first step in key frame extraction, with a critical impact on the subsequent extraction process.

It should be appreciated that color histograms are the most expressive features in video representations. In order to use the color histogram, an appropriate color space needs to be selected in advance. HSV is chosen as the color space because it is more robust to noise. HSV can effectively separate RGB into intensity (brightness) and color information, much like the way humans perceive color.

Illustratively, extracting an HSV color histogram for each image block of each frame of the image after preprocessing; the method comprises the following specific steps:

a color quantization step is employed for each of the hue (H), saturation (S) and value (V), specifically 16 hue components and 4 saturation and value components. Thus, the image block of each frame is represented as a 256-dimensional (16×4×4) HSV histogram:

it should be appreciated that HOG counts the gradient direction information of the local region to describe shape edge information of the image. It perfectly describes geometrical and optical deformations and is therefore very robust against environmental changes. Currently, HOG is widely used in the fields of scene analysis, target detection, recognition systems, and the like.

Illustratively, the video frame of each frame patch is partitioned into cell units of size 16×16, and the 8-bin histogram for each cell unit is calculated to yield 384-dimensional HOG histograms for each frame patch, expressed as:

illustratively, connecting the HSV color histogram and the HOG direction gradient histogram of each image block of each frame image to obtain a feature vector of each image block of each frame image; the method comprises the following specific steps:

HSV histogram and HOG histogram are connected into a feature vector

Known as HSV-HOG histograms.

Thus, the video frame set F is represented as a set of 640 x k x n dimensional histogram features:

F＝{f ₁ ,f ₂ ,...,f _i ,...,f _n }∈R ^640×k×n (3)

wherein f _i ＝{f _i ¹ ,f _i ² ,...,f _i ^p ,...,f _i ^k }∈R ^640×k 。

As one or more embodiments, in S103, an undirected weighted graph of each frame image is established according to the feature vector of each image block of each frame image; the method comprises the following specific steps:

s1031: taking the frequency component of the characteristic vector of each image block of each frame image as a node of the undirected weighted graph;

s1032: taking the connection between two nodes as the edge of the undirected weighted graph;

s1033: taking the distance between the frequency amplitude values of the two nodes as the weight value of the edge;

s1034: expressing the undirected weighted graph as an adjacent matrix, and regularizing the adjacent matrix;

s1035: all frame images in the video frame set are processed in the same way as in S1031 to S1034 to obtain a corresponding adjacency matrix set.

It is well known that shot detection and keyframe selection face two challenges. Firstly, excessive segmentation caused by the change of local content, illumination condition, shooting angle and shooting distance; the other is the key frame omission of the progressive shots. In the over-segmentation problem, the contents at two sides of the lens boundary of the false alarm do not have overall structural change. In the key frame loss detection problem, the background of the lost key frame is similar to the adjacent key frame, but completely different structural and spatial information is expressed. HSV-HOG histograms can only express one-dimensional statistics of video frames. Therefore, it is necessary to explore a suitable model to fully represent the structural and spatial information of video frames, effectively reflecting structural changes in the video stream. The present disclosure designs a graph model to express the content of a video frame.

Illustratively, in the step S103, an undirected weighted graph of each frame image is established according to the feature vector of each image block of each frame image; the method comprises the following specific steps:

modeling each frame image block as an Undirected Weighted Graph (UWG), i.e., G, based on features extracted from the HSV-HOG histogram ^p = { V, E }, which is constructed as follows:

1) Representing V (i) (1 < < i < < X) (x=640) in V as the i-th frequency component of the HSV-HOG histogram, i.e., one node in the graph;

2) Every two nodes v (i) and v (j) are connected as edge e _i,j Calculating Manhattan distance between frequency amplitude values of two nodes as weight d of edge _i,j ；

3) Map G ^p Represented as an adjacency matrix A ^p I.e. A ^p ＝{d _i,j And regularizing the matrix.

The video frame set F is modeled as a series of undirected weighted graphs G, i.e., g= { G ₁ ,G ₂ ,...,G _i ,...,G _n Strong connectivity between primitives becomes a key factor in determining structural features of video frames.

Finally, the graph sequence G is represented as a contiguous matrix sequence a, i.e., a= { a ₁ ,A ₂ ,...,A _i ,...,A _n }. In the middle of

Representing frame f _i Corresponding graph G _i Comprising k sub-graphs G corresponding to k sub-graphs ^p . Thus, the first and second substrates are bonded together,

representation of diagram G _i Corresponding adjacency matrix A _i Comprising k sub-matrices A ^p 。

As shown in fig. 3, in S103, a video shot boundary is detected based on a hypothesis test of the graph structure difference analysis; the method comprises the following specific steps:

s103a1: obtaining the difference degree between two adjacent frame images based on the difference of adjacent matrixes corresponding to the two adjacent frame images;

s103a2: predicting the size of the current shot, and analyzing the difference degree between frames of the current shot by adopting a sliding window with dynamically adjusted step length;

s103a3: the significance of the variation of the difference degree between the current continuous frames is judged through hypothesis testing, and then the boundary of the video shot is detected.

Further, the step length dynamic adjustment rule of the sliding window is that the difference degree between the new lens initial frame image and the frame image with the specified step length is judged to be compared with the set threshold value, so that the current lens size is predicted, and the dynamic adjustment of the step length of the sliding window of the current lens is realized.

In an exemplary embodiment, in S103a1, a difference degree between two adjacent frame images is obtained based on a difference of adjacent matrixes corresponding to the two adjacent frame images; the method comprises the following specific steps:

the dissimilarity score between graphs G and G' is measured by the sum of edge weight differences, expressed as:

to eliminate the negative effects of singular sample data, it is normalized to:

wherein Δi, j is calculated as:

in S103a2, the current shot size is predicted, and a sliding window with dynamically adjusted step size is adopted to analyze the difference between frames of the current shot; the method comprises the following specific steps:

wherein w is ₀ The step size of the initial sliding window is that of the incremental sliding window.

It should be appreciated that most methods employ a predefined fixed-size sliding window for inter-frame difference analysis. However, in real life, one important feature of video is its high time variability in similar content. That is, events/shots in one video will always last for several frames or more. Therefore, in shots of different lengths and types, it is not appropriate to detect boundaries with a sliding window of fixed size. In order to further improve the detection accuracy, a prediction strategy is adopted, and a sliding window with a proper size is automatically matched.

Illustratively, in S103a3, the significance of the variance change between the current continuous frames is determined through hypothesis testing, so as to obtain a video shot boundary; the method comprises the following specific steps:

based on equations (8), (9), it is determined by hypothesis testing whether there is a change in video content for the current consecutive frame, i.e., whether it is a shot boundary.

In the middle of

The representations respectively correspond to frame f _m And frame f _n Subgraph->

With subgraph->

A dissimilarity measure score between them. # {. Cndot. } is a counting function, so z represents the number of dissimilarity metric values that exceed a predefined threshold. k is the number of patches in the video frame. Lambda is the shot boundary detection threshold.

If at least k-1 dissimilarity scores of the k dissimilarity scores of the two frames are greater than a predefined threshold, detecting a shot boundary in the continuous frames, marking the shot boundary (the ending frame index number of the last shot and the starting frame index number of the next shot), predicting a current sliding window w through an equation (7), and starting a detection process of a new shot; otherwise, continuing to detect.

Splitting a video frame into k patches while requiring hypothesis H ₁ The value of z is greater than k-1 for two reasons. On the one hand, local variations may cause false positives in shot detection, with which the error rate of shot detection can be reduced by suppressing the influence of each pair of frame patch local variations. On the other hand, the video frame is divided into k patches, so that the calculation efficiency of matrix operation is effectively improved.

It should be appreciated that in video frame set F, video frames of the same shot express similar content and follow the same data distribution. Shot boundary detection may be achieved by analyzing differences between successive frames, where significant differences indicate that there may be a boundary at the currently detected location.

Completing shot division to obtain shot boundary frame Index set index= { In ₀ ,In ₁ ,In ₂ ,...,In _M }，In ₀ For the initial first frame, i.e. In ₀ ＝1，In _M For the last frame, i.e. In _M ＝n。

As shown in fig. 4, in S104, a key frame in each video shot is extracted based on the median map of each video shot; the method comprises the following specific steps:

for the undirected weighted graphs corresponding to all the image frames in each video shot, calculating a graph with the smallest sum of the distances between the undirected weighted graphs and all other graphs in the shot, namely, a median graph; and selecting the frame corresponding to the median map as a key frame.

Illustratively, in S104, key frames in each video shot are extracted based on median map calculation; the method comprises the following specific steps:

given the atlas S s= { G for each shot ₁ ,G ₂ ,G ₃ ,...,G _N Obtaining a median map by solving a minimization optimization problem

Wherein, the liquid crystal display device comprises a liquid crystal display device,

the calculation is as follows:

it is apparent that, in equation (10),

is a graph with the smallest sum of distances relative to the other remaining graphs. Finally, selecting a frame corresponding to the median map as a key frame;

and acquiring key frames of all shots to obtain a key frame set.

It should be noted that equation (10) ensures maximum similarity of the key frame to the remaining frames, i.e., minimal sum of distances relative to the other remaining frames. In addition, equation (11) provides support for overcoming sensitivity to local noise in frame difference analysis. The summary of the key frame components thus extracted can comprehensively reflect the overall trend of a given video.

It will be appreciated that after shots are detected in the video, the next step is to select the most informative and representative frames in extracting each shot as key frames. The basic idea is that the extracted key frame is the most similar frame to the rest of the frames in the shot. To this end, we introduced the concept of a median graph for the key frame selection task. In graph theory, the median graph is an effective tool for representing a set of graphs.

Example two

The embodiment also provides an automatic video abstract generating system based on graph structure difference analysis;

It should be noted that the preprocessing module, the feature extraction module, the shot boundary detection module, and the key frame extraction module correspond to steps S101 to S104 in the first embodiment, and the foregoing modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

The foregoing embodiments are directed to various embodiments, and details of one embodiment may be found in the related description of another embodiment.

The proposed system may be implemented in other ways. For example, the system embodiments described above are merely illustrative, such as the division of the modules described above, are merely a logical function division, and may be implemented in other manners, such as multiple modules may be combined or integrated into another system, or some features may be omitted, or not performed.

Example III

The embodiment also provides an electronic device, which includes a memory, a processor, and computer instructions stored in the memory and running on the processor, where each operation in the method is completed when the computer instructions are run by the processor, and for brevity, details are not repeated here.

The electronic device may be a mobile terminal and a non-mobile terminal, where the non-mobile terminal includes a desktop computer, and the mobile terminal includes a Smart Phone (such as an Android Phone, an IOS Phone, etc.), a Smart glasses, a Smart watch, a Smart bracelet, a tablet computer, a notebook computer, a personal digital assistant, and other mobile internet devices capable of performing wireless communication.

It should be understood that in this disclosure, the processor may be a central processing unit, CPU, the processor may also be other general purpose processors, digital signal processors, DSPs, application specific integrated circuits, ASICs, off-the-shelf programmable gate arrays, FPGAs, or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The steps of a method disclosed in connection with the present disclosure may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein. Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the embodiments disclosed herein, i.e., the algorithm steps, can be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a division of one logic function, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Example IV

The present embodiment also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the method of embodiment one.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. The automatic video abstract generation method based on graph structure difference analysis is characterized by comprising the following steps of:

extracting key frames in each video shot based on the median map of each video shot;

establishing an undirected weighted graph of each frame of image according to the characteristic vector of each image block of each frame of image; the method comprises the following specific steps:

taking the frequency component of the characteristic vector of each image block of each frame image as a node of the undirected weighted graph;

taking the connection between two nodes as the edge of the undirected weighted graph;

taking the distance between the frequency amplitude values of the two nodes as the weight value of the edge;

expressing the undirected weighted graph as an adjacent matrix, and regularizing the adjacent matrix;

all frame images in the video frame set are subjected to the same processing to obtain a corresponding adjacent matrix set;

detecting video shot boundaries based on hypothesis testing of graph structure difference analysis; the method comprises the following specific steps:

obtaining the difference degree between two adjacent frame images based on the difference of adjacent matrixes corresponding to the two adjacent frame images;

predicting the size of the current shot, and analyzing the difference degree between frames of the current shot by adopting a sliding window with dynamically adjusted step length;

and judging the significance of the difference change between the current continuous frames through hypothesis testing, so as to obtain the video shot boundary.

2. The method of claim 1, wherein a given video stream is preprocessed, and each frame of image in the preprocessed video stream is divided into a plurality of equally sized image blocks; the method comprises the following specific steps:

and performing size scaling adjustment on each frame of image in the video frame set, and dividing each frame of image after adjustment into a plurality of image blocks with equal size.

3. The method of claim 1, wherein feature extraction is performed on each image block to obtain a feature vector for each image block of each frame of image; the method comprises the following specific steps:

4. The method of claim 1, wherein the step size of the sliding window is dynamically adjusted by comparing a difference between a new shot start frame image and a frame image with a specified step size with a set threshold value, so as to predict the current shot size, thereby realizing the dynamic adjustment of the step size of the sliding window of the current shot.

5. The method of claim 1, wherein the key frames in each video shot are extracted based on a median map of each video shot; the method comprises the following specific steps:

6. The automatic video abstract generating system based on graph structure difference analysis, which executes the automatic video abstract generating method based on graph structure difference analysis according to any one of claims 1 to 5, is characterized by comprising the following steps:

7. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of any of claims 1-5.

8. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any of claims 1-5.