CN112883227B

CN112883227B - Video abstract generation method and device based on multi-scale time sequence characteristics

Info

Publication number: CN112883227B
Application number: CN202110019685.4A
Authority: CN
Inventors: 贺志强; 牛凯; 张一杰; 陈云
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2022-08-09
Anticipated expiration: 2041-01-07
Also published as: CN112883227A

Abstract

The embodiment of the application provides a video abstract generating method and device based on multi-scale time sequence characteristics. The method comprises the following steps: generating a model by utilizing a pre-trained video abstract to obtain a multi-scale time sequence fusion characteristic sequence; determining the importance score of each video frame in the multi-scale time sequence fusion characteristic sequence by utilizing a pre-trained video abstract generation model; based on a shot segmentation algorithm, segmenting the multi-scale time sequence fusion feature sequence into a basic segment set taking basic shots as a unit; generating a model by utilizing a pre-trained video abstract, and dynamically selecting core segments in a basic segment set based on each importance score and the basic segment set; and generating a dynamic video abstract by utilizing the pre-trained video abstract generation model based on the core segment, and outputting the dynamic video abstract. According to the scheme, a video abstract generation model obtained through unsupervised training is utilized, video key frames can be extracted, video abstracts with diversity and representativeness are obtained, the workload of manual intervention is reduced, and video retrieval and video monitoring are facilitated.

Description

Video abstract generation method and device based on multi-scale time sequence characteristics

Technical Field

One or more embodiments of the present disclosure relate to the field of computer vision technologies, and in particular, to a method and an apparatus for generating a video summary based on multi-scale time sequence features.

Background

With the development of the fields of mobile internet, video monitoring and the like, video recording equipment generates a large amount of videos every moment, and the amount of video data is increased explosively. Generally, the manager needs to watch the video sufficiently to know the main content and screen the effective segments.

In emerging multimedia services such as fast browsing, video retrieval and video monitoring based on video content, how to efficiently acquire key information from massive videos becomes one of the problems to be solved urgently.

Disclosure of Invention

In view of the above, an object of one or more embodiments of the present specification is to provide a method, an apparatus, a device, and a storage medium for generating a video summary based on multi-scale time-series characteristics, so as to solve the problem of how to efficiently acquire key information from a large amount of videos.

In view of the above, one or more embodiments of the present specification provide a method for generating a video summary based on multi-scale time-series characteristics, including:

generating a model by utilizing a pre-trained video abstract to obtain a multi-scale time sequence fusion characteristic sequence;

determining the importance score of each video frame in the multi-scale time sequence fusion characteristic sequence by utilizing a pre-trained video abstract generation model;

based on a shot segmentation algorithm, segmenting the multi-scale time sequence fusion feature sequence into a basic segment set taking basic shots as a unit, wherein each basic segment in the basic segment set comprises at least one video frame;

generating a model by utilizing a pre-trained video abstract, and dynamically selecting core segments in a basic segment set based on each importance score and the basic segment set;

and generating a dynamic video abstract by utilizing the pre-trained video abstract generation model based on the core segment, and outputting the dynamic video abstract.

Further, the method further comprises:

acquiring a target source video sequence;

determining a target source video frame feature vector sequence according to the target source video sequence and a pre-trained multi-target classification model;

sampling and normalizing compression coding are carried out on the target source video frame feature vector sequence to obtain a compression coding feature sequence with uniform size;

and performing multi-scale time sequence fusion on the compressed coding characteristic sequence to obtain a multi-scale time sequence fusion characteristic sequence.

Further, performing multi-scale time sequence fusion on the compressed coding feature sequence to obtain a multi-scale time sequence fusion feature sequence, including:

performing multi-level time sequence perception on the compressed coding characteristic sequence, and extracting a multi-level short-time characteristic vector sequence corresponding to the compressed coding characteristic sequence;

and determining a multi-scale time sequence fusion characteristic sequence based on the multi-level short-time characteristic vector sequence and the multi-branch association analysis network.

Further, determining a multi-scale time sequence fusion feature sequence based on the multi-level short-time feature vector sequence and the multi-branch association analysis network, including:

performing correlation coefficient calculation, weight vector coding and feature normalization processing on the multi-level short-time feature vector sequence by using a multi-branch correlation analysis network to obtain a long-time feature vector sequence corresponding to multiple branches;

dimension fusion is carried out on each long-time characteristic vector sequence corresponding to the multiple branches, and a multi-scale time sequence fusion characteristic sequence with the same scale as the target source video sequence is obtained through full connection.

Further, the method further comprises:

acquiring an initially generated countermeasure network and a training sample set, wherein the training sample set comprises a multi-scale time sequence fusion characteristic sequence, an importance score of each video frame in the marked multi-scale time sequence fusion characteristic sequence and a dynamic video abstract corresponding to each segment in the marked multi-scale time sequence fusion characteristic sequence;

and taking the multi-scale time sequence fusion characteristic sequence in the training sample set as the input of an initial generation confrontation network, taking the importance scores of all video frames in the marked multi-scale time sequence fusion characteristic sequence and the dynamic video abstract corresponding to all segments in the marked multi-scale time sequence fusion characteristic sequence as expected output, performing iterative training on the initial generation confrontation network, and finally obtaining a pre-trained video abstract generation model.

Further, the iterative training of the initial video abstract generation model comprises:

the following iterative steps are performed a plurality of times:

performing key frame sampling on the multi-scale time sequence fusion characteristic sequence in the training sample set to obtain a key frame set;

reconstructing a video sequence based on the key frame set to obtain a reconstructed feature sequence;

calculating the similarity of the reconstruction characteristic sequence and the compression coding characteristic sequence;

updating a key frame set obtained by sampling key frames of the multi-scale time sequence fusion characteristic sequence in the training sample set according to the similarity and a preset length threshold of the key frame set;

and in response to the fact that the similarity is larger than the preset similarity threshold value and the length of the key frame set is smaller than the length threshold value of the preset key frame set, ending the training of the initially generated confrontation network to obtain a pre-trained video abstract generation model.

A video summary generation device based on multi-scale time sequence characteristics is characterized by comprising:

an obtaining unit configured to obtain a multi-scale time sequence fusion feature sequence by using a pre-trained video summary generation model;

an importance score determination unit configured to determine an importance score of each video frame in the multi-scale time series fusion feature sequence using a pre-trained video summary generation model;

the segmentation unit is configured to segment the multi-scale time sequence fusion feature sequence into a basic segment set taking a basic shot as a unit based on a shot segmentation algorithm, wherein each basic segment in the basic segment set comprises at least one video frame;

a core segment selection unit configured to dynamically select core segments in the basic segment set based on each importance score and the basic segment set by using a pre-trained video summary generation model;

and the video abstract generating unit is configured to generate and output the dynamic video abstract based on the core segments by utilizing the pre-trained video abstract generating model.

Further, the obtaining unit is further configured to: acquiring a target source video sequence; and

the device also includes:

the target source video frame feature vector sequence determining unit is configured to determine a target source video frame feature vector sequence according to the target source video sequence and the pre-trained multi-target classification model;

the compression coding characteristic sequence determining unit is configured to sample and normalize compression coding on a target source video frame characteristic vector sequence to obtain a compression coding characteristic sequence with uniform size;

and the multi-scale time sequence fusion characteristic sequence determining unit is configured to perform multi-scale time sequence fusion on the compressed coding characteristic sequence to obtain a multi-scale time sequence fusion characteristic sequence.

Further, the multi-scale time-series fusion feature sequence determination unit is further configured to:

Further, the obtaining unit is further configured to: acquiring an initially generated countermeasure network and a training sample set, wherein the training sample set comprises a multi-scale time sequence fusion characteristic sequence, an importance score of each video frame in the marked multi-scale time sequence fusion characteristic sequence and a dynamic video abstract corresponding to each segment in the marked multi-scale time sequence fusion characteristic sequence; and

the device still includes:

the training unit is configured to take the multi-scale time sequence fusion feature sequence in the training sample set as input of an initial generation countermeasure network, take the importance scores of all video frames in the marked multi-scale time sequence fusion feature sequence and the dynamic video summaries corresponding to all segments in the marked multi-scale time sequence fusion feature sequence as expected output, conduct iterative training on the initial generation countermeasure network based on a preset cost function used by unsupervised training, and finally obtain a pre-trained video summary generation model.

Further, the training unit is further configured to:

the following iterative steps are performed a plurality of times:

and in response to the fact that the similarity is larger than the preset similarity threshold and the length of the key frame set is smaller than the length threshold of the preset key frame set, finishing the training of the initially generated confrontation network to obtain a pre-trained video abstract generation model.

An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method for generating a video summary based on multi-scale time sequence characteristics.

A non-transitory computer readable storage medium storing computer instructions for causing a computer to execute the method for generating a video summary based on multi-scale time series characteristics as described above.

As can be seen from the above, in the method, the apparatus, the device, and the storage medium for generating a video summary based on multi-scale time sequence characteristics provided in one or more embodiments of the present specification, a video summary generation model obtained by unsupervised training is used to extract video key frames, obtain video summaries with diversity and representativeness, reduce the workload of manual intervention, and facilitate video retrieval and video monitoring.

Drawings

In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only one or more embodiments of the present specification, and that other drawings may be obtained by those skilled in the art without inventive effort from these drawings.

Fig. 1 is a schematic diagram illustrating a video summary generation method based on multi-scale time sequence characteristics according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a video summary generation method based on multi-scale time sequence characteristics according to another embodiment of the present disclosure;

fig. 3 is a block diagram illustrating a structure of a video summary generation apparatus based on multi-scale time-series characteristics according to an embodiment of the present disclosure;

fig. 4 is a hardware structural diagram of an electronic device of a video summary generation method based on multi-scale time sequence characteristics according to an embodiment of the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the specification is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

Fig. 1 shows a schematic diagram 100 of a flow framework of a multi-scale temporal feature-based video summary generation method of the present application. As shown in the flowchart frame of fig. 1, the method for generating a video summary based on multi-scale time series characteristics according to this embodiment may include the following steps:

step 101, generating a model by using a pre-trained video abstract, and acquiring a multi-scale time sequence fusion characteristic sequence.

An executing subject (for example, a system including a pre-trained video summary generation model, a pre-trained multi-target classification model, a multi-branch association analysis network, and the like) of the video summary generation method based on the multi-scale time sequence features of the embodiment may obtain a multi-scale time sequence fusion feature sequence in a video by using the pre-trained video summary generation model. The pre-trained video abstract generation model can generate a multi-scale time sequence fusion characteristic sequence which corresponds to the input video and exists in the form of an intermediate product.

And 102, determining the importance score of each video frame in the multi-scale time sequence fusion characteristic sequence by utilizing a pre-trained video abstract generation model.

The pre-trained video summary generation model may determine the importance score of each video in the multi-scale time sequence fusion feature sequence according to the generated multi-scale time sequence fusion feature sequence, and it is understood that the importance score may also be an intermediate product of the pre-trained video summary generation model.

And 103, based on a shot segmentation algorithm, segmenting the multi-scale time sequence fusion feature sequence into a basic segment set taking a basic shot as a unit.

Wherein each elementary section in the set of elementary sections comprises at least one video frame.

The execution subject may segment an intermediate product of the pre-trained video digest generation model, i.e., the multi-scale time series fusion feature sequence, into a basic segment set using a basic shot as a unit based on a shot segmentation algorithm. Of course, the executing entity may also segment the source video or the compressed coding feature sequence subjected to compression coding into video segments containing only a single scene based on the shot segmentation algorithm, so as to form a segment set corresponding to the source video sequence or a segment set corresponding to the compressed coding feature sequence, both of which may be the above-mentioned basic segment sets. The content of the basic segment set is not specifically limited in the present application.

In this embodiment, a Kernel Temporal Segmentation (KTS) algorithm is used to segment a shot, and a dynamic programming is used to select a kernel segment and finally output a video summary.

And 104, dynamically selecting core segments in the basic segment set based on each importance score and the basic segment set by utilizing a pre-trained video abstract generation model.

The executive subject can determine the sum of the importance scores of the video frames contained in each basic segment in the basic segment set by using a pre-trained video abstract generation model, and the sum of the importance scores is used as the extraction weight of each corresponding basic segment, and the core segments in the basic segment set are dynamically selected by integrating the extraction weight of the whole basic segment set and the threshold value of the preset extraction weight. It is to be understood that the execution subject may determine the basic segment whose decimation weight is greater than a preset decimation weight threshold as the core segment. Of course, the core segments in the basic segment set may also be selected through a preset extraction weight threshold and an extraction number, and the method for selecting the core segments in the basic segment set is not particularly limited in the present application.

And 105, generating a dynamic video abstract by using the pre-trained video abstract generation model based on the core segment, and outputting the dynamic video abstract.

The execution main body utilizes the pre-trained video abstract generation model to synthesize the selected core segments into the dynamic video abstract, and the dynamic video abstract is output through a display screen or a mobile storage device. In the embodiment, the key segments are selected to form the dynamic video abstract, and the most representative abstract is obtained by maximizing the weight score of the abstract; meanwhile, regularization constraint is set when the key segments are selected, and the regularization constraint is used for limiting the length of the generated abstract.

According to the embodiment, the video abstract generation model obtained by unsupervised training is utilized, the video key frames can be extracted, the video abstract with diversity and representativeness can be obtained, the workload of manual intervention is reduced, and the video abstract generation model is beneficial to video retrieval and video monitoring.

With continued reference to fig. 2, a flow 200 of another embodiment of a method for generating a video summary based on multi-scale temporal features according to the present application is shown. As shown in fig. 2, the method for generating a video summary based on multi-scale time-series characteristics according to this embodiment may include the following steps:

step 201, a target source video sequence is obtained.

The execution subject may obtain the target source video sequence by means of a wired connection or a wireless connection. The target source video sequence may be, for example, a diving video sequence of a diving athlete, and the specific content of the target source video sequence is not limited in the present application.

Step 202, determining a target source video frame feature vector sequence according to the target source video sequence and the pre-trained multi-target classification model.

There are large differences in image size and picture quality of different videos in the target source video sequence. The execution main body can perform feature extraction on video frames in a target source video sequence one by one to obtain a feature vector sequence of the target source video frames with consistent dimensions, and specifically, the execution main body can send a video frame set in the target source video sequence into a pre-trained multi-target classification model to extract image features of the video frame set to obtain the feature vector sequence of the target source video frames.

And step 203, sampling and normalizing the target source video frame feature vector sequence, compressing and encoding to obtain a compressed encoding feature sequence with uniform size.

After the execution main body obtains the target source video frame feature vector sequence, the obtained target source video frame feature vector sequence can be sampled and subjected to normalized compression coding, and coding feature sequences with consistent dimensions are obtained. Key features such as target edge contour and the like are effectively extracted through compression coding processing, and the influence of interference factors such as motion blur and the like on video abstract generation is reduced; the size normalization can be used for training an end-to-end video abstract generation model, so that the calculation amount is effectively reduced while the video sequences with different scales are processed, and the operation efficiency is improved.

And 204, performing multi-scale time sequence fusion on the compressed coding feature sequence to obtain a multi-scale time sequence fusion feature sequence.

Specifically, step 204 can also be implemented by steps 2041 to 2042:

and 2041, performing multilevel time sequence sensing on the compressed coding feature sequence, and extracting a multilevel short-time feature vector sequence corresponding to the compressed coding feature sequence.

In this embodiment, key information is extracted by extracting multi-level image features. Specifically, for an input compressed coding characteristic sequence X, a multi-level time sequence sensing network is constructed to perform multi-level time sequence sensing on the input compressed coding characteristic sequence X. And acquiring the characteristic information of the input compressed coding characteristic sequence X on different scales by using a short-time characteristic extraction unit in the multi-level time sequence sensing network, and extracting a shallow visual characteristic and a deep semantic characteristic, namely a multi-level short-time characteristic vector sequence. Illustratively, for a given level r, the short time cell corresponding to time t

The time sequence range [ t-tau, t + tau ] of the previous level can be obtained]Internal short-time feature extraction unit

The output feature vector is used as a current short-time feature extraction unit

Is input. Obtaining a feature vector through input feature fusion and short-term feature extraction

And passed to the short-term feature extraction unit of the next level r +1 for processing. The time sequence sensing range of the short-time feature extraction unit can be effectively expanded, the features of adjacent video frames are fused, inter-frame similarity measurement is carried out on the basis of single-frame feature extraction, key information in the sensing range is extracted, and similar redundant information interference is reduced.

Specifically, the multi-level timing sensing network is constructed by using the short-time feature extraction unit. And sending the extracted compressed coding feature sequence into a trained multi-level network to obtain a multi-level short-time feature vector sequence including shallow visual features and deep semantic features. The short-time feature extraction unit may specifically be:

the short-time feature extraction unit receives a given number of input feature vectors (which may be a given number of compressed encoded feature sequences), and outputs perceptual feature vectors after calculation. Setting a time sequence sensing domain tau for each short-time feature extraction unit S, and inputting a feature vector in a time sequence range { t + i | i ∈ [ -tau, + tau ] }; setting the hole connection operation can further expand the sensing range of the extraction unit, and recording the hole connection diffusion factor as d, then the extraction unit S can receive the feature vector in the timing range { t + i · d | i ∈ [ - τ, + τ ], d ≧ 1 }. For a short-term feature extraction unit at a given time t, the execution flow is as shown in the following equation (1):

T _t ＝{t+i·d|i∈[-τ,+τ],d≥1} (1)

where f (-) represents the feature extraction operation of the short-time unit under a given sensing range, σ (-) represents the feature vector dimension fusion operation, and σ (-) based on the sensing range refers specifically to feature vector dimension fusion using hole connection.

Is a feature vector set in the sensing range of the current short-time unit, wherein: in particular, the set is to be used for a dimension fusion operation, T _t Representing a time-series set of features selected by the hole join operation, and r representing the current feature extraction level. x represents the input characteristic of a given short-time cell and s represents the output characteristic of a given short-time cell; j represents the time sequence number in the current sensing range; t represents a given time instant; i denotes the timing offset of a given input feature within the perceptual range for a given time instant t, and d is an offset dispersion factor.

The hole connection operation refers to the step of connecting a plurality of nonadjacent time sequence feature vectors to obtain a wider time sequence perception range and extract and fuse short-time feature information. Given the dimension of the feature vector is n × 1, the current level is given a local perception range T _t And sequentially stacking and splicing the internal feature vectors to obtain a feature map with the dimension of (2 tau +1) multiplied by n multiplied by 1. Dimension transformation is carried out on the short-term feature vector by using a 1 × 1 convolution operation, and a fusion short-term feature vector with the same dimension of n × 1 is obtained. The fused short-time feature vector can effectively represent key information in a given time sequence perception range, and the semantic expression range of the feature vector is expanded.

Specifically, the multi-level timing sensing network is constructed by using the short-time feature extraction unit, and the method may be as follows:

and using a short-time feature extraction unit to sequentially perform further feature extraction on the coding feature sequence. A multi-level feature extraction network is constructed in a multi-level short-time unit stacking mode, feature information is obtained on respective scales of different levels, image visual features are extracted by shallow-layer units, and image semantic features are extracted by deep-layer units. For a given level r, short-term feature extraction basic units at different times t

The current short-time feature extraction layer is formed together, and the output feature vector set of each basic unit is the short-time feature sequence S under the level r ^r . Assuming that the total number of the feature extraction network is marked as L, the timing sensing range of the deep short-time unit can be expanded to

Wherein tau is ^(r) Representing the local perceptual range below level r. And performing feature extraction through a multi-level network to obtain a multi-level short-time feature sequence set S comprising shallow feature information, intermediate hidden feature information and deep semantic feature information.

The multi-level short-time feature extraction network can be constructed by using common time sequence modeling networks such as a long-term and short-term memory network, a time convolution network and the like. Generally, in order to expand the sensing range of the structural units in the time sequence modeling task, a commonly adopted strategy is to use a larger number of structural unit connections or a deeper hierarchical network, the parameter scale of the calculation units of the structural units is exponentially increased, and stable convergence is difficult. According to the multi-level feature extraction network, a wider adjacent frame sensing domain can be effectively obtained by arranging the hole connection structure, and time sequence modeling methods such as a long-term and short-term memory network which uses a single structural unit to perform cyclic calculation and a time convolution network which uses a plurality of structural units to perform parallel calculation can be applied.

Next, specific embodiments of the hole connection operation in different time sequence modeling network forms are given. Generally speaking, the input of the short-time feature extraction unit is a source video coding feature vector or an output feature vector of a short-time unit at a corresponding moment of a previous layer, and the dimension of the input feature vector of each short-time unit is consistent with that of the output vector, so as to ensure multi-layer multiplexing of the short-time units. The explanation is given by taking a long-short term memory network and a time convolution network as examples respectively.

The basic calculating unit of the long-short term memory network consists of a forgetting gate, an input gate and an output gate, and the state of the calculating unit is updated by controlling a gate structure. For the time sequence state t, it is calculated in unitsInputting information

Inputting feature vector x from current time sequence _t And last timing state h _t-1 The composition of (2) is shown in the following formula. Wherein, the vector mapping matrix W is the network parameter, b represents the regularization offset, [ ·; a]Representing column-wise stitching vectors; the feature extraction operation f (-) of the short-time unit of a given perceptual range as described above, here represents the vector computation operation performed by the LSTM structural unit:

the number of the stacked computing units can extract high-dimensional feature vectors, but the computing units of each layer level are relatively isolated, short-time memory is not expanded, and time sequence context information is not fully utilized. When high-order features are extracted, shallow visual feature information is lost, more computing resources are consumed, and the improvement on the performance of the time-series modeling network is very limited. According to the hole connection operation, the hidden layer characteristic of the previous time sequence state and the output characteristic in the previous layer time sequence sensing range are jointly used as input information, and the state information of the structural unit is updated. For the time sequence state t, the input information of the r-th layer structure unit at the time sequence state t

As shown in the following formula (3):

wherein, the feature vector x is input _t Obtained by the connection operation of the cavity of the previous layer.

And the output hidden layer state of the r-th layer structure unit in the time sequence state t is shown. In particular, when the timing perception domain τ is 0, the timing perception rangeAnd if the hole does not exist, the hole connection operation is degraded into a common long-time memory network input operation.

For the time convolution network, sequences with any length can be mapped to output sequences with the same length, and a wider perception domain can be obtained by deep structure units through stacking of multiple layers of basic structure units. A time sequence modeling network is constructed by adopting the cavity convolution structure unit, and a wider adjacent frame sensing domain can be effectively obtained by using a smaller convolution kernel, a limited network layer number and an increasing diffusion factor. The input of the feature extraction unit can be expressed as

Wherein the timing perception range T _t ＝{t+i·d|i∈[-τ,+τ],d≥1}。

In particular, when causal convolution is employed, the sequence information is only obtained from the preceding state, and the timing perception domain is shrunk to i ≦ 0.

Step 2042, determining a multi-scale time sequence fusion feature sequence based on the multi-level short-time feature vector sequence and the multi-branch association analysis network.

In this embodiment, constructing a multi-branch association analysis network may be:

the time sequence modeling network is represented by a time convolution network. Exemplary, for short-term feature vectors s _t Construct its query vector q _t ＝s _t W ^Q Key vector k _t ＝s _t W ^K Vector of values v _t ＝s _t W ^V . Using its query vector q _t With a vector of values k for each video frame of the video sequence _t Calculating dot product, normalizing by softmax function to obtain video frame s _t And video frame s _i Is a correlation coefficient of _t,i . Will be alpha _t,i As attention correlation coefficient of video frame at time t at each time i in time sequence, value vector v for each frame _i Carrying out weighted summation to obtain a video feature vector s _t Weight coding of

The calculation formula is as follows (4)) And (5) are shown as follows:

wherein alpha is _t,i Representing the correlation coefficient of the video frame at a given moment t and the video frame at any moment i in the sequence; s represents a short-term feature vector at a given instant; w ^Q 、W ^K 、W ^V And respectively representing the vector mapping coefficient matrixes corresponding to the query vector, the key vector and the value vector constructed during calculation. (.) ^T Representing a transpose operation.

The single correlation analysis branch can be described as a mapping H from the input feature vector set S to the output weight code via correlation coefficient calculation, i.e. performing the above operation on each feature vector in the vector set, as shown in the following equation (6):

wherein H (-) represents the correlation analysis mapping encoding operation executed by a single branch; softmax (·) is a widely used normalization function; k represents a constraint coefficient.

For short-term memory, the correlation between pairs of video frames gradually decays as the sequential state interval expands. The time sequence is analyzed by using an attention mechanism, the correlation between the paired video frames is only determined by the vector distance of the feature space, the causal time sequence is not limited, and the time sequence range perception of short-term memory is expanded to the global context analysis of long-term memory.

On the basis, a structure with a plurality of associated analysis units in parallel is used for splicing the weight coding features of the feature vectors in different subspaces to provide multi-view semantic information. For a given branch r, the associative analysis branch H ^r Input feature sequence S of ^r I.e. short-time signature sequence s output by level r of step 2 _t |t∈[1,N]And N represents the sequence length. Carrying out weight coding through multi-branch correlation analysis to obtain a branch characteristic sequence set

L represents the number of branches. And processing the branch feature sequence set by using feature vector dimension fusion operation sigma (-) and calculating to obtain the multi-scale fusion feature sequence. The dimension fusion based on the branch weight does not involve the above-described hole connection, and only performs dimension splicing and conversion on the output result of the parallel association analysis unit, and obtains module output by mapping, as shown in the following formula (7):

wherein the content of the first and second substances,

representing a multi-scale time sequence characteristic sequence obtained through multi-level time sequence perception and multi-branch correlation analysis; s ^r Representing a short-time characteristic sequence output by a time sequence perception level r corresponding to a given branch, wherein the maximum value of the number of scoring branches is L; σ () represents a feature vector dimension fusion operation; h () represents the associative parsing mapping encoding operation performed by a single branch. Specifically, step 2042 may also be implemented by steps 20421 to 20422:

step 20421, the multi-branch correlation analysis network is used to perform correlation coefficient calculation, weight vector coding and feature normalization processing on the multi-level short-term feature vector sequence, so as to obtain a long-term feature vector sequence corresponding to multiple branches.

Step 20422, performing dimension fusion on each long-term feature vector sequence corresponding to the multiple branches, and obtaining a multi-scale time sequence fusion feature sequence with the same scale as the target source video sequence through full connection.

The embodiment is to analyze the associated weights of the video segments in the whole time sequence to select the core segments. For the above at different scales, different sensing rangesAnd sending each level short-time feature vector sequence in the multi-level short-time feature vector sequences extracted in the periphery into a multi-branch association analysis network. Performing pre-trained video summary generation model in subject to extract short-time feature vector sequence S from level r in step 2041 ^r Branch G into a multi-branch correlation analysis network ^r And performing correlation analysis, and calculating the correlation coefficient of the short-time features (extracted by the short-time unit) in the current level feature vector sequence one by one. The pre-trained video summary generation model can be combined with the short-term feature vector sequence S ^r And corresponding correlation coefficient, calculating branch H ^r Weight vector corresponding to time t

Construction of a sequence of weight-encoded vectors

And obtaining a long-time characteristic vector sequence corresponding to the multiple branches through characteristic normalization processing. Executing a pre-trained video abstract generation model in a main body to perform dimension splicing on the result of multi-branch correlation analysis network analysis, namely each long-time characteristic vector sequence corresponding to multiple branches, and then obtaining a multi-scale time sequence fusion characteristic sequence with the same scale as a target source video sequence through full-connection scale conversion

According to the embodiment, the multi-scale time sequence fusion characteristic sequence with the same scale as the target source video sequence is obtained by constructing the multi-branch correlation analysis network, so that the obtained dynamic video abstract is more accurate.

The video abstract generating method based on the multi-scale time sequence characteristics further comprises the following steps of carrying out unsupervised model training:

acquiring an initially generated countermeasure network and a training sample set, wherein the training sample set comprises a multi-scale time sequence fusion characteristic sequence, an importance score of each video frame in the marked multi-scale time sequence fusion characteristic sequence and a dynamic video abstract corresponding to each segment in the marked multi-scale time sequence fusion characteristic sequence; taking the multi-scale time sequence fusion characteristic sequence in the training sample set as the input of an initial generation countermeasure network, taking the marked importance scores of all video frames in the multi-scale time sequence fusion characteristic sequence and the marked dynamic video abstract corresponding to all segments in the multi-scale time sequence fusion characteristic sequence as expected output, performing iterative training on the initial generation countermeasure network, and finally obtaining a pre-trained video abstract generation model.

The iterative training of the initial video abstract generation model comprises the following steps: the following iterative steps are performed a plurality of times: performing key frame sampling on the multi-scale time sequence fusion characteristic sequence in the training sample set to obtain a key frame set; reconstructing a video sequence based on the key frame set to obtain a reconstructed feature sequence; calculating the similarity of the reconstruction characteristic sequence and the compression coding characteristic sequence; updating a key frame set obtained by sampling key frames of the multi-scale time sequence fusion characteristic sequence in the training sample set according to the similarity and a preset similarity threshold value of the key frame set; and in response to the fact that the similarity is larger than the preset similarity threshold and the length of the key frame set is smaller than the length threshold of the preset key frame set, finishing the training of the initially generated confrontation network to obtain a pre-trained video abstract generation model.

In this embodiment, the network model is trained by using an unsupervised learning method, so that the extracted key frame set is as small as possible and fully represents the feature information of the source video sequence.

Specifically, an unsupervised learning method is used for training a coding sequence to generate a confrontation network, and a video abstract generation model is constructed. The generation countermeasure network consists of a video sequence generation module and a video sequence identification module. For the obtained multi-scale time sequence fusion characteristic sequence

The execution subject can calculate the corresponding key frame extraction probability p vector by using a multi-level time sequence perception network _t . Extracting video key frame set by Bernoulli sampling, and constructing key frame coding vector set by using corresponding fusion coding vectorAnd Z. Setting a video sequence generation module G, reconstructing a source video coding sequence from a key frame coding vector set Z

Setting a video sequence identification module D, and calculating a reconstructed source video coding sequence

And according to the obtained similarity and a preset similarity threshold value of a key frame set, carrying out iterative training to optimize a video abstract generation model. When reconstructing the source video coding sequence

When a compressed coding characteristic sequence X corresponding to a target source video sequence approaches, the coding sequence is considered to generate an antagonistic network, the main content of the target source video sequence is restored and reconstructed, namely, a trained video abstract generation model accurately extracts core segments in the target source video sequence, sufficiently expresses key information in the target source video sequence, and can be used for generating an accurate dynamic video abstract after the training of the video abstract generation model is completed.

In summary, the embodiment of the present application provides a video summary generation method based on multi-scale time sequence characteristics. The method comprises the steps of firstly, coding and compressing an input target source video sequence; using a multi-level time sequence perception network to expand the time sequence perception range of the short-time feature extraction unit and extract key information of the input video image; calculating the weight of the characteristic vector in the whole coding sequence by using a multi-branch correlation analysis network, and extracting core content; training a network by using an unsupervised learning method, reconstructing a target source video sequence from the extracted key frame set, and evaluating and generating the abstract quality; and constructing a video abstract generation model on the basis of shot segmentation, and outputting representative and diverse dynamic video abstract.

The video abstract generation method based on the multi-scale time sequence characteristics can construct a multi-level time sequence sensing network, enlarge the sensing range to extract key information in short-time characteristics, and reduce redundant information interference; a multi-branch association analysis network can be constructed, the core content of a target source video sequence is extracted through weight calculation, and the representativeness and diversity of the generated dynamic video abstract are improved; the video abstract generation model can be trained by using an unsupervised learning method, manual marking is not needed for intervention, manpower and material resources required by model training are effectively reduced, an objective and effective abstract quality assessment method is provided, and the method has good application and popularization prospects.

With continuing reference to fig. 3, as an implementation of the method shown in the above-mentioned figures, the present application provides an embodiment of a video summary generation apparatus based on multi-scale time sequence characteristics, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 1, and the apparatus may be applied to various electronic devices in particular.

As shown in fig. 3, the apparatus 300 for generating a video summary based on multi-scale time-series characteristics according to the present embodiment includes: the video summarization method comprises an acquisition unit 301, an importance score determining unit 302, a segmentation unit 303, a core segment selecting unit 304 and a video summary generating unit 305.

An obtaining unit 301 configured to obtain a multi-scale time sequence fusion feature sequence by using a pre-trained video summary generation model.

An importance score determining unit 302 configured to determine an importance score of each video frame in the multi-scale time-series fusion feature sequence by using the pre-trained video summary generation model.

A slicing unit 303 configured to slice the multi-scale time-series fusion feature sequence into a basic segment set in units of basic shots based on a shot slicing algorithm, wherein each basic segment in the basic segment set includes at least one video frame.

The core segment selecting unit 304 is configured to dynamically select core segments in the basic segment set based on the importance scores and the basic segment set by using a pre-trained video summary generation model.

And a video summary generation unit 305 configured to generate and output a dynamic video summary based on the core segments by using the pre-trained video summary generation model.

In some optional implementations of the present embodiment, the obtaining unit 301 in the video summary generating apparatus based on multi-scale time sequence features is further configured to: acquiring a target source video sequence; and the apparatus further comprises: the target source video frame feature vector sequence determining unit is configured to determine a target source video frame feature vector sequence according to the target source video sequence and the pre-trained multi-target classification model; the compression coding characteristic sequence determining unit is configured to sample and normalize compression coding on a target source video frame characteristic vector sequence to obtain a compression coding characteristic sequence with uniform size; and the multi-scale time sequence fusion characteristic sequence determining unit is configured to perform multi-scale time sequence fusion on the compressed coding characteristic sequence to obtain a multi-scale time sequence fusion characteristic sequence.

In some optional implementations of this embodiment, the multi-scale time-series fusion feature sequence determination unit is further configured to: performing multi-level time sequence perception on the compression coding features, and extracting multi-level short-time feature vector sequences corresponding to the compression coding features; and determining a multi-scale time sequence fusion characteristic sequence based on the multi-level short-time characteristic vector sequence and the multi-branch association analysis network.

In some optional implementations of this embodiment, the multi-scale time-series fusion feature sequence determination unit is further configured to: performing correlation coefficient calculation, weight vector coding and feature normalization processing on the multi-level short-time feature vector sequence by using a multi-branch correlation analysis network to obtain a long-time feature vector sequence corresponding to multiple branches; dimension fusion is carried out on each long-time characteristic vector sequence corresponding to the multiple branches, and a multi-scale time sequence fusion characteristic sequence with the same scale as the target source video sequence is obtained through full connection.

In some optional implementations of this embodiment, the obtaining unit 301 is further configured to: acquiring an initially generated countermeasure network and a training sample set, wherein the training sample set comprises a multi-scale time sequence fusion characteristic sequence, an importance score of each video frame in the marked multi-scale time sequence fusion characteristic sequence and a dynamic video abstract corresponding to each segment in the marked multi-scale time sequence fusion characteristic sequence; and the apparatus further comprises: and the training unit is configured to take the multi-scale time sequence fusion feature sequence in the training sample set as the input of an initial generation countermeasure network, take the importance scores of all video frames in the marked multi-scale time sequence fusion feature sequence and the dynamic video summaries corresponding to all the segments in the marked multi-scale time sequence fusion feature sequence as expected outputs, perform iterative training on the initial generation countermeasure network, and finally obtain a pre-trained video summary generation model.

In some optional implementations of this embodiment, the training unit is further configured to: the following iterative steps are performed a plurality of times: performing key frame sampling on the multi-scale time sequence fusion characteristic sequence in the training sample set to obtain a key frame set; reconstructing a video sequence based on the key frame set to obtain a reconstructed feature sequence; calculating the similarity of the reconstruction characteristic sequence and the compression coding characteristic sequence; updating a key frame set obtained by sampling key frames of the multi-scale time sequence fusion characteristic sequence in the training sample set according to the similarity and a preset length threshold of the key frame set; and in response to the fact that the similarity is larger than the preset similarity threshold and the length of the key frame set is smaller than the length threshold of the preset key frame set, finishing the training of the initially generated confrontation network to obtain a pre-trained video abstract generation model.

Technical carriers involved in payment in the embodiments of the present specification may include Near Field Communication (NFC), WIFI, 3G/4G/5G, POS machine card swiping technology, two-dimensional code scanning technology, barcode scanning technology, bluetooth, infrared, Short Message Service (SMS), Multimedia Message (MMS), and the like, for example.

The biometric features involved in biometric identification in the embodiments of the present specification may include, for example, eye features, voice prints, fingerprints, palm prints, heart beats, pulse, chromosomes, DNA, human teeth bites, and the like. Wherein the eye pattern may include biological features of the iris, sclera, etc.

It should be noted that the method of one or more embodiments of the present disclosure may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In the case of such a distributed scenario, one of the multiple devices may only perform one or more steps of the method according to one or more embodiments of the present specification, and the multiple devices interact with each other to complete the video summary generation method based on the multi-scale time sequence feature.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the modules may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.

The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Fig. 4 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of one or more embodiments of the present description as above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures, for simplicity of illustration and discussion, and so as not to obscure one or more embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the understanding of one or more embodiments of the present description, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the one or more embodiments of the present description are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that one or more embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the embodiments discussed.

It is intended that the one or more embodiments of the present specification embrace all such alternatives, modifications and variations as fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of one or more embodiments of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A video abstract generation method based on multi-scale time sequence characteristics is characterized by comprising the following steps:

acquiring a target source video sequence;

determining a multi-scale time sequence fusion characteristic sequence based on the multi-level short-time characteristic vector sequence and a multi-branch correlation analysis network;

determining the importance scores of all video frames in the multi-scale time sequence fusion characteristic sequence by utilizing a pre-trained video abstract generation model;

based on a shot segmentation algorithm, segmenting the multi-scale time sequence fusion feature sequence into a basic segment set taking basic shots as units, wherein each basic segment in the basic segment set comprises at least one video frame;

dynamically selecting core segments in the basic segment set based on each importance score and the basic segment set by utilizing a pre-trained video abstract generation model;

and generating a dynamic video abstract by utilizing a pre-trained video abstract generation model based on the core segment, and outputting the dynamic video abstract.

2. The method of claim 1, wherein determining a multi-scale time-series fused feature sequence based on the multi-level short-time feature vector sequence and a multi-branch association analysis network comprises:

3. A video summary generation device based on multi-scale time sequence characteristics is characterized by comprising the following steps:

an acquisition unit configured to acquire a target source video sequence;

the target source video frame feature vector sequence determination unit is configured to determine a target source video frame feature vector sequence according to the target source video sequence and a pre-trained multi-target classification model;

the compression coding characteristic sequence determining unit is configured to sample and normalize compression coding on the target source video frame characteristic vector sequence to obtain a compression coding characteristic sequence with uniform size;

the multi-scale time sequence fusion characteristic sequence determining unit is configured to perform multi-level time sequence perception on the compressed coding characteristic sequence and extract a multi-level short-time characteristic vector sequence corresponding to the compressed coding characteristic sequence; determining a multi-scale time sequence fusion characteristic sequence based on the multi-level short-time characteristic vector sequence and a multi-branch correlation analysis network;

an importance score determining unit configured to determine an importance score of each video frame in the multi-scale time series fusion feature sequence using a pre-trained video summary generation model;

a segmentation unit configured to segment the multi-scale time-series fusion feature sequence into a basic segment set in units of basic shots based on a shot segmentation algorithm, wherein each basic segment in the basic segment set includes at least one video frame;

a core segment selection unit configured to dynamically select a core segment in the basic segment set based on each of the importance scores and the basic segment set by using a pre-trained video summary generation model;

and the video abstract generating unit is configured to generate and output a dynamic video abstract based on the core segment by utilizing a pre-trained video abstract generating model.

4. The apparatus of claim 3, wherein the multi-scale time-series fused feature sequence determining unit is further configured to:

5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 2 when executing the program.

6. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 2.