CN112584146B

CN112584146B - Method and system for evaluating interframe similarity

Info

Publication number: CN112584146B
Application number: CN201910944335.1A
Authority: CN
Inventors: 许燚; 高龙文; 田凯; 周水庚; 孙胡杨
Original assignee: Fudan University; Shanghai Bilibili Technology Co Ltd
Current assignee: Fudan University; Shanghai Bilibili Technology Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2021-09-28
Anticipated expiration: 2039-09-30
Also published as: CN112584146A

Abstract

The embodiment of the application provides an interframe similarity evaluation method, which comprises the following steps: acquiring a first frame and a second frame in a frame sequence; extracting a plurality of feature information of the first frame and a plurality of feature information of the second frame; partitioning the plurality of feature information of the first frame and the plurality of feature information of the second frame to obtain a plurality of first blocks corresponding to the first frame and a plurality of second blocks corresponding to the second frame; acquiring a plurality of second blocks associated with each first block; and according to a plurality of second blocks associated with each first block, performing similarity calculation on each feature information of the first frame and part of feature information of the second frame respectively to obtain the inter-frame similarity between the first frame and the second frame. The embodiment of the application can effectively reduce the computing resource of the inter-frame similarity.

Description

Method and system for evaluating interframe similarity

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method, a system, computer equipment and a computer readable storage medium for evaluating interframe similarity.

Background

With the application and development of video services in various fields, video encoding and decoding become one of the key technologies concerned and developed by all parties. Video coding refers to a method of converting a file in a certain video format into a file in another video format by a specific compression technology, so that bandwidth cost and occupied space in a storage medium during transmission can be reduced.

However, video compression is typically lossy based on some video compression algorithm, and the resulting lossy video is often accompanied by various compression artifacts, such as occlusion, edge/texture floating, mosquito noise and jerkiness, etc. As described above, the noise generated by video compression inevitably reduces the picture quality of the video, and thus the visual experience of the video viewer. The inventor finds that: the quality enhancement operation can be performed on the frame by utilizing the information in other frames according to the inter-frame similarity between different frames, so that the visual experience of a video viewer is improved. However, the inter-frame similarity calculation is too computationally expensive.

It should be noted that the above findings of the present inventors are not disclosed as an innovative matter and are only used for describing the technical problems of the present application.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method, a system, a computer device, and a computer-readable storage medium for evaluating inter-frame similarity, which can be used to solve the technical problem that computing of inter-frame similarity consumes too much computing resources.

One aspect of the embodiments of the present application provides a method for evaluating inter-frame similarity, where the method includes: acquiring a first frame and a second frame in a frame sequence; extracting a plurality of feature information of the first frame and a plurality of feature information of the second frame; partitioning the plurality of feature information of the first frame and the plurality of feature information of the second frame to obtain a plurality of first blocks corresponding to the first frame and a plurality of second blocks corresponding to the second frame; acquiring a plurality of second blocks associated with each first block; and according to a plurality of second blocks associated with each first block, performing similarity calculation on each feature information of the first frame and part of feature information of the second frame respectively to obtain the inter-frame similarity between the first frame and the second frame.

Optionally, obtaining a plurality of second blocks associated with each first block includes: pooling each first block into a corresponding first downsampling feature information to obtain M first downsampling feature information; pooling each second block into corresponding second downsampling characteristic information to obtain M second downsampling characteristic information; calculating the similarity of the first downsampling characteristic information a and each second downsampling characteristic information, wherein the first downsampling characteristic information a corresponds to a first block a; and determining k second blocks corresponding to the k second downsampling feature information with the highest similarity as k second blocks associated with the first block a, wherein a is more than or equal to 1 and less than or equal to M, k is more than or equal to 1 and less than M, and a, k and M are natural numbers.

Optionally, according to a plurality of second blocks associated with each first block, performing similarity calculation on each feature information of the first frame and part of feature information of the second frame, to obtain inter-frame similarity between the first frame and the second frame, includes: by similarity matrix

Representing an inter-frame similarity between the first frame and the second frame; wherein the content of the first and second substances,

is composed of

Is used for representing the similarity between the feature information j of the first frame and the feature information i of the second frame: when the feature information j of the first frame and the feature information i of the second frame are respectively and correspondingly located in a first block and a second block which have an association relationship, calculating the similarity between the feature information j of the first frame and the feature information i of the second frame; and when the feature information j of the first frame and the feature information i of the second frame are not correspondingly positioned in the first block and the second block which have the association relationship, setting the similarity between the feature information j of the first frame and the feature information i of the second frame to be 0.

Optionally, the similarity between the feature information j of the first frame and the feature information i of the second frame is calculated according to the following calculation formula:

wherein, F_t(j) Is the characteristic information j, F of the first frame_t-1(i) Is the characteristic information i of the second frame,

β is a constant, which is a euclidean distance between the feature information j of the first frame and the feature information i of the second frame.

Optionally, the inter-frame similarity is used to determine a reference weight between the first frame and the second frame.

Optionally, the method further includes: learning hidden state information at the t moment through a non-local convolution long-short term memory network, wherein the hidden state information is used for enhancing the first frame; wherein the non-local convolution long-short term memory network is configured to: determining the weight of hidden state information and the weight of unit state information output at the time t-1 according to the interframe similarity between a first frame corresponding to the time t and a second frame corresponding to the time t-1, and converting the hidden state information and the unit state information output at the time t-1 according to the weight of the hidden state information and the weight of the unit state information output at the time t-1 to obtain target hidden state information and target unit state information which are used as input data of the non-local convolution long-short term memory network at the time t.

Optionally, converting the hidden state information and the unit state information output at the time t-1 according to the weight of the hidden state information and the weight of the unit state information output at the time t-1, includes: extracting a plurality of third blocks at corresponding positions from hidden state information output at the t-1 moment according to a plurality of second blocks associated with each first block, and generating target hidden state information according to the hidden state information in the third blocks and the similarity matrix; and according to a plurality of second blocks associated with each first block, extracting a plurality of fourth blocks at corresponding positions from the unit state information output at the t-1 moment, and generating the target hidden state information according to the unit state information in the plurality of fourth blocks and the similarity matrix.

Another aspect of the embodiments of the present application further provides an inter-frame similarity evaluation system, where the system includes: the device comprises a first acquisition module, a second acquisition module and a first display module, wherein the first acquisition module is used for acquiring a first frame and a second frame in a frame sequence; an extraction module, configured to extract a plurality of feature information of the first frame and a plurality of feature information of the second frame; a blocking module, configured to block the feature information of the first frame and the feature information of the second frame to obtain a plurality of first blocks corresponding to the first frame and a plurality of second blocks corresponding to the second frame; the second acquisition module is used for acquiring a plurality of second blocks associated with each first block; and the third acquisition module is used for carrying out similarity calculation on each piece of characteristic information of the first frame and part of characteristic information of the second frame according to the plurality of second blocks associated with each first block so as to obtain the inter-frame similarity between the first frame and the second frame.

Yet another aspect of the embodiments of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor is configured to implement the steps of the inter-frame similarity estimation method according to any one of the above items when executing the computer program.

Yet another aspect of embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, is configured to implement the steps of the inter-frame similarity assessment method according to any one of the above.

According to the inter-frame similarity evaluation method, the inter-frame similarity evaluation system, the computer device and the computer readable storage medium, the technical complexity of calculating the inter-frame similarity can be dynamically reduced according to the sizes of the first block and the second block and the number of the plurality of second blocks associated with each first block, and the excessive consumption of calculation resources in the calculation process can be greatly reduced on the premise of approximately maintaining the accuracy.

Drawings

Fig. 1 schematically shows a flowchart of an inter-frame similarity evaluation method according to a first embodiment of the present application;

FIG. 2 schematically shows a sub-flowchart of step S106 in FIG. 1;

FIG. 3 schematically shows another flowchart of an inter-frame similarity evaluation method according to a first embodiment of the present application;

FIG. 4 schematically shows a sub-flowchart of step S110 in FIG. 3;

FIG. 5 schematically illustrates an architecture diagram of a video quality enhancement operation;

FIG. 6 schematically illustrates a workflow diagram of a first non-local module;

FIG. 7 is a schematic diagram showing the operational architecture of a forward non-partial convolution long and short term memory network;

FIG. 8 is a block diagram schematically illustrating an inter-frame similarity evaluation system according to a second embodiment of the present application; and

fig. 9 schematically shows a hardware architecture diagram of a computer device suitable for implementing the inter-frame similarity evaluation method according to a third embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

Example one

Fig. 1 schematically shows a flowchart of an inter-frame similarity evaluation method according to a first embodiment of the present application. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description is made by way of example with the computer device 2 as the execution subject.

As shown in fig. 1, the method for evaluating inter-frame similarity may include steps S100 to S108, where:

step S100, a first frame and a second frame in a frame sequence are acquired.

The frame sequence is

The first frame and the second frame may refer to the sequence of frames

In two arbitrarily adjacent frames, e.g. X_tAnd X_t-1，X_t-1And X_t-2。

The sequence of frames

The frame sequence may be a video segment of a lossy video, which may be based on various types of encoded compressed video, such as compressed video based on compression algorithms such as h.264/AVC or h.265/HEVC. It is well understood that lossy video obtained via compression may lose much information resulting in various compression artifacts.

There will be some timing relationship between two adjacent frames, such as texture, color, motion track, etc. For example, if there is an object a in the previous frame and there is an object a in the next frame, the object a is the spatio-temporal dependency information between the two frames, and based on the spatio-temporal dependency information, it is possible to try to repair the frame with poor details with the frame with good details. For a first frame, the information it lost during compression may be present in a second or other adjacent frame; similarly, for the second frame, the information it lost during compression may be present in the first frame or other adjacent frames.

Step S102, extracting a plurality of feature information of the first frame and a plurality of feature information of the second frame.

Feature information of each frame may be extracted by using a method such as HOG (Histogram of Oriented Gradient), SIFT (Scale-invariant feature transform), and the like, or may be extracted by using a deep neural network.

In an exemplary embodiment, the computer device 2 may be configured as an encoder for extracting feature information, wherein the encoder comprises a convolutional neural network and a nonlinear activation function, wherein the convolutional neural network comprises a plurality of convolutional layers.

For convenience of description, the first frame is taken as X in the following_tAnd the second frame is X_t-1The present embodiment is exemplarily described. Correspondingly, from the first frame X_tThe extracted characteristic information is F_tFrom the second frame X_t-1The extracted characteristic information is F_t-1。

Step S104, for the first frame X_tA plurality of feature information F_tAnd the second frame X_t-1A plurality of feature information F_t-1And partitioning to obtain a plurality of first blocks corresponding to the first frame and a plurality of second blocks corresponding to the second frame.

A plurality of feature information F_tAnd a plurality of feature information F_t-1May exist in the form of feature maps, and each feature map may include N pieces of feature information. When the feature maps are partitioned, the size of the blocks can be set to p × p, i.e., each feature map is partitioned into N/p²Block, p is a natural number.

As can be seen from the above, the first frame X_tA plurality of feature information F_tIs divided into N/p²A first block; the second frame X_t-1A plurality of feature information F_t-1Is divided into N/p²A second block.

And step S106, acquiring a plurality of second blocks associated with each first block.

Finding a plurality of second blocks which are most similar to each first block by calculating the similarity between each first block and each second block, such as: a plurality of second blocks most similar to the first block, a plurality of second blocks most similar to the second first block, …, and an Nth/p-th block²A plurality of second blocks most similar to the first blocks.

In an exemplary embodiment, as shown in fig. 2, the step S106 includes the steps of: step S200, pooling each first block into corresponding first down-sampling feature information to obtain M first down-sampling feature information; step S202, pooling each second block into one corresponding second down-sampling feature information to obtain M second down-sampling feature information; step S204, calculating the similarity between the first down-sampling feature information a and each second down-sampling feature information, wherein the first down-sampling feature information a corresponds to a first block a; step S206, determining k second blocks corresponding to k second downsampling feature information with the highest similarity as k second blocks associated with the first block a, where a is greater than or equal to 1 and less than or equal to M, k is greater than or equal to 1 and less than M, a, k, and M are natural numbers, and M is N/p²。

For the plurality of feature information F_tN/p obtained by blocking²Pooling (Pooling) of the first blocks to obtain the first frame X_tN/p of²Down-sampled characteristic information, i.e. F_t ^p(ii) a For the plurality of feature information F_t-1N/p obtained by blocking²Obtaining the second frame X after pooling the second blocks_t-1N/p of²Down-sampling feature information, i.e.

F is to be_t ^pEach of the down-sampled feature information of (1) and

is subjected to euclidean distance calculation, i.e.,

to the corresponding Euclidean distance matrix

Wherein the content of the first and second substances,

for down-sampling the characteristic information F_t ^pWherein each down-sampled feature information is derived from down-sampled feature information

K pieces of down-sampling characteristic information with the shortest Euclidean distance to the first block are screened out, and therefore the first block and the F are obtained_t ^pAnd the corresponding relation of the down-sampling characteristic information in (2), and each second block and

and obtaining k second blocks associated with each first block according to the corresponding relation of the down-sampling feature information.

Step S108, according to a plurality of second blocks associated with each first block, similarity calculation is carried out on each feature information of the first frame and part of feature information of the second frame respectively, so as to obtain inter-frame similarity between the first frame and the second frame.

The similarity matrix may be an inter-frame pixel-level similarity matrix that may be used in a non-local attention mechanism, such as a non-local convolutional long-short term memory network as set forth later in this document.

By similarity matrix

Represents the first frame X_tAnd the second frame X_t-1Inter-frame similarity between them;

wherein the content of the first and second substances,

is composed of

Is used for representing the similarity between the feature information j of the first frame and the feature information i of the second frame:

when the feature information j of the first frame and the feature information i of the second frame are respectively and correspondingly located in a first block and a second block which have an association relationship, calculating the similarity between the feature information j of the first frame and the feature information i of the second frame; and when the feature information j of the first frame and the feature information i of the second frame are not correspondingly positioned in the first block and the second block which have the association relationship, setting the similarity between the feature information j of the first frame and the feature information i of the second frame to be 0.

Calculating the similarity between the feature information j of the first frame and the feature information i of the second frame, wherein the calculation formula is as follows:

In an exemplary embodiment, the inter-frame similarity is used to determine a reference weight between the first frame and the second frame. The reference weights include weights of information in the first frame during the second frame enhancement operation or weights of information in the second frame during the first frame enhancement operation.

In an exemplary embodiment, the inter-frame similarity evaluation method may be used in a video quality enhancement operation to process motion trajectories (motion patterns) between different frames, such as large motion or blurred motion trajectories, in a situation with a low computational resource occupancy. For example, in a Non-local module for a Non-local Convolutional Long Short Term Memory network (NL-ConvLSTM).

In an exemplary embodiment, as shown in fig. 3, the method further includes a step S110: learning hidden state information at the t moment through a non-local convolution long-short term memory network, wherein the hidden state information is used for enhancing the first frame; wherein the non-local convolution long-short term memory network is configured to: determining the weight of hidden state information and the weight of unit state information output at the time t-1 according to the interframe similarity between a first frame corresponding to the time t and a second frame corresponding to the time t-1, and converting the hidden state information and the unit state information output at the time t-1 according to the weight of the hidden state information and the weight of the unit state information output at the time t-1 to obtain target hidden state information and target unit state information which are used as input data of the non-local convolution long-short term memory network at the time t.

In an exemplary embodiment, to further reduce the consumption of computing resources, as shown in fig. 4, the conversion process is as follows: step S400, according to a plurality of second blocks associated with each first block, extracting a plurality of third blocks (top-k positions in H) at corresponding positions from the hidden state information output at the time t-1_t-1) Generating the target hidden state information according to the hidden state information in the plurality of third blocks and the similarity matrix; step S402, according to a plurality of second blocks associated with each first block, a plurality of fourth blocks at corresponding positions are extracted from the unit state information output at the time t-1, and the target hidden state information is generated according to the unit state information in the plurality of fourth blocks and the similarity matrix.

Referring to FIG. 5, for ease of understanding, the following provides an operational flow of a video enhancement method, the present operationThe flow is directed to the first frame X_tPerforming enhancement operation to obtain enhanced frame of first frame

Step one, acquiring a frame sequence to be processed

The frame sequence comprises a first frame X_tThe second frame X_t-1And other adjacent frames.

And secondly, extracting a plurality of characteristic information of each frame in the frame sequence.

From the first frame X by the encoder_tExtracting a plurality of corresponding characteristic information F_tFrom the second frame X_t-1Extracting a plurality of corresponding characteristic information F_t-1… to obtain a sequence of frames { X }_t-T,...,X_t+TThe corresponding characteristic information sequence { F }_t-T...F_t-2,F_t-1,F_t,F_t+1,F_t+2,...F_t+T}。

Inputting a plurality of characteristic information of each frame into a non-local convolution long-short term memory network, and acquiring the reference characteristic information through the non-local convolution long-short term memory network, wherein the reference characteristic information comprises the information corresponding to the first frame X_tHidden state information of (H)_t。

The non-local convolution long-short term memory network comprises a forward non-local convolution long-short term memory network and a backward non-local convolution long-short term memory network, the forward non-local convolution long-short term memory network comprises a first non-local module and a forward LSTM module, the backward non-local convolution long-short term memory network comprises a second non-local module and a backward LSTM module, the first non-local module is used for determining the weight of hidden state information output by a previous frame and the weight of unit state information output by the previous frame according to the inter-frame similarity between two adjacent frames, and the second non-local module is used for determining the weight of hidden state information output by a next frame and the weight of unit state information output by the next frame according to the inter-frame similarity between the two adjacent frames.

The forward non-local convolution long-short term memory network and the backward non-local convolution long-short term memory network are similar and are different in time sequence. For ease of understanding, the operation of the non-partial convolution long-short term memory network will now be described with reference to fig. 7.

With reference to fig. 6 and fig. 7, the work flow of the forward non-local convolution long-short term memory network at time t is taken as an example:

(1) receiving, by the first non-local module: the first frame X corresponding to the time t_tA plurality of feature information F_tAnd a second frame X corresponding to time t-1_t-1A plurality of feature information F_t-1Wherein the time t is the current time;

(2) calculating a first frame X by the first non-local module_tAnd a second frame X_t-1Inter-frame similarity between:

the calculation process is as follows:

(2.1) for a plurality of feature information F_tAnd a plurality of feature information F_t-1Partitioning to obtain a plurality of first blocks corresponding to the first frame and a plurality of second blocks corresponding to the second frame;

a plurality of feature information F_tAnd a plurality of feature information F_t-1The method can exist in the form of feature maps, and each feature map can comprise N pieces of feature information. The size of the block can be set to p, and the characteristic information F_tIs divided into N/p²A first block; a plurality of feature information F_t-1Is divided into N/p²A second block;

(2.2) for a plurality of feature information F_tN/p of²Pooling (pooling) of the first blocks to obtain the first frame X_tN/p of²Down-sampled characteristic information, i.e. F_t ^p(ii) a For a plurality of characteristic information F_t-1N/p of²Obtaining the second frame X after pooling the second blocks_t-1N/p of²The information of the down-sampled characteristic is obtained,namely, it is

(2.3)F_t ^pEach of the down-sampled feature information of (1) and

each down-sampling feature information in the image is subjected to Euclidean distance calculation to a corresponding Euclidean distance matrix

(2.4) is F_t ^pEach down-sampled feature information from

K pieces of down-sampling characteristic information with the shortest Euclidean distance are screened out to obtain k pieces of second blocks (top-k blocks in F) with the most similar first blocks_t-1)；

(2.5) calculating to obtain the first frame X according to a plurality of second blocks which are most similar to the first blocks_tAnd the second frame X_t-1Similarity matrix between

Wherein the content of the first and second substances,

is composed of

For representing said first frame X_tCharacteristic information j and second frame X_t-1Similarity between the feature information i of (a):

when the feature information j of the first frame and the second frame X_t-1Respectively corresponding to the first block and the second block with the association relationship, and calculating the first frame X_tCharacteristic information j and second frame X_t-1Similarity between the characteristic information i of (1); when in useThe first frame X_tCharacteristic information j and second frame X_t-1If the characteristic information i is not correspondingly located in the first block and the second block with the association relationship, the first frame X is processed_tCharacteristic information j and second frame X_t-1The similarity between the feature information i of (2) is set to 0.

(3) Receiving hidden state information H output at the moment t-1 through the first non-local module_t-1And cell state information C_t-1(ii) a And according to k second blocks with most similar first blocks, hidden state information H output from the t-1 moment_t-1A plurality of third blocks (top-k positions in H) with corresponding positions extracted_t-1) Generating the target hidden state information according to the hidden state information in the third blocks and the similarity matrix

Unit state information C output from the t-1 moment according to k second blocks most similar to the first blocks_t-1Extracting a plurality of fourth blocks at corresponding positions, and generating the target concealment according to unit state information in the fourth blocks and the similarity matrixStatus information

The reference formula is as follows:

the first non-local module is for assisting in capturing a sequence of frames

The trajectory trend in (1) can be seen as a mechanism of attention. The first non-local module may capture global motion trajectories (global motion patterns) more efficiently than motion compensation (motion compensation). In addition, in the processing of the first non-local block, the inter-frame similarity can be directly determined according to the feature information of the corresponding two frames, and an additional network layer (additional layer) for generating a motion vector field (motion field) is required by training, for example, motion compensation.

(4) F is to be_t、

And

inputting the hidden state information H into the forward LSTM module, and outputting the hidden state information H at the time t through the forward LSTM module_tAnd cell state information C_tSpecifically, the formula can be used:

illustratively, the forward LSTM module operating principle may be as follows:

H_t＝o_t⊙tanh(C_t)

a forgetting gate for receiving a memory message and deciding which part of the memory is to be reserved and forgotten;

wherein the forgetting factor is f_t，f_t∈[0,1]，f_tTarget unit state information representing output from time t to time t-1

Is used for determining whether the memory information learned at the time t-1 (namely the target unit state information output at the time t-1 and obtained by conversion)

) Pass or partially pass.

An input gate for selecting information to be memorized;

i_t∈[0,1]，i_tindicating temporary cell state information g at time t_tSelection weight of g_tTemporary cell state information at time t;

may indicate information that is desired to be deleted, i_t⊙g_tIt is possible to indicate the newly added information,the cell state information C at the time t can be obtained through the two parts_t。

An output gate for outputting the hidden state information H at time t_tWherein o is_t∈[0,1]，o_tShowing the selection weight of the cell state information at time t.

In addition, W is_xf、W_hf、W_xg、W_hg、W_xi、W_hi、W_xo、W_hoAll are weight parameters in the forward LSTM module; b_f、b_g、b_i、b_oAre all bias terms in the forward LSTM module; these parameters are obtained by model training.

It should be noted that the above exemplary structure of the forward LSTM module is not intended to limit the scope of the present invention.

(5) For hidden state information H_tPerforming a decoding operation to obtain a first frame X_tResidual error (Residual).

The computer device may configure a decoder, wherein the decoder comprises a convolutional neural network and a nonlinear activation function, wherein the convolutional neural network comprises a plurality of convolutional layers. By the structural symmetry of the decoder and the encoder.

(6) From the residual and the first frame X_tObtaining the first frame X_tEnhanced frame of

The video quality enhancement operation provided by the embodiment can improve the video quality with less operation resources and effectively remove the artifacts.

The technical scheme provided by the embodiment effectively reduces the calculation complexity. The analysis was as follows:

original non-local module: in that the similarity between each feature information of the second frame and each feature information of the first frame is calculated to obtain a similarity matrix S_t(S_t∈R^N*N) (ii) a And H according to the similarity matrix and the t-1 time_t-1And C_t-1And executing the conversion operation.

For convenience, phi denotes the operation complexity of the non-local module of the present embodiment, and psi denotes the operation complexity of the original non-local module, as shown in table 1:

	primitive non-local module	The non-local module provided by the embodiment
			Time	O(2N²C)	O((N/p²)²(C+logk)+2kNCp²)
Space(s)	O(2N²)	O((N/p²)²+kN/p²+2kNp²)

TABLE 1

In the case of logk < C, φ ═ O ((N/p)²)²C+2kNCp²) I.e. (N/p)²)²C+2kNCp²O represents the same order of magnitude as the value in the parentheses but is a constant multiple thereof, N represents the number of pieces of feature information, p × p represents the block size, and C is the number of channels. Phi/psi is 1/(2 p)⁴)+kp²/N≤1，kp²N is less than or equal to N, therefore, the operation complexity can be dynamically reduced according to k and p. For a given k, by p ═ N/k^1/6Phi/psi can be obtainedMinimization value 1.5(k/N)^2/3(ii) a Further, when p is 10, k is 4, C is 64, and f is 4¹Phi may be close to O (NC)²f²)(NC²f²Constant multiple of) is calculated, corresponding to the computational complexity of the convolutional layer with convolution kernel f. With continued reference to table 1, where p is set to 10 and k to 4, phi may be close to one in a thousand of psi.

Example two

Fig. 8 is a block diagram of an inter-frame similarity evaluation system according to a second embodiment of the present application, which may be partitioned into one or more program modules, stored in a storage medium, and executed by one or more processors to implement the second embodiment of the present application. The program modules referred to in the embodiments of the present application refer to a series of computer program instruction segments that can perform specific functions, and the following description will specifically describe the functions of the program modules in the embodiments.

As shown in fig. 8, the inter-frame similarity evaluation system 800 may include the following components:

a first obtaining module 810, configured to obtain a first frame and a second frame in a frame sequence;

an extracting module 820, configured to extract a plurality of feature information of the first frame and a plurality of feature information of the second frame;

a blocking module 830, configured to block the feature information of the first frame and the feature information of the second frame to obtain a plurality of first blocks corresponding to the first frame and a plurality of second blocks corresponding to the second frame;

a second obtaining module 840, configured to obtain a plurality of second blocks associated with each first block; and

a third obtaining module 850, configured to perform similarity calculation on each feature information of the first frame and part of feature information of the second frame according to a plurality of second blocks associated with each first block, so as to obtain inter-frame similarity between the first frame and the second frame.

In an exemplary embodiment, the second obtaining module 840 is further configured to: pooling each first block into a corresponding first downsampling feature information to obtain M first downsampling feature information; pooling each second block into corresponding second downsampling characteristic information to obtain M second downsampling characteristic information; calculating the similarity of the first downsampling characteristic information a and each second downsampling characteristic information, wherein the first downsampling characteristic information a corresponds to a first block a; and determining k second blocks corresponding to the k second downsampling feature information with the highest similarity as k second blocks associated with the first block a, wherein a is more than or equal to 1 and less than or equal to M, k is more than or equal to 1 and less than M, and a, k and M are natural numbers.

In an exemplary embodiment, the second obtaining module 840 is further configured to: by similarity matrix

is composed of

In an exemplary embodiment, the similarity between the feature information j of the first frame and the feature information i of the second frame is calculated according to the following formula:

In an exemplary embodiment, the inter-frame similarity is used to determine a reference weight between the first frame and the second frame.

In an exemplary embodiment, the system further comprises a learning module for: learning hidden state information at the t moment through a non-local convolution long-short term memory network, wherein the hidden state information is used for enhancing the first frame; wherein the non-local convolution long-short term memory network is configured to: determining the weight of hidden state information and the weight of unit state information output at the time t-1 according to the interframe similarity between a first frame corresponding to the time t and a second frame corresponding to the time t-1, and converting the hidden state information and the unit state information output at the time t-1 according to the weight of the hidden state information and the weight of the unit state information output at the time t-1 to obtain target hidden state information and target unit state information which are used as input data of the non-local convolution long-short term memory network at the time t.

In an exemplary embodiment, the system further comprises a learning module for: extracting a plurality of third blocks at corresponding positions from hidden state information output at the t-1 moment according to a plurality of second blocks associated with each first block, and generating target hidden state information according to the hidden state information in the third blocks and the similarity matrix; and according to a plurality of second blocks associated with each first block, extracting a plurality of fourth blocks at corresponding positions from the unit state information output at the t-1 moment, and generating the target hidden state information according to the unit state information in the plurality of fourth blocks and the similarity matrix.

EXAMPLE III

Fig. 9 schematically shows a hardware architecture diagram of a computer device suitable for implementing the inter-frame similarity evaluation method according to a third embodiment of the present application. In the present embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set in advance or stored. For example, the server may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a monitoring device, a video conference system, a rack server, a blade server, a tower server, or a rack server (including an independent server or a server cluster composed of a plurality of servers), and the like. As shown in fig. 9, the computer device 2 includes at least, but is not limited to: the memory 21, processor 22, and network interface 23 may be communicatively coupled to each other by a system bus. Wherein:

the memory 21 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 21 may be an internal storage module of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk provided on the computer device 2, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Of course, the memory 21 may also comprise both an internal memory module of the computer device 2 and an external memory device thereof. In this embodiment, the memory 21 is generally used for storing an operating system installed in the computer device 2 and various types of application software, such as program codes of the inter-frame similarity evaluation method. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is generally configured to control the overall operation of the computer device 2, such as performing control and processing related to data interaction or communication with the computer device 2. In this embodiment, the processor 22 is configured to execute the program code stored in the memory 21 or process data.

The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is typically used to establish a communication connection between the computer device 2 and other computer devices. For example, the network interface 23 is used to connect the computer device 2 with an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), or Wi-Fi.

It is noted that fig. 9 only shows a computer device with components 21-23, but it is to be understood that not all of the shown components are required to be implemented, and that more or less components may be implemented instead.

In this embodiment, the method for evaluating the inter-frame similarity stored in the memory 21 may be further divided into one or more program modules and executed by one or more processors (in this embodiment, the processor 22) to complete the present invention.

Example four

The present embodiment also provides a computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the inter-frame similarity evaluation method in the embodiments.

In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the computer-readable storage medium may also include both internal and external storage devices of the computer device. In this embodiment, the computer-readable storage medium is generally used for storing an operating system and various types of application software installed in the computer device, for example, the program code of the inter-frame similarity evaluation method in the embodiment, and the like. Further, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An inter-frame similarity evaluation method, the method comprising:

acquiring a first frame and a second frame in a frame sequence;

extracting a plurality of feature information of the first frame and a plurality of feature information of the second frame;

partitioning the plurality of feature information of the first frame and the plurality of feature information of the second frame to obtain a plurality of first blocks corresponding to the first frame and a plurality of second blocks corresponding to the second frame;

acquiring a plurality of second blocks related to each first block by calculating the similarity between each first block and each second block; and

according to a plurality of second blocks associated with each first block, similarity calculation is carried out on each feature information of the first frame and part of feature information of the second frame respectively to obtain inter-frame similarity between the first frame and the second frame, and the method comprises the following steps: by similarity matrix

is composed of

Is used for representing the similarity between the feature information j of the first frame and the feature information i of the second frame: when the feature information j of the first frame and the feature information i of the second frame are respectively and correspondingly located in a first block and a second block which have an association relationship, calculating the similarity between the feature information j of the first frame and the feature information i of the second frame; when in useAnd if the feature information j of the first frame and the feature information i of the second frame are not correspondingly located in the first block and the second block which have the association relationship, setting the similarity between the feature information j of the first frame and the feature information i of the second frame to be 0.

2. The method according to claim 1, wherein obtaining a plurality of second blocks associated with each first block comprises:

pooling each first block into a corresponding first downsampling feature information to obtain M first downsampling feature information;

pooling each second block into corresponding second downsampling characteristic information to obtain M second downsampling characteristic information;

calculating the similarity of the first downsampling characteristic information a and each second downsampling characteristic information, wherein the first downsampling characteristic information a corresponds to a first block a; and

and determining k second blocks corresponding to the k second downsampling feature information with the highest similarity as k second blocks associated with the first block a, wherein a is more than or equal to 1 and less than or equal to M, k is more than or equal to 1 and less than M, and a, k and M are natural numbers.

3. The method according to claim 1, wherein the similarity between the feature information j of the first frame and the feature information i of the second frame is calculated by the following formula:

4. The inter-frame similarity evaluation method according to claim 1, wherein the inter-frame similarity is used to determine a reference weight between the first frame and the second frame.

5. The method of claim 4, further comprising:

learning hidden state information at the t moment through a non-local convolution long-short term memory network, wherein the hidden state information is used for enhancing the first frame;

wherein the non-local convolution long-short term memory network is configured to: determining the weight of hidden state information and the weight of unit state information output at the time t-1 according to the interframe similarity between a first frame corresponding to the time t and a second frame corresponding to the time t-1, and converting the hidden state information and the unit state information output at the time t-1 according to the weight of the hidden state information and the weight of the unit state information output at the time t-1 to obtain target hidden state information and target unit state information which are used as input data of the non-local convolution long-short term memory network at the time t.

6. The method according to claim 5, wherein converting the hidden state information and the cell state information output at the t-1 time according to the weight of the hidden state information and the weight of the cell state information output at the t-1 time comprises:

extracting a plurality of third blocks at corresponding positions from hidden state information output at the t-1 moment according to a plurality of second blocks associated with each first block, and generating target hidden state information according to the hidden state information in the third blocks and the similarity matrix; and

and according to a plurality of second blocks associated with each first block, extracting a plurality of fourth blocks at corresponding positions from the unit state information output at the t-1 moment, and generating the target hidden state information according to the unit state information in the plurality of fourth blocks and the similarity matrix.

7. An inter-frame similarity evaluation system, the system comprising:

the device comprises a first acquisition module, a second acquisition module and a first display module, wherein the first acquisition module is used for acquiring a first frame and a second frame in a frame sequence;

an extraction module, configured to extract a plurality of feature information of the first frame and a plurality of feature information of the second frame;

a blocking module, configured to block the feature information of the first frame and the feature information of the second frame to obtain a plurality of first blocks corresponding to the first frame and a plurality of second blocks corresponding to the second frame;

the second acquisition module is used for acquiring a plurality of second blocks related to each first block by calculating the similarity between each first block and each second block; and

a third obtaining module, configured to perform similarity calculation on each piece of feature information of the first frame and part of feature information of the second frame according to a plurality of second blocks associated with each first block, so as to obtain inter-frame similarity between the first frame and the second frame;

wherein the third obtaining module is further configured to: by similarity matrix

is composed of

Is used for representing the characteristic information j andsimilarity between feature information i of the second frame: when the feature information j of the first frame and the feature information i of the second frame are respectively and correspondingly located in a first block and a second block which have an association relationship, calculating the similarity between the feature information j of the first frame and the feature information i of the second frame; and when the feature information j of the first frame and the feature information i of the second frame are not correspondingly positioned in the first block and the second block which have the association relationship, setting the similarity between the feature information j of the first frame and the feature information i of the second frame to be 0.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor is configured to implement the steps of the method for evaluating inter-frame similarity according to any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, is adapted to carry out the steps of the inter-frame similarity assessment method according to any one of claims 1 to 6.