CN114650421A

CN114650421A - Video processing method and device, electronic equipment and storage medium

Info

Publication number: CN114650421A
Application number: CN202011507127.4A
Authority: CN
Inventors: 徐异凌; 晏航; 何大治; 孙军; 黄成�; 朱兴昌; 陈颖川; 尹芹; 张宇; 朱伟
Original assignee: Shanghai Jiaotong University; ZTE Corp
Current assignee: Shanghai Jiaotong University; ZTE Corp
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2022-06-21
Also published as: WO2022127865A1

Abstract

The embodiment of the invention relates to the field of videos and discloses a video processing method, a video processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: extracting initial picture characteristics from each area of a video picture; calculating the content significance of each region according to the initial picture characteristics; marking a content salient region in a video picture according to the content saliency of each region; carrying out first mode coding on the content salient region and carrying out second mode coding on the non-content salient region; the picture quality of the first mode encoding is higher than the picture quality of the second mode encoding. The video processing method in the embodiment of the invention can reduce the video data volume and the video transmission time on the premise of ensuring the sense of the user, and meets the real-time interaction requirements of the user.

Description

Video processing method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the field of videos, in particular to a video processing method and device, electronic equipment and a storage medium.

Background

With the development of internet technology, the demand of internet users for media consumption is increasing day by day, and high-quality media content and some emerging media content such as Virtual Reality VR (for short, VR) and cloud point broadcasting become mainstream gradually.

In the related video processing technology, in order to ensure the viewing experience of the user, the media publisher sends a video with high picture quality to the user side, wherein a long time is inevitably required for transmitting the video due to the large amount of video data with high picture quality.

Therefore, the related video processing technology has the following problems: in order to ensure the picture quality, the amount of video data to be transmitted is huge, and the contradiction exists with the real-time interaction requirement of a user.

Disclosure of Invention

The embodiment of the application mainly aims to provide a video processing method, a video processing device, an electronic device and a storage medium, so that the video data volume and the video transmission time are reduced on the premise of ensuring the sense of a user, and the real-time interaction requirements of the user are met.

In order to achieve the above object, an embodiment of the present application provides a video processing method, including the following steps: extracting initial picture characteristics from each area of a video picture; calculating the content significance of each region according to the initial picture characteristics; marking a content salient region in a video picture according to the content saliency of each region; coding a content salient region in a first mode, and coding a non-content salient region in a second mode; the picture quality of the first mode encoding is higher than the picture quality of the second mode encoding.

In order to achieve the above object, an embodiment of the present application further provides a video processing apparatus, including: the extraction module is used for extracting initial picture characteristics from each area of a video picture; the calculation module is used for calculating the content significance of each area according to the initial picture characteristics; the marking module is used for marking the content salient regions in the video pictures according to the content saliency of each region; the coding module is used for coding the content salient region in a first mode and coding the non-content salient region in a second mode; the picture quality coded in the first way is higher than the picture quality coded in the second way.

In order to achieve the above object, an embodiment of the present application further provides an electronic device, including: at least one processor; a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video processing method described above.

To achieve the above object, an embodiment of the present application further provides a computer-readable storage medium storing a computer program, and the computer program is executed by a processor to implement the above video processing method.

According to the video processing method, the content saliency of each region is calculated according to the initial picture features extracted from each region of a video picture, the content saliency region in the video picture is marked according to the content saliency of each region, the content saliency region is coded in a first mode, the non-content saliency region is coded in a second mode, and the picture quality of the content saliency region is higher and the picture quality of the non-content saliency region is lower because the picture quality of the first mode coding is higher than that of the second mode coding. Because the quality of the video picture is in direct proportion to the data volume of the video, the picture of the non-content salient region is coded by adopting a second mode with lower quality, so that the data volume of the non-content salient region after video coding can be reduced, and further the data volume of the whole video is reduced; and because the attention of the human visual system is mainly focused on a salient object or area, the viewing experience of the user can be ensured even if the picture quality of a non-content salient area is reduced as long as the picture quality of the content salient area is ensured. Therefore, the video processing method can reduce the video data volume on the premise of ensuring the sense of the user, further reduce the transmission time required by the video and meet the real-time interaction requirements of the user.

Drawings

Fig. 1 is a flow chart of a video processing method provided according to a first embodiment of the invention;

FIG. 2 is a schematic illustration of a mask provided according to a first embodiment of the present invention;

fig. 3 is a flow chart of a video processing method according to a second embodiment of the invention;

FIG. 4 is a schematic diagram of an algorithmic network framework provided in accordance with a second embodiment of the present invention;

FIG. 5 is a diagram of an inter-frame feature competition module according to a second embodiment of the present invention;

FIG. 6 is a schematic diagram of a compression and activation model provided in accordance with a second embodiment of the present invention;

FIG. 7 is a schematic diagram of a self-attention model provided in accordance with a second embodiment of the present invention;

FIG. 8 is a diagram of a hierarchical feature competition module provided in accordance with a second embodiment of the present invention;

fig. 9 is a schematic structural diagram of a video processing apparatus according to a third embodiment of the present invention;

fig. 10 is a schematic diagram of an electronic device provided in accordance with a fourth embodiment of the invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that in the examples of the present application, numerous technical details are set forth in order to provide a better understanding of the present application. However, the technical solutions claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present application, and the embodiments may be mutually incorporated and referred to without contradiction.

A first embodiment of the present invention relates to a video processing method, and a specific flow is as shown in fig. 1:

step 101, extracting initial picture characteristics from each area of a video picture;

102, calculating the content significance of each area according to the initial picture characteristics;

103, marking a content salient region in a video picture according to the content saliency of each region;

104, coding a content salient region in a first mode, and coding a non-content salient region in a second mode; the picture quality of the first mode encoding is higher than the picture quality of the second mode encoding.

The video processing method of the present embodiment is applied to a video encoder. The video encoder is used for performing compression coding on video data to meet the requirements of storage and transmission. The video encoder may be a video encoder used by a media platform such as a video-on-demand platform and a game platform when encoding video before video transmission. When a user requests a video through the video-on-demand platform or carries out a game needing real-time interaction, such as a VR-related game, through the game platform, the video platform needs to send the video requested by the user or the video required by the game interaction to the user, and if the requirement on the definition of the video is high, the data volume of the video is huge. The video processing method carries out non-uniform coding on each area in a video picture, carries out coding with higher quality on the area with remarkable content, namely the area of interest of a user, and carries out coding with lower quality on the area without remarkable content, thereby obtaining a new video with smaller data volume, and transmitting the new video to the user for watching or interacting by the user.

The following describes the implementation details of the video processing method of the present embodiment in detail, and the following is only provided for the convenience of understanding and is not necessary for implementing the present embodiment.

The video processing method of the present application can be implemented by constructing an algorithmic network framework of "encoder-gated cyclic unit-decoder". The Gate controlled circulation Unit (GRU) is a lightweight Recurrent neural Network, and the video encoder may use a lightweight Network Residual Network 18 (resneutral Network 18, ResNet 18 for short) and a deep separable Convolution (deepwater separable Convolution) to construct an algorithm Network.

In step 101, the video encoder extracts initial picture features from regions of a video picture. A video encoder may extract initial picture features from regions of a video picture by the encoder. The encoder extracts the initial picture features of each region through the convolutional layer, the pooling layer, and the residual block. The division specification of the region can use the specification set by the encoder by default, and can also adjust the division specification of each region of the video picture by changing the specification parameters. The encoder can respectively acquire each area of the video picture by adopting a sliding window mode to extract the initial picture characteristics. The initial picture features extracted by the encoder may be one or more, and the initial picture features may exist in the form of vectors or matrix arrays. By extracting picture features, a video encoder can derive a digitized representation of the picture content to facilitate the computational process.

In step 102, the video encoder calculates the content saliency of each region from the initial picture features. According to the extracted initial picture features, the video encoder calculates the initial picture features of each region through the GRU and the decoder respectively to obtain the content significance of each region. Wherein, the content significance of each region can be divided into: significant and non-significant.

In one example, the video encoder may calculate a content saliency value of each region according to the initial picture feature, and obtain the content saliency of each region according to the content saliency value of each region. For example, the video encoder may classify the content saliency of an area having a content saliency value greater than a preset threshold as salient and the content saliency of an area having a content saliency value not greater than the preset threshold as non-salient. And marking the area with the content significance numerical value larger than the preset threshold as a content significant area, and marking the area with the content significance numerical value not larger than the preset threshold as a non-content significant area.

Specifically, the video encoder adaptively learns a mapping function of a target domain mapped from an initial picture feature to a content saliency value by means of supervised learning. The expression of the mapping function Y is as follows: and m ═ y (z), where m is a content saliency value, and z is an initial picture feature. In the training stage of the algorithm, a video encoder samples a training data set according to Gaussian distribution, obtains an initial random function according to the sampled data, and obtains a final mapping function Y in a self-adaptive learning mode. Further, the video encoder evaluates the difference between the predicted value of the content saliency value and the actual true value in the training data through a loss function, and finds the minimum preset loss function through a gradient descent algorithm to obtain a mapping function Y. Wherein, the training stage is based on a large amount of data sets to carry out training simulation, and the preset loss function is as follows: loss is α · kl (p, s) + β · nss (p, s) + γ · cc (p, s). loss represents a loss function, alpha, beta and gamma are multiplication coefficients, the optimal values obtained after specific experiments are respectively 1, 0.1 and 0.1, and the specific formulas of the three measurement indexes are respectively as follows:

x in the above formula_iAnd each pixel point is represented, N represents the number of the pixel points, N represents the total number of points with the content significance degree value larger than a preset threshold value in the truth value, and mu represents the mathematical expectation. p represents a content significance prediction value corresponding to the area where the pixel points are located, and s represents a content significance true value corresponding to the area where the pixel points are located in the training data set. kl (p, s) measures the difference degree of distribution between the predicted value of the content significance and the true value, nss (p, s) measures the prediction accuracy degree of the pixel point with the highest predicted value, and cc (p, s) measures the consistency degree of the linear change trend of the predicted value and the linear change trend of the true value. In the training stage, the video encoder uses an initial random function as a mapping function Y, substitutes the mapping function Y into a content significance numerical value to be calculated to obtain a predicted value of the content significance numerical value, evaluates the difference between the predicted value of the content significance numerical value and a true value in training data according to a loss function, adjusts the mapping function in the direction of small difference, and carries out iterative calculation on the mapping function. Until the difference between the predicted and true values is sufficiently small, the video encoder takes this mapping function as the mapping function Y that is ultimately applied. In the application stage of the algorithm, the video encoder directly maps the features extracted by the encoder to a target domain of content significance values by convolution and pooling up-sampling operations to obtain content significance value results.

In step 103, the video encoder marks the content salient regions in the video picture according to the content saliency of each region. Wherein the video encoder may generate an indication file for marking the salient content areas of the video picture. When the content significance of each area is: when the video picture is salient or not salient, the indication file can contain the position information of the salient region in each region of the video picture.

In step 104, the video encoder performs a first mode encoding on the salient content regions and a second mode encoding on the non-salient content regions, wherein the picture quality of the first mode encoding is higher than the picture quality of the second mode encoding. The first and second modes of encoding may be compression of video pictures at different degrees, for example, different quantization parameter QP (QP) values may be set for the first and second modes of encoding to compress video pictures at different degrees. After the content salient region is coded and compressed by the first mode, the definition of a picture is higher than that of a non-content salient region coded and compressed by the second mode. In particular, the video encoder may encode each region in the video picture according to location information indicating the salient region in the file.

In one example, the video encoder may further divide the content saliency according to the content saliency value, for example, the content saliency value may be divided into: primary significance, secondary significance, non-significance, and the like. The content significance of the area with the content significance numerical value not greater than the preset threshold is divided into non-significance, the content significance of the area with the content significance numerical value greater than the first preset threshold is divided into primary significance, the content significance of the area with the content significance numerical value greater than the second preset threshold is divided into secondary significance, and the like. The indication file comprises position information of different areas in each area of the video picture and corresponding content significance. The video encoder may encode each region in the video picture according to the location information indicating the regions corresponding to different saliency levels in the file. If the video encoder sets different significance levels for the content significance region, the QP value can be linearly set according to the different significance levels, so that the video pictures can be compressed in different degrees.

In one example, the video encoder may modify a mask interface in an existing video encoding standard, and encode the video using the modified mask interface. The video encoder modifies the existing binary coding into linear non-uniform coding, and the masks before and after modification are shown in fig. 2. If the flag value in the mask is different for each region, the video encoder assigns a different QP value for each region. Where a large value of the flag indicates that the content is more salient, the video encoder allocates a smaller QP value for the region. Preferably, the video encoder may linearly allocate the QP value for each region according to a linear relationship between the content saliency values of each region.

In one example, a video encoder extracts a video frame from a video to be processed, acquires a video picture, and extracts initial picture features from regions of the extracted video frame before extracting the initial picture features from the regions of the video picture. The video encoder may perform the video processing in steps 101 to 104 of this embodiment on each frame of picture of the video.

Further, before extracting the initial picture features from each region of the video picture, the video encoder extracts video frames from the video according to a preset frame interval to obtain the video picture of the video frames. The preset frame interval may be any natural number other than 0. When the preset frame interval is N, if the video frame extracted this time is the 10 th frame, the video frame extracted next time by the video encoder is the 10+ N th frame.

In one example, after encoding regions of a video picture is complete, the video encoder acquires a next frame of video picture and encodes regions of the next frame of video picture. If the preset interval of the video encoder is greater than 0, the video encoder may encode each region of the video picture according to the encoding scheme of the current video frame in the video frame between the current video frame and the extracted next video frame, where the encoding scheme is the corresponding relationship between the position of each region of the video picture in the indication file and the corresponding content saliency and encoding mode. For example, if the current video frame is the 1000 th frame of the video and the preset interval is 5, the video encoder encodes each region of the video pictures of the 1001 st, 1002 th, 1003 th and 1004 th frames according to the encoding scheme of the 1000 th frame, that is, the video encoder correspondingly encodes the corresponding region of the video pictures of the 1001 st, 1002 th, 1003 th and 1004 th frames according to the position information of the region subjected to the first formula encoding in the video pictures of the 1000 th frame.

In an example, if the preset interval of the video encoder is greater than 0, the video encoder may further extract a next video frame according to the preset frame interval, and after the video processing method of the present embodiment is performed on the next video frame, the video encoder obtains content saliency values of respective areas of video pictures of the two video frames according to the coding scheme of the current video frame and the coding scheme of the next video frame, and linearly allocates the content saliency values of the respective areas of the video frames between the two video frames according to the content saliency values of the respective areas of the video pictures of the two video frames, so as to obtain the content saliency values of the respective areas of the video pictures of the video frames between the two video frames. For example, the current video frame is the 1000 th frame of the video, the preset interval is 5, that is, the next video frame is the 1005 th frame, and the video frames between two video frames are the 1001 st, 1002 th, 1003 th and 1004 th frames. After finishing the encoding of the 1000 th frame, the video encoder encodes the 1005 th frame, and acquires the content saliency value of the area A at the same position of the 1000 th frame and the 1005 th frame according to the encoding scheme of the 1000 th frame and the encoding scheme of the 1005 th frame, and if the content saliency value of the area A at the position of the 1000 th frame is 0 and the content saliency value of the area A at the position of the 1005 th frame is 5, the content saliency value of the area A at the position of the 1001 st frame is 1, the content saliency value of the area A at the position of the 1002 th frame is 2, the content saliency value of the area A at the position of the 1003 th frame is 3, and the content saliency value of the area A at the position of the 1004 th frame is 4. And the video encoder divides and encodes the content significance of each region of the video picture of each video frame according to the calculated content significance numerical value.

In this embodiment, the content saliency of each region is calculated by extracting the initial picture feature from each region of the video picture, the content saliency of each region is labeled in the video picture according to the content saliency of each region, the content saliency region is encoded in the first manner, and the non-content saliency region is encoded in the second manner. Because the quality of the video picture is in direct proportion to the data volume of the video, the picture of the non-content salient region is coded by adopting a second mode with lower quality, so that the data volume of the non-content salient region after video coding can be reduced, and further the data volume of the whole video is reduced; and because the attention of the human visual system is mainly focused on a salient object or area, the viewing experience of the user can be ensured even if the picture quality of a non-content salient area is reduced as long as the picture quality of the content salient area is ensured. Therefore, the video processing method can reduce the video data volume on the premise of ensuring the sense of the user, further reduce the transmission time required by the video and meet the real-time interaction requirements of the user.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A second embodiment of the present invention relates to a video processing method. The second embodiment is substantially the same as the first embodiment, and mainly differs therefrom in that: in the first embodiment, the content saliency of each region is calculated from the initial screen feature. In the second embodiment of the present invention, the content saliency is calculated from the time dimension feature and the space dimension feature.

The present embodiment relates to a video processing method. The specific flow is shown in fig. 3:

step 301, extracting initial picture features from each area of a video picture;

step 302, acquiring a time dimension characteristic and a space dimension characteristic according to the initial picture characteristic;

step 303, calculating the content significance according to the time dimension characteristics and the space dimension characteristics;

step 304, marking the content salient regions in the video picture according to the content saliency of each region;

step 305, coding a content salient region in a first mode, and coding a non-content salient region in a second mode; the picture quality coded in the first way is higher than the picture quality coded in the second way.

Step 301, step 304, and step 305 are substantially the same as step 101, step 103, and step 104 in the first embodiment, and are not described again.

The video processing method of this embodiment may be implemented by constructing an algorithm network Frame as shown in fig. 4, where the input Frame t-1 is a video Frame where a previous video Frame is located, the input Frame t is a video Frame where a current video Frame is located, the IFCM is an Inter-Frame Feature Competition submodule (abbreviated as "IFCM") for acquiring time dimension Features, the HFCM is a Hierarchical Feature Competition submodule (abbreviated as "HFCM") for acquiring space dimension Features, and the CDFE is a consistent and differentiated Features Extraction Module (abbreviated as "CDFE").

In step 302, the video encoder initiates picture features, obtaining temporal dimension features and spatial dimension features.

In one example, a video encoder acquires temporal dimensional features by: and acquiring consistency features and difference features according to consistency and difference between the initial picture features and the initial picture features of the corresponding region of the previous video picture, and performing weighted fusion on the consistency features and the difference features to obtain time dimension features. The video encoder may acquire the temporal dimension feature through an inter-frame feature competition module as shown in fig. 5. The inter-frame feature competition module performs Correlation operations such as point multiplication, weighting optimization and the like on the initial picture features of each region of the current video picture of the current video frame and the initial picture features of the corresponding region of the previous video picture through a Correlation Layer (Correlation Layer), a compression and activation module and a self-attention module in an optical Flow network (Flow-Net) to obtain similar feature representations of the previous video frame and the current video frame at a local spatial position (namely a region). So that the video encoder can perform the calculation of the content saliency for each region of the video picture. The video encoder uses global pooling and Sigmoid functions after the Relu function convolution activation by using a compression-and-activation model SE (for short "SE") as shown in FIG. 6Obtaining an activation value of each dimension feature in the feature dimension of the initial picture feature, and indicating the spatial position of each dimension feature according to the initial picture feature of the previous video picture, thereby performing weighting optimization on each dimension on the feature after calculation of the related layer, wherein the activation value is used as a weight value, and the numerical range is [0,1 ]]. The video encoder further performs, by using a residual addition method adopted by a Self-Attention model SA (Self-Attention model, referred to as "SA") shown in fig. 7, spatial weighting optimization using a Sigmoid function, performing consistency and difference calculation between an initial picture feature and an initial picture feature of a corresponding region of a previous video picture, obtaining a consistency feature and a difference feature, performing convolution operation, cascade operation, and nonlinear activation function calculation using a Gate recovery Unit (Gate recovery Unit, referred to as "GRU"), and performing weighted fusion of the consistency feature and the difference feature to obtain a time dimension feature, wherein a weighted fusion formula of the consistency feature and the difference feature is as follows: f. of_time＝SA(SE(Cat(δ(W_corrf_corr),δ(W_difff_diff) In a)) in which f)_timeIs a time dimension feature, f_corrIs a consistency feature, f_diffIs a differential feature, W_corr、W_diffIs the parameter that the convolution is to learn. In the above formula, the specific formula for SE function is as follows: SE (x) ═ F_scale(x,σ(g(W₂δ(W₁x))) the specific calculation of the SA function is as follows: sa (x) x + σ (Wx), the parameter δ representing the nonlinear activation function Relu, σ representing the nonlinear activation function Sigmoid, g representing the global pooling operation, Cat representing the cascading operation, W, W₁And W₂Parameters to be learned for convolution, F_scaleA dot product operation representing a feature dimension.

Specifically, a video encoder calculates a consistency mask and a difference mask according to initial picture characteristics and initial picture characteristics of a corresponding region of a previous video picture; performing dot multiplication on the initial picture characteristics of the corresponding area of the previous video picture and the consistency mask to obtain consistency characteristics; and performing dot multiplication on the initial picture features and the difference masks to obtain difference features. Uniformity mask and varianceThe calculation formula of the consistency mask and the consistency characteristics and the difference characteristics is as follows: m_corr＝SA(WCat(f_t-1,SE(Corr(f_t,f_t-1) )) in which the Corr function is specifically calculated as follows, Corr (x)₁,x₂)＝∑_{o∈[-k,k]×[-k,k]}f_t-1(x₁+o)·f_t(x₂+ o). In the above equation, Corr represents the correlation layer in the optical flow network, f_t、f_t-1Initial picture characteristics of regions in a video picture representing a current video frame and initial picture characteristics of corresponding regions of a previous video picture, [ -k, k]Denotes f_t-1And f_tX calculated at respective zone positions₁And x₂Spatial extent of, M_corrA consistency Mask (Mask) obtained from an attention network is also used for representing the consistency of each area in the previous video frame and the current video frame by utilizing the related layer and the compression and activation network. The video encoder obtains the uniformity feature and the disparity feature according to the following formula: f. of_corr＝f_t-1·M_corr、f_diff＝f_t·(1-M_corr) Wherein f is_corr、f_diffRepresenting extracted identity and difference features, 1-M_corrIs a differential mask.

In this embodiment, consistency features and difference features are obtained according to consistency and difference between initial picture features and initial picture features of a corresponding region of a previous video picture, and the consistency features and the difference features are weighted and fused according to convolution operation, cascade operation, and nonlinear activation function calculation, so as to obtain time dimension features. The consistency and the difference of each region of the video picture and each corresponding region of the previous video picture can reflect the dynamic change of the picture content of each region of the video picture in the time dimension, so the time dimension characteristics obtained based on the consistency characteristics and the difference characteristics fully mine the characteristics of a human visual system in the time dimension, and the accuracy of content significance calculation can be further improved.

In one example, a video encoder obtains spatial dimension features by: according to the initial pictureSurface features, acquiring low-level features and high-level semantic features; wherein the low-level features are features obtained by shallow layer recognition of a video picture, and the high-level semantic features are features obtained by content recognition of the video picture; and performing weighted fusion on the low-level features and the high-level semantic features to obtain the spatial dimension features. The low-level features are features of the outline, edge, chroma, contrast, texture, shape, and the like of the picture content, and the high-level semantic features are features obtained by performing semantic recognition on the picture content, such as people, cars, trees, wolves, and the like. The video encoder may obtain the spatial dimension features through a hierarchical feature competition module as shown in fig. 8. The hierarchical feature competition module obtains an activation value of each dimension feature in the feature dimension of the initial picture feature by using a compression and activation model SE after convolution activation of the Relu function and using a global pooling function and a Sigmoid function, wherein the value range of the activation value is [0,1 ]]. The video encoder also performs weighted fusion on the space by using a Sigmoid function in a residual error addition mode adopted by the self-attention model SA to obtain spatial dimension characteristics. The video encoder extracts multi-level low-level features from the encoder and high-level semantic features from the decoder, and performs weighted fusion of the low-level features and the high-level semantic features according to the following formula: f. of_fuse＝SA(SE(Cat(δ(W_lowf_low),δ(W_highf_high))))，f_low、f_highRepresenting extracted low-level features and high-level semantic features, W_low、W_highThe parameters to be learned for convolution.

In the embodiment, the low-level features and the high-level semantic features are obtained according to the initial picture features, the low-level features and the high-level semantic features are subjected to weighted fusion according to convolution operation, cascade operation and nonlinear activation function calculation, and the spatial dimension features are obtained.

In step 303, the video encoder calculates the content saliency from the temporal dimensional features and the spatial dimensional features.

Specifically, a content saliency value is calculated and obtained in the video encoder according to the time dimension characteristic and the space dimension characteristic, and the calculation process is as follows: after the time dimension characteristics and the space dimension characteristics are obtained, a mapping function of a target domain which is mapped from the characteristic domains of the time dimension characteristics and the space dimension characteristics to the content significance degree value is obtained by utilizing a supervised learning mode and self-adaptive learning. The expression of the mapping function Y' is as follows: m-Y '(z'₁,z′₂) Wherein m is a content saliency value, z'₁Is a time dimension characteristic, z'₂Are spatial dimensional features. In the training stage of the algorithm, a video encoder samples a training data set according to Gaussian distribution to obtain an initial random function, and a mapping function Y' is obtained in a self-adaptive learning mode. Further, the video encoder evaluates the difference between the predicted value of the content saliency value and the actual value in the training data through a loss function, and finds the minimum preset loss function through adaptive learning of a gradient descent algorithm to obtain a mapping function Y'. Wherein, the training stage is based on a large amount of data sets to carry out training simulation, and the preset loss function is as follows: loss is α · kl (p, s) + β · nss (p, s) + γ · cc (p, s). loss represents a loss function, alpha, beta and gamma are multiplication coefficients, the optimal values obtained after specific experiments are respectively 1, 0.1 and 0.1, and the specific formulas of the three measurement indexes are respectively as follows:

in the training stage, the video encoder substitutes the initial random function as a mapping function Y' into the calculated content significance value to obtain a predicted value of the content significance value, evaluates the difference between the predicted value of the content significance value and a true value in training data according to a loss function, adjusts the mapping function in the direction of small difference, and performs iterative calculation on the mapping function. Until the difference between the predicted and true values is sufficiently small, the video encoder takes this mapping function as the final applied mapping function Y'. In the application stage of the algorithm, the trained video encoder directly maps the features extracted by the encoder to a target domain of content significance numerical values by utilizing convolution and pooling up-sampling operations to map the extracted time dimension features and space dimension features to the target domain of the content significance numerical values to obtain content significance numerical value results.

In this embodiment, the time dimension feature and the space dimension feature are obtained according to the initial picture feature, and the calculation of the content saliency is performed, so that the picture content showing the saliency in different dimensions can be obtained according to different features shown by the video picture in time and space, and the accuracy of the calculation of the content saliency is improved.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are within the scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A third embodiment of the present invention relates to a video processing apparatus, as shown in fig. 9, including:

an extracting module 901, configured to extract an initial picture feature from each region of a video picture;

a calculating module 902, configured to calculate content saliency of each region according to the initial picture feature;

a marking module 903, configured to mark a content salient region in a video picture according to the content saliency of each region;

an encoding module 904, configured to perform a first mode encoding on the content salient region and perform a second mode encoding on the non-content salient region; the picture quality of the first mode encoding is higher than the picture quality of the second mode encoding.

In an example, the calculating module 902 is specifically configured to obtain a time dimension feature and a space dimension feature according to the initial picture feature, and calculate the content saliency according to the time dimension feature and the space dimension feature.

In one example, the calculation module 902 further comprises: the inter-frame feature competition submodule is used for acquiring consistency features and difference features according to consistency and difference between the initial picture features and the initial picture features of a corresponding region of a previous video picture; and weighting and fusing the consistency characteristics and the difference characteristics to obtain time dimension characteristics.

In an example, the calculating module 902 is further configured to calculate a consistency mask and a difference mask according to the initial picture features and the initial picture features of the corresponding region of the previous video picture; performing dot multiplication on the initial picture characteristics of the corresponding area of the previous video picture and the consistency mask to obtain consistency characteristics; and performing dot multiplication on the initial picture features and the difference masks to obtain difference features.

In one example, the calculation module 902 further comprises: the hierarchical feature competition submodule is used for acquiring low-level features and high-level semantic features according to the initial picture features; and according to convolution operation, cascade operation and nonlinear activation function calculation, performing weighted fusion on the low-level features and the high-level semantic features to obtain the spatial dimension features.

In one example, the calculating module 902 is further configured to calculate a content saliency value of each region according to the initial picture feature; the marking module 903 is further configured to mark an area with a content saliency value greater than a preset threshold as a content saliency area.

In one example, the video processing apparatus further includes an extraction module, configured to extract video frames from the video to be processed at a preset frame interval before extracting the initial picture features from each region of the video picture; the extracting module 901 is further configured to extract initial picture features from each region of the extracted video frame.

It should be noted that, in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, elements that are not so closely related to solving the technical problems proposed by the present invention are not introduced in the present embodiment, but this does not indicate that other elements are not present in the present embodiment.

A fourth embodiment of the present invention relates to an electronic apparatus, as shown in fig. 10, including: at least one processor 1001; a memory 1002 communicatively coupled to the at least one processor; the memory 1002 stores instructions executable by the at least one processor 1001, and the instructions are used by the at least one processor 1001 to perform the video processing method.

The memory 1002 and the processor 1001 are coupled by a bus, which may comprise any number of interconnecting buses and bridges that interconnect one or more of the various circuits of the processor 1001 and the memory 1002. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, etc., which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. Information processed by processor 1001 is transmitted over a wireless medium through an antenna, which further receives the information and passes the information to processor 1001.

The processor 1001 is responsible for managing the bus and general processing and may provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 1002 may be used to store information used by the processor in performing operations.

A fifth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A video processing method, comprising:

extracting initial picture characteristics from each area of a video picture;

calculating the content significance of each region according to the initial picture characteristics;

marking the content salient regions in the video picture according to the content saliency of each region;

coding the content salient region in a first mode, and coding a non-content salient region in a second mode;

the picture quality of the first mode encoding is higher than the picture quality of the second mode encoding.

2. The video processing method according to claim 1, wherein said calculating the content saliency of each of the regions according to the initial picture feature comprises:

acquiring a time dimension characteristic and a space dimension characteristic according to the initial picture characteristic;

and calculating the content significance according to the time dimension characteristics and the space dimension characteristics.

3. The video processing method according to claim 2, wherein the time dimension characteristic is obtained by:

obtaining consistency characteristics and difference characteristics according to consistency and difference between the initial picture characteristics and the initial picture characteristics of the corresponding area of the previous video picture;

and weighting and fusing the consistency features and the difference features to obtain the time dimension features.

4. The method according to claim 3, wherein said obtaining the consistent features and the different features according to the consistent features and the different features between the initial picture features and the initial picture features of the corresponding region of the previous video picture comprises:

calculating to obtain a consistency mask and a difference mask according to the initial picture characteristics and the initial picture characteristics of the corresponding region of the previous video picture;

carrying out pixel-by-pixel dot multiplication on the initial picture characteristic of the corresponding area of the previous video picture and the consistency mask to obtain the consistency characteristic;

and performing dot multiplication on the initial picture features and the difference masks to obtain the difference features.

5. The video processing method according to claim 2, wherein the spatial dimension picture features are obtained by:

acquiring low-level features and high-level semantic features according to the initial picture features; wherein the low-level features are features obtained by shallow layer recognition of the video pictures, and the high-level semantic features are features obtained by content recognition of the video pictures;

and performing weighted fusion on the low-level features and the high-level semantic features to obtain the spatial dimension features.

6. The video processing method according to any of claims 1 to 5, wherein before said extracting initial picture features from regions of a video picture, the method further comprises:

extracting video frames from a video to be processed according to a preset frame interval;

the extracting of the initial picture features from the regions of the video picture comprises:

extracting the initial picture features from the extracted regions of the video frame.

7. The video processing method according to any one of claims 1 to 5, wherein the calculating the content saliency of each region according to the initial picture feature comprises:

calculating the content significance numerical value of each region according to the initial picture characteristics;

the marking the content salient region in the video picture according to the content saliency of each region comprises:

and marking the area with the content significance degree value larger than a preset threshold value as the content significance area.

8. A video processing apparatus, comprising:

the extraction module is used for extracting initial picture characteristics from each area of a video picture;

the calculation module is used for calculating the content significance of each area according to the initial picture characteristics;

the marking module is used for marking the content salient regions in the video pictures according to the content saliency of each region;

the coding module is used for coding the content salient region in a first mode and coding the non-content salient region in a second mode; the picture quality of the first mode encoding is higher than the picture quality of the second mode encoding.

9. An electronic device, comprising:

at least one processor;

a memory communicatively coupled to the at least one processor;

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video processing method of any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the video processing method of any one of claims 1 to 7.