WO2022127865A1 - 视频处理方法、装置、电子设备及存储介质 - Google Patents

视频处理方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2022127865A1
WO2022127865A1 PCT/CN2021/138819 CN2021138819W WO2022127865A1 WO 2022127865 A1 WO2022127865 A1 WO 2022127865A1 CN 2021138819 W CN2021138819 W CN 2021138819W WO 2022127865 A1 WO2022127865 A1 WO 2022127865A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
video
content
picture
area
Prior art date
Application number
PCT/CN2021/138819
Other languages
English (en)
French (fr)
Inventor
徐异凌
晏航
何大治
孙军
黄成�
朱兴昌
陈颖川
尹芹
张宇
朱伟
Original Assignee
中兴通讯股份有限公司
上海交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司, 上海交通大学 filed Critical 中兴通讯股份有限公司
Publication of WO2022127865A1 publication Critical patent/WO2022127865A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N9/00Details of colour television systems
    • H04N9/12Picture reproducers
    • H04N9/30Picture reproducers using solid-state colour display devices

Definitions

  • the embodiments of the present application relate to the field of video, and in particular, to a video processing method, apparatus, electronic device, and storage medium.
  • the media publisher in order to ensure the user's viewing experience, the media publisher will send a video with high picture quality to the user end. Due to the large amount of video data with high picture quality, it will take a long time to transmit the video.
  • the video processing technology has the following problems: in order to ensure the picture quality, a huge amount of video data needs to be transmitted, which contradicts the real-time interaction requirements of users.
  • An embodiment of the present application provides a video processing method, including the following steps: extracting initial picture features from each region of a video picture; calculating the content prominence of each region according to the initial picture features; marking the content prominence of each region Content conspicuous area in the video picture; the content conspicuous area is encoded in the first mode, and the non-content conspicuous area is encoded in the second mode; the picture quality encoded in the first mode is higher than that encoded in the second mode.
  • the embodiment of the present application also provides a video processing device, including: an extraction module, used for extracting initial picture features from each region of a video picture; a calculation module, used for calculating the content salience of each region according to the initial picture features;
  • the marking module is used to mark the content salient areas in the video picture according to the content salience of each area;
  • the coding module is used to encode the content salient areas in the first mode, and perform the second mode encoding for the non-content salient areas; the first The picture quality coded in the mode is higher than the picture quality coded in the second mode.
  • An embodiment of the present application further provides an electronic device, including: at least one processor; a memory connected in communication with the at least one processor; the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor, In order to enable at least one processor to perform the above-mentioned video processing method.
  • Embodiments of the present application further provide a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, implements the above-mentioned video processing method.
  • FIG. 1 is a flowchart of a video processing method provided according to a first embodiment of the present application
  • FIG. 2 is a schematic diagram of a mask provided according to the first embodiment of the present application.
  • FIG. 3 is a flowchart of a video processing method provided according to a second embodiment of the present application.
  • FIG. 4 is a schematic diagram of an algorithm network framework provided according to a second embodiment of the present application.
  • FIG. 5 is a schematic diagram of an inter-frame feature competition module provided according to a second embodiment of the present application.
  • FIG. 6 is a schematic diagram of a compression and activation model provided according to a second embodiment of the present application.
  • FIG. 7 is a schematic diagram of a self-attention model provided according to a second embodiment of the present application.
  • FIG. 8 is a schematic diagram of a hierarchical feature competition module provided according to a second embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a video processing apparatus provided according to a third embodiment of the present application.
  • FIG. 10 is a schematic diagram of an electronic device provided according to a fourth embodiment of the present application.
  • the main purpose of the embodiments of the present application is to propose a video processing method, device, electronic device and storage medium, so as to reduce the amount of video data, reduce the video transmission time, and meet the real-time interaction requirements of users on the premise of ensuring the user's senses.
  • the video processing method proposed in the present application calculates the content saliency of each area according to the extraction of initial picture features from each area of the video screen, marks the content salient area in the video frame according to the content salience of each area, Perform the first mode encoding, and perform the second mode encoding on the non-salient areas. Since the picture quality of the first method is higher than that of the second method, the picture quality of the content salient area is higher, and the picture quality of the non-content salient area is higher. Low.
  • the video processing method of the present application can reduce the amount of video data on the premise of ensuring the user's senses, thereby reducing the required transmission time of the video, and meeting the real-time interaction requirements of the user.
  • the first embodiment of the present application relates to a video processing method, and the specific process is shown in Figure 1:
  • Step 101 extracting initial picture features from each region of the video picture
  • Step 102 calculate the content salience of each area
  • Step 103 according to the content salience of each area, mark the content salient area in the video picture
  • Step 104 encoding the content salient area in the first manner, and performing the second method encoding on the non-content salient area; the picture quality encoded in the first manner is higher than the picture quality encoded in the second manner.
  • Video encoders are used to compress and encode video data to meet storage and transmission requirements.
  • the video encoder may be a video encoder used by media platforms such as video-on-demand platforms and game platforms to encode videos before video transmission.
  • media platforms such as video-on-demand platforms and game platforms to encode videos before video transmission.
  • the video platform needs to send the user's on-demand video or the video required for game interaction to the user. If it is higher, the data volume of the video is huge.
  • the video processing method of the present application performs non-uniform encoding on each area in the video picture, performs high-quality encoding on the content significant area, that is, the user's interest area, and performs low-quality encoding on the non-content significant area, thereby obtaining the amount of data. Smaller new video that is delivered to the user for viewing or interaction.
  • the video processing method of the present application can be implemented by constructing an algorithmic network framework of "encoder-gated recurrent unit-decoder".
  • the gated recurrent unit (Gate Recurrent Unit, referred to as "GRU") is a lightweight recurrent neural network
  • the video encoder can use a lightweight network residual network 18 (Residual Network 18, referred to as "ResNet 18")
  • the algorithm network is constructed with Deepwise Seperable Convolution.
  • the video encoder extracts initial picture features from each region of the video picture.
  • the video encoder can extract initial picture features from each region of the video picture through the encoder.
  • the encoder extracts the initial picture features of each region through convolutional layers, pooling layers and residual blocks.
  • the division specification of the area may use the specification set by default by the encoder, or the division specification of each area of the video picture may be adjusted by changing the specification parameters.
  • the encoder can use the sliding window method to obtain each area of the video picture, and extract the characteristics of the initial picture.
  • the initial picture feature extracted by the encoder may be one or multiple, and the initial picture feature may exist in the form of a vector or a matrix array. By extracting picture features, the video encoder can obtain a digital representation of the picture content to facilitate computational processing.
  • the video encoder calculates the content saliency of each region according to the initial picture feature. According to the extracted initial picture features, the video encoder calculates the initial picture features of each region through the GRU and the decoder respectively, and obtains the content saliency of each region. Among them, the content saliency of each region can be divided into: significant and non-salient.
  • the video encoder may calculate the content prominence value of each area according to the initial picture feature, and obtain the content prominence degree of each area according to the content prominence value of each area. For example, the video encoder may classify the content prominence of the region with the content prominence value greater than the preset threshold as significant, and the content prominence of the region with the content prominence value not greater than the preset threshold as non-prominent. The area with the content salience value greater than the preset threshold is marked as the content salient area, and the area with the content salience value not greater than the preset threshold is marked as the non-content salient area.
  • the video encoder adaptively learns to obtain a mapping function from the initial picture feature to the target domain of the content saliency value by means of supervised learning.
  • the video encoder samples the training data set according to the Gaussian distribution, obtains the initial random function according to the sampled data, and obtains the final mapping function Y through adaptive learning. Further, the video encoder evaluates the difference between the predicted value of the content saliency value and the actual value in the training data through the loss function, and obtains the mapping function Y by finding the minimum preset loss function through the gradient descent algorithm.
  • loss represents the loss function
  • ⁇ , ⁇ and ⁇ are multiplication coefficients.
  • the best values obtained after specific experiments are 1, 0.1 and 0.1 respectively.
  • the specific formulas of the three measurement indicators are:
  • x i represents each pixel point
  • n represents the number of pixel points
  • N represents the total number of points whose content saliency value is greater than the preset threshold in the true value
  • represents the mathematical expectation.
  • p represents the predicted value of the content saliency corresponding to the region where the pixel is located
  • s represents the true value of the content saliency corresponding to the region where the pixel is located in the training data set.
  • kl(p,s) measures the difference in distribution between the predicted value and the true value of the content significance
  • nss(p,s) measures the prediction accuracy of the pixel with the highest predicted value
  • cc(p,s) measures the predicted value The degree of consistency between the linear change trend of and the linear change trend of the true value.
  • the video encoder uses the initial random function as the mapping function Y, and substitutes it into the calculated content saliency value to obtain the predicted value of the content saliency value.
  • the loss function the predicted value of the content saliency value and the true value in the training data are evaluated. The difference between the two, adjust the mapping function in the direction of small difference, and iteratively calculate the mapping function.
  • the video encoder takes this mapping function as the final applied mapping function Y.
  • the video encoder directly uses the features extracted by the encoder, and uses the convolution and pooling upsampling operations to map the extracted initial picture features to the target domain of the content saliency value to obtain the content saliency value result.
  • the video encoder marks the content salient regions in the video picture according to the content saliency of each region.
  • the video encoder can generate an instruction file for marking the content salient area in the video picture.
  • the indication file may include the position information of the conspicuous area in each area of the video picture.
  • the video encoder performs the first mode encoding on the content salient area, and the second mode encoding on the non-content salient area, and the picture quality coded in the first mode is higher than that of the second mode encoding.
  • the first mode encoding and the second mode encoding can compress the video images to different degrees.
  • different quantization parameter QP Qstep, “QP” for short
  • QP quantization parameter
  • Images are compressed to varying degrees. Wherein, after the content conspicuous area is encoded and compressed in the first manner, the picture definition is higher than that of the non-content conspicuous area encoded and compressed in the second manner.
  • the video encoder may encode each region in the video picture according to the position information indicating the salient region in the file.
  • the video encoder may further divide the content prominence according to the content prominence value, for example, it can be divided into: first-level prominence, second-level prominence, non-prominence and other prominence levels.
  • the content prominence of the region with the content prominence value not greater than the preset threshold is divided into non-prominence
  • the content prominence of the region with the content prominence value greater than the first preset threshold is divided into first-level prominence
  • the content prominence value is greater than the first-level prominence.
  • the content saliency of the region with the two preset thresholds is divided into two levels of saliency and the like.
  • the instruction file contains the position information and the corresponding content salience of different areas in each area of the video screen.
  • the video encoder may encode each region in the video picture according to the position information indicating regions corresponding to different salience degrees in the file. If the video encoder sets different salience levels for the salient areas of the content, the QP value can be linearly set according to the different salience levels, so as to achieve different degrees of compression of the video picture.
  • the video encoder may modify the mask interface in the existing video encoding standard, and encode the video using the modified mask interface.
  • the video encoder modifies the existing binary coding into linear non-uniform coding.
  • the masks before and after modification are shown in Figure 2.
  • the flag values of each area are different, and the video encoder assigns different QP values to each area.
  • the larger the marked value the higher the content salience, and the smaller the QP value assigned by the video encoder to this area.
  • the video encoder can linearly allocate the QP values of each region according to the linear relationship between the content saliency values of each region.
  • the video encoder before extracting the initial picture features from each region of the video picture, extracts a video frame from the video to be processed, acquires a video picture, and extracts the initial picture feature from each region of the extracted video frame.
  • the video encoder may perform steps 101 to 104 in this embodiment to perform video processing on each frame of the video.
  • the video encoder before extracting the initial picture features from each region of the video picture, the video encoder also extracts video frames from the video at preset frame intervals to obtain the video picture of the video frame.
  • the preset frame interval may be any natural number other than 0.
  • the preset frame interval is N, if the video frame extracted this time is the 10th frame, the video frame extracted by the video encoder next time is the 10th+Nth frame.
  • the video encoder acquires the next frame of video picture, and encodes each region of the next frame of video picture.
  • the preset interval of the video encoder is greater than 0, in the video frame between the current video frame and the extracted next video frame, the video encoder can perform the encoding of each region of the video picture according to the encoding scheme of the current video frame,
  • the encoding scheme is to indicate the correspondence between the position of each region of the video picture in the file and the corresponding content salience and encoding method.
  • the video encoder encodes each area of the video image of the 1001st, 1002nd, 1003rd, and 1004th frames according to the coding scheme of the 1000th frame, that is, , the video encoder correspondingly encodes the corresponding regions in the video pictures of the 1001st, 1002, 1003, and 1004th frames according to the position information of the regions encoded by the first method in the video picture of the 1000th frame.
  • the video encoder may first extract the next video frame according to the preset frame interval, and after performing the video processing method of this embodiment on the next video frame, the video encoding According to the coding scheme of the current video frame and the coding scheme of the next video frame, the controller obtains the content saliency value of each area of the video picture of the two video frames.
  • the content saliency value of the two video frames is linearly allocated to the content saliency value of each corresponding area of each video frame between the two video frames, and the corresponding video frames of the video frames between the two video frames are obtained.
  • the content saliency value of the area is linearly allocated to the content saliency value of each corresponding area of each video frame between the two video frames, and the corresponding video frames of the video frames between the two video frames are obtained.
  • the video encoder After the video encoder completes the encoding of the 1000th frame, it first encodes the 1005th frame. According to the coding scheme of the 1000th frame and the coding scheme of the 1005th frame, the content of the area A in the same position of the 1000th frame and the 1005th frame is obtained.
  • the video encoder divides and encodes the content saliency of each region of the video picture of each video frame according to the calculated content saliency value.
  • the content saliency of each area is calculated according to the extraction of initial image features from each area of the video image, and the content saliency area in the video image is marked according to the content saliency of each area, and a first step is performed on the content saliency area.
  • the second method is used to encode the non-conspicuous area. Since the picture quality of the first method is higher than that of the second method, the picture quality of the content salient area is higher, and the picture quality of the non-content salient area is lower.
  • the video processing method of the present application can reduce the amount of video data on the premise of ensuring the user's senses, thereby reducing the required transmission time of the video, and meeting the real-time interaction requirements of the user.
  • the second embodiment of the present application relates to a video processing method.
  • the second embodiment is substantially the same as the first embodiment, and the main difference is that: in the first embodiment, the content saliency of each region is calculated according to the characteristics of the initial screen. However, in the second embodiment of the present application, the content saliency is calculated according to the time dimension feature and the space dimension feature.
  • This embodiment relates to a video processing method.
  • the specific process is shown in Figure 3:
  • Step 301 extracting initial picture features from each region of the video picture
  • Step 302 according to the initial picture feature, obtain the time dimension feature and the space dimension feature;
  • Step 303 according to the time dimension feature and the space dimension feature, calculate the content salience
  • Step 304 according to the content prominence of each area, mark the content conspicuous area in the video picture
  • Step 305 encoding the content salient area in the first manner, and performing the second method encoding on the non-content salient area; the picture quality encoded in the first manner is higher than the picture quality encoded in the second manner.
  • step 301 , step 304 , and step 305 are substantially the same as step 101 , step 103 , and step 104 in the first embodiment, and will not be repeated here.
  • the video processing method of this embodiment can be implemented by constructing an algorithm network framework as shown in FIG. 4 , wherein the input frame t-1 is the video frame where the previous video picture is located, and the input frame t is the video frame where the current video picture is located, IFCM is an Inter-Frame Feature Competition Module (“IFCM” for short), which is used to obtain temporal dimension features, and HFCM is a Hierarchical Feature Competition Module (“HFCM” for short), which is used for To obtain spatial dimension features, CDFE is the Correlated and Differential Features Extraction Module (“CDFE” for short).
  • IFCM Inter-Frame Feature Competition Module
  • HFCM Hierarchical Feature Competition Module
  • step 302 the video encoder initializes picture features, and obtains temporal dimension features and spatial dimension features.
  • the video encoder obtains the temporal dimension feature by: obtaining the consistency feature and the difference feature according to the consistency and difference between the initial picture feature and the initial picture feature of the corresponding region of the previous video picture, Consistent features and difference features are weighted and fused to obtain time dimension features.
  • the video encoder can obtain temporal dimension features through the inter-frame feature competition module as shown in Figure 5.
  • the inter-frame feature competition module uses the correlation layer (Correlation Layer), compression and activation module, and self-attention module in the optical flow network (Flow-Net) to obtain the current video frame.
  • the video encoder uses the compression and activation model SE (Squeeze-and-Excitation Module, referred to as "SE”) as shown in Figure 6 to use global pooling and Sigmoid function after the Relu function convolution activation in the feature dimension of the initial picture feature.
  • SE Sequeeze-and-Excitation Module
  • the activation value of each dimension feature is obtained, and the spatial position of each dimension feature is indicated according to the initial picture feature of the previous video picture, so that each dimension weighted optimization is performed on the feature calculated by the relevant layer, where the activation value is used as the weight value. , whose value range is [0,1].
  • the video encoder also uses the Sigmoid function to perform weighted optimization in space by using the residual addition method adopted by the Self-Attention Model SA (Self-Attention Module, referred to as "SA") as shown in Figure 7 to perform the initial picture.
  • SA Self-Attention Module
  • GRU Gated recurrent Unit
  • the video encoder calculates the consistency mask and the difference mask according to the initial picture feature and the initial picture feature of the corresponding region of the previous video picture; the initial picture feature of the corresponding region of the previous video picture and the consistency mask Do point multiplication to obtain consistent features; perform dot multiplication between the initial image features and the difference mask to obtain different features.
  • Corr represents the relevant layer in the optical flow network
  • f t , f t-1 represent the initial picture feature of each area in the video picture of the current video frame and the initial picture feature of the corresponding area of the previous video picture
  • [-k ,k] represents the spatial range of x 1 and x 2 where f t-1 and f t are calculated at the corresponding region positions, respectively
  • M corr is the consistency mask obtained by using the correlation layer, the compression and activation network and the self-attention network. Mask, used to characterize the consistency of each region in the previous video frame and the current video frame.
  • the consistency and difference features are obtained according to the consistency and difference between the initial picture feature and the initial picture feature of the corresponding area of the previous video picture, and the convolution operation, the cascade operation and the nonlinear Activation function calculation, weighted fusion of consistency features and difference features, to obtain time dimension features. Due to the consistency and difference between each area of the video image and the corresponding areas of the previous video image, it can reflect the dynamic changes of the content of each area of the video image in the time dimension. Therefore, based on the consistency and difference characteristics, It fully exploits the characteristics of the human visual system in the time dimension, which can further improve the accuracy of content saliency calculation.
  • the video encoder obtains spatial dimension features by: obtaining low-level features and high-level semantic features according to the initial picture features; wherein, the low-level features are features obtained through shallow recognition of video pictures, and the high-level semantic features are Features obtained by recognizing the content of video images; weighted fusion of low-level features and high-level semantic features to obtain spatial dimension features.
  • the low-level features are the features of the outline, edge, chroma, contrast, texture and shape of the picture content
  • the high-level semantic features are the features obtained by semantically identifying the picture content, such as people, cars, trees, wolves, etc.
  • the video encoder can obtain spatial dimension features through the hierarchical feature competition module shown in Figure 8.
  • the hierarchical feature competition module obtains the activation value of each dimension feature in the feature dimension by using the compression and activation model SE after the Relu function convolution activation using global pooling and sigmoid function in the initial picture feature, where the value range of the activation value is [ 0,1].
  • the video encoder also uses the Sigmoid function to perform weighted fusion in space by using the residual addition method adopted by the self-attention model SA to obtain spatial dimension features.
  • low-level features and high-level semantic features are obtained according to initial picture features, and low-level features and high-level semantic features are weighted and fused according to convolution operation, cascade operation and nonlinear activation function calculation to obtain spatial dimension features.
  • the low-level features and high-level semantic features of the video picture are used to calculate the content saliency of the features obtained from different content dimensions by using the spatial picture features obtained by the weighted fusion of the two, which can further improve the accuracy of the content saliency calculation.
  • step 303 the video encoder calculates the content saliency according to the temporal dimension feature and the spatial dimension feature.
  • the video encoder calculates and obtains the content saliency value according to the temporal dimension feature and the spatial dimension feature.
  • a mapping function that maps the feature domains of features and spatial dimension features to the target domain of content saliency values.
  • the video encoder samples the training data set according to the Gaussian distribution to obtain the initial random function, and obtains the mapping function Y′ through adaptive learning.
  • the video encoder evaluates the difference between the predicted value of the content saliency value and the actual value in the training data through the loss function, and adaptively learns to find the minimum preset loss function through the gradient descent algorithm to obtain the mapping function Y′.
  • loss represents the loss function
  • ⁇ , ⁇ and ⁇ are multiplication coefficients.
  • the best values obtained after specific experiments are 1, 0.1 and 0.1 respectively.
  • the specific formulas of the three measurement indicators are:
  • the video encoder uses the initial random function as the mapping function Y′, and substitutes it into the calculated content saliency value to obtain the predicted value of the content saliency value, and evaluates the predicted value of the content saliency value according to the loss function.
  • the difference between the values adjust the mapping function in the direction of small difference, and iteratively calculate the mapping function.
  • the video encoder takes this mapping function as the final applied mapping function Y'.
  • the trained video encoder directly uses the features extracted by the encoder, and uses the convolution and pooling upsampling operations to map the extracted temporal dimension features and spatial dimension features to the target of the content saliency value. domain to get the content saliency numerical results.
  • the third embodiment of the present application relates to a video processing apparatus, as shown in FIG. 9 , including:
  • Extraction module 901 used for extracting initial picture features from each region of the video picture
  • a calculation module 902 configured to calculate the content prominence of each area according to the initial picture feature
  • the marking module 903 is used to mark the content salient areas in the video picture according to the content salience of each area;
  • the encoding module 904 is configured to perform first-mode encoding for content-salient areas, and second-mode encoding for non-content-salient areas; the picture quality encoded in the first method is higher than the picture quality encoded in the second method.
  • the calculation module 902 is specifically configured to obtain a temporal dimension feature and a spatial dimension feature according to the initial picture feature, and calculate the content saliency according to the temporal dimension feature and the spatial dimension feature.
  • the computing module 902 further includes: an inter-frame feature competition sub-module, configured to obtain the consistency and difference according to the consistency and difference between the initial picture feature and the initial picture feature of the corresponding region of the previous video picture Sexual features; weighted fusion of consistency features and difference features to obtain time dimension features.
  • an inter-frame feature competition sub-module configured to obtain the consistency and difference according to the consistency and difference between the initial picture feature and the initial picture feature of the corresponding region of the previous video picture Sexual features
  • weighted fusion of consistency features and difference features to obtain time dimension features.
  • the calculation module 902 is further configured to calculate the consistency mask and the difference mask according to the initial picture feature and the initial picture feature of the corresponding region of the previous video picture; The feature is dot-multiplied with the consistency mask to obtain the consistent feature; the initial image feature is dot-multiplied with the difference mask to obtain the difference feature.
  • the calculation module 902 further includes: a hierarchical feature competition sub-module, configured to obtain low-level features and high-level semantic features according to initial picture features; It is weighted and fused with high-level semantic features to obtain spatial dimension features.
  • a hierarchical feature competition sub-module configured to obtain low-level features and high-level semantic features according to initial picture features; It is weighted and fused with high-level semantic features to obtain spatial dimension features.
  • the calculation module 902 is further configured to calculate the content saliency value of each area according to the initial picture feature; the marking module 903 is further configured to mark the area with the content saliency value greater than the preset threshold as the content saliency area.
  • the video processing apparatus further includes an extraction module for extracting video frames from the video to be processed at preset frame intervals before extracting initial picture features from each region of the video picture; the extraction module 901 also uses Then, the initial picture features are extracted from each region of the extracted video frame.
  • each module involved in this embodiment is a logical module.
  • a logical unit may be a physical unit, a part of a physical unit, or multiple physical units.
  • a composite implementation of the unit in order to highlight the innovative part of the present application, this embodiment does not introduce units that are not closely related to solving the technical problem raised by the present application, but this does not mean that there are no other units in this embodiment.
  • the fourth embodiment of the present application relates to an electronic device, as shown in FIG. 10 , comprising: at least one processor 1001; Executed instructions, the instructions are executed by at least one processor 1001 to execute the above-mentioned video processing method.
  • the memory 1002 and the processor 1001 are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors 1001 and various circuits of the memory 1002 together.
  • the bus may also connect together various other circuits, such as peripherals, voltage regulators, and power management circuits, which are well known in the art and therefore will not be described further herein.
  • the bus interface provides the interface between the bus and the transceiver.
  • a transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing a means for communicating with various other devices over a transmission medium.
  • the information processed by the processor 1001 is transmitted on the wireless medium through the antenna, and further, the antenna also receives the information and transmits the information to the processor 1001 .
  • the processor 1001 is responsible for managing the bus and general processing, and may also provide various functions including timing, peripheral interface, voltage regulation, power management, and other control functions.
  • memory 1002 may be used to store information used by the processor in performing operations.
  • the fifth embodiment of the present application relates to a computer-readable storage medium storing a computer program.
  • the above method embodiments are implemented when the computer program is executed by the processor.
  • the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请实施例涉及视频领域,公开了一种视频处理方法、装置、电子设备及存储介质,该方法包括:从视频画面的各区域中提取初始画面特征;根据初始画面特征,计算各区域的内容显著度;根据各区域的内容显著度,标记视频画面中的内容显著区域;对内容显著区域进行第一方式编码,对非内容显著区域进行第二方式编码;第一方式编码的画面质量高于第二方式编码的画面质量。

Description

视频处理方法、装置、电子设备及存储介质
交叉引用
本申请基于申请号为“202011507127.4”、申请日为2020年12月18日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。
技术领域
本申请实施例涉及视频领域,特别涉及一种视频处理方法、装置、电子设备及存储介质。
背景技术
随着互联网技术的发展,互联网用户对媒体消费的需求日益提高,高质量媒体内容以及一些新兴媒体内容如虚拟现实VR(Virtual Reality,简称“VR”)、云点播逐渐成为主流。
相关的视频处理技术中,为了保证用户的观看体验,媒体发布方会向用户端发送画面质量高的视频,其中,由于画面质量高的视频数据量大,传输视频必然需要较长的时间。
因此,视频处理技术存在以下问题:为了保证画面质量,需要传输的视频数据量庞大,与用户的实时交互需求存在矛盾。
发明内容
本申请实施例提供了一种视频处理方法,包括以下步骤:从视频画面的各区域中提取初始画面特征;根据初始画面特征,计算各区域的内容显著度;根据各区域的内容显著度,标记视频画面中的内容显著区域;对内容显著区域进行第一方式编码,对非内容显著区域进行第二方式编码;第一方式编码的画面质量高于第二方式编码的画面质量。
本申请实施例还提供了一种视频处理装置,包括:提取模块,用于从视频画面的各区域中提取初始画面特征;计算模块,用于根据初始画面特征,计算各区域的内容显著度;标记模块,用于根据各区域的内容显著度,标记视频画面中的内容显著区域;编码模块,用于对内容显著区域进行第一方式编码,对非内容显著区域进行第二方式编码;第一方式编码的画面质量高于第二方式编码的画面质量。
本申请实施例还提供了一种电子设备,包括:至少一个处理器;与至少一个处理器通信连接的存储器;存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行上述的视频处理方法。
本申请实施例还提供了一种计算机可读存储介质,存储有计算机程序,计算机程序被处理器执行时实现上述的视频处理方法。
附图说明
图1是根据本申请第一实施方式提供的视频处理方法流程图;
图2是根据本申请第一实施方式提供的掩码示意图;
图3是根据本申请第二实施方式提供的视频处理方法的流程图;
图4是根据本申请第二实施方式提供的算法网络框架的示意图;
图5是根据本申请第二实施方式提供的帧间特征竞争模块的示意图;
图6是根据本申请第二实施方式提供的压缩和激活模型的示意图;
图7是根据本申请第二实施方式提供的自注意力模型的示意图;
图8是根据本申请第二实施方式提供的层级特征竞争模块的示意图;
图9是根据本申请第三实施方式提供的视频处理装置的结构示意图;
图10是根据本申请第四实施方式提供的电子设备示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施例进行详细的阐述。然而,本领域的普通技术人员可以理解, 在本申请各实施例中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施例的种种变化和修改,也可以实现本申请所要求保护的技术方案。以下各个实施例的划分是为了描述方便,不应对本申请的具体实现方式构成任何限定,各个实施例在不矛盾的前提下可以相互结合相互引用。
本申请实施例的主要目的在于提出一种视频处理方法、装置、电子设备及存储介质,实现在保证用户感官的前提下,减少视频数据量,减少视频传输时间,满足用户的实时交互需求。
本申请提出的视频处理方法,根据从视频画面的各区域中提取初始画面特征,计算各区域的内容显著度,根据各区域的内容显著度,标记视频画面中的内容显著区域,对内容显著区域进行第一方式编码,对非内容显著区域进行第二方式编码,由于第一方式编码的画面质量高于第二方式编码,因此内容显著区域的画面质量较高,非内容显著区域的画面质量较低。由于视频画面质量和视频数据量成正比,因此,对非内容显著区域的画面采用质量较低的第二方式编码,可以减少视频编码后非内容显著区域的数据量,进而减少视频整体的数据量;又因为人类视觉系统的注意力主要集中在显著的目标或区域,所以,只要保证内容显著区域的画面质量,即使降低非内容显著区域的画面质量,也可以保证用户的观看体验。因此,本申请的视频处理方法能够在保证用户感官的前提下,减少视频数据量,进而减少视频所需传输时间,满足用户的实时交互需求。
本申请的第一实施方式涉及一种视频处理方法,具体流程如图1所示:
步骤101,从视频画面的各区域中提取初始画面特征;
步骤102,根据初始画面特征,计算各区域的内容显著度;
步骤103,根据各区域的内容显著度,标记视频画面中的内容显著区域;
步骤104,对内容显著区域进行第一方式编码,对非内容显著区域进行第二方式编码;第一方式编码的画面质量高于第二方式编码的画面质量。
本实施方式的视频处理方法,应用于视频编码器中。视频编码器用于对视频数据进行压缩编码,以满足存储和传输的要求。其中,视频编码器可以是视频点播平台、游戏平台等媒体平台在视频传输前对视频进行编码时使用的视频 编码器。当用户通过视频点播平台点播视频,或通过游戏平台进行需要实时交互的游戏,如VR相关的游戏时,视频平台需要向用户发送用户点播的视频或游戏交互需要的视频,若视频清晰度的要求较高,则视频的数据量庞大。本申请的视频处理方法对视频画面中各区域进行非均匀编码,对内容显著区域,即用户感兴趣区域做质量较高的编码,对非内容显著区域做质量较低的编码,从而得到数据量较小的新视频,将此新视频传输给用户,供用户观看或进行交互。
下面对本实施方式的视频处理方法的实现细节进行具体的说明,以下内容仅为方便理解提供的实现细节,并非实施本方案的必须。
本申请的视频处理方法可以通过构造“编码器-门控循环单元-解码器”的算法网络框架实现。其中,门控循环单元(Gate Recurrent Unit,简称“GRU”)是一种轻量型循环神经网络,视频编码器可以采用轻量级网络残差网络18(Residual Network 18,简称“ResNet 18”)和深度可分离卷积(DeepwiseSeperable Convolution)进行算法网络的构建。
在步骤101中,视频编码器从视频画面的各区域中提取初始画面特征。视频编码器可以通过编码器从视频画面的各区域中提取初始画面特征。编码器通过卷积层、池化层和残差块,对各区域的初始画面特征提取。其中,区域的划分规格可以使用编码器默认设置的规格,也可以通过改变规格参数的方式,调节视频画面各区域的划分规格。编码器可以采用滑动窗口的方式,分别获取视频画面的各个区域,进行初始画面特征的提取。编码器提取的初始画面特征可以是一个,也可以是多个,初始画面特征可以向量或矩阵数组等形式存在。通过提取画面特征,视频编码器可以得到画面内容的数字化表征,以便于计算处理。
在步骤102中,视频编码器根据初始画面特征,计算各区域的内容显著度。根据提取得到的初始画面特征,视频编码器通过GRU和解码器分别对各区域的初始画面特征进行计算,得到各区域的内容显著度。其中,各区域的内容显著度可以分为:显著、非显著。
在一个例子中,视频编码器可以根据初始画面特征,计算各区域的内容显著度数值,根据各区域的内容显著度数值得到各区域的内容显著度。例如,视频编码器可以将内容显著度数值大于预设阈值的区域的内容显著度划分为显著, 内容显著度数值不大于预设阈值的区域的内容显著度划分为非显著。将内容显著度数值大于预设阈值的区域标记为内容显著区域,将内容显著度数值不大于预设阈值的区域标记为非内容显著区域。
具体地,视频编码器利用监督学习的方式自适应学习得到从初始画面特征映射至内容显著度数值的目标域的映射函数。映射函数Y的表达式如下式:m=Y(z),其中,m是内容显著度数值,z是初始画面特征。在算法的训练阶段,视频编码器根据高斯分布,对训练数据集进行采样,根据采样数据得到初始随机函数,并通过自适应学习的方式,得到最终的映射函数Y。进一步地,视频编码器通过损失函数评估内容显著度数值的预测值和训练数据中的实际的真值之间的差异,通过梯度下降算法找寻最小预设损失函数得到映射函数Y。其中,训练阶段基于大量的数据集进行训练模拟,预设损失函数为:loss=α·kl(p,s)+β·nss(p,s)+γ·cc(p,s)。loss表示损失函数,α、β和γ为乘法系数,经具体实验后得到的最佳数值分别为1、0.1和0.1,三个衡量指标的具体公式分别为:
Figure PCTCN2021138819-appb-000001
Figure PCTCN2021138819-appb-000002
上式中的x i表示每一个像素点,n表示像素点的个数,N表示真值中内容显著度数值大于预设阈值的总点数,μ表示数学期望。p表示像素点所在区域对应的内容显著度预测值,s表示训练数据集中像素点所在区域对应的内容显著度真值。kl(p,s)衡量内容显著度的预测值和真值之间分布的差异程度,nss(p,s)衡量预测值最高的像素点的预测准确程度,cc(p,s)衡量预测值的线性变化趋势与真值的线性变化趋势的一致性程度。在训练阶段,视频编码器将初始随机函数作为映射函数Y,代入计算内容显著度数值,得到内容显著度数值的预测值,根据损失函数评估内容显著度数值的预测值与训练数据中的真值之间的差异,向差异小的方向调整映射函数,对映射函数迭代计算。直到预测值与真值之间的差 异足够小时,视频编码器将此映射函数作为最终应用的映射函数Y。在算法的应用阶段,视频编码器直接将编码器提取到的特征,利用卷积和池化上采样操作将提取到的初始画面特征映射至内容显著度数值的目标域得到内容显著度数值结果。
在步骤103中,视频编码器根据各区域的内容显著度,标记视频画面中的内容显著区域。其中,视频编码器可以生成一个指示文件,用于标记视频画面中的内容显著区域。当各区域的内容显著度为:显著、非显著时,指示文件中可以包含视频画面各区域中显著区域的位置信息。
在步骤104中,视频编码器对内容显著区域进行第一方式编码,对非内容显著区域进行第二方式编码,第一方式编码的画面质量高于第二方式编码的画面质量。第一方式编码、第二方式编码可以是对视频画面进行不同程度的压缩,例如,可以为第一方式编码、第二方式编码设置不同的量化参数QP(Qstep,简称“QP”)值对视频画面进行不同程度的压缩。其中,内容显著区域经过第一方式编码压缩后,画面清晰度高于使用第二方式编码压缩的非内容显著区域。具体地,视频编码器可以根据指示文件中显著区域的位置信息,对视频画面中的各区域进行编码。
在一个例子中,视频编码器也可以根据内容显著度数值对内容显著度进行进一步划分,例如,可以划分为:一级显著、二级显著、非显著等等显著级别。将内容显著度数值不大于预设阈值的区域的内容显著度划分为非显著、将内容显著度数值大于第一预设阈值的区域的内容显著度划分为一级显著、内容显著度数值大于第二预设阈值的区域的内容显著度划分为二级显著等。指示文件中包含视频画面各区域中不同区域的位置信息及对应的内容显著度。视频编码器可以根据指示文件中不同显著度对应的区域的位置信息,对视频画面中的各区域进行编码。若视频编码器为内容显著区域设置不同显著级别,可以根据不同显著级别,线性设置QP值,实现对视频画面进行不同程度的压缩。
在一个例子中,视频编码器可以对现有视频编码标准中的掩码接口进行修改,使用修改后的掩码接口对视频进行编码。视频编码器将现有的二值编码,修改成线性非均匀编码,修改前后的掩码如图2所示。掩码中,各区域的标记数值不同,则视频编码器为各区域分配的不同的QP值。其中,标记数值大的, 表示内容显著度越高,视频编码器为此区域分配越小的QP值。优选地,视频编码器可以根据各区域内容显著度数值之间的线性关系,对各区域的QP值进行线性分配。
在一个例子中,视频编码器在从视频画面的各区域中提取初始画面特征前,从待处理的视频中提取视频帧,获取视频画面,从抽取的视频帧的各区域中提取初始画面特征。其中,视频编码器可以对视频的每一帧画面执行本实施例的步骤101至步骤104进行视频处理。
进一步地,视频编码器在从视频画面的各区域中提取初始画面特征前,还按预设帧间隔,从视频中抽取视频帧,获取视频帧的视频画面。其中,预设帧间隔可以取不为0的任一自然数。当预设帧间隔为N,若此次抽取的视频帧为第10帧,则视频编码器下一次抽取的视频帧为第10+N帧。
在一个例子中,在对视频画面的各区域编码完成后,视频编码器获取下一帧视频画面,对下一帧视频画面的各区域进行编码。其中,若视频编码器的预设间隔大于0,则在当前视频帧与抽取的下一视频帧之间的视频帧,视频编码器可以按当前视频帧的编码方案进行视频画面各区域的编码,编码方案即是指示文件中视频画面各区域位置与对应的内容显著度和编码方式的对应关系。例如,设当前视频帧为视频的第1000帧,预设间隔为5,则视频编码器对第1001、1002、1003、1004帧的视频画面各区域都按第1000帧的编码方案进行编码,即,视频编码器根据第1000帧的视频画面中进行第一方式编码的区域的位置信息,对应将第1001、1002、1003、1004帧的视频画面中的对应区域也进行第一方式编码。
在一个例子中,若视频编码器的预设间隔大于0,视频编码器还可以先根据预设帧间隔抽取下一视频帧,对下一视频帧执行本实施方式的视频处理方法后,视频编码器根据当前视频帧的编码方案和下一视频帧的编码方案,得到两个视频帧视频画面各区域的内容显著度数值,根据两个视频帧视频画面各对应区域的内容显著度数值,可以根据此两个视频帧的内容显著度数值,对此两个视频帧之间的各视频帧各对应区域的内容显著度数值进行线性分配,得到此两个视频帧之间的视频帧视频画面各对应区域的内容显著度数值。例如,设当前视频帧为视频的第1000帧,预设间隔为5,即下一视频帧为第1005帧,两个 视频帧之间的视频帧为第1001、1002、1003、1004帧。视频编码器在完成第1000帧的编码后,先对第1005帧进行编码,根据第1000帧的编码方案和第1005帧的编码方案,获取第1000帧和第1005帧同一位置A区域的内容显著度数值,设第1000帧位置A区域的内容显著度数值为0、第1005帧位置A区域的内容显著度数值为5,则第1001帧位置A区域的内容显著度数值为1、第1002帧位置A区域的内容显著度数值为2、第1003帧位置A区域的内容显著度数值为3、第1004帧位置A区域的内容显著度数值为4。视频编码器根据计算得到的内容显著度数值,对各视频帧的视频画面各区域进行内容显著度的划分及编码。
本实施例中,根据从视频画面的各区域中提取初始画面特征,计算各区域的内容显著度,根据各区域的内容显著度,标记视频画面中的内容显著区域,对内容显著区域进行第一方式编码,对非内容显著区域进行第二方式编码,由于第一方式编码的画面质量高于第二方式编码,因此内容显著区域的画面质量较高,非内容显著区域的画面质量较低。由于视频画面质量和视频数据量成正比,因此,对非内容显著区域的画面采用质量较低的第二方式编码,可以减少视频编码后非内容显著区域的数据量,进而减少视频整体的数据量;又因为人类视觉系统的注意力主要集中在显著的目标或区域,所以,只要保证内容显著区域的画面质量,即使降低非内容显著区域的画面质量,也可以保证用户的观看体验。因此,本申请的视频处理方法能够在保证用户感官的前提下,减少视频数据量,进而减少视频所需传输时间,满足用户的实时交互需求。
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。
本申请的第二实施方式涉及一种视频处理方法。第二实施方式与第一实施方式大致相同,主要区别之处在于:在第一实施方式中,根据初始画面特征,计算各区域的内容显著度。而在本申请第二实施方式中,根据时间维度特征和空间维度特征,计算内容显著度。
本实施方式涉及一种视频处理方法。具体流程如图3所示:
步骤301,从视频画面的各区域中提取初始画面特征;
步骤302,根据初始画面特征,获取时间维度特征和空间维度特征;
步骤303,根据时间维度特征和空间维度特征,计算内容显著度;
步骤304,根据各区域的内容显著度,标记视频画面中的内容显著区域;
步骤305,对内容显著区域进行第一方式编码,对非内容显著区域进行第二方式编码;第一方式编码的画面质量高于第二方式编码的画面质量。
其中,步骤301、步骤304、步骤305与第一实施方式中的步骤101、步骤103、步骤104大致相同,不再赘述。
本实施方式的视频处理方法可以通过构造如图4所示的算法网络框架实现,其中,输入帧t-1为前一视频画面所在的视频帧,输入帧t为当前视频画面所在的视频帧,IFCM是帧间特征竞争子模块(Inter-Frame Feature Competition Module,简称“IFCM”),用于获取时间维度特征,HFCM是层级特征竞争子模块(Hierarchical Feature Competition Module,简称“HFCM”),用于获取空间维度特征,CDFE是一致性和差异性特征提取模块(Correlated and Differential Features Extraction Module,简称“CDFE”)。
在步骤302中,视频编码器初始画面特征,获取时间维度特征和空间维度特征。
在一个例子中,视频编码器通过以下方式获取时间维度特征:根据初始画面特征和前一视频画面对应区域的初始画面特征之间的一致性和差异性,获取一致性特征和差异性特征,将一致性特征和差异性特征加权融合,得到时间维度特征。视频编码器可以通过如图5所示的帧间特征竞争模块获取时间维度特征。帧间特征竞争模块通过光流网络(Flow-Net)中的相关层(Correlation Layer)、压缩和激活模块、自注意力模块,将得到的当前视频帧的当前视频画面各区域的初始画面特征和前一视频画面对应区域的初始画面特征进行点乘、加权优化等相关操作,得到局部空间位置(即一个区域)的前一视频帧和当前视频帧的相似特征表征。从而视频编码器可以对视频画面的各区域进行内容显著度的计算。视频编码器通过使用如图6所示的压缩和激活模型SE(Squeeze-and-Excitation Module,简称“SE”)在Relu函数卷积激活之后使用全 局池化和Sigmoid函数在初始画面特征在特征维度得到每一维特征的激活值,并根据前一视频画面的初始画面特征,指示各维特征的空间位置,从而对相关层计算后的特征进行各维加权优化,其中,以激活值作为权值,其数值范围为[0,1]。视频编码器还通过使用如图7所示的自注意力模型SA(Self-Attention Module,简称“SA”)采用的残差相加的方式,利用Sigmoid函数在空间上进行加权优化,进行初始画面特征和前一视频画面对应区域的初始画面特征之间的一致性和差异性计算,获取一致性特征和差异性特征,再利用门控循环单元(Gate Recurrent Unit,简称“GRU”),进行卷积操作、级联操作和非线性激活函数计算,将一致性特征和差异性特征加权融合,得到时间维度特征,其中,将一致性特征和差异性特征的加权融合公式如下:f time=SA(SE(Cat(δ(W corrf corr),δ(W difff diff)))),其中,f time是时间维度特征、f corr是一致性特征、f diff是差异性特征,W corr、W diff是卷积要学习的参数。上述算式中,SE函数的具体计算式如下:SE(x)=F scale(x,σ(g(W 2δ(W 1x)))),SA函数的具体计算式如下:SA(x)=x+σ(Wx),参数δ表示非线性激活函数Relu,σ表示非线性激活函数Sigmoid,g代指全局池化操作,Cat代指级联操作,W、W 1和W 2为卷积要学习的参数,F scale代表特征维度的点乘操作。
具体地,视频编码器根据初始画面特征和前一视频画面对应区域的初始画面特征,计算得到一致性掩膜和差异性掩膜;将前一视频画面对应区域的初始画面特征与一致性掩膜进行点乘,得到一致性特征;将初始画面特征与差异性掩膜进行点乘,得到差异性特征。一致性掩膜和差异性掩膜及一致性特征和差异性特征的计算公式如下:M corr=SA(WCat(f t-1,SE(Corr(f t,f t-1)))),其中,Corr函数具体计算式如下,Corr(x 1,x 2)=∑ o∈[-k,k]×[-k,k]f t-1(x 1+o)·f t(x 2+o)。上述算式中,Corr表示光流网络中的相关层,f t、f t-1表示当前视频帧的视频画面中各区域的初始画面特征和前一视频画面对应区域的初始画面特征,[-k,k]表示f t-1和f t分别在相应区域位置进行计算的x 1和x 2的空间范围,M corr为利用相关层和压缩和激活网络还有自注意力网络得到的一致性掩膜(Mask),用于表征前一视频帧和当前视频帧中各个区域的一致性。视频编码器根据下式得到一致性特征和差异性特征:f corr=f t-1·M corr、f diff=f t· (1-M corr),其中,f corr、f diff表示提取的一致性特征和差异性特征,1-M corr是差异性掩膜。
本实施例中,通过根据初始画面特征和前一视频画面对应区域的初始画面特征之间的一致性和差异性,获取一致性特征和差异性特征,根据卷积操作、级联操作和非线性激活函数计算,将一致性特征和差异性特征加权融合,得到时间维度特征。由于视频画面的各区域与前一视频画面的各对应区域的一致性和差异性,可以体现视频画面各区域的画面内容在时间维度上的动态变化,因此,基于一致性特征和差异性特征得到的时间维度特征,充分挖掘了时间维度上人眼视觉系统的特性,可以进一步提高内容显著度计算的准确性。
在一个例子中,视频编码器通过以下方式获取空间维度特征:根据初始画面特征,获取低级特征和高级语义特征;其中,低级特征为通过对视频画面的浅层识别得到的特征,高级语义特征为通过对视频画面的内容识别得到的特征;将低级特征和高级语义特征加权融合,得到空间维度特征。其中,低级特征是画面内容的轮廓、边缘、色度、对比度、纹理和形状等等方面的特征,高级语义特征是对画面内容进行语义识别得到的特征,如人、车、树、狼等。视频编码器可以通过如图8所示的层级特征竞争模块获取空间维度特征。层级特征竞争模块通过使用压缩和激活模型SE在Relu函数卷积激活之后使用全局池化和Sigmoid函数在初始画面特征在特征维度得到每一维特征的激活值,其中,激活值的数值范围为[0,1]。视频编码器还通过使用自注意力模型SA采用的残差相加的方式,利用Sigmoid函数在空间上进行加权融合,得到空间维度特征。其中,视频编码器从编码器中提取多层级的低级特征、从解码器中提取高级语义特征,根据下式进行低级特征和高级语义特征加权融合:f fuse=SA(SE(Cat(δ(W lowf low),δ(W highf high)))),f low、f high表示提取的低级特征和高级语义特征,W low、W high为卷积要学习的参数。
本实施例中,通过根据初始画面特征,获取低级特征和高级语义特征,根据卷积操作、级联操作和非线性激活函数计算,将低级特征和高级语义特征加权融合,得到空间维度特征,由于视频画面的低级特征和高级语义特征对从不同内容维度得到的特征,使用二者加权融合得到的空间画面特征进行内容显著度的计算,可以进一步提高内容显著度计算的准确性。
在步骤303中,视频编码器根据时间维度特征和空间维度特征,计算内容显著度。
具体地,视频编码器中根据时间维度特征和空间维度特征,计算得到内容显著度数值,计算过程如下:在获取时间维度特征和空间维度特征后,利用监督学习的方式自适应学习得到从时间维度特征和空间维度特征的特征域映射至内容显著度数值的目标域的映射函数。映射函数Y′的表达式如下式:m=Y′(z′ 1,z′ 2),其中,m是内容显著度数值,z′ 1是时间维度特征,z′ 2是空间维度特征。在算法的训练阶段,视频编码器根据高斯分布,对训练数据集进行采样,得到初始随机函数,通过自适应学习的方式,得到映射函数Y′。进一步地,视频编码器通过损失函数评估内容显著度数值的预测值和训练数据中的实际值之间的差异,通过梯度下降算法自适应学习找寻最小预设损失函数得到映射函数Y′。其中,训练阶段基于大量的数据集进行训练模拟,预设损失函数为:loss=α·kl(p,s)+β·nss(p,s)+γ·cc(p,s)。loss表示损失函数,α、β和γ为乘法系数,经具体实验后得到的最佳数值分别为1、0.1和0.1,三个衡量指标的具体公式分别为:
Figure PCTCN2021138819-appb-000003
Figure PCTCN2021138819-appb-000004
在训练阶段,视频编码器将初始随机函数作为映射函数Y′,代入计算内容显著度数值,得到内容显著度数值的预测值,根据损失函数评估内容显著度数值的预测值与训练数据中的真值之间的差异,向差异小的方向调整映射函数,对映射函数迭代计算。直到预测值与真值之间的差异足够小时,视频编码器将此映射函数作为最终应用的映射函数Y′。在算法的应用阶段,已经训练好的视频编码器直接将编码器提取到的特征,利用卷积和池化上采样操作将提取好的时间维度特征和空间维度特征映射至内容显著度数值的目标域得到内容显著度数值结果。
本实施方式中,通过根据初始画面特征,获取时间维度特征和空间维度特征,进行计算内容显著度的计算,可以根据视频画面在时间和空间上的表现出的不同特征,得到在不同维度上表现出显著性的画面内容,从而提高内容显著度计算的准确性。
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。
本申请第三实施方式涉及一种视频处理装置,如图9所示,包括:
提取模块901,用于从视频画面的各区域中提取初始画面特征;
计算模块902,用于根据初始画面特征,计算各区域的内容显著度;
标记模块903,用于根据各区域的内容显著度,标记视频画面中的内容显著区域;
编码模块904,用于对内容显著区域进行第一方式编码,对非内容显著区域进行第二方式编码;第一方式编码的画面质量高于第二方式编码的画面质量。
在一个例子中,计算模块902具体用于,根据初始画面特征,获取时间维度特征和空间维度特征,根据时间维度特征和空间维度特征,计算内容显著度。
在一个例子中,计算模块902还包括:帧间特征竞争子模块,用于根据初始画面特征和前一视频画面对应区域的初始画面特征之间的一致性和差异性,获取一致性特征和差异性特征;将一致性特征和差异性特征加权融合,得到时间维度特征。
在一个例子中,计算模块902还用于,根据初始画面特征和前一视频画面对应区域的初始画面特征,计算得到一致性掩膜和差异性掩膜;将前一视频画面对应区域的初始画面特征与一致性掩膜进行点乘,得到一致性特征;将初始画面特征与差异性掩膜进行点乘,得到差异性特征。
在一个例子中,计算模块902还包括:层级特征竞争子模块,用于根据初始画面特征,获取低级特征和高级语义特征;根据卷积操作、级联操作和非线性激活函数计算,将低级特征和高级语义特征加权融合,得到空间维度特征。
在一个例子中,计算模块902还用于根据初始画面特征,计算各区域的内容显著度数值;标记模块903还用于将内容显著度数值大于预设阈值的区域标记为内容显著区域。
在一个例子中,视频处理装置还包括抽取模块,用于在从视频画面的各区域中提取初始画面特征前,按预设帧间隔,从待处理的视频中抽取视频帧;提取模块901还用于,从抽取的视频帧的各区域中提取初始画面特征。
值得一提的是,本实施方式中所涉及到的各模块均为逻辑模块,在实际应用中,一个逻辑单元可以是一个物理单元,也可以是一个物理单元的一部分,还可以以多个物理单元的组合实现。此外,为了突出本申请的创新部分,本实施方式中并没有将与解决本申请所提出的技术问题关系不太密切的单元引入,但这并不表明本实施方式中不存在其它的单元。
本申请第四实施方式涉及一种电子设备,如图10所示,包括:至少一个处理器1001;与至少一个处理器通信连接的存储器1002;其中,存储器1002存储有可被至少一个处理器1001执行的指令,指令被至少一个处理器1001执行上述的视频处理方法。
其中,存储器1002和处理器1001采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器1001和存储器1002的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器1001处理的信息通过天线在无线介质上进行传输,进一步,天线还接收信息并将信息传送给处理器1001。
处理器1001负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器1002可以被用于存储处理器在执行操作时所使用的信息。
本申请第五实施方式涉及一种计算机可读存储介质,存储有计算机程序。 计算机程序被处理器执行时实现上述方法实施例。
即,本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本领域的普通技术人员可以理解,上述各实施方式是实现本申请的具体实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本申请的精神和范围。

Claims (10)

  1. 一种视频处理方法,包括:
    从视频画面的各区域中提取初始画面特征;
    根据所述初始画面特征,计算所述各区域的内容显著度;
    根据所述各区域的内容显著度,标记视频画面中的内容显著区域;
    对所述内容显著区域进行第一方式编码,对非内容显著区域进行第二方式编码;
    所述第一方式编码的画面质量高于所述第二方式编码的画面质量。
  2. 根据权利要求1所述的视频处理方法,其中,所述根据所述初始画面特征,计算所述各区域的内容显著度,包括:
    根据所述初始画面特征,获取时间维度特征和空间维度特征;
    根据所述时间维度特征和所述空间维度特征,计算所述内容显著度。
  3. 根据权利要求2所述的视频处理方法,其中,所述时间维度特征通过以下方式获取:
    根据所述初始画面特征和前一视频画面对应区域的初始画面特征之间的一致性和差异性,获取一致性特征和差异性特征;
    将所述一致性特征和所述差异性特征加权融合,得到所述时间维度特征。
  4. 根据权利要求3所述的视频处理方法,其中,所述根据所述初始画面特征和前一视频画面对应区域的初始画面特征之间的一致性和差异性,获取一致性特征和差异性特征,包括:
    根据所述初始画面特征和前一视频画面对应区域的初始画面特征,计算得到一致性掩膜和差异性掩膜;
    将所述前一视频画面对应区域的初始画面特征与所述一致性掩膜进行逐像素点点乘,得到所述一致性特征;
    将所述初始画面特征与所述差异性掩膜进行点乘,得到所述差异性特征。
  5. 根据权利要求2至4中任一项所述的视频处理方法,其中,所述空间维度画面特征通过以下方式获取:
    根据所述初始画面特征,获取低级特征和高级语义特征;其中,所述低级特征为通过对所述视频画面的浅层识别得到的特征,所述高级语义特征为通过对所述视频画面的内容识别得到的特征;
    将所述低级特征和所述高级语义特征加权融合,得到所述空间维度特征。
  6. 根据权利要求1至5中任一项所述的视频处理方法,其中,在所述从视频画面的各区域中提取初始画面特征前,所述方法还包括:
    按预设帧间隔,从待处理的视频中抽取视频帧;
    所述从视频画面的各区域中提取初始画面特征,包括:
    从所述抽取的视频帧的各区域中提取所述初始画面特征。
  7. 根据权利要求1至5中任一项所述的视频处理方法,其中,所述根据所述初始画面特征,计算所述各区域的内容显著度,包括:
    根据所述初始画面特征,计算所述各区域的内容显著度数值;
    所述根据所述各区域的内容显著度,标记视频画面中的内容显著区域,包括:
    将所述内容显著度数值大于预设阈值的区域标记为所述内容显著区域。
  8. 一种视频处理装置,包括:
    提取模块,用于从视频画面的各区域中提取初始画面特征;
    计算模块,用于根据所述初始画面特征,计算所述各区域的内容显著度;
    标记模块,用于根据所述各区域的内容显著度,标记视频画面中的内容显著区域;
    编码模块,用于对所述内容显著区域进行第一方式编码,对非内容显著区域进行第二方式编码;所述第一方式编码的画面质量高于所述第二方式编码的画面质量。
  9. 一种电子设备,包括:
    至少一个处理器;
    与所述至少一个处理器通信连接的存储器;
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至7中任一所述的视频处理方法。
  10. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至7中任一所述的视频处理方法。
PCT/CN2021/138819 2020-12-18 2021-12-16 视频处理方法、装置、电子设备及存储介质 WO2022127865A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011507127.4 2020-12-18
CN202011507127.4A CN114650421A (zh) 2020-12-18 2020-12-18 视频处理方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022127865A1 true WO2022127865A1 (zh) 2022-06-23

Family

ID=81991428

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/138819 WO2022127865A1 (zh) 2020-12-18 2021-12-16 视频处理方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN114650421A (zh)
WO (1) WO2022127865A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024078512A1 (en) * 2022-10-10 2024-04-18 Alibaba Damo (Hangzhou) Technology Co., Ltd. Pre-analysis based image compression methods

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102801997A (zh) * 2012-07-11 2012-11-28 天津大学 基于感兴趣深度的立体图像压缩方法
CN103618900A (zh) * 2013-11-21 2014-03-05 北京工业大学 基于编码信息的视频感兴趣区域提取方法
CN104539962A (zh) * 2015-01-20 2015-04-22 北京工业大学 一种融合视觉感知特征的可分层视频编码方法
US20180139456A1 (en) * 2008-11-17 2018-05-17 Checkvideo Llc Analytics-modulated coding of surveillance video
CN110310343A (zh) * 2019-05-28 2019-10-08 西安万像电子科技有限公司 图像处理方法及装置
CN110324679A (zh) * 2018-03-29 2019-10-11 优酷网络技术(北京)有限公司 一种视频数据处理方法及装置
CN111193932A (zh) * 2019-12-13 2020-05-22 西安万像电子科技有限公司 图像处理方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180139456A1 (en) * 2008-11-17 2018-05-17 Checkvideo Llc Analytics-modulated coding of surveillance video
CN102801997A (zh) * 2012-07-11 2012-11-28 天津大学 基于感兴趣深度的立体图像压缩方法
CN103618900A (zh) * 2013-11-21 2014-03-05 北京工业大学 基于编码信息的视频感兴趣区域提取方法
CN104539962A (zh) * 2015-01-20 2015-04-22 北京工业大学 一种融合视觉感知特征的可分层视频编码方法
CN110324679A (zh) * 2018-03-29 2019-10-11 优酷网络技术(北京)有限公司 一种视频数据处理方法及装置
CN110310343A (zh) * 2019-05-28 2019-10-08 西安万像电子科技有限公司 图像处理方法及装置
CN111193932A (zh) * 2019-12-13 2020-05-22 西安万像电子科技有限公司 图像处理方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024078512A1 (en) * 2022-10-10 2024-04-18 Alibaba Damo (Hangzhou) Technology Co., Ltd. Pre-analysis based image compression methods

Also Published As

Publication number Publication date
CN114650421A (zh) 2022-06-21

Similar Documents

Publication Publication Date Title
Wang et al. Towards unified depth and semantic prediction from a single image
CN110751649B (zh) 视频质量评估方法、装置、电子设备及存储介质
US11978178B2 (en) Electronic device, control method thereof, and system
EP4090022A1 (en) Image processing method and related device
CN112634296A (zh) 门机制引导边缘信息蒸馏的rgb-d图像语义分割方法及终端
CN113204659B (zh) 多媒体资源的标签分类方法、装置、电子设备及存储介质
CN113822794A (zh) 一种图像风格转换方法、装置、计算机设备和存储介质
WO2022127865A1 (zh) 视频处理方法、装置、电子设备及存储介质
CN114511041A (zh) 模型训练方法、图像处理方法、装置、设备和存储介质
CN113343981A (zh) 一种视觉特征增强的字符识别方法、装置和设备
CN116580278A (zh) 一种基于多注意力机制的唇语识别方法、设备及存储介质
Su et al. Bitstream-based perceptual quality assessment of compressed 3d point clouds
CN114529785A (zh) 模型的训练方法、视频生成方法和装置、设备、介质
CN111898638B (zh) 融合不同视觉任务的图像处理方法、电子设备及介质
CN113177526A (zh) 基于人脸识别的图像处理方法、装置、设备及存储介质
Han Texture image compression algorithm based on self-organizing neural network
CN116310315A (zh) 抠图方法、装置、电子设备以及存储介质
Uchigasaki et al. Deep image compression using scene text quality assessment
Zhao et al. End‐to‐End Retinex‐Based Illumination Attention Low‐Light Enhancement Network for Autonomous Driving at Night
Wang et al. Multi-priors guided dehazing network based on knowledge distillation
CN113657415A (zh) 一种面向示意图的对象检测方法
Zhang et al. TCDM: Transformational Complexity Based Distortion Metric for Perceptual Point Cloud Quality Assessment
JP7479507B2 (ja) 画像処理方法及び装置、コンピューター機器、並びにコンピュータープログラム
CN116778376B (zh) 内容安全检测模型训练方法、检测方法和装置
Jiang et al. An end-to-end dynamic point cloud geometry compression in latent space

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21905791

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21905791

Country of ref document: EP

Kind code of ref document: A1