WO2024001887A1 - 视频图像的处理方法、装置、电子设备和存储介质 - Google Patents

视频图像的处理方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2024001887A1
WO2024001887A1 PCT/CN2023/101498 CN2023101498W WO2024001887A1 WO 2024001887 A1 WO2024001887 A1 WO 2024001887A1 CN 2023101498 W CN2023101498 W CN 2023101498W WO 2024001887 A1 WO2024001887 A1 WO 2024001887A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
processed
motion estimation
optical flow
layer
Prior art date
Application number
PCT/CN2023/101498
Other languages
English (en)
French (fr)
Inventor
陈杰
易自尧
徐科
孔德辉
Original Assignee
深圳市中兴微电子技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市中兴微电子技术有限公司 filed Critical 深圳市中兴微电子技术有限公司
Publication of WO2024001887A1 publication Critical patent/WO2024001887A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • H04N19/147Data rate or code amount at the encoder output according to rate distortion criteria
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/96Tree coding, e.g. quad-tree coding

Definitions

  • the present application relates to the technical field of video image processing, and specifically to a video image processing method, device, electronic equipment and storage medium.
  • video data needs to be divided into blocks in different modes and video data of different frames need to be compressed to ensure different application requirements.
  • This application provides a video image processing method, device, electronic equipment and storage medium.
  • Embodiments of the present application provide a video image processing method, which includes: inputting the video image to be processed into a multi-scale optical flow motion estimation network for motion estimation, and obtaining the to-be-processed video image.
  • An embodiment of the present application provides a video image processing device, including: a motion estimation module configured to input the video image to be processed into a multi-scale optical flow motion estimation network for motion estimation, and obtain the motion estimation information of the video image to be processed.
  • the multi-scale optical flow motion estimation network is a network that represents optical flow and optical flow guidance information at different scales
  • the encoding module is configured to input the video image to be processed and its corresponding motion estimation information into the encoder for encoding, Get the target image.
  • Embodiments of the present application provide an electronic device, including: one or more processors; and a memory on which one or more programs are stored. When one or more programs are executed by one or more processors, one or more A processor implements the video image processing method of this application.
  • Embodiments of the present application provide a readable storage medium that stores a computer program.
  • the computer program is executed by a processor, the video image processing method of the present application is implemented.
  • Figure 1 shows a schematic flowchart of a video image processing method according to the present application.
  • FIG. 2 shows a block diagram of a video image processing system according to the present application.
  • Figure 3 shows a schematic structural diagram of a multi-scale optical flow motion estimation network based on pyramid image determination according to the present application.
  • Figure 4 shows a schematic diagram of the training process of the multi-scale optical flow motion estimation network according to the present application.
  • FIG. 5 shows another schematic flowchart of a video image processing method according to the present application.
  • FIG. 6 shows a block diagram of a video image processing device according to the present application.
  • FIG. 7 shows a structural diagram of an exemplary hardware architecture of a computing device that implements a video image processing method and apparatus according to the present application.
  • the Coding Tree Unit (Coding Tree Unit, CTU) is usually used as the basic processing structure.
  • the CTU can be further divided into Coding unit (Code Unit, CU).
  • CTU and CU can also be divided into multiple prediction units (Prediction Unit, PU), and working parameters (such as coding mode, etc.) are shared between each PU.
  • CU can use the form of a quadtree to represent its internal structure.
  • the unit division methods supported by the HEVC protocol and the VVC protocol include: no partition, quad partition, two binary partitions and two ternary partitions (for example, 1/4, 2/4, 1/4 horizontal or vertical partition of CU ).
  • 64 geometric PU partitioning methods are also introduced to allow non-horizontal or non-vertical partitioning in rectangular CUs or square CUs. Each of the 64 geometry partitions is represented by an index value pointing to its parameter (e.g. angle and/or distance, etc.).
  • the geometric PU partitioning method cannot be applied to CUs whose width (or height) is greater than 64, or whose width (or height) is less than 8.
  • the VVC protocol also includes: specific partition modes for internal sub-partitions (ISP). Multiple partitioning methods can bring higher partitioning flexibility.
  • intra-frame coding implemented using the multifunctional video coding VVC protocol can increase the compression efficiency of the image by about 50%.
  • the VVC protocol usually uses a quadtree , block partitioning mode that combines ternary tree and binary tree. During the entire process of block division, it is necessary to recursively traverse all possible division methods and select the division mode with the smallest rate distortion cost as the target division mode. This greatly increases the computational complexity and prolongs the processing time of the video file. This reduces the image compression rate and cannot meet the user's real-time needs for video file processing.
  • Motion estimation is a key part of video encoding and decoding, but in traditional encoding methods, motion estimation is processed based on PU, and different areas in the video data are processed. After line division, different macroblocks are usually simply divided based on position information, which can easily lead to inaccurate motion estimation of objects in the video, and the corresponding motion compensation cannot meet the needs of the image.
  • FIG. 1 shows a schematic flowchart of a video image processing method according to the present application. This method can be applied to video image processing devices. As shown in Figure 1, the video image processing method according to the present application includes but is not limited to the following steps S110 to S120.
  • step S110 the video image to be processed is input into the multi-scale optical flow motion estimation network for motion estimation, and motion estimation information of the video image to be processed is obtained.
  • the multi-scale optical flow motion estimation network is a network that represents optical flow and optical flow guidance information at different scales.
  • Optical flow (Optical Flow or Optic Flow) is a concept used to describe the movement of objects in video images, that is, the movement of the observation target, surface or edge caused by the movement of the observer.
  • step S120 the video image to be processed and its corresponding motion estimation information are input into the encoder for encoding to obtain the target image.
  • the motion estimation information of the video image to be processed is obtained, so that the motion estimation information of the video image to be processed can reflect the video to be processed.
  • the motion estimation information of different scales corresponding to the image facilitates subsequent processing of the video image to be processed; the video image to be processed and its corresponding motion estimation information are input into the encoder for encoding, which can be processed based on the motion estimation information of different scales.
  • Encoding video images reduces computational complexity, improves image processing efficiency, and reduces image processing time, so that the obtained target image can accurately reflect the motion trajectory of the objects in the video image to be processed and satisfy the user's use of video images. need.
  • step S110 before inputting the video image to be processed into the multi-scale optical flow motion estimation network for motion estimation and obtaining the motion estimation information of the video image to be processed (ie, step S110), it also includes: obtaining the sample optical flow. data and sample video image data; pre-train the optical flow motion estimation network based on the endpoint error function and sample optical flow data to obtain the network to be processed; input the sample video image into the network to be processed for fine-tuning training to obtain multi-scale optical flow Motion estimation network.
  • the sample video image includes multiple layers of sample images, and the image segmentation corresponding to each layer of sample images is The resolution is different.
  • Optical flow data uses the changes in the time domain of pixels in the image sequence and the correlation between adjacent frames to find the correspondence between the previous frame and the current frame, thereby calculating the movement of objects between adjacent frames. information.
  • optical flow data is data generated due to the movement of the foreground object itself in the scene, the movement of the camera, or the joint movement of the two.
  • the sample optical flow data may be sample data obtained by manually annotating optical flow.
  • the sample optical flow data is input into the optical flow motion estimation network for training, and the end point error function (End Point Error Loss) is used as the loss function to determine whether to end the pre-training of the optical flow motion estimation network.
  • the endpoint error function is used to calculate the two-dimensional space Euclidean distance between the predicted optical flow of each pixel in the sample optical flow data and the pre-labeled optical flow, and determine whether the two-dimensional space Euclidean distance is within the preset distance threshold. within the range to determine whether to end the pre-training of the optical flow motion estimation network.
  • the network to be processed is obtained; otherwise, the pre-training process continues.
  • sample video images need to be input into the network to be processed for fine-tuning training, so that the fine-tuned and trained network can meet the processing needs of images of different scales.
  • the video image to be processed and its corresponding motion estimation information are input into the encoder for encoding to obtain the target image (ie, step S120), which can be implemented in the following manner: the video image to be processed and its corresponding motion estimation information are The motion estimation information is input into the encoder for encoding, and the encoded image is obtained; when it is determined that the encoded image meets the preset image quality evaluation index, the target image is obtained.
  • the preset image quality evaluation index may include: at least one of peak signal-to-noise ratio (Peak Signal-to-Noise Ratio, PSNR), image similarity (Structural Similarity, SSIM) and encoding speed.
  • PSNR Peak Signal-to-noise Ratio
  • SSIM Structuretural Similarity
  • the encoded images can be reasonably evaluated, thereby obtaining target images that meet user needs and improving the user experience.
  • FIG. 2 shows a block diagram of a video image processing system according to the present application.
  • the video image sequence 201 is input into the motion estimation network 202 for training, Motion estimation information corresponding to the video image sequence 201 is obtained, and then the motion estimation information and the video image sequence 201 are input to the video encoder 203 for encoding, thereby obtaining the compressed video image sequence 204 output by the video encoder 203.
  • the motion estimation network 202 is a multi-scale optical flow motion estimation network capable of representing optical flow and optical flow guidance information at different scales.
  • the motion estimation network 202 may include: a convolution (Convolution) module, a deconvolution (Deconvolution) module, a linear rectification function (Linear Rectification Function, ReLU) processing module, a neural network activation function (such as a Sigmoid function) processing module, a full At least one or more of the connection layer (Full-Connection) and reconstruction function (Reshape) processing modules.
  • the video image to be processed is input into the multi-scale optical flow motion estimation network for motion estimation, and the motion estimation information of the video image to be processed is obtained (ie, step S110).
  • step S110 the motion estimation information of the video image to be processed is obtained.
  • This can be implemented in the following manner: based on image resolution
  • the video image to be processed is layered to obtain the pyramid image corresponding to the video image to be processed; the image to be processed in each layer is input into the multi-scale optical flow motion estimation network for motion estimation, and the motion corresponding to the image to be processed in each layer is obtained.
  • the pyramid image includes multiple layers of images to be processed, and each layer of images to be processed has different image resolutions.
  • each layer of the image to be processed in the pyramid image corresponds to a set of motion estimation parameters and residual values.
  • the residual value corresponding to the image to be processed in a certain layer is compared with a preset residual threshold to determine whether to end the running estimation of the image to be processed in the layer. For example, when the residual value corresponding to the image to be processed in a certain layer is greater than the preset residual threshold, the motion estimation of the image to be processed in the layer can be ended and the motion estimation information corresponding to the image to be processed in the layer is obtained; otherwise, it is necessary to continue Perform motion estimation on the image to be processed in this layer.
  • a comprehensive analysis can be performed based on the motion estimation information corresponding to the images to be processed in each layer to determine the motion estimation information of the video image to be processed. For example, the motion estimation information corresponding to the images to be processed in each layer is combined to obtain the motion estimation information of the video image to be processed.
  • the motion estimation information of the video image to be processed includes motion estimation information of multiple different image resolutions.
  • the images to be processed in each layer are input into the multi-scale optical flow motion estimation network for motion estimation, and the motion estimation parameters and residual values corresponding to the images to be processed in each layer are obtained. This can be implemented in the following manner.
  • the following processing is performed on each layer of the image to be processed: input the image to be processed into the multi-scale optical flow motion estimation network for motion estimation, and obtain the pixel-based optical flow data corresponding to the image to be processed; according to the macros in the image to be processed.
  • the number of blocks and the motion vectors of multiple pixels in each macro block are used to determine the average motion vector of multiple pixels in each macro block in the image to be processed; based on the motion vector corresponding to the macro block and the motion vector of each macro block
  • the average motion vector of multiple pixels determines the residual value corresponding to the macroblock.
  • the motion vector corresponding to the macroblock is the motion vector of the macroblock in the current frame image.
  • Each layer of the image to be processed includes multiple macroblocks, each macroblock includes multiple pixels, and the motion estimation parameters include motion vector averages.
  • Pixel-based optical flow data includes motion vectors of multiple pixels within each macroblock in the image to be processed.
  • the pixel-based optical flow data corresponding to the image to be processed is obtained, and pixel-level optical flow data (for example, each pixel in the image to be processed) can be obtained.
  • motion vectors of multiple pixels within a macroblock to improve the processing accuracy of the image to be processed.
  • the number of macroblocks in the image to be processed and the motion vectors of multiple pixels in each macroblock can be calculated by adding and averaging to obtain multiple pixels in each macroblock in the image to be processed. The average motion vector of the point to reflect the motion directionality within each macroblock.
  • the motion vector corresponding to the macro block by comparing the motion vector corresponding to the macro block with the calculated average motion vector of multiple pixels in each macro block, it can be determined whether the motion vector of the macro block is consistent with the motion vector of multiple pixels in the macro block.
  • the mean values are the same, and the residual value corresponding to the macroblock is determined to facilitate subsequent processing of the macroblock.
  • the video image to be processed at least includes: continuous first frame sample image and second frame sample image.
  • This image For example, FIG. 3 shows a schematic structural diagram of a multi-scale optical flow motion estimation network based on pyramid image determination according to the present application.
  • F1 and F2 respectively represent two consecutive image sample images in the video image (for example, an image with an image resolution of 512*512), where F2 is the current frame image and F1 is the previous frame of F2 image.
  • n-layer pyramid structure for F1 and F2 respectively (for example, as shown in Figure 3, n equals 3), obtain the first pyramid image corresponding to F1, and the second pyramid image corresponding to F2, the first pyramid and the second The number of layers of the pyramid is the same.
  • the first pyramid image and the second pyramid image each include n layers of images to be processed.
  • k is an integer greater than or equal to 1 and less than or equal to the number of layers of the first pyramid image (ie, n), that is, 1 ⁇ k ⁇ n.
  • the motion estimation parameters corresponding to the image to be processed in the k-1 layer determine the motion estimation information of the video image to be processed, where the motion estimation parameters corresponding to the image to be processed in the k-1 layer may include: in the image to be processed in each layer The average motion vector of multiple pixels in each macroblock.
  • each macroblock in the k-th layer of the image to be processed can be obtained (for example, an image with an image resolution of 32*32, or an image with a resolution of The average running vector of multiple pixels in a 64*64 image, etc.).
  • the above steps need to be repeated until the residual values corresponding to the images to be processed at all levels in the pyramid image meet the requirements of the preset residual threshold, and finally the It includes the average motion vector of multiple pixels in macroblocks of different sizes to reflect the running estimation parameters corresponding to the image to be processed in each layer. number for easy subsequent use.
  • determining the motion estimation information of the video image to be processed based on the residual value corresponding to the image to be processed in each layer and the preset residual threshold value includes: determining that the residual value corresponding to the image to be processed in the p-th layer is at If the preset residual threshold is within the range, the motion estimation information of the video image to be processed is determined based on the motion estimation parameters corresponding to the image to be processed in the p-1 layer.
  • the motion estimation parameters corresponding to the image to be processed in layer p-1 include: the average motion vector value of multiple pixels in each macroblock in the image to be processed in each layer.
  • p is greater than 1 and less than or equal to the number of layers of the pyramid image.
  • (ie, n) is an integer, that is, 1 ⁇ p ⁇ n, and the image to be processed in the first layer is the image to be processed in the initial layer.
  • FIG. 4 shows a schematic diagram of the training process of the multi-scale optical flow motion estimation network according to the present application.
  • the training network of the multi-scale optical flow motion estimation network includes but is not limited to the following modules: forward function module (previous feature) 401, deformable convolutional networks (Deformable Convolutional Networks, DCN) 402, feature deformation ( Warped feature module 403, optical flow data 404, LR feature module 405, residual offsets calculation module 406, DCN offsets module 407, DCN masks module 408 , and the first convolution kernel (Conv C 1 ) 411 to the m-th convolution kernel (Conv C m ) 41m, where m represents the number of convolution kernels, and m is an integer greater than or equal to 1.
  • the diversity of offsets in DCN can enable deformation alignment to have better performance than optical flow data alignment.
  • deformation alignment has the problem of being difficult to train.
  • the offset can easily diverge, thus affecting the performance of the network model.
  • the deformation data is guided by using optical flow data 404 to obtain good output data.
  • limiting offsets to within the range of preset optical flow data can limit the overflow of offsets and greatly increase the stability of network training.
  • the process of affine transformation can be expressed by the following formula:
  • the offset o i ⁇ 1-1 is output through the DCN offset module 407, and the modulation mask m i ⁇ i-1 is performed through the DCN identification module 408, thereby completing the pre-alignment of the features, i is an integer greater than 1.
  • offset-based learning can assist the alignment of features by making the optical flow data 404; and the convolutional neural network only learns the residual, It can reduce the burden of conventional deformation alignment modules.
  • the DCN identification module 408 can also function as an attention mechanism, making the trained network more flexible.
  • FIG. 5 shows another schematic flowchart of a video image processing method according to the present application.
  • the video image processing method in the embodiment of the present application includes but is not limited to the following steps S501 to S506.
  • step S501 sample optical flow data and sample video image data are obtained.
  • the sample video image data includes: multi-layer sample images, and the image resolution corresponding to each layer of sample images is different. For example, multiple sample video images are divided based on a preset number to obtain training set data and test set data.
  • Sample optical flow data may include: data obtained by downloading public data sets on the Internet, and/or by manually annotating optical flow.
  • step S502 the optical flow motion estimation network is pre-trained based on the endpoint error function and the sample optical flow data to obtain the network to be processed.
  • the optical flow motion estimation network may include: convolution-based neural network (such as optical flow network (FlowNet), etc.).
  • FlowNet optical flow network
  • the above optical flow motion estimation network is only an example, and specific settings can be made according to actual needs.
  • Other unexplained optical flow motion estimation networks are also within the protection scope of this application and will not be described again here.
  • the endpoint error function can be used to calculate the two-dimensional space Euclidean distance between the predicted optical flow of each pixel in the sample optical flow data and the pre-labeled optical flow. Through the two-dimensional space Euclidean distance, Determine whether pre-training of the optical flow motion estimation network is completed.
  • the network to be processed is obtained; otherwise, the pre-training process continues.
  • step S503 the sample video image is input into the network to be processed for fine-tuning training to obtain a multi-scale optical flow motion estimation network.
  • Fine-tune training is to input sample video images into the network to be processed for training.
  • the mean square error (Mean Square Error, MSE) loss function and/or L1 loss function is used to fine-tune the training results.
  • MSE loss function and/or L1 loss function converges stably, a multi-scale optical flow motion estimation network is obtained; otherwise, the fine-tuning training process continues.
  • the L1 loss function is used to minimize the error, which is the sum of all absolute differences between the true value and the predicted value.
  • step S504 the video image to be processed is input into the multi-scale optical flow motion estimation network for motion estimation, and motion estimation information of the video image to be processed is obtained.
  • step S505 the multi-scale optical flow motion estimation network is connected to the encoder, so that the video image to be processed and its corresponding motion estimation information are input into the encoder for encoding, and the encoded image is obtained.
  • step S506 if it is determined that the encoded image satisfies the preset image quality evaluation index, the target image is obtained.
  • the multi-scale optical flow motion estimation network needs to continue to be trained until the encoded image meets the preset image quality evaluation index.
  • the preset image quality evaluation indicators include: image quality indicators (such as PSNR, SSIM, etc.), compression performance indicators (such as encoding compression ratio, encoding speed, etc.) and any one or more of the weighted sum values of the above indicators. .
  • the accuracy of motion estimation and motion compensation of the prediction unit is effectively improved, making the obtained motion estimation information of the video image to be processed more accurate.
  • the use of multi-scale optical flow network prediction methods can reduce computational complexity, thereby improving image processing efficiency, making the inter-frame prediction process more accurate, and improving encoding and decoding efficiency.
  • FIG. 6 shows a block diagram of a video image processing device according to the present application.
  • the video image processing device 600 includes but is not limited to the following modules: a motion estimation module 601 and an encoding module 602 .
  • the motion estimation module 601 is configured to input the video image to be processed into a multi-scale optical flow motion estimation network for motion estimation, and obtain the motion estimation information of the video image to be processed, where the multi-scale optical flow motion estimation network is to characterize different scales.
  • the encoding module 602 is configured to input the video image to be processed and its corresponding motion estimation information into the encoder for encoding to obtain the target image.
  • the motion estimation module 601 is specifically used to: layer the video image to be processed according to the image resolution, and obtain a pyramid image corresponding to the video image to be processed, where the pyramid image includes multiple layers of images to be processed, each layer The image resolution corresponding to the image to be processed is different; the image to be processed in each layer is input into the multi-scale optical flow motion estimation network for motion estimation, and the motion estimation parameters and residual values corresponding to the image to be processed in each layer are obtained; according to each layer The residual value corresponding to the image to be processed and the preset residual threshold determine the motion estimation information of the video image to be processed.
  • the image to be processed in each layer includes multiple macroblocks, each macroblock includes multiple pixels, and the motion estimation parameters include motion vector averages.
  • the images to be processed in each layer are input into the multi-scale optical flow motion estimation network for motion estimation, and the motion estimation parameters and residual values corresponding to the images to be processed in each layer are obtained, including: performing the following processing on the images to be processed in each layer:
  • the image to be processed is input into the multi-scale optical flow motion estimation network for motion estimation, and the pixel-based optical flow data corresponding to the image to be processed is obtained.
  • the pixel-based optical flow data includes each macro in the image to be processed.
  • the motion vectors of multiple pixels in the block based on the number of macroblocks in the image to be processed and the motion vectors of multiple pixels in each macroblock, determine the motion vectors of multiple pixels in each macroblock in the image to be processed.
  • Motion vector mean determine the residual value corresponding to the macroblock based on the motion vector corresponding to the macroblock and the mean motion vector of multiple pixels in each macroblock.
  • the motion vector corresponding to the macroblock is the motion vector of the macroblock in the current frame image.
  • the video image to be processed at least includes: the first continuous frame The sample image and the second frame sample image, where the first frame sample image corresponds to the first pyramid image, the second frame sample image corresponds to the second pyramid image, the first pyramid image and the second pyramid image have the same number of layers; the to-be-processed The image is input into the multi-scale optical flow motion estimation network for motion estimation, and the pixel-based optical flow data corresponding to the image to be processed is obtained, including: respectively extracting the k-th layer of the image to be processed and the second pyramid image in the first pyramid image.
  • the k-th layer of the image to be processed in the first pyramid image is input into the multi-scale optical flow motion estimation network for motion estimation, and the k-th layer of the image to be processed in the second pyramid image is obtained.
  • the motion estimation information of Optical flow data. k is an integer greater than or equal to 1 and less than or equal to the number of layers of the first pyramid image (ie, n), that is, 1 ⁇ k ⁇ n.
  • determining the motion estimation information of the video image to be processed based on the residual value corresponding to the image to be processed in each layer and the preset residual threshold value includes: determining that the residual value corresponding to the image to be processed in the p-th layer is at If the preset residual threshold is within the range, the motion estimation information of the video image to be processed is determined based on the motion estimation parameters corresponding to the image to be processed in the p-1 layer.
  • the motion estimation parameters corresponding to the image to be processed in layer p-1 include: the average motion vector value of multiple pixels in each macroblock in the image to be processed in each layer.
  • p is greater than 1 and less than or equal to the number of layers of the pyramid image.
  • (ie, n) is an integer, that is, 1 ⁇ p ⁇ n, and the image to be processed in the first layer is the image to be processed in the initial layer.
  • the video image processing device also includes: an acquisition module, used to obtain sample optical flow data and sample video image data; a pre-training module, used to calculate the optical flow motion based on the endpoint error function and the sample optical flow data.
  • the estimation network is pre-trained to obtain the network to be processed; the fine-tuning training module is used to input sample video images into the network to be processed for fine-tuning training to obtain a multi-scale optical flow motion estimation network.
  • the sample video image includes multiple layers of sample images, and each layer of sample images corresponds to different image resolutions.
  • the encoding module 602 is specifically used to: input the video image to be processed and its corresponding motion estimation information into the encoder for encoding, and obtain the encoded Image; when it is determined that the encoded image meets the preset image quality evaluation index, the target image is obtained.
  • the image quality evaluation index is preset, including at least one of: peak signal-to-noise ratio, image similarity, and encoding speed.
  • video image processing device 600 in this embodiment can implement any video image processing method in the embodiments of this application.
  • the video image to be processed is input into the multi-scale optical flow motion estimation network for motion estimation through the motion estimation module, and the motion estimation information of the video image to be processed is obtained, which can enable the motion estimation of the video image to be processed.
  • the information reflects the motion estimation information of different scales corresponding to the video image to be processed, which facilitates subsequent processing of the video image to be processed;
  • the encoding module inputs the video image to be processed and its corresponding motion estimation information into the encoder for encoding, which can be based on different Scale motion estimation information is used to encode the video image to be processed, reducing computational complexity, improving image processing efficiency, and reducing image processing time, so that the obtained target image can accurately reflect the motion trajectory of the object in the video image to be processed. Meet users' needs for video images.
  • FIG. 7 shows a structural diagram of an exemplary hardware architecture of a computing device that implements a video image processing method and apparatus according to the present application.
  • computing device 700 includes an input device 701 , an input interface 702 , a central processing unit 703 , a memory 704 , an output interface 705 , and an output device 706 .
  • the input interface 702, the central processing unit 703, the memory 704, and the output interface 705 are connected to each other through the bus 707.
  • the input device 701 and the output device 706 are connected to the bus 707 through the input interface 702 and the output interface 705 respectively, and then to other parts of the computing device 700. Component connections.
  • the input device 701 receives input information from the outside and transmits the input information to the central processor 703 through the input interface 702; the central processor 703 processes the input information based on computer-executable instructions stored in the memory 704 to generate an input.
  • the output information is temporarily or permanently stored in the memory 704, and then the output information is transmitted to the output device 706 through the output interface 705; the output device 706 outputs the output information to the outside of the computing device 700 for use by the user.
  • the computing device shown in FIG. 7 may be implemented as an electronic device, and the electronic device may include: a memory configured to store a program; a processor configured to execute the program stored in the memory to The video image processing method described in the above embodiment is executed.
  • the computing device shown in FIG. 7 may be implemented as a video image processing system.
  • the system may include: a memory configured to store a program; a processor configured to move the program stored in the memory. , to perform the video image processing method described in the above embodiment.
  • Embodiments of the present application may be implemented by a data processor of the mobile device executing computer program instructions, for example in a processor entity, or by hardware, or by a combination of software and hardware.
  • Computer program instructions may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code written in any combination of one or more programming languages or target code.
  • ISA instruction set architecture
  • Any block diagram of a logic flow in the figures of this application may represent program steps, or may represent interconnected logic circuits, modules, and functions, or may represent a combination of program steps and logic circuits, modules, and functions.
  • Computer programs can be stored on memory.
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as, but not limited to, read only memory (ROM), random access memory (RAM), optical storage devices and systems (digital versatile disc DVD or CD), etc.
  • Computer-readable media may include non-transitory storage media.
  • the data processor may be of any type suitable for the local technical environment, such as, but not limited to, general-purpose computers, special-purpose computers, microprocessors, digital signal processors (DSPs), application-specific integrated circuits, etc. circuit (ASIC), programmable logic device (FGPA) and processors based on multi-core processor architecture.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

本申请提出一种视频图像的处理方法、装置、电子设备和存储介质,涉及视频图像处理技术领域。该方法包括:将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理视频图像的运动估计信息;将待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,获得目标图像。

Description

视频图像的处理方法、装置、电子设备和存储介质
相关申请的交叉引用
本申请要求于2022年6月30日提交的名称为“视频图像的处理方法、装置、电子设备和存储介质”的中国专利申请CN 202210770276.2的优先权,其全部内容通过引用并入本文中。
技术领域
本申请涉及视频图像处理技术领域,具体涉及一种视频图像的处理方法、装置、电子设备和存储介质。
背景技术
目前,在视频编码领域中,为了获得较好的视频质量,需要对视频数据进行不同模式的分块,并对不同帧的视频数据进行压缩处理,以保证不同的应用需求。
但是,在对视频数据进行编码和分块的处理过程中,需要递归遍历所有可能的划分方式,并选择其中失真代价最小的处理模式对视频数据进行处理,在对视频数据中的不同区域进行划分后,通常还会根据位置信息对不同的宏块(如,32*32等)进行简单的划分,易导致对视频中的物体的运动估计不准确,且对应的运动补偿不能满足图像的使用需求的问题。
上述处理过程虽然提升了视频压缩率,但极大地增加了计算复杂度,延长了对视频文件的处理时长,无法满足用户对视频文件的处理实时性需求。
发明内容
本申请提供一种视频图像的处理方法、装置、电子设备和存储介质。
本申请实施例提供一种视频图像的处理方法,包括:将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得待处 理视频图像的运动估计信息,其中,多尺度光流运动估计网络为表征不同尺度的光流和光流引导信息的网络;将待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,获得目标图像。
本申请实施例提供一种视频图像的处理装置,包括:运动估计模块,被配置为将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理视频图像的运动估计信息,其中,多尺度光流运动估计网络为表征不同尺度的光流和光流引导信息的网络;编码模块,被配置为将待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,获得目标图像。
本申请实施例提供一种电子设备,包括:一个或多个处理器;存储器,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现本申请的视频图像的处理方法。
本申请实施例提供了一种可读存储介质,该可读存储介质存储有计算机程序,计算机程序被处理器执行时实现本申请的视频图像的处理方法。
关于本申请的以上实施例和其他方面以及其实现方式,在附图说明、具体实施方式和权利要求中提供更多说明。
附图说明
图1示出根据本申请的视频图像的处理方法的流程示意图。
图2示出根据本申请的视频图像的处理系统的方框示意图。
图3示出根据本申请的基于金字塔图像确定的多尺度光流运动估计网络的结构示意图。
图4示出根据本申请的多尺度光流运动估计网络的训练流程示意图。
图5示出根据本申请的视频图像的处理方法的另一流程示意图。
图6示出根据本申请的视频图像的处理装置的组成方框图。
图7示出根据本申请的实现视频图像的处理方法和装置的计算设备的示例性硬件架构的结构图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚明白,下文中将结合附图对本申请的实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。
在高效率视频编码(High Efficiency Video Coding,HEVC)协议或(Versati le Video Coding,VVC)协议中,通常会采用编码树单元(Coding Tree Unit,CTU)作为基本的处理结构,CTU可以进一步划分为编码单元(Code Unit,CU)。在帧内预测或帧间预测的过程中,CTU和CU也可以划分为多个预测单元(Prediction Unit,PU),各个PU之间共享工作参数(如,编码模式等)。在HEVC协议中,CU可使用四叉树的形式来表示其内部结构。
HEVC协议和VVC协议支持的单元划分方式包括:无分区、四元分区、两个二元分区和两个三元分区(例如,CU的1/4、2/4、1/4水平或垂直分区)。VVC协议中,还引入了64个几何PU分区方式,以允许在矩形CU或方形CU中进行非水平或非垂直的分割。64个几何分区中的每个分区都由指向其参数(例如,角度和/或距离等)的索引值表示。几何PU分区方式不能应用于宽度(或高度)大于64,或宽度(或高度)小于8的CU。VVC协议还包括:内部子分区(Intra Sub-Partitions,ISP)的特定分区模式。多种分区方式能够带来更高的分区灵活性。
在相同视频质量的前提下,采用多功能视频编码VVC协议实现的帧内编码,能够将图像的压缩效率提升约50%,但是,在进行编码块的压缩过程中,VVC协议通常采用四叉树、三叉树和二叉树相结合的块划分模式。在块划分的整个过程中,需要递归遍历所有可能的划分方式,并选择其中率失真代价最小的划分模式作为目标划分模式,这极大地增加了计算复杂度,延长了对视频文件的处理时长,使得图像压缩的速率降低,无法满足用户对视频文件的处理实时性需求。
运动估计是视频编解码的关键部分,但是在传统的编码方式中,运动估计是基于PU进行处理的,并且在对视频数据中的不同区域进 行划分后,通常还会根据位置信息对不同的宏块进行简单的划分,这易导致对视频中的物体的运动估计不准确,且对应的运动补偿不能满足图像的使用需求的问题。
图1示出根据本申请的视频图像的处理方法的流程示意图。该方法可应用于视频图像的处理装置。如图1所示,根据本申请的视频图像的处理方法包括但不限于以下步骤S110至S120。
在步骤S110,将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理视频图像的运动估计信息。
多尺度光流运动估计网络为表征不同尺度的光流和光流引导信息的网络。光流(Optical Flow或Optic Flow)是用来描述视频图像中的物体运动的概念,即,相对于观察者的运动所造成的观测目标、表面或边缘的运动。
在步骤S120,将待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,获得目标图像。
在本实施例中,通过将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理视频图像的运动估计信息,能够使待处理视频图像的运动估计信息体现待处理视频图像对应的不同尺度的运动估计信息,方便后续对待处理视频图像进行处理;将待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,能够分别基于不同尺度的运动估计信息对待处理视频图像进行编码,降低计算复杂度,提升图像的处理效率,减少对图像的处理时长,从而使获得的目标图像能够准确体现待处理视频图像中的物体的运动轨迹,满足用户对视频图像的使用需求。
在一些具体实现中,在将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理视频图像的运动估计信息(即,步骤S110)之前,还包括:获取样本光流数据和样本视频图像数据;依据端点误差函数和样本光流数据对光流运动估计网络进行预训练,获得待处理网络;将样本视频图像输入至待处理网络中进行微调训练,获得多尺度光流运动估计网络。
样本视频图像包括多层样本图像,每层样本图像对应的图像分 辨率不同。光流数据是利用图像序列中像素在时间域上的变化以及相邻帧之间的相关性来找到上一帧跟当前帧之间存在的对应关系,从而计算出相邻帧之间物体的运动信息。一般而言,光流数据是由于场景中前景目标本身的移动、相机的运动,或者两者的共同运动所产生的数据。样本光流数据可以是通过人工标注光流的方式获得的样本数据。
将样本光流数据输入到光流运动估计网络中进行训练,并采用端点误差函数(End Point Error Loss)作为损失函数,以判断是否结束对光流运动估计网络的预训练。例如,通过端点误差函数计算样本光流数据中的每个像素点预测的光流与预先标注的光流之间的二维空间欧式距离,并判断该二维空间欧式距离是否在预设距离阈值范围内,从而确定是否结束对光流运动估计网络的预训练。
在确定二维空间欧式距离在预设距离阈值范围内(即,端点误差函数的计算结果处于稳定收敛)的情况下,获得待处理网络;否则,继续进行预训练的过程。
当获得待处理网络后,还需要将样本视频图像输入到待处理网络中进行微调训练,以使微调训练后的网络能够满足不同尺度的图像的处理需求。
在一些具体实现中,将待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,获得目标图像(即,步骤S120),可以采用如下方式实现:将待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,获得编码后的图像;在确定编码后的图像满足预设图像质量评价指标的情况下,获得目标图像。
预设图像质量评价指标可以包括:尖峰信噪比(Peak Signal-to-Noise Ratio,PSNR)、图像相似度(Structural Similarity,SSIM)和编码速度中的至少一种。
通过多种不同的评价指标对编码后的图像进行评价,能够对编码后的图像进行合理的评价,从而获得符合用户需求的目标图像,提升用户的使用体验。
图2示出根据本申请的视频图像的处理系统的方框示意图。如图2所示,将视频图像序列201输入到运动估计网络202中进行训练, 获得与视频图像序列201对应的运动估计信息,然后将运动估计信息和视频图像序列201一起输入至视频编码器203中进行编码,从而获得视频编码器203输出的压缩视频图像序列204。
运动估计网络202是能够表征不同尺度的光流和光流引导信息的多尺度光流运动估计网络。运动估计网络202可以包括:卷积(Convolution)模块、解卷积(Deconvolution)模块、线性整流函数(Linear Rectification Function,ReLU)处理模块、神经网络的激活函数(如,Sigmoid函数)处理模块、全连接层(Full-Connection)和重构函数(Reshape)处理模块中的至少一种或几种。
在一些具体实现中,将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理视频图像的运动估计信息(即,步骤S110),可以采用如下方式实现:依据图像分辨率对待处理视频图像进行分层,获得与待处理视频图像对应的金字塔图像;分别将每层待处理图像输入到多尺度光流运动估计网络中进行运动估计,获得各层待处理图像对应的运动估计参数和残差值;依据各层待处理图像对应的残差值和预设残差阈值,确定待处理视频图像的运动估计信息。
金字塔图像包括多层待处理图像,每层待处理图像对应的图像分辨率不同。
需要说明的是,不同的图像分辨率对应不同的金字塔图像的层级,从而将不同层级的待处理图像输入到多尺度光流运动估计网络中进行运动估计,所获得的运动估计参数分别与不同的图像分辨率相对应,即,金字塔图像中的每层待处理图像都对应一组运动估计参数和残差值。
进一步地,将某层待处理图像对应的残差值和预设残差阈值进行比较,从而确定是否结束对该层待处理图像的运行估计。例如,当某层待处理图像对应的残差值大于预设残差阈值时,则可以结束对该层待处理图像的运动估计,获得该层待处理图像对应的运动估计信息;否则,需要继续对该层待处理图像进行运动估计。
当金字塔图像中的各层待处理图像都完成了运动估计后,可基于各层待处理图像对应的运动估计信息进行综合分析,确定待处理视频图像的运动估计信息。例如,将各层待处理图像对应的运动估计信息进行合并,获得待处理视频图像的运动估计信息,待处理视频图像的运动估计信息包括多种不同图像分辨率的运动估计信息。
在一些具体实现中,分别将每层待处理图像输入到多尺度光流运动估计网络中进行运动估计,获得各层待处理图像对应的运动估计参数和残差值,可采用如下方式实现。
分别对每层待处理图像做如下处理:将待处理图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理图像对应的基于像素点的光流数据;依据待处理图像中的宏块的数量和每个宏块内多个像素点的运动向量,确定待处理图像中的每个宏块内多个像素点的运动向量均值;依据宏块对应的运动向量和每个宏块内多个像素点的运动向量均值,确定宏块对应的残差值。在一个实施例中,宏块对应的运动向量为当前帧图像中的宏块的运动向量。
每层待处理图像包括多个宏块,每个宏块包括多个像素点,运动估计参数包括运动向量均值。基于像素点的光流数据包括待处理图像中的每个宏块内多个像素点的运动向量。
通过将待处理图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理图像对应的基于像素点的光流数据,能够获取像素级别的光流数据(例如,待处理图像中的每个宏块内多个像素点的运动向量),以提升对待处理图像到的处理精准性。此外,可以通过加和求平均的方式,对待处理图像中的宏块的数量和每个宏块内多个像素点的运动向量进行计算,获得待处理图像中的每个宏块内多个像素点的运动向量均值,以体现每个宏块内的运动方向性。进一步地,将宏块对应的运动向量和计算得到的每个宏块内多个像素点的运动向量均值进行比较,可确定宏块的运动向量是否与该宏块内多个像素点的运动向量均值相同,并确定该宏块对应的残差值,以方便后续对该宏块进行处理。
待处理视频图像至少包括:连续的第一帧样本图像和第二帧样 本图像。例如,图3示出根据本申请的基于金字塔图像确定的多尺度光流运动估计网络的结构示意图。如图3所示,F1和F2分别表示视频图像中的连续两帧图像样本图像(例如,图像分辨率为512*512的图像),其中,F2为当前帧图像,F1为F2的前一帧图像。
分别对F1和F2构建n层金字塔结构(例如,如图3所示,n等于3),获得与F1对应的第一金字塔图像,以及与F2对应的第二金字塔图像,第一金字塔和第二金字塔的层数相同,例如,第一金字塔图像和第二金字塔图像都分别包括n层待处理图像。
然后,分别提取第一金字塔图像中的第k层待处理图像和第二金字塔图像中的第k层待处理图像;将第一金字塔图像中的第k层待处理图像输入到多尺度光流运动估计网络中进行运动估计,获得第二金字塔图像中的第k层待处理图像对应的运动估计信息;依据第二金字塔图像中的第k层待处理图像对应的运动估计信息和第二金字塔图像中的第k层待处理图像,确定第k层待处理图像对应的基于像素点的光流数据;同时,计算获得第k层待处理图像对应的残差值。k为大于或等于1,且小于或等于第一金字塔图像的层数(即,n)的整数,即,1≤k≤n。
若第k层待处理图像对应的残差值在预设残差阈值的范围内(例如,第k层待处理图像对应的残差值大于预设残差阈值),则依据相应宏块区域对应的k-1层的待处理图像对应的运动估计参数,确定待处理视频图像的运动估计信息,其中,k-1层的待处理图像对应的运动估计参数可以包括:每层待处理图像中的每个宏块内多个像素点的运动向量均值。
例如,第k层为图像分辨率为128*128的图像,则可以获得第k层待处理图像中的每个宏块(例如,图像分辨率为32*32的图像,或,图像分辨率为64*64的图像等)中的多个像素点的运行向量均值。
需要说明的是,在对每层待处理图像进行处理时,都需要重复上述步骤,直至金字塔图像中的所有层次的待处理图像对应的残差值都满足预设残差阈值的要求,最终获得包括不同尺寸大小的宏块内多个像素点的运动向量均值,以体现各层待处理图像对应的运行估计参 数,方便后续使用。
在一些具体实现中,依据各层待处理图像对应的残差值和预设残差阈值,确定待处理视频图像的运动估计信息,包括:在确定第p层待处理图像对应的残差值在预设残差阈值的范围内的情况下,依据p-1层的待处理图像对应的运动估计参数,确定待处理视频图像的运动估计信息。
p-1层的待处理图像对应的运动估计参数包括:每层待处理图像中的每个宏块内多个像素点的运动向量均值,p为大于1,且小于或等于金字塔图像的层数(即,n)的整数,即,1<p≤n,第一层的待处理图像为初始层的待处理图像。
图4示出根据本申请的多尺度光流运动估计网络的训练流程示意图。如图4所示,多尺度光流运动估计网络的训练网络包括但不限于如下模块:前向功能模块(previous feature)401、可变形卷积网络(Deformable Convolutional Networks,DCN)402、特征变形(Warped feature)模块403、光流(optical flow)数据404、LR特征模块405、剩余偏移量计算(Residual offsets)模块406、DCN偏移(DCN offsets)模块407、DCN标识(DCN masks)模块408、以及第一卷积核(Conv C1)411至第m卷积核(Conv Cm)41m,其中,m表示卷积核的数量,m为大于或等于1的整数。
需要说明的是,DCN中的偏移的多样性,能够使形变对齐具有优于光流数据对齐的性能。然而,形变对齐存在难训练的问题,在网络训练的过程中,由于训练结果的不稳定性,易导致偏移量产生发散,进而影响网络模型的性能。为了充分利用偏移的多样性并克服网络训练的不稳定性,基于形变对齐与光流对齐之间的强相关性,通过使用光流数据404对形变数据进行引导,以获得良好的输出数据。
例如,限定偏移(offsets)在预设设定的光流数据的范围内,就可以限制偏移的溢出,极大的增加网络训练的稳定性。
在第j次时间补偿的过程中,若LR特征模块405输出的形变特征为gi,前向功能模块401输出的特征为fi-1;光流数据404对应表示为si→1-1,(即,从第i处移动到第i-1处的光流数据),则可通 过对fi-1进行仿射变换,获得变换结果其中,仿射变换的过程可采用如下公式表示:
进一步地,通过DCN偏移模块407输出偏移oi→1-1,并通过DCN标识模块408进行调制掩码mi→i-1,从而完成特征的预对齐,i为大于1的整数。
在上述过程中,通过计算光流残差,而非直接计算偏移的方式,能够基于偏移的学习可以通过使光流数据404辅助特征的对齐;而使卷积神经网络仅学习残差,能够降低常规形变对齐模块的负担。DCN标识模块408还能起到注意力机制的作用,使训练后的网络具有更好的灵活性。
图5示出根据本申请的视频图像的处理方法的另一流程示意图。如图5所示,本申请实施例中的视频图像的处理方法包括但不限于以下步骤S501至S506。
在步骤S501,获取样本光流数据和样本视频图像数据。
样本视频图像数据包括:多层样本图像,每层样本图像对应的图像分辨率不同。例如,基于预设数量对多个样本视频图像进行划分,以获得训练集数据和测试集数据。
样本光流数据可以包括:通过下载互联网上的公开数据集,和/或通过人工标注光流的方式获得的数据。
在步骤S502,依据端点误差函数和样本光流数据对光流运动估计网络进行预训练,获得待处理网络。
光流运动估计网络可以包括:基于卷积的神经网络(如,光流网络(FlowNet)等)。以上对于光流运动估计网络仅是举例说明,可根据实际需要进行具体设定,其他未说明的光流运动估计网络也在本申请的保护范围之内,在此不再赘述。
在一些具体实现中,可采用端点误差函数,计算样本光流数据中的每个像素点预测的光流与预先标注的光流之间的二维空间欧式距离,通过该二维空间欧式距离,确定是否完成对光流运动估计网络的预训练。
在确定二维空间欧式距离在预设距离阈值范围内(即,端点误差函数的计算结果处于稳定收敛)的情况下,获得待处理网络;否则,继续进行预训练的过程。
在步骤S503,将样本视频图像输入至待处理网络中进行微调训练,获得多尺度光流运动估计网络。
微调训练(fine-tune)是通过将样本视频图像能够输入至待处理网络中进行训练,例如,采用均方误差(Mean Square Error,MSE)损失函数和/或L1损失函数对训练结果进行微调,当MSE损失函数和/或L1损失函数稳定收敛时,获得多尺度光流运动估计网络;否则,继续进行微调训练的过程。
需要说明的是,L1损失函数用于最小化误差,该误差是真实值和预测值之间的所有绝对差之和。
在步骤S504,将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理视频图像的运动估计信息。
在步骤S505,将多尺度光流运动估计网络与编码器进行连接,使待处理视频图像及其对应运动估计信息都输入到编码器中进行编码,获得编码后的图像。
在步骤S506,在确定编码后的图像满足预设图像质量评价指标的情况下,获得目标图像。
需要说明的是,若编码后的图像不能满足预设图像质量评价指标,则还需要继续对多尺度光流运动估计网络进行训练,直至编码后的图像满足预设图像质量评价指标。
预设图像质量评价指标包括:图像质量指标(如,PSNR、SSIM等)、压缩性能指标(如,编码压缩比、编码速度等)和上述各个指标的加权和值中的任意一种或几种。
在本实施例中,通过采用基于光流网络进行像素级别的帧间预测的运动估计,有效提高预测单元的运动估计和运动补偿的精度,使获得的待处理视频图像的运动估计信息更准确。此外,采用多尺度光流网络的预测方式,能够降低计算复杂度,从而提升图像的处理效率,使帧间预测过程更准确,并提高编解码效率。
下面结合附图,详细介绍根据本申请实施例的视频图像的处理装置。图6示出根据本申请的视频图像的处理装置的组成方框图。如图6所示,该视频图像的处理装置600包括但不限于如下模块:运动估计模块601和编码模块602。
运动估计模块601被配置为将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理视频图像的运动估计信息,其中,多尺度光流运动估计网络为表征不同尺度的光流和光流引导信息的网络。
编码模块602被配置为将待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,获得目标图像。
在一些具体实现中,运动估计模块601具体用于:依据图像分辨率对待处理视频图像进行分层,获得与待处理视频图像对应的金字塔图像,其中,金字塔图像包括多层待处理图像,每层待处理图像对应的图像分辨率不同;分别将每层待处理图像输入到多尺度光流运动估计网络中进行运动估计,获得各层待处理图像对应的运动估计参数和残差值;依据各层待处理图像对应的残差值和预设残差阈值,确定待处理视频图像的运动估计信息。
在一些具体实现中,每层待处理图像包括多个宏块,每个宏块包括多个像素点,运动估计参数包括运动向量均值。分别将每层待处理图像输入到多尺度光流运动估计网络中进行运动估计,获得各层待处理图像对应的运动估计参数和残差值,包括:分别对每层待处理图像做如下处理:将待处理图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理图像对应的基于像素点的光流数据,其中,基于像素点的光流数据包括待处理图像中的每个宏块内多个像素点的运动向量;依据待处理图像中的宏块的数量和每个宏块内多个像素点的运动向量,确定待处理图像中的每个宏块内多个像素点的运动向量均值;依据宏块对应的运动向量和每个宏块内多个像素点的运动向量均值,确定宏块对应的残差值。在一个实施例中,宏块对应的运动向量为当前帧图像中的宏块的运动向量。
在一些具体实现中,待处理视频图像至少包括:连续的第一帧 样本图像和第二帧样本图像,其中,第一帧样本图像对应第一金字塔图像,第二帧样本图像对应第二金字塔图像,第一金字塔图像和第二金字塔图像的层数相同;将待处理图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理图像对应的基于像素点的光流数据,包括:分别提取第一金字塔图像中的第k层待处理图像和第二金字塔图像中的第k层待处理图像;将第一金字塔图像中的第k层待处理图像输入到多尺度光流运动估计网络中进行运动估计,获得第二金字塔图像中的第k层待处理图像对应的运动估计信息;依据第二金字塔图像中的第k层待处理图像对应的运动估计信息和第二金字塔图像中的第k层待处理图像,确定第k层待处理图像对应的基于像素点的光流数据。k为大于或等于1,且小于或等于第一金字塔图像的层数(即,n)的整数,即,1≤k≤n。
在一些具体实现中,依据各层待处理图像对应的残差值和预设残差阈值,确定待处理视频图像的运动估计信息,包括:在确定第p层待处理图像对应的残差值在预设残差阈值的范围内的情况下,依据p-1层的待处理图像对应的运动估计参数,确定待处理视频图像的运动估计信息。
p-1层的待处理图像对应的运动估计参数包括:每层待处理图像中的每个宏块内多个像素点的运动向量均值,p为大于1,且小于或等于金字塔图像的层数(即,n)的整数,即,1<p≤n,第一层的待处理图像为初始层的待处理图像。
在一些具体实现中,视频图像的处理装置,还包括:获取模块,用于获取样本光流数据和样本视频图像数据;预训练模块,用于依据端点误差函数和样本光流数据对光流运动估计网络进行预训练,获得待处理网络;微调训练模块,用于将样本视频图像输入至待处理网络中进行微调训练,获得多尺度光流运动估计网络。
样本视频图像包括多层样本图像,每层样本图像对应的图像分辨率不同。
在一些具体实现中,编码模块602具体用于:将待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,获得编码后的 图像;在确定编码后的图像满足预设图像质量评价指标的情况下,获得目标图像。
在一些具体实现中,预设图像质量评价指标,包括:尖峰信噪比、图像相似度和编码速度中的至少一种。
需要说明的是,本实施例中的视频图像的处理装置600能够实现本申请实施例中任一种视频图像的处理方法。
根据本申请实施例的设备,通过运动估计模块将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得待处理视频图像的运动估计信息,能够使待处理视频图像的运动估计信息体现待处理视频图像对应的不同尺度的运动估计信息,方便后续对待处理视频图像进行处理;编码模块将待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,能够分别基于不同尺度的运动估计信息对待处理视频图像进行编码,降低计算复杂度,提升图像的处理效率,减少对图像的处理时长,从而使获得的目标图像能够准确体现待处理视频图像中的物体的运动轨迹,满足用户对视频图像的使用需求。
需要明确的是,本申请并不局限于上文实施例中所描述并在图中示出的特定配置和处理。为了描述的方便和简洁,这里省略了对已知方法的详细描述,并且上述描述的系统、模块和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
图7示出根据本申请的实现视频图像的处理方法和装置的计算设备的示例性硬件架构的结构图。
如图7所示,计算设备700包括输入设备701、输入接口702、中央处理器703、存储器704、输出接口705、以及输出设备706。输入接口702、中央处理器703、存储器704、以及输出接口705通过总线707相互连接,输入设备701和输出设备706分别通过输入接口702和输出接口705与总线707连接,进而与计算设备700的其他组件连接。
具体地,输入设备701接收来自外部的输入信息,并通过输入接口702将输入信息传送到中央处理器703;中央处理器703基于存储器704中存储的计算机可执行指令对输入信息进行处理以生成输 出信息,将输出信息临时或者永久地存储在存储器704中,然后通过输出接口705将输出信息传送到输出设备706;输出设备706将输出信息输出到计算设备700的外部供用户使用。
在一个实施例中,图7所示的计算设备可以被实现为一种电子设备,该电子设备可以包括:存储器,被配置为存储程序;处理器,被配置为运动存储器中存储的程序,以执行上述实施例描述的视频图像的处理方法。
在一个实施例中,图7所示的计算设备可以被实现为一种视频图像的处理系统,该系统可以包括:存储器,被配置为存储程序;处理器,被配置为运动存储器中存储的程序,以执行上述实施例描述的视频图像的处理方法。
以上所述,仅为本申请的示例性实施例而已,并非用于限定本申请的保护范围。一般来说,本申请的多种实施例可以在硬件或专用电路、软件、逻辑或其任何组合中实现。例如,一些方面可以被实现在硬件中,而其它方面可以被实现在可以被控制器、微处理器或其它计算装置执行的固件或软件中,尽管本申请不限于此。
本申请的实施例可以通过移动装置的数据处理器执行计算机程序指令来实现,例如在处理器实体中,或者通过硬件,或者通过软件和硬件的组合。计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码。
本申请附图中的任何逻辑流程的框图可以表示程序步骤,或者可以表示相互连接的逻辑电路、模块和功能,或者可以表示程序步骤与逻辑电路、模块和功能的组合。计算机程序可以存储在存储器上。存储器可以具有任何适合于本地技术环境的类型并且可以使用任何适合的数据存储技术实现,例如但不限于只读存储器(ROM)、随机访问存储器(RAM)、光存储器装置和系统(数码多功能光碟DVD或CD光盘)等。计算机可读介质可以包括非瞬时性存储介质。数据处理器可以是任何适合于本地技术环境的类型,例如但不限于通用计算机、专用计算机、微处理器、数字信号处理器(DSP)、专用集成电 路(ASIC)、可编程逻辑器件(FGPA)以及基于多核处理器架构的处理器。
通过示范性和非限制性的示例,上文已提供了对本申请的示范实施例的详细描述。但结合附图和权利要求来考虑,对以上实施例的多种修改和调整对本领域技术人员来说是显而易见的,但不偏离本申请的范围。因此,本申请的恰当范围将根据权利要求确定。

Claims (12)

  1. 一种视频图像的处理方法,包括:
    将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得所述待处理视频图像的运动估计信息,其中,所述多尺度光流运动估计网络为表征不同尺度的光流和光流引导信息的网络;
    将所述待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,获得目标图像。
  2. 根据权利要求1所述的方法,其中,将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得所述待处理视频图像的运动估计信息,包括:
    依据图像分辨率对所述待处理视频图像进行分层,获得与所述待处理视频图像对应的金字塔图像,其中,所述金字塔图像包括多层待处理图像,每层所述待处理图像对应的图像分辨率不同;
    分别将每层所述待处理图像输入到所述多尺度光流运动估计网络中进行运动估计,获得各层所述待处理图像对应的运动估计参数和残差值;
    依据各层所述待处理图像对应的残差值和预设残差阈值,确定所述待处理视频图像的运动估计信息。
  3. 根据权利要求2所述的方法,其中,每层所述待处理图像包括多个宏块,每个所述宏块包括多个像素点,所述运动估计参数包括运动向量均值,并且
    分别将每层所述待处理图像输入到所述多尺度光流运动估计网络中进行运动估计,获得各层所述待处理图像对应的运动估计参数和残差值,包括:
    分别对每层所述待处理图像做如下处理:
    将所述待处理图像输入到所述多尺度光流运动估计网络中进行运动估计,获得所述待处理图像对应的基于像素点的光流数据,其中, 所述基于像素点的光流数据包括所述待处理图像中的每个宏块内多个像素点的运动向量;
    依据所述待处理图像中的宏块的数量和每个所述宏块内多个像素点的运动向量,确定所述待处理图像中的每个宏块内多个像素点的运动向量均值;
    依据所述宏块对应的运动向量和每个所述宏块内多个像素点的运动向量均值,确定所述宏块对应的残差值。
  4. 根据权利要求3所述的方法,其中,所述宏块对应的运动向量为当前帧图像中的所述宏块的运动向量。
  5. 根据权利要求3所述的方法,其中,所述待处理视频图像至少包括:连续的第一帧样本图像和第二帧样本图像,其中,所述第一帧样本图像对应第一金字塔图像,所述第二帧样本图像对应第二金字塔图像,所述第一金字塔图像和所述第二金字塔图像的层数相同,并且
    将所述待处理图像输入到所述多尺度光流运动估计网络中进行运动估计,获得所述待处理图像对应的基于像素点的光流数据,包括:
    分别提取所述第一金字塔图像中的第k层待处理图像和所述第二金字塔图像中的第k层待处理图像;
    将所述第一金字塔图像中的第k层待处理图像输入到所述多尺度光流运动估计网络中进行运动估计,获得所述第二金字塔图像中的第k层待处理图像对应的运动估计信息;
    依据所述第二金字塔图像中的第k层待处理图像对应的运动估计信息和所述第二金字塔图像中的第k层待处理图像,确定所述第k层待处理图像对应的基于像素点的光流数据;
    其中,k为大于或等于1,且小于或等于所述第一金字塔图像的层数的整数。
  6. 根据权利要求3所述的方法,其中,依据各层所述待处理图 像对应的残差值和预设残差阈值,确定所述待处理视频图像的运动估计信息,包括:
    在确定第p层待处理图像对应的残差值在所述预设残差阈值的范围内的情况下,依据p-1层的待处理图像对应的运动估计参数,确定所述待处理视频图像的运动估计信息;
    其中,所述p-1层的待处理图像对应的运动估计参数包括:每层所述待处理图像中的每个宏块内多个像素点的运动向量均值,p为大于1,且小于或等于所述金字塔图像的层数的整数,第一层的待处理图像为初始层的待处理图像。
  7. 根据权利要求1至6中任一项所述的方法,其中,在将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得所述待处理视频图像的运动估计信息之前,所述方法还包括:
    获取样本光流数据和样本视频图像数据;
    依据端点误差函数和所述样本光流数据对光流运动估计网络进行预训练,获得所述待处理网络;
    将所述样本视频图像输入至所述待处理网络中进行微调训练,获得所述多尺度光流运动估计网络;其中,所述样本视频图像包括多层样本图像,每层所述样本图像对应的图像分辨率不同。
  8. 根据权利要求1至6中任一项所述的方法,其中,将所述待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,获得目标图像,包括:
    将所述待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,获得编码后的图像;
    在确定所述编码后的图像满足预设图像质量评价指标的情况下,获得所述目标图像。
  9. 根据权利要求8所述的方法,其中,所述预设图像质量评价指标,包括:尖峰信噪比、图像相似度和编码速度中的至少一种。
  10. 一种视频图像的处理装置,包括:
    运动估计模块,被配置为将待处理视频图像输入到多尺度光流运动估计网络中进行运动估计,获得所述待处理视频图像的运动估计信息,其中,所述多尺度光流运动估计网络为表征不同尺度的光流和光流引导信息的网络;
    编码模块,被配置为将所述待处理视频图像及其对应的运动估计信息输入到编码器中进行编码,获得目标图像。
  11. 一种电子设备,包括:
    一个或多个处理器;
    存储器,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1至9中任一项所述的视频图像的处理方法。
  12. 一种可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器实现如权利要求1至9中任一项所述的视频图像的处理方法。
PCT/CN2023/101498 2022-06-30 2023-06-20 视频图像的处理方法、装置、电子设备和存储介质 WO2024001887A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210770276.2A CN117395423A (zh) 2022-06-30 2022-06-30 视频图像的处理方法、装置、电子设备和存储介质
CN202210770276.2 2022-06-30

Publications (1)

Publication Number Publication Date
WO2024001887A1 true WO2024001887A1 (zh) 2024-01-04

Family

ID=89383253

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101498 WO2024001887A1 (zh) 2022-06-30 2023-06-20 视频图像的处理方法、装置、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN117395423A (zh)
WO (1) WO2024001887A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201400875D0 (en) * 2008-10-15 2014-03-05 Spinella Ip Holdings Inc Digital processing method and system for determination of optical flow
US20150365696A1 (en) * 2014-06-13 2015-12-17 Texas Instruments Incorporated Optical flow determination using pyramidal block matching
CN111327926A (zh) * 2020-02-12 2020-06-23 北京百度网讯科技有限公司 视频插帧方法、装置、电子设备及存储介质
CN111340844A (zh) * 2020-02-24 2020-06-26 南昌航空大学 基于自注意力机制的多尺度特征光流学习计算方法
CN114066761A (zh) * 2021-11-22 2022-02-18 青岛根尖智能科技有限公司 基于光流估计与前景检测的运动视频帧率增强方法及系统
CN114677412A (zh) * 2022-03-18 2022-06-28 苏州大学 一种光流估计的方法、装置以及设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201400875D0 (en) * 2008-10-15 2014-03-05 Spinella Ip Holdings Inc Digital processing method and system for determination of optical flow
US20150365696A1 (en) * 2014-06-13 2015-12-17 Texas Instruments Incorporated Optical flow determination using pyramidal block matching
CN111327926A (zh) * 2020-02-12 2020-06-23 北京百度网讯科技有限公司 视频插帧方法、装置、电子设备及存储介质
CN111340844A (zh) * 2020-02-24 2020-06-26 南昌航空大学 基于自注意力机制的多尺度特征光流学习计算方法
CN114066761A (zh) * 2021-11-22 2022-02-18 青岛根尖智能科技有限公司 基于光流估计与前景检测的运动视频帧率增强方法及系统
CN114677412A (zh) * 2022-03-18 2022-06-28 苏州大学 一种光流估计的方法、装置以及设备

Also Published As

Publication number Publication date
CN117395423A (zh) 2024-01-12

Similar Documents

Publication Publication Date Title
US10965959B2 (en) Multi-frame quality enhancement for compressed video
CN106331723B (zh) 一种基于运动区域分割的视频帧率上变换方法及系统
CN110248048B (zh) 一种视频抖动的检测方法及装置
WO2017084258A1 (zh) 编码过程中的实时视频降噪方法、终端和非易失性计算机可读存储介质
JP4504230B2 (ja) 動画像処理装置、動画像処理方法、及び動画像処理プログラム
CN118055253A (zh) 用于视频代码化中的运动补偿预测的光流估计
CN102075760A (zh) 快速运动估计方法及装置
WO2021093060A1 (zh) 一种视频编码方法、系统及设备
US11190766B2 (en) Method and apparatus for determining division of coding unit, computing device, and readable storage medium
CN108574844B (zh) 一种时空显著感知的多策略视频帧率提升方法
Kaviani et al. Frame rate upconversion using optical flow and patch-based reconstruction
US20200380290A1 (en) Machine learning-based prediction of precise perceptual video quality
JP2013532926A (ja) 複数のプロセッサを使用してビデオフレームを符号化するための方法およびシステム
CN108449599B (zh) 一种基于面透射变换的视频编码与解码方法
US20100177974A1 (en) Image processing method and related apparatus
CN107483960A (zh) 一种基于空间预测的运动补偿帧率上转换方法
CN111310594A (zh) 一种基于残差纠正的视频语义分割方法
US20210407105A1 (en) Motion estimation method, chip, electronic device, and storage medium
WO2024001887A1 (zh) 视频图像的处理方法、装置、电子设备和存储介质
CN113824961A (zh) 一种可适用于vvc编码标准的帧间图像编码方法与系统
US20200265557A1 (en) Motion detection method and image processing device for motion detection
Yılmaz et al. Dfpn: Deformable frame prediction network
CN106303545B (zh) 用于在帧序列中执行运动估计的数据处理系统和方法
WO2021093059A1 (zh) 一种感兴趣区域的识别方法、系统及设备
CN109788297B (zh) 一种基于元胞自动机的视频帧率上转换方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23830066

Country of ref document: EP

Kind code of ref document: A1