WO2023123873A1 - Dense optical flow calculation method employing attention mechanism - Google Patents

Dense optical flow calculation method employing attention mechanism Download PDF

Info

Publication number
WO2023123873A1
WO2023123873A1 PCT/CN2022/097531 CN2022097531W WO2023123873A1 WO 2023123873 A1 WO2023123873 A1 WO 2023123873A1 CN 2022097531 W CN2022097531 W CN 2022097531W WO 2023123873 A1 WO2023123873 A1 WO 2023123873A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
optical flow
generate
network
sequence
Prior art date
Application number
PCT/CN2022/097531
Other languages
French (fr)
Chinese (zh)
Inventor
张继东
吕超
曹靖城
涂娟娟
Original Assignee
天翼数字生活科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天翼数字生活科技有限公司 filed Critical 天翼数字生活科技有限公司
Publication of WO2023123873A1 publication Critical patent/WO2023123873A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/207Analysis of motion for motion estimation over a hierarchy of resolutions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4038Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images

Definitions

  • the invention relates to the field of video applications, and mainly relates to dense optical flow calculation in video applications.
  • optical flow is the instantaneous velocity of the pixel motion of a spatially moving object on the observation imaging plane.
  • the optical flow method uses the changes of pixels in the image sequence in the time domain and the correlation between adjacent frames to find the corresponding relationship between the previous frame and the current frame, thereby calculating the motion of objects between adjacent frames.
  • the traditional methods of calculating optical flow mainly include gradient-based, frequency-based, phase-based and matching-based methods.
  • Dense optical flow is an image registration method for point-by-point matching of an image or a specified area. It calculates the offset of all points on the image to form a dense optical flow field. Through this dense optical flow field, pixel-level image registration can be performed.
  • the Horn-Schunck algorithm and most optical flow methods based on region matching belong to the category of dense optical flow.
  • FlowNet is the most widely used in practical applications.
  • the patent "robust interpolation optical flow calculation method for pyramid occlusion detection block matching" discloses a robust interpolation optical flow calculation method for pyramid occlusion detection block matching.
  • the pyramid occlusion detection block matching is performed to obtain a sparse robust motion field , two consecutive frames of images are constructed by downsampling factors to form a k-level image pyramid, and block matching is performed on each layer of the pyramid to obtain the matching result with initial occlusion; the occlusion detection information is obtained through the occlusion detection algorithm based on deformation error; by matching
  • dense optical flow needs to be obtained through a robust interpolation algorithm; after the dense optical flow is obtained by the robust interpolation algorithm, the dense optical flow is optimized by the global energy functional variation: obtained by the global energy functional variational optimization Final optical flow.
  • Second Any two consecutive frames of images in the input image sequence are subjected to feature pyramid down-sampling and layering to obtain multi-resolution two-frame features; the correlation between the first frame feature and the second frame feature is calculated in each layer of the pyramid, and the correlation is used Construct the module based on the occlusion mask; then use the obtained occlusion mask to remove the edge artifacts of deformation features to optimize the optical flow of image motion edge blur; and use the optical flow after occlusion constraints to construct a secondary deformation optimization module, the secondary deformation is Further optimize the image motion edge optical flow estimation at the pixel level; perform the same occlusion mask and secondary deformation on the deformation features in each layer of the pyramid to obtain the residual flow to refine the optical flow, and output the final optimized when it reaches the bottom of the pyramid Optical flow estimation.
  • the present invention Compared with the existing dense optical flow method, the present invention introduces a multi-head self-attention machine in the optical flow prediction calculation task, and utilizes Transformer's global self-attention advantage in sequence-to-sequence prediction to improve the effect of the optical flow calculation task.
  • the present invention can improve the accuracy of the dense optical flow map at key positions, and at the same time improve the timeliness of dense optical flow calculation by reducing the network depth of Unet's up-sampling and down-sampling.
  • a method for calculating dense optical flow including: stitching adjacent frames on channels to generate a stitched vector map; inputting the stitched vector map for downsampling
  • the network performs feature extraction to generate feature vectors; the generated feature vectors are mapped to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence; the high-dimensional embedding representation sequence is input into the feature processing composed of I Transformer layers Network to generate hidden feature sequences; reorganize the generated hidden feature sequences to generate reorganized feature vectors; and input the reorganized feature vectors into the upsampling network for processing to generate dense optical flow maps.
  • a system for computing dense optical flow including a downsampling module, a feature processing module and an upsampling module.
  • the down-sampling module is configured to: splice adjacent frames on the channel to generate a spliced vector map; input the spliced vector map into the down-sampling network for feature extraction to generate a feature vector.
  • the feature processing module is configured to: map the feature vector generated by the downsampling module to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence; input the high-dimensional embedding representation sequence into the A feature processing network to generate a sequence of hidden features.
  • the upsampling module is configured to: reorganize the hidden feature sequence generated by the feature processing module to generate a reorganized feature vector; and input the reorganized feature vector into the upsampling network for processing to generate a dense optical flow map.
  • a computing device for dense optical flow calculation including: a processor; a memory, the memory stores instructions, and the instructions can be executed when executed by the processor method as above.
  • FIG. 1 shows a block diagram of a system 100 for dense optical flow calculation according to an embodiment of the present invention
  • FIG. 2 shows a detailed diagram 200 of each module 101-103 in FIG. 1 according to an embodiment of the present invention
  • FIG. 3 shows a flowchart of a method 300 for calculating dense optical flow according to an embodiment of the present invention.
  • FIG. 4 shows a block diagram 400 of an exemplary computing device according to one embodiment of the invention.
  • Unet is a segmentation model, specifically, it is a fully convolutional network including 4 layers of downsampling, 4 layers of upsampling and a similar skip connection structure. Symmetrical, and the feature map at the downsampling end can skip deep sampling and be spliced to the corresponding upsampling end.
  • Transformer is a natural language processing (NLP) model that uses an attention mechanism for machine translation tasks.
  • optical flow plays an important role and has very important applications in target object segmentation, recognition, tracking, robot navigation, and shape information recovery.
  • Optical flow computing can be widely used in various scenarios, for example, motion detection of video codec in cloud storage video compression tasks, high-altitude parabolic, fall detection and other motion recognition and video understanding tasks.
  • dense optical flow calculation is a key module in video coding and decoding technology.
  • the traditional dense optical flow calculation method has a large amount of calculation and poor timeliness.
  • the existing optical flow calculation methods based on deep learning methods have improved timeliness, but the accuracy of dense optical flow maps is low, which will have a negative impact on the quality of video encoding and decoding.
  • the present invention proposes a dense optical flow calculation method based on Unet and Transformer.
  • the method introduces the Transformer module into the Unet structure, and utilizes Transformer's global self-attention advantage in sequence-to-sequence prediction to improve the accuracy of dense optical flow at key positions. At the same time, it can also reduce the network depth of Unet's upsampling and downsampling, and improve the timeliness of dense optical flow calculations.
  • FIG. 1 shows a block diagram of a system 100 for calculating dense optical flow according to an embodiment of the present invention.
  • the system 100 is divided into modules, and communication and data exchange are performed between modules in a manner known in the art.
  • each module can be implemented by software or hardware or a combination thereof.
  • the system 100 may include a downsampling module 101 , a feature processing module 102 and an upsampling module 103 .
  • the downsampling module 101 is configured to stitch two adjacent frames on channels (for example, color channels) to form an input picture, which is input to a convolutional network for downsampling, thereby obtaining a feature map.
  • the feature processing module 102 is configured to perform global context feature processing on the encoded input sequence of feature maps output by the down-sampling module 101 .
  • the upsampling module 103 is configured as a cascaded upsampler, which upsamples the feature map after feature processing to reconstruct an optical flow map with the same size as the input picture.
  • FIG. 2 shows a detailed diagram 200 of each of the modules 101-103 in FIG. 1 according to one embodiment of the present invention.
  • the downsampling module 101 receives two adjacent frames 201, first splices the two frames 201 to obtain a vector map of h ⁇ w ⁇ 6, and then inputs it to the The downsampling network of , each convolutional block consists of a convolutional layer and a ReLU activation function, of which 5 convolutional layers have a stride of 2.
  • the down-sampling module 101 outputs a size of feature map for the feature processing module 102 to process.
  • the feature processing module 102 includes using a trainable linear map E to map the feature map sequence output by the downsampling module 101 into the high-dimensional embedding space of the latent layer, and the calculation method is shown in formula (1) :
  • the high-dimensional embedding representation sequence is then fed into a feature processing network consisting of I Transformer layers.
  • the specific structure of the Transformer layer is shown in Figure 3.
  • the Transformer layer is composed of a Multihead Self-Attention (MSA) and a Multi-Layer Perceptron (MLP).
  • MSA Multihead Self-Attention
  • MLP Multi-Layer Perceptron
  • the output of the i-th layer is shown in formula (2) (3) :
  • the feature processing module 102 finally outputs the hidden feature sequence z I .
  • the upsampling module 103 is a cascaded upsampling network, which includes multiple upsampling steps to decode and output the final optical flow picture 202 .
  • the up-sampling module 103 reorganizes the hidden feature sequence z I finally output by the feature processing module 102 into
  • the feature vector of size is then input into the upsampling network consisting of 7 deconvolution blocks, each deconvolution block consists of a deconvolution layer and a ReLU activation function, and the step size of the 5 deconvolution layers is 2.
  • an optical flow map output of size h ⁇ w ⁇ 3 is obtained.
  • the present invention adds three jumping layers between the down-sampled feature vectors to realize feature aggregation (203, 204, 205) at different resolution levels, thereby optimizing the details of the optical flow.
  • FIG. 3 shows a flowchart of a method 300 for dense optical flow calculation according to an embodiment of the present invention.
  • step 301 adjacent frames are spliced on channels to generate a spliced vector image.
  • the channel is a color channel, such as an RGB channel.
  • the size of the vector map is h ⁇ w ⁇ 6.
  • step 302 input the spliced vector image into the down-sampling network for feature extraction, so as to generate feature vectors.
  • the downsampling network is composed of 7 convolutional blocks, and each convolutional block is composed of a convolutional layer and a ReLU activation function, and the stride of 5 convolutional layers is 2.
  • the size of the feature vector is
  • step 303 the feature vector generated in step 302 is mapped to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence.
  • a trainable linear map E can be used to map the feature vector obtained in step 302 into the high-dimensional embedding space of the latent layer.
  • the high-dimensional embedding representation sequence is input into a feature processing network composed of I Transformer layers to generate a hidden feature sequence.
  • the Transformer layer consists of MSA and MLP for global contextual feature processing.
  • step 305 the hidden feature sequence generated in step 304 is reorganized to generate a reorganized feature vector.
  • the hidden feature sequence z I is reorganized as A eigenvector of size.
  • the recombined feature vector is input into the up-sampling network for processing to generate a dense optical flow map.
  • the dense optical flow map may reflect the optical flow of object motion in two adjacent frames obtained in step 301 .
  • the upsampling network is composed of 7 deconvolution blocks, and each deconvolution block is composed of a deconvolution layer and a ReLU activation function, wherein the step size of the 5 deconvolution layers is 2.
  • the size of the dense optical flow map is h ⁇ w ⁇ 3.
  • the upsampling network is a cascaded upsampling network, which realizes feature aggregation at different resolution levels, thereby optimizing the details of dense optical flow.
  • the main advantages of the present invention are: (1) Introducing a multi-head self-attention machine in the optical flow prediction calculation task, utilizing Transformer's global self-attention advantage in sequence-to-sequence prediction, the present invention Can improve the accuracy of the dense optical flow of the key position; (2) thanks to the excellent performance of the multi-head self-attention machine in the feature layer for predictive calculation, it can also reduce the network depth of Unet's up-sampling and down-sampling, the present invention can Improve the timeliness of dense optical flow calculation.
  • FIG. 4 illustrates a block diagram 400 of an exemplary computing device, which is an example of a hardware device applicable to aspects of the invention, according to one embodiment of the invention.
  • Computing device 400 can be any machine that can be configured to perform processing and/or computing, and can be, but is not limited to, a workstation, server, desktop, laptop, tablet, personal digital processing, smartphone , on-board computer, or any combination thereof.
  • Computing device 400 may include components that may be connected or communicate via one or more interfaces and bus 402 .
  • computing device 400 may include a bus 402 , one or more processors 404 , one or more input devices 406 , and one or more output devices 408 .
  • the one or more processors 404 may be any type of processor and may include, but are not limited to, one or more general purpose processors and/or one or more special purpose processors (eg, specialized processing chips).
  • Input device 406 may be any type of device capable of entering information into a computing device and may include, but is not limited to, a mouse, keyboard, touch screen, microphone, and/or remote control.
  • Output devices 408 may be any type of device capable of presenting information and may include, but are not limited to, displays, speakers, video/audio output terminals, vibrators, and/or printers.
  • the computing device 400 may also include a non-transitory storage device 410 or be connected to the non-transitory storage device.
  • the non-transitory storage device may be any storage device that is non-transitory and capable of storing data
  • the non-transitory storage device Transient storage devices may include, but are not limited to, magnetic disk drives, optical storage devices, solid state memory, floppy disks, floppy disks, hard disks, magnetic tape or any other magnetic media, optical disks or any other optical media, ROM (read only memory), RAM (random memory access memory), cache memory and/or any memory chip or cartridge, and/or any other medium from which a computer can read data, instructions and/or code.
  • the non-transitory storage device 410 is detachable from the interface.
  • the non-transitory storage device 410 may have data/instructions/codes for implementing the above methods and steps.
  • Computing device 400 may also include a communication device 412 .
  • Communication device 412 may be any type of device or system capable of communicating with internal devices and/or with a network and may include, but is not limited to, a modem, network card, infrared communication device, wireless communication device, and/or chipset, such as a Bluetooth device , IEEE 1302.11 devices, WiFi devices, WiMax devices, cellular communications devices, and/or similar devices.
  • Bus 402 may include, but is not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • EISA Enhanced ISA
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • Computing device 400 may also include working memory 414, which may be any type of working memory capable of storing instructions and/or data that facilitate the operation of processor 404 and may include, but is not limited to, random access memory and/or Read-only storage device.
  • working memory 414 may be any type of working memory capable of storing instructions and/or data that facilitate the operation of processor 404 and may include, but is not limited to, random access memory and/or Read-only storage device.
  • Software components may be located in working memory 414 including, but not limited to, an operating system 416, one or more application programs 418, drivers, and/or other data and code.
  • the instructions for realizing the above-mentioned methods and steps of the present invention may be included in the one or more application programs 418, and the instructions of the one or more application programs 418 may be read and executed by the processor 404 to realize the present invention The method 300 described above.
  • custom hardware could also be used, and/or particular components could be implemented in hardware, software, firmware, middleware, microcode, hardware description voice, or any combination thereof.
  • connections to other computing devices such as network input/output devices and the like, may be employed.
  • programming hardware eg, programmable logic circuits including field programmable gate arrays (FPGAs) and/or programmable logic arrays (PLAs)
  • assembly language or hardware programming languages eg, VERILOG, VHDL, C++

Abstract

The present invention relates to a dense optical flow calculation method employing an attention mechanism. Provided in the present invention is a dense optical flow calculation method employing a Unet and a Transformer. In the method, a Transformer module is introduced into a Unet architecture to process a feature sequence, and effectively uses the global self-attention advantages of a multihead self-attention device of a Transformer in sequence-to-sequence prediction. In the present invention, firstly, two adjacent frames are joined on a channel by means of a down-sampling module and are then input into a convolutional network to undergo down-sampling; then a feature processing module is used to carry out global context feature processing by encoding a feature map output by a down-sampling network and inputting a sequence into the feature map; and finally, an up-sampling module is used to perform up-sampling on the feature map that has undergone feature processing to reconstruct an optical flow image with the same size as an input image.

Description

一种基于注意力机制稠密光流计算方法A Dense Optical Flow Calculation Method Based on Attention Mechanism 技术领域technical field
本发明涉及视频应用领域,主要涉及视频应用中的稠密光流计算。The invention relates to the field of video applications, and mainly relates to dense optical flow calculation in video applications.
背景技术Background technique
当人的眼睛观察运动物体时,物体的景象在人眼的视网膜上形成一系列连续变化的图像,这一系列连续变化的信息不断“流过”视网膜(即图像平面),好像一种光的“流”,故称之为光流(optical flow)。具体而言,光流是空间运动物体在观察成像平面上的像素运动的瞬时速度。光流法是利用图像序列中像素在时间域上的变化以及相邻帧之间的相关性来找到上一帧跟当前帧之间存在的对应关系,从而计算出相邻帧之间物体的运动信息的一种方法。传统计算光流的方法主要有基于梯度、基于频率、基于相位和基于匹配的方法。When the human eye observes a moving object, the scene of the object forms a series of continuously changing images on the retina of the human eye. This series of continuously changing information continuously "flows" through the retina (that is, the image plane), like a light "Flow", so it is called optical flow. Specifically, optical flow is the instantaneous velocity of the pixel motion of a spatially moving object on the observation imaging plane. The optical flow method uses the changes of pixels in the image sequence in the time domain and the correlation between adjacent frames to find the corresponding relationship between the previous frame and the current frame, thereby calculating the motion of objects between adjacent frames. A method of information. The traditional methods of calculating optical flow mainly include gradient-based, frequency-based, phase-based and matching-based methods.
稠密光流是一种针对图像或指定的某一片区域进行逐点匹配的图像配准方法,它计算图像上所有的点的偏移量,从而形成一个稠密的光流场。通过这个稠密的光流场,可以进行像素级别的图像配准。Horn-Schunck算法以及基于区域匹配的大多数光流法都属于稠密光流的范畴。在使用深度学习的光流计算方法中,FlowNet在实际应用中最为广泛。Dense optical flow is an image registration method for point-by-point matching of an image or a specified area. It calculates the offset of all points on the image to form a dense optical flow field. Through this dense optical flow field, pixel-level image registration can be performed. The Horn-Schunck algorithm and most optical flow methods based on region matching belong to the category of dense optical flow. Among optical flow calculation methods using deep learning, FlowNet is the most widely used in practical applications.
专利“金字塔遮挡检测块匹配的鲁棒插值光流计算方法”(CN112509014A)公开了一种金字塔遮挡检测块匹配的鲁棒插值光流计算方法,首先进行金字塔遮挡检测块匹配得到稀疏的鲁棒运动场,对连续两帧图像通过下采样因子构成k层图像金字塔,在每一层金字塔进行块匹配,获取带有初始遮挡的匹配结果;通过基于变形误差的遮挡检测算法,得到遮挡检测信息;由匹配得到准确的稀疏匹配结果,需要经过鲁棒插值算法获取稠密光流;由鲁棒插值算法得到稠密光流后,经过全局能量泛函变分优化稠密光流:经过全局能量泛函变分优化得到最终光流。The patent "robust interpolation optical flow calculation method for pyramid occlusion detection block matching" (CN112509014A) discloses a robust interpolation optical flow calculation method for pyramid occlusion detection block matching. First, the pyramid occlusion detection block matching is performed to obtain a sparse robust motion field , two consecutive frames of images are constructed by downsampling factors to form a k-level image pyramid, and block matching is performed on each layer of the pyramid to obtain the matching result with initial occlusion; the occlusion detection information is obtained through the occlusion detection algorithm based on deformation error; by matching To obtain accurate sparse matching results, dense optical flow needs to be obtained through a robust interpolation algorithm; after the dense optical flow is obtained by the robust interpolation algorithm, the dense optical flow is optimized by the global energy functional variation: obtained by the global energy functional variational optimization Final optical flow.
专利“一种基于可学习遮挡掩模与二次变形优化的图像序列光流估计方法” (CN112465872A)公开了一种基于可学习遮挡掩模和二次变形优化的图像序列光流估计方法,首先输入图像序列中任意连续两帧图像,对其进行特征金字塔下采样分层,获得多分辨率两帧特征;在每层金字塔中计算第一帧特征和第二帧特征的相关度,利用相关度构建基于遮挡掩模模块;然后利用得到的遮挡掩模去除变形特征边缘伪影来优化图像运动边缘模糊的光流;并且使用遮挡约束后的光流构建二次变形优化模块,二次变形在亚像素级进一步优化图像运动边缘光流估计;在金字塔各层中对变形特征进行相同的遮挡掩模以及二次变形求取残差流来细化光流,在到达金字塔底层时,输出最终优化的光流估计。The patent "An Image Sequence Optical Flow Estimation Method Based on Learnable Occlusion Mask and Quadratic Deformation Optimization" (CN112465872A) discloses an image sequence optical flow estimation method based on learnable occlusion mask and quadratic deformation optimization. First Any two consecutive frames of images in the input image sequence are subjected to feature pyramid down-sampling and layering to obtain multi-resolution two-frame features; the correlation between the first frame feature and the second frame feature is calculated in each layer of the pyramid, and the correlation is used Construct the module based on the occlusion mask; then use the obtained occlusion mask to remove the edge artifacts of deformation features to optimize the optical flow of image motion edge blur; and use the optical flow after occlusion constraints to construct a secondary deformation optimization module, the secondary deformation is Further optimize the image motion edge optical flow estimation at the pixel level; perform the same occlusion mask and secondary deformation on the deformation features in each layer of the pyramid to obtain the residual flow to refine the optical flow, and output the final optimized when it reaches the bottom of the pyramid Optical flow estimation.
上述两个专利都有效地提高了光流估计的计算精度,但在稠密光流的精确度上还是不能满足视频编码和HDR合成等任务对光流的要求。因此,需要一种改进的技术来提升稠密光流计算的准确度。The above two patents have effectively improved the calculation accuracy of optical flow estimation, but the accuracy of dense optical flow still cannot meet the requirements of optical flow for tasks such as video coding and HDR synthesis. Therefore, an improved technique is needed to improve the accuracy of dense optical flow calculations.
发明内容Contents of the invention
提供本发明内容以便以简化形式介绍将在以下具体实施方式中进一步的描述一些概念。本发明内容并非旨在标识所要求保护的主题的关键特征或必要特征,也不旨在用于帮助确定所要求保护的主题的范围。This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
相比现有的稠密光流方法,本发明在光流预测计算任务中引入多头自注意力机,利用Transformer在序列到序列预测方面的全局自注意力优势,提高光流计算任务的效果。此外,本发明能够提高关键位置稠密光流图的准确度,同时通过减少Unet的上采样和下采样的网络深度,提高了稠密光流计算的时效性。Compared with the existing dense optical flow method, the present invention introduces a multi-head self-attention machine in the optical flow prediction calculation task, and utilizes Transformer's global self-attention advantage in sequence-to-sequence prediction to improve the effect of the optical flow calculation task. In addition, the present invention can improve the accuracy of the dense optical flow map at key positions, and at the same time improve the timeliness of dense optical flow calculation by reducing the network depth of Unet's up-sampling and down-sampling.
根据本发明的一个实施例,公开了一种用于稠密光流计算的方法,包括:将相邻帧在通道上进行拼接,以生成拼接后的向量图;将拼接后的向量图输入下采样网络进行特征提取,以生成特征向量;将生成的特征向量映射到潜层的高维嵌入空间,以生成一个高维嵌入表示序列;将高维嵌入表示序列输入由I个Transformer层组成的特征处理网络,以生成隐藏特征序列;将生成的隐藏特征序列进行重组,以生成重组后的特征向量;以及将重组后的特征向量输入上采样网络进行处理,以生成稠密光流图。According to an embodiment of the present invention, a method for calculating dense optical flow is disclosed, including: stitching adjacent frames on channels to generate a stitched vector map; inputting the stitched vector map for downsampling The network performs feature extraction to generate feature vectors; the generated feature vectors are mapped to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence; the high-dimensional embedding representation sequence is input into the feature processing composed of I Transformer layers Network to generate hidden feature sequences; reorganize the generated hidden feature sequences to generate reorganized feature vectors; and input the reorganized feature vectors into the upsampling network for processing to generate dense optical flow maps.
根据本发明的另一个实施例,公开了一种用于稠密光流计算的系统,包括下采样模块,特征处理模块和上采样模块。下采样模块被配置为:将相邻帧在 通道上进行拼接,以生成拼接后的向量图;将拼接后的向量图输入下采样网络进行特征提取,以生成特征向量。特征处理模块被配置为:将所述下采样模块生成的特征向量映射到潜层的高维嵌入空间,以生成一个高维嵌入表示序列;将高维嵌入表示序列输入由I个Transformer层组成的特征处理网络,以生成隐藏特征序列。上采样模块被配置为:将所述特征处理模块生成的隐藏特征序列进行重组,以生成重组后的特征向量;以及将重组后的特征向量输入上采样网络进行处理,以生成稠密光流图。According to another embodiment of the present invention, a system for computing dense optical flow is disclosed, including a downsampling module, a feature processing module and an upsampling module. The down-sampling module is configured to: splice adjacent frames on the channel to generate a spliced vector map; input the spliced vector map into the down-sampling network for feature extraction to generate a feature vector. The feature processing module is configured to: map the feature vector generated by the downsampling module to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence; input the high-dimensional embedding representation sequence into the A feature processing network to generate a sequence of hidden features. The upsampling module is configured to: reorganize the hidden feature sequence generated by the feature processing module to generate a reorganized feature vector; and input the reorganized feature vector into the upsampling network for processing to generate a dense optical flow map.
根据本发明的另一个实施例,公开了一种用于稠密光流计算的计算设备,包括:处理器;存储器,所述存储器存储有指令,所述指令在被所述处理器执行时能执行如上所述的方法。According to another embodiment of the present invention, a computing device for dense optical flow calculation is disclosed, including: a processor; a memory, the memory stores instructions, and the instructions can be executed when executed by the processor method as above.
通过阅读下面的详细描述并参考相关联的附图,这些及其他特点和优点将变得显而易见。应该理解,前面的概括说明和下面的详细描述只是说明性的,不会对所要求保护的各方面形成限制。These and other features and advantages will become apparent by reading the following detailed description and by reference to the associated drawings. It is to be understood that both the foregoing general description and the following detailed description are illustrative only and are not restrictive in all respects as claimed.
附图说明Description of drawings
为了能详细地理解本发明的上述特征所用的方式,可以参照各实施例来对以上简要概述的内容进行更具体的描述,其中一些方面在附图中示出。然而应该注意,附图仅示出了本发明的某些典型方面,故不应被认为限定其范围,因为该描述可以允许有其它等同有效的方面。So that the manner in which the above recited features of the invention can be understood in detail, a more particular description of what has been briefly summarized above may be had by reference to various embodiments, some aspects of which are illustrated in the accompanying drawings. It is to be noted, however, that the drawings illustrate only certain typical aspects of the invention and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.
图1示出了根据本发明的一个实施例用于稠密光流计算的系统100的框图;FIG. 1 shows a block diagram of a system 100 for dense optical flow calculation according to an embodiment of the present invention;
图2示出了根据本发明的一个实施例的图1中的各模块101-103的详细示图200;FIG. 2 shows a detailed diagram 200 of each module 101-103 in FIG. 1 according to an embodiment of the present invention;
图3示出了根据本发明的一个实施例的用于稠密光流计算的方法300的流程图;以及FIG. 3 shows a flowchart of a method 300 for calculating dense optical flow according to an embodiment of the present invention; and
图4出了根据本发明的一个实施例的示例性计算设备的框图400。FIG. 4 shows a block diagram 400 of an exemplary computing device according to one embodiment of the invention.
具体实施方式Detailed ways
下面结合附图详细描述本发明,本发明的特点将在以下的具体描述中得到 进一步的显现。Describe the present invention in detail below in conjunction with accompanying drawing, and feature of the present invention will be further manifested in the following detailed description.
以下为在本发明中使用的术语的解释,其包括本领域的技术人员所熟知的一般含义:The following is an explanation of the terms used in the present invention, which includes the general meanings well known to those skilled in the art:
Unet:是一种分割模型,具体而言,其是一个包含4层降采样、4层升采样和类似跳跃连接结构的全卷积网络,其特点是卷积层在降采样和升采样部分完全对称,且降采样端的特征图可以跳过深层采样,被拼接至对应的升采样端。Unet: is a segmentation model, specifically, it is a fully convolutional network including 4 layers of downsampling, 4 layers of upsampling and a similar skip connection structure. Symmetrical, and the feature map at the downsampling end can skip deep sampling and be spliced to the corresponding upsampling end.
Transformer:Transformer是一种自然语言处理(NLP)模型,其采用注意力机制来实现机器翻译任务。Transformer: Transformer is a natural language processing (NLP) model that uses an attention mechanism for machine translation tasks.
在计算机视觉中,光流扮演着重要角色,在目标对象分割、识别、跟踪、机器人导航以及形状信息恢复等都有着非常重要的应用。光流计算可以广泛应用于各种场景,例如,云存视频压缩任务中视频编解码的运动检测、高空抛物、摔倒检测等运动识别和视频理解任务等。为了获得更准确地运动估计,稠密光流计算是视频编解码技术中的关键模块。传统的稠密光流计算方法计算量较大,时效性较差。现有基于深度学习方法的光流计算方法在时效性上有所提高,但稠密光流图的准确度较低,会对视频编解码的质量造成负面影响。In computer vision, optical flow plays an important role and has very important applications in target object segmentation, recognition, tracking, robot navigation, and shape information recovery. Optical flow computing can be widely used in various scenarios, for example, motion detection of video codec in cloud storage video compression tasks, high-altitude parabolic, fall detection and other motion recognition and video understanding tasks. In order to obtain more accurate motion estimation, dense optical flow calculation is a key module in video coding and decoding technology. The traditional dense optical flow calculation method has a large amount of calculation and poor timeliness. The existing optical flow calculation methods based on deep learning methods have improved timeliness, but the accuracy of dense optical flow maps is low, which will have a negative impact on the quality of video encoding and decoding.
本发明提出一种基于Unet和Transformer的稠密光流计算方法,该方法在Unet结构中引入Transformer模块,利用Transformer在序列到序列预测方面的全局自注意力优势,提高关键位置的稠密光流的准确度,同时也能减少Unet的上采样和下采样的网络深度,提高稠密光流计算的时效性。The present invention proposes a dense optical flow calculation method based on Unet and Transformer. The method introduces the Transformer module into the Unet structure, and utilizes Transformer's global self-attention advantage in sequence-to-sequence prediction to improve the accuracy of dense optical flow at key positions. At the same time, it can also reduce the network depth of Unet's upsampling and downsampling, and improve the timeliness of dense optical flow calculations.
图1示出了根据本发明的一个实施例用于稠密光流计算的系统100的框图。如图1中示出的,该系统100按模块进行划分,各模块之间通过本领域已知的方式进行通信和数据交换。在本发明中,各模块可通过软件或硬件或其组合的方式来实现。如图1所示,系统100可包括下采样模块101、特征处理模块102和上采样模块103。FIG. 1 shows a block diagram of a system 100 for calculating dense optical flow according to an embodiment of the present invention. As shown in FIG. 1 , the system 100 is divided into modules, and communication and data exchange are performed between modules in a manner known in the art. In the present invention, each module can be implemented by software or hardware or a combination thereof. As shown in FIG. 1 , the system 100 may include a downsampling module 101 , a feature processing module 102 and an upsampling module 103 .
根据本发明的一个实施例,下采样模块101被配置为将相邻两帧在通道(例如,色彩通道)上拼接后形成输入图片,以输入到卷积网络进行下采样,从而得到特征图。特征处理模块102被配置为将下采样模块101输出的特征图编码输入序列进行全局上下文特征处理。上采样模块103被配置为一个级联上采样器,将特征处理后的特征图上采样以重建成与输入图片尺寸相同大小的光流图。According to an embodiment of the present invention, the downsampling module 101 is configured to stitch two adjacent frames on channels (for example, color channels) to form an input picture, which is input to a convolutional network for downsampling, thereby obtaining a feature map. The feature processing module 102 is configured to perform global context feature processing on the encoded input sequence of feature maps output by the down-sampling module 101 . The upsampling module 103 is configured as a cascaded upsampler, which upsamples the feature map after feature processing to reconstruct an optical flow map with the same size as the input picture.
图2示出了根据本发明的一个实施例的图1中的各模块101-103的详细示图200。FIG. 2 shows a detailed diagram 200 of each of the modules 101-103 in FIG. 1 according to one embodiment of the present invention.
如图2中所示出的,下采样模块101接收相邻两帧201,首先将该两帧201进行拼接,得到一个h×w×6的向量图,随后输入到由7个卷积块组成的下采样网络,每个卷积块由一个卷积层和一个ReLU激活函数组成,其中5个卷积层的步长为2。As shown in FIG. 2 , the downsampling module 101 receives two adjacent frames 201, first splices the two frames 201 to obtain a vector map of h×w×6, and then inputs it to the The downsampling network of , each convolutional block consists of a convolutional layer and a ReLU activation function, of which 5 convolutional layers have a stride of 2.
最终,下采样模块101输出一个大小为
Figure PCTCN2022097531-appb-000001
的特征图,以供特征处理模块102进行处理。
Finally, the down-sampling module 101 outputs a size of
Figure PCTCN2022097531-appb-000001
feature map for the feature processing module 102 to process.
如图2中所示,特征处理模块102包括使用一个可训练的线性映射E将下采样模块101输出的特征图序列映射到潜层的高维嵌入空间中,计算方法如公式(1)所示:As shown in Figure 2, the feature processing module 102 includes using a trainable linear map E to map the feature map sequence output by the downsampling module 101 into the high-dimensional embedding space of the latent layer, and the calculation method is shown in formula (1) :
Figure PCTCN2022097531-appb-000002
Figure PCTCN2022097531-appb-000002
随后将高维嵌入表示序列输入由I个Transformer层组成的特征处理网络中。Transformer层的具体结构如图3所示。具体而言,Transformer层由多头自注意力机(Multihead Self-Attention,MSA)和多层感知机(Multi-Layer Perceptron,MLP)组成,第i层的输出如公式(2)(3)所示:The high-dimensional embedding representation sequence is then fed into a feature processing network consisting of I Transformer layers. The specific structure of the Transformer layer is shown in Figure 3. Specifically, the Transformer layer is composed of a Multihead Self-Attention (MSA) and a Multi-Layer Perceptron (MLP). The output of the i-th layer is shown in formula (2) (3) :
z′ i=MSA(LN(z i-1))+z i-1,        (2) z′ i =MSA(LN(z i-1 ))+z i-1 , (2)
z i=MLP(LN(z’ i))+z’ i,          (3) z i =MLP(LN(z' i ))+z' i , (3)
其中LN(·)表示层级归一化运算。特征处理模块102最终输出隐藏特征序列z Iwhere LN(·) represents the level normalization operation. The feature processing module 102 finally outputs the hidden feature sequence z I .
如图2中示出的,上采样模块103是一个级联上采样网络,它包括多个上采样步骤来解码输出最终的光流图片202。首先,上采样模块103将特征处理模块102最终输出的隐藏特征序列z I重组为
Figure PCTCN2022097531-appb-000003
大小的特征向量,随后输入由7个反卷积块组成的上采样网络,每个反卷积块由一个反卷积层和一个ReLU激活函数组成,其中5个反卷积层的步长为2。最终得到一个大小为h×w×3的光流图输出。此外,本发明加入了三个与下采样特征向量之间的跳接层实现在不同分辨率级别的特征聚合(203,204,205),从而优化光流的细节。
As shown in FIG. 2 , the upsampling module 103 is a cascaded upsampling network, which includes multiple upsampling steps to decode and output the final optical flow picture 202 . First, the up-sampling module 103 reorganizes the hidden feature sequence z I finally output by the feature processing module 102 into
Figure PCTCN2022097531-appb-000003
The feature vector of size is then input into the upsampling network consisting of 7 deconvolution blocks, each deconvolution block consists of a deconvolution layer and a ReLU activation function, and the step size of the 5 deconvolution layers is 2. Finally, an optical flow map output of size h×w×3 is obtained. In addition, the present invention adds three jumping layers between the down-sampled feature vectors to realize feature aggregation (203, 204, 205) at different resolution levels, thereby optimizing the details of the optical flow.
图3示出了根据本发明的一个实施例的用于稠密光流计算的方法300的流程图。FIG. 3 shows a flowchart of a method 300 for dense optical flow calculation according to an embodiment of the present invention.
在步骤301,将相邻帧在通道上进行拼接,以生成拼接后的向量图。根据本发明的一个实施例,该通道为色彩通道,例如RGB通道。根据本发明的一个实施例,该向量图大小为h×w×6。In step 301, adjacent frames are spliced on channels to generate a spliced vector image. According to an embodiment of the present invention, the channel is a color channel, such as an RGB channel. According to an embodiment of the present invention, the size of the vector map is h×w×6.
在步骤302,将拼接后的向量图输入下采样网络进行特征提取,以生成特征向量。根据本发明的一个实施例,下采样网络由7个卷积块组成,每个卷积块由一个卷积层和一个ReLU激活函数组成,其中5个卷积层的步长为2。根据本发明的一个实施例,该特征向量大小为
Figure PCTCN2022097531-appb-000004
In step 302, input the spliced vector image into the down-sampling network for feature extraction, so as to generate feature vectors. According to an embodiment of the present invention, the downsampling network is composed of 7 convolutional blocks, and each convolutional block is composed of a convolutional layer and a ReLU activation function, and the stride of 5 convolutional layers is 2. According to an embodiment of the present invention, the size of the feature vector is
Figure PCTCN2022097531-appb-000004
在步骤303,将步骤302生成的特征向量映射到潜层的高维嵌入空间,以生成一个高维嵌入表示序列。根据本发明的一个实施例,可使用一个可训练的线性映射E将步骤302得到的特征向量映射到潜层的高维嵌入空间中。In step 303, the feature vector generated in step 302 is mapped to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence. According to an embodiment of the present invention, a trainable linear map E can be used to map the feature vector obtained in step 302 into the high-dimensional embedding space of the latent layer.
在步骤304,将高维嵌入表示序列输入由I个Transformer层组成的特征处理网络,以生成隐藏特征序列。根据本发明的一个实施例,Transformer层由MSA和MLP组成,以进行全局上下文特征处理。In step 304, the high-dimensional embedding representation sequence is input into a feature processing network composed of I Transformer layers to generate a hidden feature sequence. According to one embodiment of the present invention, the Transformer layer consists of MSA and MLP for global contextual feature processing.
在步骤305,将步骤304生成的隐藏特征序列进行重组,以生成重组后的特征向量。根据本发明的一个实施例,隐藏特征序列z I被重组为
Figure PCTCN2022097531-appb-000005
大小的特征向量。
In step 305, the hidden feature sequence generated in step 304 is reorganized to generate a reorganized feature vector. According to one embodiment of the present invention, the hidden feature sequence z I is reorganized as
Figure PCTCN2022097531-appb-000005
A eigenvector of size.
在步骤306,将重组后的特征向量输入上采样网络进行处理,以生成稠密光流图。该稠密光流图可以体现步骤301获取的相邻两帧中物体运动的光流。根据本发明的一个实施例,上采样网络由7个反卷积块组成,每个反卷积块由一个反卷积层和一个ReLU激活函数组成,其中5个反卷积层的步长为2。根据本发明的一个实施例,稠密光流图的大小为h×w×3。根据本发明的一个实施例,上采样网络为级联上采样网络,实现在不同分辨率级别的特征聚合,从而优化稠密光流的细节。In step 306, the recombined feature vector is input into the up-sampling network for processing to generate a dense optical flow map. The dense optical flow map may reflect the optical flow of object motion in two adjacent frames obtained in step 301 . According to an embodiment of the present invention, the upsampling network is composed of 7 deconvolution blocks, and each deconvolution block is composed of a deconvolution layer and a ReLU activation function, wherein the step size of the 5 deconvolution layers is 2. According to an embodiment of the present invention, the size of the dense optical flow map is h×w×3. According to an embodiment of the present invention, the upsampling network is a cascaded upsampling network, which realizes feature aggregation at different resolution levels, thereby optimizing the details of dense optical flow.
综上,本发明和现有技术相比,主要优势在于:(1)在光流预测计算任务 中引入多头自注意力机,利用Transformer在序列到序列预测方面的全局自注意力优势,本发明能够提高关键位置的稠密光流的准确度;(2)得益于多头自注意力机在特征层进行预测计算的优秀性能,也能减少Unet的上采样和下采样的网络深度,本发明能够提高稠密光流计算的时效性。To sum up, compared with the prior art, the main advantages of the present invention are: (1) Introducing a multi-head self-attention machine in the optical flow prediction calculation task, utilizing Transformer's global self-attention advantage in sequence-to-sequence prediction, the present invention Can improve the accuracy of the dense optical flow of the key position; (2) thanks to the excellent performance of the multi-head self-attention machine in the feature layer for predictive calculation, it can also reduce the network depth of Unet's up-sampling and down-sampling, the present invention can Improve the timeliness of dense optical flow calculation.
图4出了根据本发明的一个实施例的示例性计算设备的框图400,该计算设备是可应用于本发明的各方面的硬件设备的一个示例。计算设备400可以是可被配置成用于实现处理和/或计算的任何机器,可以是但并不局限于工作站、服务器、桌面型计算机、膝上型计算机、平板计算机、个人数字处理、智能手机、车载计算机或者它们的任何组合。计算设备400可包括可经由一个或多个接口和总线402连接或通信的组件。例如,计算设备400可包括总线402、一个或多个处理器404、一个或多个输入设备406以及一个或多个输出设备408。该一个或多个处理器404可以是任何类型的处理器并且可包括但不限于一个或多个通用处理器和/或一个或多个专用处理器(例如,专门的处理芯片)。输入设备406可以是任何类型的能够向计算设备输入信息的设备并且可以包括但不限于鼠标、键盘、触摸屏、麦克风和/或远程控制器。输出设备408可以是任何类型的能够呈现信息的设备并且可以包括但不限于显示器、扬声器、视频/音频输出终端、振动器和/或打印机。计算设备400也可以包括非瞬态存储设备410或者与所述非瞬态存储设备相连接,所述非瞬态存储设备可以是非瞬态的并且能够实现数据存储的任何存储设备,并且所述非瞬态存储设备可以包括但不限于磁盘驱动器、光存储设备、固态存储器、软盘、软磁盘、硬盘、磁带或任何其它磁介质、光盘或任何其它光介质、ROM(只读存储器)、RAM(随机存取存储器)、高速缓冲存储器和/或任何存储芯片或盒式磁带、和/或计算机可从其读取数据、指令和/或代码的任何其它介质。非瞬态存储设备410可从接口分离。非瞬态存储设备410可具有用于实施上述方法和步骤的数据/指令/代码。计算设备400也可包括通信设备412。通信设备412可以是任何类型的能够实现与内部装置通信和/或与网络通信的设备或系统并且可以包括但不限于调制解调器、网卡、红外通信设备、无线通信设备和/或芯片组,例如蓝牙设备、IEEE 1302.11设备、WiFi设备、WiMax设备、蜂窝通信设备和/或类似设备。FIG. 4 illustrates a block diagram 400 of an exemplary computing device, which is an example of a hardware device applicable to aspects of the invention, according to one embodiment of the invention. Computing device 400 can be any machine that can be configured to perform processing and/or computing, and can be, but is not limited to, a workstation, server, desktop, laptop, tablet, personal digital processing, smartphone , on-board computer, or any combination thereof. Computing device 400 may include components that may be connected or communicate via one or more interfaces and bus 402 . For example, computing device 400 may include a bus 402 , one or more processors 404 , one or more input devices 406 , and one or more output devices 408 . The one or more processors 404 may be any type of processor and may include, but are not limited to, one or more general purpose processors and/or one or more special purpose processors (eg, specialized processing chips). Input device 406 may be any type of device capable of entering information into a computing device and may include, but is not limited to, a mouse, keyboard, touch screen, microphone, and/or remote control. Output devices 408 may be any type of device capable of presenting information and may include, but are not limited to, displays, speakers, video/audio output terminals, vibrators, and/or printers. The computing device 400 may also include a non-transitory storage device 410 or be connected to the non-transitory storage device. The non-transitory storage device may be any storage device that is non-transitory and capable of storing data, and the non-transitory storage device Transient storage devices may include, but are not limited to, magnetic disk drives, optical storage devices, solid state memory, floppy disks, floppy disks, hard disks, magnetic tape or any other magnetic media, optical disks or any other optical media, ROM (read only memory), RAM (random memory access memory), cache memory and/or any memory chip or cartridge, and/or any other medium from which a computer can read data, instructions and/or code. The non-transitory storage device 410 is detachable from the interface. The non-transitory storage device 410 may have data/instructions/codes for implementing the above methods and steps. Computing device 400 may also include a communication device 412 . Communication device 412 may be any type of device or system capable of communicating with internal devices and/or with a network and may include, but is not limited to, a modem, network card, infrared communication device, wireless communication device, and/or chipset, such as a Bluetooth device , IEEE 1302.11 devices, WiFi devices, WiMax devices, cellular communications devices, and/or similar devices.
总线402可以包括但不限于工业标准结构(ISA)总线、微通道结构(MCA) 总线、增强型ISA(EISA)总线、视频电子标准协会(VESA)局部总线和外部设备互连(PCI)总线。 Bus 402 may include, but is not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.
计算设备400还可包括工作存储器414,该工作存储器414可以是任何类型的能够存储有利于处理器404的工作的指令和/或数据的工作存储器并且可以包括但不限于随机存取存储器和/或只读存储设备。 Computing device 400 may also include working memory 414, which may be any type of working memory capable of storing instructions and/or data that facilitate the operation of processor 404 and may include, but is not limited to, random access memory and/or Read-only storage device.
软件组件可位于工作存储器414中,这些软件组件包括但不限于操作系统416、一个或多个应用程序418、驱动程序和/或其它数据和代码。用于实现本发明上述方法和步骤的指令可包含在所述一个或多个应用程序418中,并且可通过处理器404读取和执行所述一个或多个应用程序418的指令来实现本发明的上述方法300。Software components may be located in working memory 414 including, but not limited to, an operating system 416, one or more application programs 418, drivers, and/or other data and code. The instructions for realizing the above-mentioned methods and steps of the present invention may be included in the one or more application programs 418, and the instructions of the one or more application programs 418 may be read and executed by the processor 404 to realize the present invention The method 300 described above.
也应该认识到可根据具体需求而做出变化。例如,也可使用定制硬件、和/或特定组件可在硬件、软件、固件、中间件、微代码、硬件描述语音或其任何组合中实现。此外,可采用与其它计算设备、例如网络输入/输出设备等的连接。例如,可通过具有汇编语言或硬件编程语言(例如,VERILOG、VHDL、C++)的编程硬件(例如,包括现场可编程门阵列(FPGA)和/或可编程逻辑阵列(PLA)的可编程逻辑电路)利用根据本发明的逻辑和算法来实现所公开的方法和设备的部分或全部。It should also be recognized that variations can be made according to specific needs. For example, custom hardware could also be used, and/or particular components could be implemented in hardware, software, firmware, middleware, microcode, hardware description voice, or any combination thereof. Additionally, connections to other computing devices, such as network input/output devices and the like, may be employed. For example, programming hardware (eg, programmable logic circuits including field programmable gate arrays (FPGAs) and/or programmable logic arrays (PLAs)) with assembly language or hardware programming languages (eg, VERILOG, VHDL, C++) ) utilize logic and algorithms in accordance with the present invention to implement some or all of the disclosed methods and apparatus.
尽管目前为止已经参考附图描述了本发明的各方面,但是上述方法和设备仅是示例,并且本发明的范围不限于这些方面,而是仅由所附权利要求及其等同物来限定。各种组件可被省略或者也可被等同组件替代。另外,也可以在与本发明中描述的顺序不同的顺序实现所述步骤。此外,可以按各种方式组合各种组件。也重要的是,随着技术的发展,所描述的组件中的许多组件可被之后出现的等同组件所替代。While aspects of the invention have been described thus far with reference to the accompanying drawings, the methods and apparatus described above are examples only, and the scope of the invention is not limited to these aspects but only by the appended claims and their equivalents. Various components may be omitted or may also be substituted for equivalent components. In addition, the steps may also be implemented in an order different from that described in the present invention. Also, various components may be combined in various ways. It is also important to note that, as technology advances, many of the components described may be replaced by equivalent components presented hereafter.

Claims (10)

  1. 一种用于稠密光流计算的方法,包括:A method for dense optical flow calculations, comprising:
    将相邻帧在通道上进行拼接,以生成拼接后的向量图;Stitch adjacent frames on the channel to generate a spliced vector image;
    将拼接后的向量图输入下采样网络进行特征提取,以生成特征向量;Input the spliced vector image into the downsampling network for feature extraction to generate feature vectors;
    将生成的特征向量映射到潜层的高维嵌入空间,以生成一个高维嵌入表示序列;Map the generated feature vectors to the high-dimensional embedding space of the latent layer to generate a sequence of high-dimensional embedding representations;
    将高维嵌入表示序列输入由I个Transformer层组成的特征处理网络,以生成隐藏特征序列;Input the high-dimensional embedding representation sequence into a feature processing network consisting of I Transformer layers to generate a hidden feature sequence;
    将生成的隐藏特征序列进行重组,以生成重组后的特征向量;以及recombining the generated hidden feature sequence to generate a recombined feature vector; and
    将重组后的特征向量输入上采样网络进行处理,以生成稠密光流图。The restructured feature vectors are fed into an upsampling network for processing to generate dense optical flow maps.
  2. 如权利要求1所述的方法,其特征在于,所述下采样网络由7个卷积块组成,每个卷积块由一个卷积层和一个ReLU激活函数组成,其中5个卷积层的步长为2。The method according to claim 1, wherein the downsampling network is composed of 7 convolutional blocks, and each convolutional block is composed of a convolutional layer and a ReLU activation function, wherein the 5 convolutional layers The step size is 2.
  3. 如权利要求1所述的方法,其特征在于,所述Transformer层由多头自注意力机和多层感知机组成。The method according to claim 1, wherein the Transformer layer is composed of a multi-head self-attention machine and a multi-layer perceptron.
  4. 如权利要求1所述的方法,其特征在于,所述上采样网络为级联上采样网络,并且由7个反卷积块组成,每个反卷积块由一个反卷积层和一个ReLU激活函数组成,其中5个反卷积层的步长为2。The method according to claim 1, wherein the upsampling network is a cascaded upsampling network, and consists of 7 deconvolution blocks, each deconvolution block consists of a deconvolution layer and a ReLU The activation function consists of 5 deconvolution layers with a step size of 2.
  5. 如权利要求1所述的方法,其特征在于,将生成的特征向量映射到潜层的高维嵌入空间,以生成一个高维嵌入表示序列进一步包括:使用一个可训练的线性映射E将所述特征向量映射到潜层的高维嵌入空间中。The method according to claim 1, wherein mapping the generated feature vectors to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence further comprises: using a trainable linear map E to transform the The feature vectors are mapped into the high-dimensional embedding space of the latent layer.
  6. 一种用于稠密光流计算的系统,包括:A system for dense optical flow calculations, comprising:
    下采样模块,所述下采样模块被配置为:A downsampling module, the downsampling module is configured to:
    将相邻帧在通道上进行拼接,以生成拼接后的向量图;Stitch adjacent frames on the channel to generate a spliced vector image;
    将拼接后的向量图输入下采样网络进行特征提取,以生成特征向量;Input the spliced vector image into the downsampling network for feature extraction to generate feature vectors;
    特征处理模块,所述特征处理模块被配置为:A feature processing module, the feature processing module is configured to:
    将所述下采样模块生成的特征向量映射到潜层的高维嵌入空间,以生成一个高维嵌入表示序列;Mapping the feature vector generated by the downsampling module to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence;
    将高维嵌入表示序列输入由I个Transformer层组成的特征处理网络,以生成隐藏特征序列;Input the high-dimensional embedding representation sequence into a feature processing network consisting of I Transformer layers to generate a hidden feature sequence;
    上采样模块,所述上采样模块被配置为:An upsampling module, the upsampling module is configured to:
    将所述特征处理模块生成的隐藏特征序列进行重组,以生成重组后的特征向量;以及recombining the hidden feature sequence generated by the feature processing module to generate a reorganized feature vector; and
    将重组后的特征向量输入上采样网络进行处理,以生成稠密光流图。The restructured feature vectors are fed into an upsampling network for processing to generate dense optical flow maps.
  7. 如权利要求6所述的系统,其特征在于,所述下采样网络由7个卷积块组成,每个卷积块由一个卷积层和一个ReLU激活函数组成,其中5个卷积层的步长为2;The system according to claim 6, wherein the downsampling network is composed of 7 convolutional blocks, and each convolutional block is composed of a convolutional layer and a ReLU activation function, wherein the 5 convolutional layers The step size is 2;
    其中所述上采样网络为级联上采样网络,并且由7个反卷积块组成,每个反卷积块由一个反卷积层和一个ReLU激活函数组成,其中5个反卷积层的步长为2。Wherein the upsampling network is a cascaded upsampling network, and consists of 7 deconvolution blocks, each deconvolution block consists of a deconvolution layer and a ReLU activation function, of which 5 deconvolution layers The step size is 2.
  8. 如权利要求6所述的系统,其特征在于,所述Transformer层由多头自注意力机和多层感知机组成。The system according to claim 6, wherein the Transformer layer is composed of a multi-head self-attention machine and a multi-layer perceptron.
  9. 如权利要求6所述的系统,其特征在于,将生成的特征向量映射到潜层的高维嵌入空间,以生成一个高维嵌入表示序列进一步包括:使用一个可训练的线性映射E将所述特征向量映射到潜层的高维嵌入空间中。The system according to claim 6, wherein mapping the generated feature vectors to the high-dimensional embedding space of the latent layer to generate a high-dimensional embedding representation sequence further comprises: using a trainable linear map E to transform the The feature vectors are mapped into the high-dimensional embedding space of the latent layer.
  10. 一种用于稠密光流计算的计算设备,包括:A computing device for dense optical flow calculations, comprising:
    处理器;processor;
    存储器,所述存储器存储有指令,所述指令在被所述处理器执行时能执行如权利要求1-5任一所述的方法。A memory, the memory stores instructions, and when the instructions are executed by the processor, the method according to any one of claims 1-5 can be executed.
PCT/CN2022/097531 2021-12-28 2022-06-08 Dense optical flow calculation method employing attention mechanism WO2023123873A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111623934.7 2021-12-28
CN202111623934.7A CN114913196A (en) 2021-12-28 2021-12-28 Attention-based dense optical flow calculation method

Publications (1)

Publication Number Publication Date
WO2023123873A1 true WO2023123873A1 (en) 2023-07-06

Family

ID=82763430

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/097531 WO2023123873A1 (en) 2021-12-28 2022-06-08 Dense optical flow calculation method employing attention mechanism

Country Status (2)

Country Link
CN (1) CN114913196A (en)
WO (1) WO2023123873A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116486107B (en) * 2023-06-21 2023-09-05 南昌航空大学 Optical flow calculation method, system, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140153784A1 (en) * 2012-10-18 2014-06-05 Thomson Licensing Spatio-temporal confidence maps
CN111724360A (en) * 2020-06-12 2020-09-29 深圳技术大学 Lung lobe segmentation method and device and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140153784A1 (en) * 2012-10-18 2014-06-05 Thomson Licensing Spatio-temporal confidence maps
CN111724360A (en) * 2020-06-12 2020-09-29 深圳技术大学 Lung lobe segmentation method and device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN JIENENG, LU YONGYI, YU QIHANG, LUO XIANGDE, ADELI EHSAN, WANG YAN, LU LE, YUILLE ALAN L, ZHOU YUYIN: "Transunet: Transformers make strong encoders for medical image segmentation", 8 February 2021 (2021-02-08), XP093010142, Retrieved from the Internet <URL:https://arxiv.org/pdf/2102.04306.pdf> [retrieved on 20221221], DOI: 10.48550/arXiv.2102.04306 *
YAO-QIAN LI, LI CAI-ZI, LIU RUI-QIANG, SI WEI- XIN, JIN YUE-MING: "Semi-supervised Spatiotemporal Transformer Networks for Semantic Segmentation of Surgical Instrument", JOURNAL OF SOFTWARE, vol. 33, no. 4, 15 April 2022 (2022-04-15), pages 1501 - 1515, XP093074494, ISSN: 1000-9825, DOI: 10.13328/j.cnki.jos.006469 *

Also Published As

Publication number Publication date
CN114913196A (en) 2022-08-16

Similar Documents

Publication Publication Date Title
Liu et al. Video super-resolution based on deep learning: a comprehensive survey
Chen et al. Learning spatial attention for face super-resolution
Xie et al. Edge-guided single depth image super resolution
US20220222776A1 (en) Multi-Stage Multi-Reference Bootstrapping for Video Super-Resolution
WO2023060746A1 (en) Small image multi-object detection method based on super-resolution
WO2021105765A1 (en) Systems and methods for performing direct conversion of image sensor data to image analytics
CN113066017B (en) Image enhancement method, model training method and equipment
CN106663314A (en) Real time skin smoothing image enhancement filter
US20220156943A1 (en) Consistency measure for image segmentation processes
KR102289239B1 (en) Disparity estimation system and method, electronic device, and computer-readable storage medium
WO2022062344A1 (en) Method, system, and device for detecting salient target in compressed video, and storage medium
EP3874404A1 (en) Video recognition using multiple modalities
US20220101539A1 (en) Sparse optical flow estimation
WO2023036157A1 (en) Self-supervised spatiotemporal representation learning by exploring video continuity
WO2021051606A1 (en) Lip shape sample generating method and apparatus based on bidirectional lstm, and storage medium
WO2023123873A1 (en) Dense optical flow calculation method employing attention mechanism
Yu et al. Event-based high frame-rate video reconstruction with a novel cycle-event network
Li et al. Self-supervised monocular depth estimation with frequency-based recurrent refinement
WO2024032331A9 (en) Image processing method and apparatus, electronic device, and storage medium
WO2024041235A1 (en) Image processing method and apparatus, device, storage medium and program product
CN111726621B (en) Video conversion method and device
US20230093827A1 (en) Image processing framework for performing object depth estimation
CN115272906A (en) Video background portrait segmentation model and algorithm based on point rendering
Hussein et al. Enhanced Semantic Segmentation of Aerial images with Spatial Smoothness Using CRF Model
US20230177722A1 (en) Apparatus and method with object posture estimating

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913162

Country of ref document: EP

Kind code of ref document: A1