CN116486107B

CN116486107B - Optical flow calculation method, system, equipment and medium

Info

Publication number: CN116486107B
Application number: CN202310735464.6A
Authority: CN
Inventors: 王子旭; 葛利跃; 陈震; 张聪炫; 卢锋; 吕科; 胡卫明
Original assignee: Nanchang Hangkong University
Current assignee: Nanchang Hangkong University
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-09-05
Anticipated expiration: 2043-06-21
Also published as: CN116486107A

Abstract

The invention discloses a method, a system, equipment and a medium for calculating optical flow, and relates to the field of optical flow processing. The method comprises the following steps: acquiring a target image; the target image includes: two continuous frames of images, namely a first image and a second image; extracting the motion characteristics of the target image by adopting a motion characteristic extraction network; determining a feature map of the first image and a feature map of the second image according to the motion features and the feature extraction channel number of the target image, and calculating the matching cost volumes of the two feature maps; extracting a context feature of the first image using a context encoder; based on the matching cost volume and the context characteristics, adopting a global-local loop optical flow decoder to carry out loop iteration solution to obtain an optical flow field of the target image; the global-local loop optical flow decoder is constructed based on a depth separable residual block, a multi-layer perceptron block, a depth separable convolution module, and a multi-head attention module. The invention can improve the accuracy and the robustness of the optical flow estimation.

Description

Optical flow calculation method, system, equipment and medium

Technical Field

The present invention relates to the field of optical flow processing, and in particular, to a method, a system, an apparatus, and a medium for optical flow calculation.

Background

The optical flow refers to a two-dimensional motion vector of a moving object and a scene surface pixel point in an image sequence, which not only provides the motion vector of the object and the scene in the image, but also carries rich shape and structure information. Optical flow estimation is therefore a research hotspot in the fields of image processing and computer vision. In many advanced visual tasks, such as motion recognition, video interpolation, video segmentation, object tracking, provide valuable motion correlation cues as a basis.

In recent years, with the advent of deep learning, optical flow estimation models based on convolutional neural networks (Convolutional Neural Network, CNN) have been highly successful. The method firstly utilizes a data-driven learning-based optimization strategy, and adopts a modeled feature encoder to extract image features. And then calculating the similarity of all the feature vectors between the feature images, taking a pair of feature vectors with highest similarity as matching points, and finally decoding the displacement field between the continuous frame images. Since the encoding and decoding processes require features with sufficient resolution, the problem of matching errors due to large displacement motion and local ambiguity (occlusion, weak texture, illumination variation, etc.) is reduced, so how to efficiently and accurately decode motion features becomes a key to improve the accuracy and robustness of optical flow estimation. However, the existing deep learning optical flow calculation model generally adopts a local convolution operation with limited receptive field to perform optical flow decoding, which results in insufficient extraction and expression capability of the model on image features, thereby affecting the overall performance of optical flow estimation.

Disclosure of Invention

Based on the above, the embodiment of the invention provides an optical flow calculation method, an optical flow calculation system, an optical flow calculation device and an optical flow calculation medium, so as to improve the accuracy and the robustness of optical flow estimation.

In order to achieve the above object, the embodiment of the present invention provides the following solutions:

an optical flow calculation method, comprising:

acquiring a target image; the target image includes: a first image and a second image; the first image and the second image are two continuous frames of images;

extracting the motion characteristics of the target image by adopting a motion characteristic extraction network; the motion feature extraction network comprises a plurality of convolution layers with different sizes;

determining a feature map of the first image and a feature map of the second image according to the motion feature and the feature extraction channel number of the target image, and calculating the matching cost volume of the feature map of the first image and the feature map of the second image;

extracting a context feature of the first image using a context encoder; the structure of the context encoder is the same as that of the motion feature extraction network;

based on the matching cost volume and the context characteristics, adopting a global-local loop optical flow decoder to carry out loop iteration solution to obtain an optical flow field of the target image;

wherein the global-local loop optical flow decoder comprises: a local motion information encoder, a global motion information encoder and a global-local motion information decoder connected in sequence; the output of the global-local motion information decoder is connected to the input of the local motion information encoder;

the local motion information encoder and the global-local motion information decoder each include: the depth separable residual block and the multi-layer perceptron block are connected in sequence; the global motion information encoder includes: a depth separable convolution module and a multi-head attention module which are connected in sequence;

the local motion information encoder is used for encoding according to the matching cost volume and the residual light stream of the last iteration to obtain local motion characteristics;

the global motion information encoder is used for encoding according to the local motion characteristics and the context characteristics to obtain global motion information;

the global-local motion information decoder is used for decoding according to the local motion characteristics, the global motion information and the context characteristics to obtain a residual light stream of the current iteration; the residual optical flow of the last iteration is used to determine the optical flow field of the target image.

Optionally, the motion feature extraction network specifically includes: the first convolution layer, the convolution residual block and the second convolution layer are sequentially connected;

the convolution kernel size of the first convolution layer is 7×7; the convolution residual block includes: the third convolution layer and the fourth convolution layer are sequentially connected; the size of the convolution kernel of the second convolution layer is 1×1; the size of the convolution kernel of the third convolution layer is 3 multiplied by 3, and the step length is 2; the size of the convolution kernel of the fourth convolution layer is 3×3, and the step size is 1.

Optionally, determining a feature map of the first image and a feature map of the second image according to the motion feature of the target image and the feature extraction channel number, and calculating a matching cost volume of the feature map of the first image and the feature map of the second image, which specifically includes:

determining the first half of the motion characteristics of the target image as a characteristic diagram of the first image, and determining the second half of the motion characteristics of the target image as a characteristic diagram of the second image;

performing dot product similarity operation on the feature images of the first image and the feature images of the second image to obtain matching cost information of the feature images of the first image and the feature images of the second image;

and downsampling the matching cost information by adopting pooling operation to obtain the matching cost volumes of the feature map of the first image and the feature map of the second image.

Optionally, the depth separable residual block specifically includes: the first depth separable convolution layer, the first activation function, the second depth separable convolution layer and the second activation function are connected in sequence;

the convolution kernel size of the first depth separable convolution layer is 7 x 7; the second depth separable convolution layers are connected in a dense manner, and the convolution kernel size is 15 multiplied by 15; the first activation function and the second activation function are both GELU activation functions.

Optionally, the multi-layer perceptron block specifically includes: the fifth convolution layer, the third depth separable convolution layer, the third activation function and the sixth convolution layer are sequentially connected;

the convolution kernel sizes of the fifth convolution layer and the sixth convolution layer are 1×1; the convolution kernel size of the third depth separable convolution layer is 3×3; the third activation function is a GELU activation function.

The present invention also provides an optical flow computing system comprising:

the image acquisition module is used for acquiring a target image; the target image includes: a first image and a second image; the first image and the second image are two continuous frames of images;

the motion feature extraction module is used for extracting the motion features of the target image by adopting a motion feature extraction network; the motion feature extraction network comprises a plurality of convolution layers with different sizes;

the matching cost calculation module is used for determining a feature map of the first image and a feature map of the second image according to the motion feature and the feature extraction channel number of the target image, and calculating the matching cost volume of the feature map of the first image and the feature map of the second image;

a context feature extraction module for extracting context features of the first image using a context encoder; the structure of the context encoder is the same as that of the motion feature extraction network;

the optical flow field solving module is used for carrying out loop iteration solving by adopting a global-local loop optical flow decoder based on the matching cost volume and the context characteristics to obtain an optical flow field of the target image;

The invention also provides an electronic device, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor runs the computer program to enable the electronic device to execute the optical flow calculation method.

The present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the optical flow calculation method described above.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

aiming at the problem of insufficient feature extraction capability in the existing optical flow estimation model, the embodiment of the invention introduces a depth separable residual block and a multi-layer perceptron block to increase the receptive field, and constructs a global-local circulation optical flow decoder by means of the local characteristics of a depth local motion information encoder and the global characteristics of a global motion information encoder, wherein the global-local circulation optical flow decoder is used as an optical flow estimation model related to global and local motion information, and can improve the accuracy and the robustness of optical flow estimation of a large-displacement image area and a weak texture area.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an optical flow calculation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a first image according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a second image according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a global-local loop optical flow decoder according to an embodiment of the present invention;

FIG. 5 is a visual image of an optical flow field provided by an embodiment of the present invention;

FIG. 6 is a block diagram of an optical flow computing system according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

Referring to fig. 1, the optical flow calculation method of the present embodiment includes:

step 101: acquiring a target image; the target image includes: a first image and a second image; the first image and the second image are two consecutive frames of images.

In the present embodiment, successive thirty-first frame images and thirty-first frame images in a sequence of the b mboo_3 images are selected and input, wherein the thirty-first frame image is the first image I ₁ As shown in fig. 2; thirty-first frame image as second image I ₂ As shown in fig. 3.

Step 102: extracting the motion characteristics of the target image by adopting a motion characteristic extraction network; the motion feature extraction network includes a plurality of different sized convolutional layers.

Specifically, the step is that firstly, a motion feature extraction network is constructed: combining a plurality of stacked successive convolutions for a first image I ₁ And a second image I ₂ Performing motion feature extraction, wherein the input of the motion feature extraction network is a first image I ₁ And a second image I ₂ The stacked image F after the stacking is performed,h represents the height of the input image, W represents the width of the input image, and the motion characteristics F of two continuous frames of images are output _M 。

The motion feature extraction network specifically comprises: the first convolution layer, the convolution residual block and the second convolution layer are sequentially connected. The convolution kernel size of the first convolution layer is 7×7; the convolution residual block includes: the third convolution layer and the fourth convolution layer are sequentially connected; the size of the convolution kernel of the second convolution layer is 1×1; the size of the convolution kernel of the third convolution layer is 3 multiplied by 3, and the step length is 2; the size of the convolution kernel of the fourth convolution layer is 3×3, and the step size is 1.

The motion feature extraction network is divided into 3 stages (Stage), namely Stage 1, stage 2 and Stage 3, wherein Stage 1 is 1/2 resolution, stage 2 is 1/4 resolution, and Stage 3 is 1/8 resolution. First, the 7×7 first convolution layer of Stage 1 is passed to control downsampling and extraction of features, then two 3×3 convolved residual blocks stacked in succession in Stage 2 and Stage 3, each downsampling the image by one Stage representation, finally, the two blocks are passed to the processing unitA second convolution layer of 1×1 adjusts the channel number and outputs the motion characteristic F of two continuous frames _M . The specific calculation formula is as follows:

；

the above formula represents the process of extracting network extraction characteristics from motion characteristics, conv _7×7 (·)、Conv _1×1 (. Cndot.) means feature extraction of an image using a 7×7 first convolution layer and a 1×1 second convolution layer, respectively; convBlock _3×3 (. Cndot.) means that the feature extraction of an image is performed using a convolution residual block composed of a third convolution layer having a step size of 2 and a convolution kernel size of 3×3 and a fourth convolution layer having a step size of 1 and a convolution kernel size of 3×3.

Wherein,,

；

f ₁ the third convolution layer with the step length of 2 and the convolution kernel size of 3 multiplied by 3 is used for carrying out downsampling and feature extraction on the input image x1, and then carrying out residual connection and output result after activating a function relu; f represents the f of the third convolution layer with a step size of 1 and a convolution kernel size of 3×3 to the input ₁ After downsampling and feature extraction, the output result after the function relu is connected and activated through the residual error.

Step 103: and determining a feature map of the first image and a feature map of the second image according to the motion feature of the target image and the feature extraction channel number, and calculating the matching cost volume of the feature map of the first image and the feature map of the second image.

The method specifically comprises the following steps:

(1) Motion characteristics F _M The motion characteristic F is divided into two according to the number of the characteristic extraction channels _M Mid-first half motion feature determinationMap F defined as the first image ₁ Motion characteristics F _M The second half of the motion features are determined as the feature map F of the second image ₂ 。

(2) And carrying out dot product similarity operation on the feature vectors on the feature images of the first image and the feature images of the second image, so that matching cost information between all relevant point pairs on the two feature images can be obtained, and the matching cost information of the feature images of the first image and the feature images of the second image is obtained.

(3) And downsampling the matching cost information by adopting pooling operation, so that the matching cost information of large displacement is converted into the matching cost information of small displacement, and the matching cost volumes of the feature map of the first image and the feature map of the second image are obtained, wherein the matching cost volumes represent the form of a multi-scale matching cost pyramid, and the calculation formula is as follows:

；

wherein Cost represents matching Cost information;representing a matrix multiplication operation; avgPool represents an average pooling operation;layer numbers representing the multi-scale matching cost pyramid; />Representing +.f. of multi-scale matching cost pyramid obtained after downsampling matching cost information>Matching cost volumes of layers. When->When the change occurs, the size of each feature map in the Cost will change, and the change in size will be determined by the step size of the average pooling operation. This embodimentAnd obtaining the matching cost volume of each layer so as to better perform optical flow estimation when large displacement and small displacement occur.

Step 104: extracting a context feature of the first image using a context encoder; the structure of the context encoder is the same as the structure of the motion feature extraction network.

The context encoder in this step and the motion feature extraction network in step 102 are in a parallel configuration with respect to each other. The context encoder combines a plurality of stacked successive convolutions for a first image I in the sequence of images ₁ Context feature extraction is performed, the context encoder input being the first image I ₁ ，Output is the contextual feature F of the first image _C 。

Specifically, the context encoder is divided into 3 stages (Stage), stage 1, stage 2, and Stage 3, respectively, where Stage 1 is 1/2 resolution, stage 2 is 1/4 resolution, and Stage 3 is 1/8 resolution. Firstly, the 7×7 first convolution layer of Stage 1 is used for controlling downsampling and extracting features, then two 3×3 convolution residual blocks which are stacked in succession in Stage 2 and Stage 3 are used for downsampling an image by one time each, finally, the channel number is adjusted by a second convolution layer of 1×1, and the context feature F of the first image is output _C . The calculation formula is as follows:

。

step 105: and based on the matching cost volume and the context characteristics, adopting a global-local loop optical flow decoder to carry out loop iteration solution to obtain an optical flow field of the target image.

Referring to fig. 4, the global-local loop optical flow decoder includes: a local motion information encoder, a global motion information encoder and a global-local motion information decoder connected in sequence; the output of the global-local motion information decoder is connected to the input of the local motion information encoder.

The various parts of the global-local loop optical flow decoder described above are described in further detail below in conjunction with fig. 4.

(1) A local motion information encoder.

The local motion information encoder includes: a depth separable residual block and a multi-layer perceptron block (Multilayer Perceptron, MLP) connected in sequence; the global motion information encoder includes: the depth separable convolution module and the multi-head attention module are connected in sequence.

The depth separable residual block specifically comprises: the first depth separable convolution layer, the first activation function, the second depth separable convolution layer, and the second activation function are connected in sequence. The convolution kernel size of the first depth separable convolution layer is 7 x 7; the second depth separable convolution layers are connected in a dense manner, and the convolution kernel size is 15 multiplied by 15; the first activation function and the second activation function are both GELU activation functions. The second depth separable convolution layer in this embodiment specifically includes: and a seventh convolution layer with a convolution kernel size of 1 multiplied by 1, an eighth convolution layer with a convolution kernel size of 15 multiplied by 15 and a ninth convolution layer with a convolution kernel size of 1 multiplied by 1, which are sequentially connected, wherein a GELU activation function is adopted to activate the feature after each layer of convolution operation, and residual connection operation is carried out on the feature after each layer of convolution operation.

The multi-layer perceptron block specifically comprises: the fifth convolution layer, the third depth separable convolution layer, the third activation function, and the sixth convolution layer are connected in sequence. The convolution kernel sizes of the fifth convolution layer and the sixth convolution layer are 1×1; the convolution kernel size of the third depth separable convolution layer is 3×3; the third activation function is a GELU activation function.

The local motion information encoder is used for encoding according to the matching cost volume and the residual light stream of the last iteration to obtain local motion characteristics. Specifically, the input of the local motion information encoder is motion characteristics formed by matching motion information in the cost volume with the initial optical flow field, and the characteristics are input into the local motion information encoderPerforming cyclic iterative coding of motion features, and finally outputting local motion features of the current iterationThe calculation formula is as follows:

；

wherein,,representing the residual optical flow for each iteration, DLP (·) represents a depth separable MLP block; cat (·) represents the operation of connecting the feature graphs, namely splicing a plurality of feature graphs with the same resolution according to the channel dimension; />Representing +.1 convolutional layer for input>Extracting features; f (F) _f Representing features extracted from the optical flow of the iteration; f (F) _cost Representing local motion characteristics which are extracted from the matching cost volume through the optical flow of the iteration and are obtained according to the matching cost volume; f (F) _m Representing local motion features enhanced by DLP blocks; />Representing the local motion characteristics of the current iteration. RDSCBlocks (·) represents a depth separable residual block; MLP (&) represents a multi-layer perceptron block implemented by convolution; dwConv _7×7 (x 2) means feature extraction of the input image x2 using a first depth separable convolution layer having a convolution kernel size of 7 x 7; dwConv _15×15,d (. Cndot.) means feature extraction of an image using a second depth separable convolution layer having a convolution kernel size of 15 x 15; GELU (·) represents a GELU activation function; conv _1×1 (x 3) means that the fifth convolution layer with the convolution kernel size of 1×1 is used to perform feature extraction on the input image x 3; dwConv _3×3 (. Cndot.) represents feature extraction of an image using a third depth separable convolution layer having a convolution kernel size of 3 x 3.

(2) Global motion information encoder.

The global motion information encoder includes: the depth separable convolution module and the multi-head attention module are connected in sequence.

The global motion information encoder is used for encoding according to the local motion characteristics and the context characteristics to obtain global motion information. In particular, local motion information encoder output local motion features by constructing a depth separable convolution module and a multi-head attention module with local position codingAnd context feature F output by the context feature encoder _C Input to global motion information encoder, and finally output global motion information F _g The calculation formula is as follows:

；

global motion information encoder pair context feature F _C Coding to obtain query vector q _c And key vector k _c . i represents the abscissa of a certain point in the query vector, and j represents the ordinate of a certain point in the query vector. u represents the abscissa of a point in the key vector and v represents the ordinate of a point in the key vector. q _c (i, j) represents a characteristic value, k, of a point in the query vector _c (u, v) represents a feature value of a certain point in the key vector. v _m Representing the motion according to local motion features F _l Constructing a value vector; v _m (u, v) represents a feature value of a certain point in the value vector. Gamma is a learnable factor; f (·) represents a point-by-point attention function. F (F) _l (i, j) represents a feature value of a certain point in the local motion feature.

(3) Global-local motion information decoder.

The structure of the global-local motion information decoder is the same as that of the local motion information encoder, and will not be described again.

Specifically, the input of the global-local motion information decoder is global and local motion information formed by aggregating local motion characteristics output by the local motion information encoder and other part of hidden state characteristics output by the global motion information encoder and the context characteristic encoder, and the output is residual error optical flow of the current iterationThe calculation formula is as follows:

；

F _a representing aggregated features aggregated from global motion information, local motion features, and contextual features;representing the global-local motion characteristics of the iteration; />Representing the global-local motion characteristics of the last iteration.

The global-local loop optical flow decoder of the embodiment optimizes the optical flow field in a loop iteration mode for n times, and upsamples the optical flow after the last iteration optimization to the same resolution as the input image so as to obtain a final optical flow field. Wherein, during the initial iteration, the initial optical flow field f is set to 0, and the residual optical flow is setIs 0, lead toAfter iterating the optical flow, the optical flow field after each iteration update +.>And calculating the residual error light flow of the last iteration to determine the final light flow field. The final optical flow field visualization is shown in fig. 5.

The embodiment utilizes the local modeling capability of the depth separable convolution residual block with the large convolution kernel and the remote modeling capability of the local position coding transducer to improve the capturing capability of motion characteristics and capture the motion relation among more pixels, thereby optimizing the optical flow estimation accuracy of a large displacement image area and a weak texture image area, reducing errors caused by local information and ensuring the reliability and the robustness of optical flow estimation.

Example two

In order to perform a corresponding method of the above embodiment to achieve the corresponding functions and technical effects, an optical flow computing system is provided below.

Referring to fig. 6, the system includes:

an image acquisition module 601, configured to acquire a target image; the target image includes: a first image and a second image; the first image and the second image are two consecutive frames of images.

A motion feature extraction module 602, configured to extract a motion feature of the target image using a motion feature extraction network; the motion feature extraction network includes a plurality of different sized convolutional layers.

The matching cost calculation module 603 is configured to determine a feature map of the first image and a feature map of the second image according to the motion feature and the feature extraction channel number of the target image, and calculate a matching cost volume of the feature map of the first image and the feature map of the second image.

A context feature extraction module 604 for extracting context features of the first image using a context encoder; the structure of the context encoder is the same as the structure of the motion feature extraction network.

And the optical flow field solving module 605 is configured to perform loop iteration solving by using a global-local loop optical flow decoder based on the matching cost volume and the context feature, so as to obtain an optical flow field of the target image.

Wherein the global-local loop optical flow decoder comprises: a local motion information encoder, a global motion information encoder and a global-local motion information decoder connected in sequence; the output of the global-local motion information decoder is connected to the input of the local motion information encoder.

The local motion information encoder and the global-local motion information decoder each include: the depth separable residual block and the multi-layer perceptron block are connected in sequence; the global motion information encoder includes: the depth separable convolution module and the multi-head attention module are connected in sequence.

The local motion information encoder is used for encoding according to the matching cost volume and the residual light stream of the last iteration to obtain local motion characteristics. The global motion information encoder is used for encoding according to the local motion characteristics and the context characteristics to obtain global motion information. The global-local motion information decoder is used for decoding according to the local motion characteristics, the global motion information and the context characteristics to obtain a residual light stream of the current iteration; the residual optical flow of the last iteration is used to determine the optical flow field of the target image.

Example III

The present embodiment provides an electronic device, including a memory and a processor, where the memory is configured to store a computer program, and the processor runs the computer program to enable the electronic device to execute the optical flow calculation method of the first embodiment.

Alternatively, the electronic device may be a server.

In addition, an embodiment of the present invention further provides a computer-readable storage medium storing a computer program, which when executed by a processor, implements the optical flow calculation method of the first embodiment.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. An optical flow calculation method, comprising:

the global-local motion information decoder is used for decoding according to the local motion characteristics, the global motion information and the context characteristics to obtain a residual light stream of the current iteration; the residual optical flow of the last iteration is used to determine the optical flow field of the target image;

according to the motion characteristics and the characteristic extraction channel number of the target image, determining the characteristic diagram of the first image and the characteristic diagram of the second image, and calculating the matching cost volume of the characteristic diagram of the first image and the characteristic diagram of the second image, wherein the method specifically comprises the following steps:

2. The optical flow computing method according to claim 1, characterized in that the motion feature extraction network specifically comprises: the first convolution layer, the convolution residual block and the second convolution layer are sequentially connected;

3. The optical flow computing method according to claim 1, characterized in that the depth separable residual block comprises: the first depth separable convolution layer, the first activation function, the second depth separable convolution layer and the second activation function are connected in sequence;

4. The optical flow computing method according to claim 1, wherein the multi-layer perceptron block specifically comprises: the fifth convolution layer, the third depth separable convolution layer, the third activation function and the sixth convolution layer are sequentially connected;

5. An optical flow computing system, comprising:

6. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform the optical flow calculation method of any one of claims 1 to 4.

7. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the optical flow calculation method according to any one of claims 1 to 4.