CN110751672B

CN110751672B - Method and apparatus for implementing multi-scale optical flow pixel transform using dilution convolution

Info

Publication number: CN110751672B
Application number: CN201810815694.2A
Authority: CN
Inventors: 刘景初
Original assignee: Shenzhen Horizon Robotics Science and Technology Co Ltd
Current assignee: Shenzhen Horizon Robotics Science and Technology Co Ltd
Priority date: 2018-07-24
Filing date: 2018-07-24
Publication date: 2022-06-21
Anticipated expiration: 2038-07-24
Also published as: CN110751672A

Abstract

The invention relates to a method and a device for realizing multi-scale optical flow pixel transformation by utilizing diluted convolution. According to an embodiment, there is provided a method for implementing a multi-scale optical-flow pixel transform with diluted convolution for a pyramid optical-flow estimator, the pyramid optical-flow estimator comprising at least two layers of optical-flow estimators, the method comprising: each layer of optical flow estimator uses a known image frame and a previous layer of predicted image frame generated by the previous layer of optical flow estimator as input to estimate a current layer optical flow field with the current layer working scale; determining a current layer convolution kernel corresponding to a current layer optical flow field; diluting the current layer convolution kernel to obtain a current layer diluted convolution kernel; and performing optical flow pixel transformation processing on the previous layer predicted image frame by using the current layer dilution convolution kernel to obtain the current layer predicted image frame, wherein the previous layer predicted image frame serving as the input of the first layer optical flow estimator is zero, and the working scale of each layer of optical flow field is different from each other.

Description

Method and apparatus for implementing multi-scale optical flow pixel transform using dilution convolution

Technical Field

The present invention relates generally to the field of image processing, and more particularly, to a method and apparatus for implementing multi-scale optical flow pixel transformation with diluted convolution for optical flow field pyramid, and an electronic device.

Background

Video prediction can be widely applied to various fields, for example, the video prediction can be used for assisting driving, and the future driving environment is predicted based on the current driving environment so as to adopt a corresponding driving strategy in advance. One commonly used method of video prediction involves the use of an optical flow field that describes the displacement vectors of corresponding pixels between adjacent image frames in an image sequence, an exemplary a-priori optical flow field being shown in fig. 1A and an exemplary a-priori optical flow field being shown in fig. 1B. As shown in fig. 1A, when the previous image frame 1 and the next image frame 2 are known, an optical flow field of a corresponding pixel between the image frame 1 and the image frame 2, which is called a posterior optical flow field, can be determined, and displacement vectors of a circular pixel and a triangular pixel are shown in fig. 1A as an example. As shown in fig. 1B, when an image frame 1 is known but a next image frame 2 is unknown, an a priori optical flow field, which represents possible displacement vectors of pixels and their probabilities, may be estimated based on the known image frame 1, and possible image frames 2, such as the possible image frame 2 with probability a and the possible image frame 2 with probability B shown in fig. 1B, may be obtained by applying affine transformation to the corresponding pixels according to the optical flow field. Thus, when the a priori optical flow field is predicted from several known image frames, the predicted image frame may be obtained by affine transformation of the pixels.

When estimating the optical flow field for the image sequence prediction problem, it is necessary to consider both the prediction accuracy and the output dynamic range. The displacement of pixels in the image may vary in magnitude due to the moving speed of the vehicle, and the displacements of different pixels in the same frame of image may differ greatly from each other. In order to be able to consider the pixel motion modes of different speeds in the prediction process, a large dynamic range is required for the vector length of the corresponding optical flow field. On the other hand, under the same relative prediction accuracy, increasing the dynamic range means a decrease in absolute accuracy; while ensuring absolute prediction accuracy means that the prediction dynamic range needs to be limited, which constitutes a fundamental contradiction that needs to be reconciled. The existing optical flow field prediction under a single scale is difficult to take accuracy and dynamic range into consideration, and the accuracy is generally ensured by sacrificing the dynamic range, for example, by limiting the maximum length of the optical flow field, or by mathematically balancing the contribution of optical flow data samples with different lengths to the final model estimation.

In a more general optical flow estimation problem, work has been done to propose an optical flow field estimation method based on a spatial pyramid, which can better take into account both accuracy and dynamic range compared to an optical flow field estimation method of a single spatial scale. The optical flow field estimation method based on the spatial pyramid decomposes the estimation problem of the optical flow field into the superposition of the estimation problems of the sub optical flow fields on a plurality of spatial scales, and can be regarded as the cascade of a plurality of optical flow field estimators with similar relative precision: the top layer of the pyramid estimates the optical flow field on a coarse spatial scale, and the bottom layer estimates the residual optical flow field with a relatively small spatial scale on the basis of the optical flow field on the coarse spatial scale. In each level of the spatial pyramid, the multi-scale image pixel transformation operation realizes the multi-scale optical flow image transformation operation by carrying out bilinear interpolation scaling on the original image according to the scale proportion and carrying out pixel affine transformation on the scaled image.

However, these methods are all posterior estimation methods for non-prediction tasks (shown in fig. 1A), and are very different from the prior estimation methods for prediction tasks (shown in fig. 1B). Specifically, in the posterior estimation, all image frames in the image sequence are known at the time of estimation, and the optical flow field between the image frames is deterministic; and the estimation of the optical flow field in the task of predicting the image sequence is prior, namely the optical flow field between unknown image frames needs to be predicted. Because of the uncertainty of the prediction problem, the optical flow field between unknown image frames is random, and the probability distribution of various possible value conditions of the optical flow field needs to be estimated. How to express and estimate the probability distribution of the random optical flow field on a space pyramid, how to use the obtained random optical flow field for prediction and inference, and how to efficiently perform pixel-level image transformation are problems which are not solved yet. At present, no suitable method is available for applying the idea of spatial pyramid to the prior optical flow field estimation of the prediction-class task.

On the other hand, in order to implement the multi-scale optical flow pyramid, it is necessary to first scale the input image according to the optical flow field working scale in each level of the pyramid, then perform pixel affine transformation on the scaled image based on the optical flow field, and finally scale the transformed image back to the original scale or another target scale. For example, if the working scale of the optical flow field is 2 times of the original scale, the input image is firstly reduced to one half of the original scale, pixel transformation is carried out on the reduced image, and finally the small image is enlarged to the original scale through algorithms such as bilinear interpolation and the like. In the image scaling process, blurring of pixels is inevitably caused, so that the prediction precision is influenced, and the subsequent utilization of the prediction result is not facilitated.

Therefore, there is still a need to solve the above problem when performing optical flow field estimation in an image sequence prediction task.

Disclosure of Invention

An aspect of the present invention is to provide an image prediction method using an optical flow field, which can secure a desired dynamic range and provide improved accuracy.

According to an exemplary embodiment, there is provided a method for implementing a multi-scale optical-flow pixel transform with diluted convolution for a pyramid optical-flow estimator, the pyramid optical-flow estimator comprising at least two layers of optical-flow estimators, the method comprising: each layer of optical flow estimator estimates a current layer of optical flow field by using a known image frame and a previous layer of predicted image frame generated by the previous layer of optical flow estimator as input, wherein the current layer of optical flow field has a current layer working scale; determining a current layer convolution kernel corresponding to the current layer optical flow field; diluting the current layer convolution kernel to obtain a current layer diluted convolution kernel; performing an optical flow pixel transform process on the previous layer predicted image frame using the current layer dilution convolution kernel to obtain a current layer predicted image frame, wherein the previous layer predicted image frame as an input of the first layer optical flow estimator is zero, and wherein the working scales of each layer of optical flow field are different from each other.

In some examples, the known image frame and each layer of the predicted image frame have the same image size.

In some examples, the working dimension of an optical flow field of an upper layer is greater than the working dimension of an optical flow field of a lower layer.

In some examples, the working dimension of the optical flow field of the upper layer is twice the working dimension of the optical flow field of the lower layer.

In some examples, the working scale of the bottom-most optical flow field corresponds to an image size of the known image frame, and the dilution factor of the bottom-most dilution convolution kernel is 1.

In some examples, the current layer optical-flow field estimated by each layer of the optical-flow estimator is a residual optical-flow field between the prediction target and the previous layer of the predicted image frame.

In some examples, the step of performing optical-flow pixel transform processing on the previous layer predicted image frame using the current layer diluted convolution kernel further comprises performing extended padding on a boundary of the previous layer predicted image frame.

In some examples, the extended padding includes a cyclic padding, a mirror padding, or a fixed value padding.

In some examples, each layer of the pyramid optical-flow estimator comprises a full convolution network or a circular convolution network.

In some examples, the optical-flow field generated by each layer of the pyramid optical-flow estimator includes a random optical-flow field probability distribution, and the current-layer convolution kernel corresponding to each layer of the optical-flow field includes a plurality of non-zero values.

In some examples, diluting the current layer convolution kernel includes adding a forced all-zero region between rows and columns of the convolution kernel.

According to another exemplary embodiment, there is provided an apparatus for implementing a multi-scale optical-flow pixel transform using diluted convolution, comprising: a pyramid optical flow estimator comprising at least two layers of optical flow estimators; and a convolution module including a convolution kernel determining unit, a convolution kernel diluting unit and a convolution executing unit, wherein each layer of optical flow estimator of the pyramid optical flow estimator estimates a current layer optical flow field using a known image frame and a previous layer predicted image frame generated by the previous layer optical flow estimator as input, the current layer optical flow field has a current layer working scale, the convolution kernel determining unit determines a current layer convolution kernel corresponding to the current layer optical flow field, the convolution kernel diluting unit dilutes the current layer convolution kernel to obtain a current layer diluted convolution kernel, the convolution executing unit performs an optical flow pixel transform process on the previous layer predicted image frame using the current layer diluted convolution kernel to obtain the current layer predicted image frame, wherein the previous layer predicted image frame as input of the first layer optical flow is zero, and wherein the working dimensions of each layer of the optical flow field are different from each other.

In some examples, the convolution module further includes an image expansion unit configured to expand and fill in a boundary of the previous layer predicted image frame when the previous layer predicted image frame is subjected to optical flow pixel transform processing using the current layer diluted convolution kernel.

According to another exemplary embodiment, there is provided an electronic device including: a processor; and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the above method.

According to another exemplary embodiment, a vehicle is provided that includes the above-described electronic apparatus.

According to another exemplary embodiment, a computer-readable medium is provided, on which computer program instructions are stored, which computer program instructions, when executed by a processor, cause the processor to carry out the above-mentioned method.

In the embodiment of the invention, the idea of the spatial pyramid is applied to the prior optical flow field estimation of the prediction task, so that the prediction precision and the output dynamic range can be considered, and a good prediction effect is realized. Moreover, by adopting the dilution convolution, the pixel transformation can be carried out on the original scale of the image, the information loss caused by the scaling process is avoided, and the improved prediction precision can be provided.

The above and other features and advantages of the present invention will become apparent from the following description of exemplary embodiments, which is to be read in connection with the accompanying drawings.

Drawings

FIG. 1A shows a schematic of an posterior optic flow field.

FIG. 1B shows a schematic of an a priori optical flow field.

Fig. 2 is a schematic diagram illustrating a training process of an image prediction method according to an exemplary embodiment of the present invention.

Fig. 3A and 3B illustrate exemplary convolution kernels and diluted convolution kernels, respectively.

Fig. 4 illustrates a schematic diagram of a prediction process of an image prediction method according to an exemplary embodiment of the present invention.

Fig. 5 illustrates a functional block diagram of an image prediction apparatus according to an exemplary embodiment of the present invention.

Fig. 6 illustrates a block diagram of an electronic device according to an exemplary embodiment of the present invention.

Fig. 7 shows a schematic diagram of a vehicle equipped with the electronic device of fig. 6 according to an exemplary embodiment of the present invention.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein. Note that the drawings are not to scale.

Fig. 2 illustrates a schematic diagram of a training process of an image prediction method according to an exemplary embodiment of the present invention, and fig. 4 illustrates a schematic diagram of a prediction process of an image prediction method according to an exemplary embodiment of the present invention. The image prediction method of the present invention adopts the optical flow pyramid concept. Specifically, in the example of fig. 2, a layer 1 optical flow estimator 10a, a layer 2 optical flow estimator 10b, and a layer 3 optical flow estimator 10c are employed, which constitute the pyramid optical flow estimator 10. Although a 3-layer estimator is shown here, it should be understood that the pyramid optical flow estimator 10 may comprise, for example, a 2-layer estimator, a 4-layer estimator, or more layer estimators, etc.

In training, a known image frame 1, which may include a plurality of known image frames, is provided to the optical flow estimators 10a, 10b, and 10c of the respective layers. Each of the layers of optical flow estimators 10a, 10b, and 10c may comprise a convolutional network, which may comprise, for example, a full convolutional network, a cyclic convolutional network, and so forth. It should be noted that here, the known image frame 1 of the original size is provided to each of the layer optical flow estimators 10a, 10b and 10c, and the current layer optical flow field 11 generated by each of the layer optical flow estimators 10a, 10b and 10c has a different working scale, which can be determined by its corresponding optical flow estimator 10 for the optical flow field 11. In fig. 2, the estimator 10c of the layer 3 as the bottom layer may generate an optical flow field 11c corresponding to the original size of the known image frame 1, and at this time, the working scale of the optical flow field 11c may be considered as the base scale corresponding to the original image frame 1. The layer 2 estimator 10b generates an optical flow field 11b corresponding to half the size of the original image, and it is considered that the working scale of the optical flow field 11b is 2 times that of the optical flow field 11 c. The layer 1 estimator 10a generates the optical flow field 11a corresponding to a quarter of the original image size, and at this time, it can be considered that the working scale of the optical flow field 11a is 2 times of the working scale of the optical flow field 11b and 4 times of the working scale of the optical flow field 11 c. Although the working dimension of the upper optical flow field is shown here to be 2 times the working dimension of the lower optical flow field, it is understood that other ratios are possible. Generally, the working dimension of the upper optical flow field is greater than the working dimension of the lower optical flow field, but other situations may exist. For example, the working scale of the upper optical flow field is equal to that of the lower optical flow field, so that the prediction accuracy is improved by repeating the calculation, or even the working scale of a certain upper optical flow field may be smaller than that of the lower optical flow field.

Each layer optical flow estimator 10 receives a known image frame 1 of original size and also receives a predicted image frame 2 of the previous layer generated by an immediately adjacent optical flow estimator of the previous layer as inputs to estimate the optical flow field 11 of the current layer. For the uppermost optical flow estimator 10a, it does not receive the previous layer predicted image frame 2, or it receives the previous layer predicted image frame 2 as zero. The process is specifically described below.

As shown in fig. 2, a layer 1 estimator 10a receives a known image frame 1 and produces a layer 1a priori optical flow field distribution 11a between the known frame to a future frame, having a working scale of 4 (i.e., 4 times the base scale). In some embodiments, the layer 1 estimator 10a (as well as estimators 10b and 10c described below) may generate a random optical flow field probability distribution and then generate a determined a priori optical flow field 11a from the random optical flow field probability distribution by averaging, sampling, taking a maximum probability, etc., or may directly use the random optical flow field probability distribution as the a priori optical flow field 11 a. Then, a convolution kernel 12a corresponding to the layer 1 optical flow field 11a may be determined, which has the same working scale as the layer 1 optical flow field 11a, also 4. The principle applied here is: when a convolution kernel with only one non-zero term is used to perform a convolution operation on an image frame, it is equivalent to performing a translation operation on the image. It should be understood, however, that the convolution kernel 12 of each layer may have a plurality of non-zero terms, which are equivalent to the linear superposition of various translation operations according to term-taking values because of the linear nature of convolution operations, and thus such convolution kernels may express a discrete probability distribution, and thus may be used in the case where the a priori optical flow field 11a is a random optical flow field probability distribution, to efficiently find an average predicted image of the random optical flow field probability distribution.

The layer 1 convolution kernel 12a may then be subjected to a dilution process to obtain a layer 1 diluted convolution kernel 13 a. Here, the diluted convolution kernel 13A should have a working scale of 1 (i.e., a base scale), and thus the layer 1 convolution kernel 12a needs to be diluted by 4 times, i.e., a forced all-zero region is added between the rows and columns of the convolution kernel 12a, and fig. 3A and 3B show examples of the dilution.

Fig. 3A shows an example convolution kernel of 3 x 3, which, as will be appreciated, can shift the pixel points on the image by one pixel in each dimension. If the 3 x 3 convolution kernel is diluted by 2 times so that it can move the pixel by two pixels, a row or column forced all-zero region is added between each row and each column of the 3 x 3 convolution kernel (i.e., none of the values in the region is zero), resulting in a 5 x 5 convolution kernel, as shown in fig. 3B. Although not shown, it should be understood that if the 3 x 3 convolution kernel is to be diluted 4 times so that it can move the pixel by four pixels, then three rows or three columns of forced all-zero regions are added between each row and each column of the 3 x 3 convolution kernel, resulting in a 9 x 9 convolution kernel. Similarly, a 5 x 5 convolution kernel can shift a pixel by two pixels, adding a row or column of forced all-zero regions between each row and each column of the 5 x 5 convolution kernel if the 5 x 5 convolution kernel is diluted by a factor of 2 so that it can shift a pixel by four pixels. If the 5 x 5 convolution kernel is diluted 4 times so that it can shift the pixel by eight pixels, then three rows or three columns of forced all zero regions are added between each row and each column of the 5 x 5 convolution kernel. This is not exemplified.

Then, the known image frame 1 of the original size is subjected to optical flow pixel transform (warping) processing using the layer 1 thinning convolution kernel 13a, resulting in a layer 1 predicted image frame 2 a. It will be appreciated that the working scale of the dilution convolution kernel 13a is the base scale and that the known image frame 1 has an original size, so the image size of the layer 1 predicted image frame 2a is also equal to the original image size of the known frame 1. When performing the optical flow pixel conversion process using the layer 1 thinning convolution kernel 13a, the boundary of the known image frame 1 may be extended and padded (padding) in order to avoid pixels from being moved outside the image boundary and being lost. Typical extended fill patterns include circular fill, mirror fill, fixed value (e.g., zero or other background pixel value) fill, and the like.

The layer 1 predicted image frame 2a may then be compared to the predicted frame true value, i.e. the next frame 3, and the error between the two calculated. It will be appreciated that the image size of the next frame 3 may also be equal to the original size of the known frame 1. The layer 1 estimator 10a is optimized at the training cost of layer 1 predicting the error between the image frame 2a and the next frame 3.

Similarly, the layer 2 estimator 10b receives the known image frame 1 and also receives the layer 1 predicted image frame 2a generated by the previous layer optical flow estimator 10a, generating the layer 2a priori optical flow field 11 b. Here, both the known image frame 1 and the layer 1 predicted image frame 2a have the same original size, while the working scale of the layer 2a priori optical flow field 11b may be 2 times the base scale, i.e. corresponding to half the original image size. Layer 2 may perform a priori optical flow field estimation on a finer scale relative to layer 1, which has a working scale of 4. The layer 2 estimator 10b generates a residual optical flow field between the optical flow field on the coarse scale from the previous layer and a target optical flow field, that is, an optical flow field between the known image frame 1 and the predicted target frame, as the layer 2a priori optical flow field distribution 11 b. In other words, the layer 2 estimator 10b estimates a residual optical flow field between the prediction target and the previous layer predicted image frame. It should be understood that in practice the layer 1 estimator 10a also generates a residual optical flow field between the prediction target and the previous layer predicted image frame, except that it receives zero for the previous layer predicted image frame.

Similarly, a convolution kernel 12b corresponding to the layer 2a priori optical flow field 11b is generated and subjected to a dilution process to obtain a layer 2 dilution convolution kernel 13 b. Here, it is only necessary to dilute the layer 2 convolution kernel 12b by 2 times, that is, the working scale of the layer 2 diluted convolution kernel 13b can be made the base scale. Then, the layer 2 dilution convolution kernel 13b performs an optical flow pixel transform process on the layer 1 predicted image frame 2a to generate a layer 2 predicted image frame 2b, the image size of which is also equal to the original image size of the known frame 1. The layer 2 estimator 10b may then be optimized at the training cost of comparing the layer 2 predicted image frame 2b to the predicted frame true value, i.e., the next frame 3, calculating the error between the two.

Similar to layer 2, the layer 3 estimator 10c generates the layer 3a priori optical flow field 11c using the known image frame 1 and the previous layer predicted image frame 2 b. In the example of fig. 2, layer 3 is the bottommost, or base, layer, and the working dimension of layer 3 optical-flow field 11c may be the base dimension. Therefore, when the layer 3 convolution kernel 12c is subjected to the dilution processing after the layer 3 convolution kernel 12c corresponding to the layer 3 optical flow field 11c is determined, the dilution factor is 1, that is, the dilution convolution kernel 13c is the same as the convolution kernel 12c, or the layer 3 convolution kernel 12c may not be diluted. The layer 3 dilution convolution kernel 13c is then used to perform an optical flow pixel transform process on the layer 2 predicted image frame 2b, resulting in a layer 3 predicted image frame 2c, also having an image size equal to the original image size of the known frame 1. The layer 3 estimator 10c may then be optimized at the training cost of comparing the layer 3 predicted image frame 2c to the predicted frame true value, i.e., the next frame 3, calculating the error between the two.

As described above, the

optical flow distributions

11a, 11b, and 11c are optical flow distributions at different scales, which constitute the optical flow pyramid 11. By utilizing dilution convolution, optical flow transformation operation under different scales is realized, so that the predicted

image frames

2a, 2b and 2c have the same original size as the known frame 1, the loss of image pixel information and the reduction of prediction precision caused by image scaling are avoided, and a more accurate prediction model is generated in the training process.

While an unsupervised training mode is described above in connection with fig. 2, it should be understood that supervised training or other modes of unsupervised training may also be performed within the framework shown in fig. 2. For example, as an example of supervised training, the training data set may include optical flow field truth values between image frames, at which time, a part of the image frames may be regarded as known frames, the remaining subsequent image frames may be regarded as predicted target frames, and the corresponding optical flow field data may be used as a supervisory signal to train the output of the current pyramid under the known frames. Another unsupervised training approach may include obtaining a posterior estimate of the optical flow field between the known frame and the target frame by a posterior optical flow estimation method, and training the estimator to generate the prior optical flow field in a similar supervised manner by using this posterior estimate as a supervisory signal. Of course, other training modes are possible.

When the training process is completed, the trained estimator can be used to perform image prediction, and the image prediction process is described below with reference to fig. 4. Similarly to the training process shown in fig. 2, in short, each layer of optical flow estimator estimates a current layer of optical flow field using a known image frame and a previous layer predicted image frame generated by the previous layer of optical flow estimator as inputs, and determines a convolution kernel corresponding to the current layer of optical flow field. The convolution kernel may then be diluted such that the working scale of the diluted convolution kernel is the base scale, and then the diluted convolution kernel is used to perform an optical flow pixel transform on the previous layer of predicted image frames to produce a current layer of predicted image frames having the same original image size as the known frames. The current layer predicted image frame generated by the base layer is the final prediction result and is output. Wherein the previous layer predicted image frame as an input of the first layer optical flow estimator is zero, and the first layer diluted convolution kernel performs optical flow pixel transform processing on the known image frame to obtain the first layer predicted image frame.

Referring to fig. 4, the known image frame 1 is provided to a layer 1 estimator 10a, which generates a layer 1 optical flow field 11a having a working dimension of 4. A convolution kernel 12a corresponding to the layer 1 optical flow field 11a is determined and then the convolution kernel 12a is diluted by a factor of 4 to obtain a diluted convolution kernel 13a having a working scale of 1. The known frame 1 is processed with a dilution convolution kernel 13a to obtain a layer 1 predicted image frame 2a having the same image size as the known frame 1.

The known image frame 1 is also provided to a layer 2 estimator 10b, which also receives the previous layer predicted image frame 2a, and generates a residual optical flow field between the predicted target and the previous layer predicted image frame 2a as a layer 2 optical flow field 11b, which has a working scale of 2. A convolution kernel 12b corresponding to the layer 2 optical flow field 11b is determined and then the convolution kernel 12b is diluted by a factor of 2 to obtain a diluted convolution kernel 13b having a working scale of 1. The previous layer predicted image frame 2a is processed with a dilution convolution kernel 13b to obtain a layer 2 predicted image frame 2b having the same image size as the known frame 1.

The known image frame 1 is also provided to a layer 3 estimator 10c, which also receives the previous layer predicted image frame 2b, and generates a residual optical flow field between the predicted target and the previous layer predicted image frame 2b as a layer 3 optical flow field 11c, which has a working scale of 1. The convolution kernel 12c corresponding to the layer 3 optical flow field 11c is determined and then the convolution kernel 12b is diluted 1-fold (or undiluted) to obtain a diluted convolution kernel 13c with a working scale of 1. The previous layer predicted image frame 2b is processed using a dilution convolution kernel 13c to obtain a layer 3 predicted image frame 2c having the same image size as the known frame 1. Since layer 3 is the base layer, layer 3 predicted image frame 2c is output as the final prediction result. In some embodiments, the layer 3 predicted image frame 2c as a result of the prediction may also be used as the known image frame 1 to further predict the next image frame.

Fig. 5 illustrates a functional block diagram of the image prediction apparatus 100 according to an exemplary embodiment of the present invention. As shown in FIG. 5, the image prediction apparatus 100 according to an exemplary embodiment of the present invention may include a training unit 110, a pyramid optical flow estimator 120, and a convolution module 130.

The pyramid optical-flow estimator 120 may include at least two layers of optical-flow estimators, such as layer 1 estimator 121 and layer 2 estimator 122 shown in FIG. 5, each of which may estimate the current layer optical-flow field using a known image frame and a previous layer predicted image frame generated by the previous layer optical-flow estimator as inputs. Where the previous layer predicted image frame as an input of the first layer optical flow estimator 121 is zero.

Although not shown, the convolution module 130 may include a convolution kernel determination unit 131, a convolution kernel dilution unit 132, and a convolution execution unit 133 corresponding to each layer of the optical flow estimator. The convolution kernel determining unit 131 may determine a current layer convolution kernel corresponding to the current layer prediction image frame, and the convolution kernel diluting unit 132 may dilute the current layer convolution kernel to obtain a diluted convolution kernel having a working scale of 1 (i.e., a base size). Then, the convolution execution unit 133 may perform optical-flow pixel transform processing on the previous layer predicted image frame using the obtained diluted convolution kernel to obtain the current layer predicted image frame. For the first layer convolution execution unit, it may process the known image frame using the first layer diluted convolution kernel to generate a first layer predicted image frame. Wherein the last layer of predicted image frames is output as a prediction result.

The training unit 110 may be configured to train the pyramid optical-flow estimator 120 with a training data set, and the specific training process may refer to the embodiment described above with reference to fig. 2, which is not repeated here.

The detailed functions and operations of the respective units and blocks in the above-described image prediction apparatus 100 have been described in detail in the image prediction method described above with reference to fig. 2 to 4, and thus only a brief explanation is made here, and a repeated detailed description thereof is omitted.

The image prediction apparatus 100 according to the embodiment of the present application may be implemented in an image prediction device, and may be integrated into the image prediction device as a software module and/or a hardware module, for example. Fig. 6 shows a block diagram of an exemplary electronic device 200 that may implement the image prediction apparatus 100.

As shown in fig. 6, the electronic device 200 includes one or more processors 210 and memory 220.

The processor 210 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 200 to perform desired functions, such as the image prediction functions described above.

Memory 220 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 210 to implement the image prediction methods of the various embodiments of the present application described above and/or other desired functions.

In one example, the electronic device 200 may also include an input unit 230 and an output unit 240, which are interconnected by a bus system and/or other form of connection mechanism (not shown). The input unit 230 may be used to receive successive video images, for example the input unit 230 may be connected to an on-board camera to receive video images taken by it, which may be used to perform the training or prediction process described above. The output unit 240 may output the prediction result, and may output the prediction result to the vehicle-mounted driving assistance system, for example. The vehicle-mounted driving assist system can make a driving strategy judgment based on the prediction result, thereby realizing safe and reliable driving assist.

Of course, for simplicity, only some of the components of the electronic device 200 that are relevant to the present application are shown in fig. 6, while many other necessary or optional components are omitted. In addition, electronic device 200 may include any other suitable components depending on the particular application.

Fig. 7 shows a schematic view of a vehicle that may be equipped with such an electronic device 200. As shown in fig. 7, a vehicle 300 may include a camera 301 and an electronic device 310. The camera 301 may be a monocular or monocular camera, or may be an infrared camera, a laser radar, or the like, to capture images of the surrounding driving environment. The electronic device 310 may be implemented as the electronic device 200 described with reference to fig. 6, which receives video images from the camera 301 to perform the aforementioned training or prediction process.

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the image prediction method according to various embodiments of the present application described above.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the steps in the image prediction method according to various embodiments of the present application described above.

The computer readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably herein. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, each component or step can be decomposed and/or re-combined. These decompositions and/or recombinations should be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method for a pyramid optical-flow estimator to implement a multi-scale optical-flow pixel transform using dilution convolution, the pyramid optical-flow estimator comprising at least two layers of optical-flow estimators, the method comprising:

each layer of optical flow estimator estimates a current layer of optical flow field by using a known image frame and a previous layer of predicted image frame generated by the previous layer of optical flow estimator as input, wherein the current layer of optical flow field has a current layer working scale;

determining a current layer convolution kernel corresponding to the current layer optical flow field;

diluting the current layer convolution kernel to obtain a current layer diluted convolution kernel;

performing an optical-flow pixel transform process on the previous layer predicted image frame using the current layer dilution convolution kernel to obtain a current layer predicted image frame,

wherein the predicted image frame of the previous layer as an input of the first layer optical flow estimator is zero, and

wherein the working dimension of each layer of optical flow field is different from each other,

wherein the thinning out adds a forced all-zeros region between the rows and columns of the convolution kernel.

2. The method of claim 1, wherein the known image frame and each layer of predicted image frames have the same image size.

3. The method of claim 1, wherein an operating dimension of an optical flow field of an upper layer is greater than an operating dimension of an optical flow field of a lower layer.

4. A method as claimed in claim 3, wherein the working dimension of the optical flow field of a previous layer is twice the working dimension of the optical flow field of a next layer.

5. The method of claim 3, wherein a working dimension of a bottommost optical flow field corresponds to an image size of the known image frame, and

wherein the dilution factor of the dilution convolution kernel at the bottom layer is 1.

6. The method as claimed in claim 1, wherein the current layer optical-flow field estimated by each layer of optical-flow estimator is a residual optical-flow field between a prediction target and a previous layer of predicted image frame.

7. The method of claim 1, wherein performing an optical-flow pixel transform on the previous layer predicted image frame using the current layer diluted convolution kernel further comprises extended padding of boundaries of the previous layer predicted image frame.

8. The method of claim 7, wherein the extended padding comprises cyclic padding, mirror padding, or fixed value padding.

9. The method of claim 1, wherein each layer of optical flow estimators of the pyramid optical flow estimator comprises a full convolution network or a circular convolution network.

10. The method of claim 1 wherein the optical-flow field generated by each layer of the pyramid optical-flow estimator comprises a random optical-flow field probability distribution, and the current-layer convolution kernel corresponding to each layer of the optical-flow field comprises a plurality of non-zero values.

11. The method of claim 1 wherein thinning out the current layer convolution kernel comprises adding a forced all zero region between rows and columns of convolution kernels.

12. An apparatus for implementing a multi-scale optical flow pixel transform using diluted convolution, comprising:

a pyramid optical flow estimator comprising at least two layers of optical flow estimators; and

a convolution module including a convolution kernel determination unit, a convolution kernel dilution unit and a convolution execution unit,

wherein each layer of optical flow estimator of the pyramid optical flow estimator estimates a current layer optical flow field using a known image frame and a previous layer predicted image frame generated by a previous layer optical flow estimator as input, the current layer optical flow field having a current layer working scale, the convolution kernel determining unit determines a current layer convolution kernel corresponding to the current layer optical flow field, the convolution kernel diluting unit dilutes the current layer convolution kernel to obtain a current layer diluted convolution kernel, the convolution executing unit performs an optical flow pixel transform process on the previous layer predicted image frame using the current layer diluted convolution kernel to obtain a current layer predicted image frame,

wherein the working scale of each layer of optical flow field is different from each other,

wherein the thinning process adds a forced all-zero region between rows and columns of the convolution kernel.

13. The apparatus of claim 12, wherein the convolution module further comprises an image extension unit configured to extend fill in a boundary of the previous layer predicted image frame when the previous layer predicted image frame is subjected to optical flow pixel transform processing using the current layer diluted convolution kernel.

14. An electronic device, comprising:

a processor; and

a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the method of any of claims 1-11.

15. A vehicle comprising the electronic device of claim 14.

16. A computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1-11.