CN114881841A

CN114881841A - Image generation method and device

Info

Publication number: CN114881841A
Application number: CN202210410031.9A
Authority: CN
Inventors: 邢俊
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2022-08-09

Abstract

The application discloses an image generation method and an image generation device, and belongs to the technical field of information. The method comprises the following steps: acquiring a first view image through a target camera; determining a first parallax image corresponding to the first view image; performing affine transformation on the first view image based on the first parallax image to obtain a second view image; the first view image and the second view image are view images of the same shooting scene from different perspectives.

Description

Image generation method and device

Technical Field

The application belongs to the technical field of information, and particularly relates to an image generation method and an image generation device.

Background

With the development of binocular imaging technology, binocular images have gained wide attention in recent years.

At present, the mainstream binocular image acquisition schemes mainly include the following two schemes: (1) the 3D rendering software is used for generating the forged image data, the image data is not sufficient in generalization due to the fact that the difference between the image data and the actual scene is too large, and a model trained by the forged image data is generally represented on a real data set. The software using threshold and the time cost of the method are high, and a large amount of time is needed to learn the use of the 3D rendering software and make image data of various scenes; (2) the binocular image is acquired by the RGB camera and the depth camera in a cooperative mode, the annotation precision of the image data is poor, the image data is expensive and labor-consuming to acquire, the acquisition scene is limited, and the binocular image acquisition method is generally only suitable for acquiring indoor scenes.

Therefore, the two existing schemes have high binocular image acquisition cost and difficulty, and the acquired binocular images have poor generalization, so that the calculation precision is low, and how to acquire the binocular images becomes a problem to be solved urgently.

Disclosure of Invention

The embodiment of the application aims to provide a binocular data generation method, which can solve the problems that the data set acquisition cost and difficulty are high, and the acquired data set is poor in generalization, so that the calculation precision is low.

In a first aspect, an embodiment of the present application provides an image generation method, where the method includes: acquiring a first view image through a target camera; determining a first parallax image corresponding to the first view image; performing affine transformation on the first view image based on the first parallax image to obtain a second view image; the first view image and the second view image are view images of the same shooting scene from different perspectives.

In a second aspect, an embodiment of the present application provides an image generating apparatus, including: shooting module and execution module, wherein: the shooting module is used for acquiring a first view image through the target camera; the execution module is used for determining a first parallax image corresponding to the first view image acquired by the shooting module; the execution module is further configured to perform affine transformation on the first view image based on the first parallax image to obtain a second view image; the first view image and the second view image are view images of the same shooting scene from different perspectives.

In a third aspect, embodiments of the present application provide an electronic device, which includes a processor and a memory, where the memory stores a program or instructions executable on the processor, and the program or instructions, when executed by the processor, implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product, stored on a storage medium, for execution by at least one processor to implement the method according to the first aspect.

In the embodiment of the application, an image generation device acquires a first view image through a target camera, determines a first parallax image corresponding to the first view image, and performs affine transformation on the first view image based on the first parallax image to obtain a second view image; the first view image and the second view image are view images of the same shooting scene at different viewing angles. By the method, the image generation device can acquire the first view image through the target camera, and perform image affine transformation on the first view image according to the disparity map of the first view image to obtain the second view image matched with the first view image. Thus, view images can be acquired through one camera without complex binocular data acquisition procedures and expensive binocular data acquisition equipment, and the generalization of the generated second view images is improved.

Drawings

Fig. 1 is a schematic flowchart of an image generation method provided in an embodiment of the present application;

fig. 2 is a schematic diagram of performing image affine transformation processing on a first-view image according to an embodiment of the present application;

fig. 3 is a schematic diagram of disparity prediction on a fourth-view image according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present application;

fig. 5 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present disclosure;

fig. 6 is a second schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application are capable of operation in sequences other than those illustrated or described herein, and that the terms "first," "second," etc. are generally used in a generic sense and do not limit the number of terms, e.g., a first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The image generation method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings by specific embodiments and application scenarios thereof.

The embodiment of the application provides a shooting method, and fig. 1 shows a flow chart of the shooting image generation method provided by the embodiment of the application. As shown in fig. 1, an image generation method provided in an embodiment of the present application may include the following steps 201 to 203:

step 201: the image generation device acquires a first view image through the target camera.

In this embodiment, the target camera may be a monocular camera.

Optionally, in this embodiment of the application, the monocular camera is a camera on an electronic device. Illustratively, the monocular camera is a single camera on the electronic device. For example, a front/rear camera of the electronic device, a camera of the camera, etc.

In the embodiment of the present application, the first view image may be a left eye image, i.e., a left view, or the first view image may be a right eye image, i.e., a right view.

Optionally, in this embodiment of the present application, the first-view image is an image including a plurality of color colors.

Optionally, in this embodiment of the application, the first view image includes a plurality of view images, for example, the first view image may include 100 left views, or the first view image may include 100 right views.

Optionally, in this embodiment of the application, the first view image may be a plurality of images acquired by a monocular camera in a plurality of scenes. Alternatively, the scene may be a landscape, a building, a portrait, a still, and the like. Illustratively, the scene may be a sea, a forest, a house, an indoor/outdoor portrait, a food, and the like. The above only lists some common shooting scenes, and the scenes may be selected according to actual needs, and this is not limited in this embodiment of the present application.

Illustratively, the number of the first-view images may be equal to or greater than 100 and equal to or less than 1000.

For example, the image generation apparatus may acquire 100 left views by a monocular camera, and the 1000 left views include different types of images acquired in different shooting scenes, such as landscape images, portrait images, and building images.

For another example, the image generation apparatus may acquire 550 left views by a monocular camera, and the 1000 left views include different types of images acquired under different shooting scenes, such as landscape images, portrait images, and building images.

For another example, the image generation device may acquire 1000 view images by using a monocular camera, and the 1000 view images include different types of images acquired in different shooting scenes, such as a landscape image, a portrait image, and a building image.

Therefore, the image generation device can acquire the monocular image containing more scenes through the monocular camera, the cost for acquiring the monocular image is low, and the generalization capability of the image data of the monocular image is effectively improved.

Step 202: the image generation device determines a first parallax image corresponding to the first view image.

In an embodiment of the present application, the first parallax image includes: and the corresponding parallax value of each pixel point in the first view image.

Optionally, in an embodiment of the present application, the disparity value includes: and the offset of each pixel point in the first view image. For example, the offset may be an offset of pixel coordinates of each pixel point. Further, the offset may be an offset of a pixel coordinate of each pixel point in a horizontal direction.

Alternatively, the disparity value corresponding to each pixel point in the first-view image may be the same or different. For example, the disparity value of each pixel point in the first view image is 3 px; for another example, the parallax value of the pixel point 1 in the first view image is 3px, and the parallax value of the pixel point 2 in the first view image is 4 px.

Optionally, in this embodiment of the application, the image generating device may obtain the first parallax image corresponding to the first view image based on a deep learning network (e.g., a neural network). For example, the deep learning network may be a binocular stereo matching network, and the image generation device may input the first view image into the binocular stereo matching network, and use a disparity map output by the binocular stereo matching network as a first disparity image corresponding to the first view image.

Alternatively, the image generation apparatus may perform the first parallax image prediction using the deep learning network with the first view image as an input sample set.

Illustratively, assuming that the input sample set includes 1000 image samples, the samples may be divided into 10 batches, i.e., 10 lots, for prediction, one batch including 100 image samples.

Optionally, in this embodiment of the application, the image generation device may obtain a first parallax image corresponding to the first view image based on the depth information of the first view image. For example, the image generation device may calculate disparity information of the first view image based on depth information of the first view image, and obtain a first disparity image corresponding to the first view image according to the disparity information.

Step 203: the image generation device performs affine transformation on the first view image based on the first parallax image to obtain a second view image.

The first view image and the second view image are view images of the same shooting scene at different view angles.

Optionally, in this embodiment of the application, the image generation device may perform image affine transformation processing on the first view image based on the disparity value corresponding to each pixel point in the first view image, so as to obtain a second view image corresponding to the first view image.

Alternatively, in the embodiment of the present application, the image generation device may perform affine transformation processing on the first image region of the first view image based on the first parallax image. For example, the first image area may be a foreground or a background of the first-view image, or the first image area may be an image area corresponding to the first object of the first-view image. For example, the first image area is an image area where a portrait is located in the first-view image.

Illustratively, the first view image is taken as a left eye image (i.e., a left view). Fig. 2 is a schematic diagram of image affine transformation processing performed on the first-view image. After obtaining the parallax image corresponding to the left eye image 21, as shown in fig. 2, the image generation device performs image affine transformation on the left eye image 21 based on the parallax value of each pixel point in the left eye image 21 included in the parallax image to obtain a corresponding right eye image 22, and as can be seen from fig. 2, the obtained right eye image 22 and the left eye image 21 are relative to the same coordinate system, and each pixel point in the right eye image 22 is translated leftward.

The above-mentioned left eye image 21 is a color image, and the size of the left eye image 21 is the same as that of the corresponding parallax value image.

It should be noted that the affine transformation of the image is a process of mapping each pixel point on the image to a new position according to a certain rule, and is essentially a process of solving new horizontal and vertical coordinates of the pixel points of the image, that is, mapping M × N pixel points of the original image to M × N new positions of the target image.

Note that the affine transformation (warp) of an image is also referred to as image warping.

In the image generation method provided by the embodiment of the application, an image generation device acquires a first view image through a target camera, determines a first parallax image corresponding to the first view image, and performs affine transformation on the first view image based on the first parallax image to obtain a second view image; the first view image and the second view image are view images of the same shooting scene at different viewing angles. By the method, the image generation device can acquire the first view image through the target camera, and perform image affine transformation on the first view image according to the disparity map of the first view image to obtain the second view image matched with the first view image. Thus, view images can be acquired through one camera without complex binocular data acquisition procedures and expensive binocular data acquisition equipment, and the generalization of the generated second view images is improved.

Optionally, in this embodiment of the present application, the step 203 may include the following steps 203a1 and 203a 2:

step 203a 1: the image generation device performs affine transformation on the first position information of each pixel point in the first view image respectively based on the parallax value corresponding to each pixel point in the first view image to obtain second position information.

Step 203a 2: the image generation device generates a second view image based on the second position information and the pixel information of each pixel point.

Alternatively, the first position information of the pixel point may be represented by a coordinate of the pixel point. For example, the coordinates of the pixel points may be two-dimensional coordinates (x, y). For example, the position of the pixel point i can be represented by coordinates (xi, yi).

Optionally, the second position information may be: and the position information of each pixel point in the first view image in the second view image. Alternatively, the second position information of the pixel point may be represented by a coordinate of the pixel point. For example, the position of the pixel point i in the first view image may be represented by coordinates (xi, yi), and the position of the pixel point in the second view image may be represented by coordinates (xi ', yi').

Optionally, the pixel information of each pixel point may be a pixel value of each pixel point. Illustratively, the pixel value may be a gray value.

Illustratively, the first view image is a left eye image, and the second view image is a right eye image. Assuming that the coordinate position of the pixel point i in the left eye image is (xi, yi), and the disparity value corresponding to the pixel point i is dsi, the coordinate position corresponding to the pixel point i in the right eye image is (xi-dsi, yi).

Illustratively, the first view image is a right eye image, and the second view image is a left eye image. Assuming that the coordinate position of the pixel point j in the right eye image is (xj, yj), and the disparity value corresponding to the pixel point j is dsj, the coordinate position of the pixel point j corresponding to the right eye image is (xj + dsj, yj).

It should be noted that when the first view image is used as a left eye image to determine a right eye image, Di pixels need to be translated leftward for each pixel point i in the left eye image, that is, Di is subtracted from the abscissa of each pixel point i; when the first view image is used as a right eye image to determine a left eye image, it is necessary to translate each pixel point j in the right eye image by Dj pixels to the right, that is, add Dj to the abscissa of each pixel point j. And Di is a parallax value corresponding to each pixel point i in the left eye image, and Dj is a parallax value corresponding to each pixel point j in the left eye image.

Alternatively, the image generation device may determine coordinates of each pixel point in the first view image in the second view image, and assign a pixel value to the pixel point at each coordinate to generate the second view image.

In this way, the image generation device may perform affine transformation on the coordinates of each pixel point in the first view image based on the parallax value corresponding to each pixel point in the first view image to obtain the coordinates of each pixel point in the second view image, so as to obtain the second view image with a higher matching degree with the first view image.

Optionally, in this embodiment of the present application, the step 202 may include the following steps 202a and 202 b:

step 202 a: the image generation device generates a second parallax image corresponding to the first view image based on the depth information of the first view image.

Step 202 b: and the image generation device deletes the pixel points of the invalid value region in the second parallax image to obtain the first parallax image.

Alternatively, the image generation device may generate the second parallax image corresponding to the first-view image based on the depth information of the first-view image, and the baseline and focal length information of the binocular imaging system.

For example, the image generation apparatus may perform depth prediction on the first view image by using a monocular depth network model, and obtain depth information of the first view image. Optionally, the monocular deep network model may be a MiDas network or another network, and embodiments of the present application do not limit each other.

For example, the image generation device may acquire parallax information of the first-view image and generate the second parallax image based on the parallax information. Illustratively, the above-described disparity information may include disparity values. For example, the image generation device generates the second parallax image by using each parallax value of the first view image as pixel point information of the second parallax image.

Illustratively, the first-view image includes image a as an example. The image generation device inputs an image a to a monocular depth model m, predicts the image a, and obtains a monocular depth estimation result z of the image a, that is, z is m (a).

For example, the image generation apparatus may determine disparity information of the first-view image based on depth information of the first-view image.

Illustratively, the image generation device acquires a base line b and focal length information f of the binocular imaging system, and calculates parallax information d of the image a by using a parallax formula based on the base line b and focal length information f and the predicted depth estimation result z.

It should be noted that the above disparity formula is formula (1) below, and for the use of the disparity formula, a specific calculation method for obtaining disparity information d based on the baseline b, the focal length information f, and the predicted depth estimation result z may be referred to below, and details are not described here.

Optionally, the invalid value region in the second parallax image includes at least one of:

an image region including a scatter noise;

an image area comprising invalid pixel values.

It should be noted that, geometric constraint of the depth estimation result z obtained by using the monocular depth network is generally weak, and therefore, if the disparity map obtained by converting the depth estimation result z is directly used for image affine transformation processing, that is, warp, the result obtained by warp is accompanied by some isolated noise points, and thus the effect of the generated second view image is poor.

Optionally, the image generating device may extract, by using a sobel operator, a region of the second parallax image with a gradient greater than g (for example, the value of g may be 3), obtain an edge map e, and then delete the edge map e from the second parallax image, so as to delete an image region including the scatter noise in the second parallax image.

Alternatively, the image generation device may perform image filling on the second parallax image from which the invalid value region is deleted after deleting the pixel points of the invalid value region in the second parallax image.

For example, the image generation device may perform interpolation processing on the second parallax image from which the invalid value region is deleted to perform image filling on the hole region formed by deleting the invalid value region. For example, after deleting the image region e including the scatter noise in the disparity map d, obtaining a disparity map d' of a hole region, and then performing interpolation processing on the hole region by using a multivariate function to obtain a filled disparity map ds.

In this way, the image generation device can remove the image region including the scatter noise in the second parallax image, thereby effectively improving the effect of the generated second view image.

Alternatively, the image generation device may determine an image region where the invalid pixel point in the second parallax image is located, and delete the image region.

Illustratively, the invalid pixel point includes at least one of:

and the pixel point with the pixel value larger than the abscissa of the corresponding pixel point in the first view image is calculated according to the parallax and is not in the range represented by the second view image.

And a plurality of pixel points corresponding to the same pixel position in the second view image, namely a plurality of pixel points in the first view image belong to the same position in the right image according to parallax calculation.

It should be noted that the pixel value of each pixel point in the second parallax image is essentially the parallax value of each pixel point in the first view image, and each pixel point in the second parallax image corresponds to each pixel point in the first view image one to one.

It should be noted that, in general, a part of pixel points near the left edge in the left eye image will not be reflected in the right eye image, that is, a part of image content existing in the left eye image is invisible in the right eye image, and similarly, a part of pixel points near the right edge in the right eye image will not be reflected in the left eye image. Therefore, when the left eye image is generated through the left eye image, the pixel points close to the left side edge in the left eye image can be regarded as the invalid pixel points, and when the left eye image is generated through the right eye image, the pixel points close to the right side edge in the right eye image can be regarded as the invalid pixel points.

For example, the image generation device may delete a pixel point in the second parallax image whose pixel value is greater than the abscissa of its corresponding pixel point in the first view image, so that the generated second view image does not include image content near the left edge or the right edge in the first view image, thereby obtaining the first view image and the second view image that are strictly matched with the binocular image acquired through the binocular device.

Exemplarily, the first-view image is taken as the left-eye image. If the position of a pixel point i in the left-eye image is (xi, yi), and the parallax of the pixel point is dsi, then the coordinate of the pixel point corresponding to the right-eye image is (xi-dsi, yi), then it is possible to obtain a parallax map dt without a valid value by counting whether the (xi-dsi, yi) is within the size of the right-eye image (i.e., whether the xi-dsi is smaller than 0) and whether a plurality of pixel points in the right-eye image correspond to the pixel points, and if so, deleting the pixel point i.

In this way, by deleting the invalid value in the second parallax image, the generated second view image does not include the image content near the left edge or the right edge in the first view image, so that the first view image and the second view image which are strictly matched with the binocular image acquired by the binocular device are obtained, and the effect of the generated second view image is improved.

Further optionally, the step 202a may include the following steps a1 and a 2:

step A1: and calculating a corresponding parallax value of each pixel point in the first view image according to the depth information and the first parameter of the first view image.

Step A2: and obtaining a second parallax image based on the parallax value corresponding to each pixel point.

Wherein the first parameter includes: first distance and focal length information.

Optionally, the first distance is used to represent a distance between different viewing angles of the same shooting scene.

Illustratively, the first distance may be a baseline of the binocular imaging system.

It should be noted that the binocular imaging system is generally a system that performs three-dimensional stereo imaging by using binocular imaging devices (e.g., binocular cameras), a base line of the binocular imaging system is a distance between two cameras of the binocular imaging devices, and the base lines of different binocular imaging devices may be different.

For example, in a left-right binocular camera, both cameras may be considered as two pinhole cameras positioned horizontally with the aperture centers of both cameras positioned on the x-axis. The distance between the two cameras is called the Baseline of the binocular camera (b). When the base line is longer, the maximum measurable distance of the binoculars is larger; conversely, the smaller the maximum measurable distance of the eyes.

Optionally, the depth information of the first-view image is: the distance (depth) values from the monocular camera of the first view image to each point in the scene are collected.

Optionally, the image generating apparatus may perform depth prediction on the first view image by using a monocular depth network model, so as to obtain depth information of the first view image. Optionally, the monocular deep network model may be a MiDas network or another network, and embodiments of the present application do not limit each other.

Exemplarily, the first-view image includes 100 left views. And inputting the 100 left views as a batch of monocular data sets a into a monocular depth network model m for depth prediction, and outputting a monocular depth estimation result z of the 100 left views, wherein the relationship between the monocular data sets a and the monocular depth estimation result z is m (a).

For example, the image generation apparatus may generate a first depth image corresponding to the first view image based on the depth information of the first view image. Illustratively, the pixel values of the respective pixel points of the first depth image are depth values of the first view image.

Example 1, when image depth information prediction is performed on a first-view image, a monocular image is input into a monocular depth network for depth prediction, and a depth image corresponding to the monocular image is output.

The monocular image input to the monocular depth network is a color image, and the output depth image is a depth map of the monocular image.

Alternatively, the image generation device may generate the second parallax image corresponding to the first view image by using the parallax value corresponding to each pixel point in the first view image as a pixel value.

Illustratively, the image generation apparatus may calculate the disparity value d based on the base line b and the focal length information f of the binocular imaging system, and the first depth map z corresponding to the first-view image, as shown in formula (1).

Example 2, in combination with example 1, when performing parallax conversion on the first-view image, the image generation device performs parallax conversion processing on the depth image corresponding to the monocular image, and obtains a parallax image corresponding to the monocular image.

It should be noted that, for performing the parallax conversion processing on the depth image, reference may be made to related technologies, and details are not described here.

Therefore, the base line and the focal length can be flexibly selected according to actual requirements, the parallax values of various base lines and various focal lengths can be conveniently generated, and second view images of various base lines and various focal lengths can be obtained subsequently.

Further optionally, in this embodiment, the step 203 may include the following steps 203b1 and 203b 2:

step 203b 1: the image generation device performs affine transformation on the first view image based on the first parallax image to obtain a third view image.

Step 203b 2: and the image generation device carries out image filling on the target image area of the third view image to obtain a second view image.

Wherein, the target image area is: and an image region in the third-view image corresponding to the invalid value region in the second parallax image.

It can be understood that, because some invalid value pixel points are deleted from the first parallax image, that is, there is some pixel points in the first view image that do not have corresponding parallax values, based on the first parallax image, there are some missing pixels in the third view image obtained by performing warp operation on the first view image, that is, there are some pixels that are invalid.

Optionally, the image generating device may perform image filling on the target image area of the third view image according to the acquired background image. For example, the image generation device may capture background images in a plurality of scenes to obtain a background image including the plurality of scenes.

Illustratively, the first view image is taken as a left view, and the third view image is taken as a right view. Firstly, a plurality of acquired background images are used as a batch of background data sets, secondly, warp operation is carried out on the left view a by utilizing the disparity map dt after the pixel points of the invalid value area are deleted, and a right view b is obtained. And finally, randomly sampling on the background data set, and filling the pixel value obtained by sampling as the pixel value of the invalid value area, namely filling the invalid value area in the right view b to obtain the filled right view b.

Optionally, the image generating device may perform image filling on the target image region in the third view image according to the pixel point of the second image region in the third view image. For example, the second image area may be an image area around the target image area.

Illustratively, the first view image is taken as a left view, and the third view image is taken as a right view. Firstly, warp operation is carried out on a left view a by utilizing a disparity map dt after a pixel point of an invalid value area is deleted, and a right view b is obtained. The right view b now contains a partial invalid value area. And then searching N matching points with the nearest neighborhood to the neighborhood of the pixel to be filled by using a template matching mode. And finally, sorting the matching similarity of the N matching points, selecting the optimal matching point to fill the invalid value area, and obtaining a complete right-eye image.

In this way, the image generation apparatus may fill the missing region of the third view image with the texture in the randomly selected image in the training set, or fill the missing region with the texture around the missing region in the third view image, thereby improving the reality and accuracy of the finally generated second view image.

Optionally, in an embodiment of the present application, the image generation method provided in the embodiment of the present application further includes the following steps 204 to 206:

step 204: the image generation device trains the binocular stereo matching network by using the first view image and the second view image to obtain the trained binocular stereo matching network.

Step 205: and the image generation device inputs the fourth view image into the trained binocular stereo matching network and outputs a target parallax image.

Step 206: the image generation device performs affine transformation on the fourth view image based on the target parallax image to obtain a target view image.

The fourth view image and the target view image are view images of the same shooting scene at different angles of view.

Optionally, the fourth view image may be a view image acquired by a monocular camera.

For example, the image generating device may train the binocular stereo matching network by using the first view image as a left eye image, the second view image as a right eye image, and the first parallax image as an output true value, to obtain the trained binocular stereo matching network p.

Illustratively, the binocular Stereo matching network may be a RAFT-Stereo network.

It should be noted that, at present, more and more camera algorithms start to adopt AI intelligent algorithms to replace traditional algorithms, and binocular stereo matching algorithms are used as an important branch of computer vision tasks, so that the method has a very wide application prospect in the fields of robots, automatic driving and the like.

In the related art, when the AI-based stereo matching algorithm calculates the disparity of the corresponding pixels in the binocular image, a large amount of training data is required.

In an embodiment of the application, the image generation means may generate a second view image based on the first view image, and train the binocular stereo matching network using the first view image and the second view image as a pair of binocular training images as input. So, when needs use binocular training data to train the model, only need gather monocular data can automatic generation and monocular data assorted binocular training data, and need not complicated binocular data acquisition flow and expensive binocular data acquisition equipment, reduce the data acquisition cost of model training, and effectively improve the precision of model, thereby it is little to solve training data set data volume, and the problem that the generalization is poor, and then promote the training effect and the robustness of binocular three-dimensional matching network.

Optionally, after the binocular stereo matching network is trained by using the first view image and the second view image generated by the first view image to obtain the trained binocular stereo matching network, disparity estimation may be performed on an input fourth view image by using the trained binocular stereo matching network to obtain a target disparity image corresponding to the fourth view image.

Optionally, the fourth view image may be the same as, at least partially the same as, or different from the first view image.

In an example, the image generating apparatus may use a trained binocular stereo matching network to predict the first view image again, so as to obtain a predicted disparity value dp, and further optimize the obtained first disparity image.

Example 3, fig. 3 is a schematic diagram of performing disparity prediction on a fourth-view image. As shown in fig. 3, the left eye image 31 (i.e., left view) is input to the trained binocular stereo matching network 32 for disparity estimation, and a disparity image 33 corresponding to the monocular image 32 is output. The trained binocular stereo matching network is obtained by optimizing the first view image and the second view image generated based on the first view image, namely, the optimized binocular stereo matching network has higher prediction precision, so that the fourth view image is predicted through the optimized binocular stereo matching network, and a more accurate parallax image corresponding to the fourth view image can be obtained.

It should be noted that the depth information of a scene corresponding to an image is generally described by a grayscale map with the same size, and the grayscale value of each pixel in the grayscale map describes the depth value of the scene corresponding to the image, which is also referred to as a depth map.

In practical applications, usually, image datasets formed by a plurality of images are processed, for example, a plurality of images are input into a network as a batch of image datasets to perform prediction to obtain a parallax image corresponding to each image.

It should be noted that, the binocular stereo matching network trained according to the first view image and the second view image is already trained on a large batch of data sets, so that the prediction result is better than the disparity value dt obtained based on the depth information, and therefore, the accuracy of obtaining the predicted disparity estimation dp is higher by predicting the first view image by reusing the trained binocular stereo matching network.

In another example, the image generation apparatus may continue to predict a new fourth view image by using the trained binocular stereo matching network, so as to improve the accuracy of the parallax image corresponding to the subsequently generated view image.

Further optionally, after the trained binocular stereo matching network is obtained, the image generation device may predict a disparity value corresponding to the first view image again by using the binocular stereo matching network, and determine the depth information of the first view image based on the disparity value and the selected baseline b and the selected focal length f to obtain a depth map corresponding to the first view image, so as to obtain more accurate depth information.

Optionally, the image generation device may train the monocular depth network model by using the first view image and the corresponding depth map, and update the weight of the monocular depth network model to obtain the trained monocular depth network model. Therefore, the effect of the monocular depth network model can be optimized, the prediction precision of the monocular depth network model is effectively improved, and the robustness of the model is improved.

Optionally, after obtaining the optimized monocular depth network model, the depth information of the first view image may be predicted again by using the optimized monocular depth network model. Further, the image generation device performs depth conversion on the depth information of the first view image obtained by re-prediction to obtain a parallax image corresponding to the first view image, so as to obtain a more accurate parallax image and further obtain a more accurate second view image.

Illustratively, the monocular image is input into the optimized monocular depth network model for depth prediction, and a depth image corresponding to the monocular image is output. Therefore, the optimized monocular depth network model is used for predicting the depth information of the first view image again, and the accuracy of the obtained depth information is improved.

According to the image generation method provided by the embodiment of the application, the execution subject can be an image generation device. The image generation device provided by the embodiment of the present application will be described with an example in which an image generation device executes an image generation method.

An embodiment of the present application provides an image generation apparatus 400, as shown in fig. 4, including: a photographing module 401 and an executing module 402, wherein: the shooting module 401 is configured to acquire a first view image through a target camera; the executing module 402 is configured to determine a first parallax image corresponding to the first view image acquired by the shooting module; the executing module 402 is further configured to perform affine transformation on the first view image based on the first parallax image to obtain a second view image; the first view image and the second view image are view images of a same shooting scene from different perspectives.

Optionally, in an embodiment of the present application, the first parallax image includes: a disparity value corresponding to each pixel point in the first view image;

the executing module 402 is specifically configured to perform affine transformation on first position information of each pixel point in the first view image respectively based on a disparity value corresponding to each pixel point in the first view image, so as to obtain second position information;

the executing module 402 is specifically configured to generate a second view image based on the second position information and the pixel information of each pixel point.

Alternatively, in the embodiments of the present application,

the executing module 402 is specifically configured to generate a second parallax image corresponding to a first view image based on depth information of the first view image;

the executing module 402 is specifically configured to delete a pixel point in an invalid value region in the second parallax image, so as to obtain the first parallax image.

Alternatively, in the embodiments of the present application,

the executing module 402 is specifically configured to perform affine transformation on the first view image based on the first parallax image to obtain a third view image;

the executing module 402 is specifically configured to perform image filling on a target image area of the third view image to obtain a second view image;

wherein, the target image area is: an image region in the third view image corresponding to the invalid value region in the second parallax image.

Optionally, in an embodiment of the present application, the apparatus further includes: a training module 403 and a processing module 404, wherein,

the training module 403 is configured to train a binocular stereo matching network by using the first view image acquired by the shooting module 401 and the second view image obtained by the executing module 402, so as to obtain a trained binocular stereo matching network;

the processing module 404 is configured to input a fourth view image into the trained binocular stereo matching network, and output a target parallax image;

the executing module 402 is further configured to perform affine transformation on the fourth view image based on the target parallax image to obtain a target view image;

In the image generation device provided by the embodiment of the application, the image generation device acquires a first view image through a target camera, determines a first parallax image corresponding to the first view image, and performs affine transformation on the first view image based on the first parallax image to obtain a second view image; the first view image and the second view image are view images of the same shooting scene at different viewing angles. By the method, the image generation device can acquire the first view image through the target camera, and perform image affine transformation on the first view image according to the disparity map of the first view image to obtain the second view image matched with the first view image. Thus, view images can be acquired through one camera without complex binocular data acquisition procedures and expensive binocular data acquisition equipment, and the generalization of the generated second view images is improved.

The image generation apparatus in the embodiment of the present application may be an electronic device, or may be a component in an electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be a device other than a terminal. The electronic Device may be, for example, a Mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic Device, a Mobile Internet Device (MID), an Augmented Reality (AR)/Virtual Reality (VR) Device, a robot, a wearable Device, an ultra-Mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and may also be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine, a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The image generation apparatus in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The image generation device provided in the embodiment of the present application can implement each process implemented by the method embodiments of fig. 1 to fig. 3, and is not described here again to avoid repetition.

Optionally, as shown in fig. 5, an electronic device 500 is further provided in an embodiment of the present application, and includes a processor 501 and a memory 502, where the memory 502 stores a program or an instruction that can be executed on the processor 501, and when the program or the instruction is executed by the processor 701, the steps of the embodiment of the image generation method are implemented, and the same technical effects can be achieved, and are not described again to avoid repetition.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 6 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 100 includes, but is not limited to: a radio frequency unit 101, a network module 102, an audio output unit 103, an input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, and a processor 110.

Those skilled in the art will appreciate that the electronic device 100 may further comprise a power source (e.g., a battery) for supplying power to various components, and the power source may be logically connected to the processor 110 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The electronic device structure shown in fig. 6 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

The input unit 104 is a target camera in the embodiment of the present application, and is configured to acquire a first view image; the processor 110 is configured to determine a first parallax image corresponding to a first view image acquired by a target camera; the processor 110 is further configured to perform affine transformation on the first view image based on the first parallax image to obtain a second view image; the first view image and the second view image are view images of a same shooting scene from different perspectives.

the processor 110 is specifically configured to perform affine transformation on first position information of each pixel point in the first view image respectively based on a disparity value corresponding to each pixel point in the first view image, so as to obtain second position information;

the processor 110 is specifically configured to generate a second view image based on the second position information and the pixel information of each pixel point.

Alternatively, in the embodiments of the present application,

the processor 110 is specifically configured to generate a second parallax image corresponding to a first-view image based on depth information of the first-view image;

the processor 110 is specifically configured to delete a pixel point in an invalid value region in the second parallax image, so as to obtain the first parallax image.

Alternatively, in the embodiments of the present application,

the processor 110 is specifically configured to perform affine transformation on the first view image based on the first parallax image to obtain a third view image;

the processor 110 is specifically configured to perform image filling on a target image area of the third view image to obtain a second view image;

Optionally, in this embodiment of the application, the processor 110 is configured to train a binocular stereo matching network by using a first view image acquired by a monocular camera and an obtained second view image, so as to obtain the trained binocular stereo matching network;

the processor 110 is configured to input a fourth view image into the trained binocular stereo matching network, and output a target parallax image;

the processor 110 is further configured to perform affine transformation on the fourth view image based on the target parallax image to obtain a target view image;

In the electronic device provided by the embodiment of the application, the electronic device acquires a first view image through a target camera, determines a first parallax image corresponding to the first view image, and performs affine transformation on the first view image based on the first parallax image to obtain a second view image; the first view image and the second view image are view images of the same shooting scene at different viewing angles. By the method, the electronic equipment can acquire the first view image through the target camera, and perform image affine transformation on the first view image according to the disparity map of the first view image to obtain the second view image matched with the first view image. Thus, view images can be acquired through one camera of the electronic device without a complex binocular data acquisition process and expensive binocular data acquisition equipment, and the generalization of the generated second view images is improved.

It should be understood that, in the embodiment of the present application, the input Unit 104 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, and the Graphics Processing Unit 1041 processes image data of a still picture or a video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 107 includes at least one of a touch panel 1071 and other input devices 1072. The touch panel 1071 is also referred to as a touch screen. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.

The memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a first storage area storing a program or an instruction and a second storage area storing data, wherein the first storage area may store an operating system, an application program or an instruction (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, memory 109 may include volatile memory or non-volatile memory, or memory x09 may include both volatile and non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. The volatile Memory may be a Random Access Memory (RAM), a Static Random Access Memory (Static RAM, SRAM), a Dynamic Random Access Memory (Dynamic RAM, DRAM), a Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (Double Data Rate SDRAM, ddr SDRAM), an Enhanced Synchronous SDRAM (ESDRAM), a Synchronous Link DRAM (SLDRAM), and a Direct Memory bus RAM (DRRAM). Memory 109 in the embodiments of the subject application includes, but is not limited to, these and any other suitable types of memory.

Processor 110 may include one or more processing units; optionally, the processor 110 integrates an application processor, which mainly handles operations related to the operating system, user interface, application programs, etc., and a modem processor, which mainly handles wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the embodiment of the image generation method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a computer read only memory ROM, a random access memory RAM, a magnetic or optical disk, and the like.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the embodiment of the image generation method, and can achieve the same technical effect, and the details are not repeated here to avoid repetition.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

Embodiments of the present application provide a computer program product, where the program product is stored in a storage medium, and the program product is executed by at least one processor to implement the processes of the foregoing embodiments of the image generation method, and achieve the same technical effects, and in order to avoid repetition, details are not repeated here.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatuses in the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions recited, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An image generation method, characterized in that the method comprises:

acquiring a first view image through a target camera;

determining a first parallax image corresponding to the first view image;

performing affine transformation on the first view image based on the first parallax image to obtain a second view image; the first view image and the second view image are view images at different perspectives for the same shooting scene.

2. The method according to claim 1, wherein the first parallax image comprises: a disparity value corresponding to each pixel point in the first view image;

performing affine transformation on the first view image based on the first parallax image to obtain a second view image, including:

performing affine transformation on first position information of each pixel point in the first view image respectively based on a parallax value corresponding to each pixel point in the first view image to obtain second position information;

and generating a second view image based on the second position information and the pixel information of each pixel point.

3. The method according to claim 1, wherein the determining a first parallax image corresponding to the first view image comprises:

generating a second parallax image corresponding to the first view image based on the depth information of the first view image;

and deleting the pixel points of the invalid value area in the second parallax image to obtain a first parallax image.

4. The method according to claim 3, wherein performing affine transformation on the first view image based on the first parallax image to obtain a second view image comprises:

performing affine transformation on the first view image based on the first parallax image to obtain a third view image;

performing image filling on a target image area of the third view image to obtain a second view image;

wherein the target image area is: an image region in the third view image corresponding to an invalid value region in the second parallax image.

5. The method of claim 1, further comprising:

training a binocular stereo matching network by using the first view image and the second view image to obtain the trained binocular stereo matching network;

inputting a fourth view image into the trained binocular stereo matching network, and outputting a target parallax image;

performing affine transformation on the fourth view image based on the target parallax image to obtain a target view image;

wherein the fourth view image and the target view image are view images at different perspectives for the same shooting scene.

6. An image generation apparatus, characterized in that the apparatus comprises: shooting module and execution module, wherein:

the shooting module is used for acquiring a first view image through a target camera;

the execution module is used for determining a first parallax image corresponding to the first view image acquired by the shooting module;

the execution module is further configured to perform affine transformation on the first view image based on the first parallax image to obtain a second view image; the first view image and the second view image are view images at different perspectives for the same shooting scene.

7. The apparatus according to claim 6, wherein the first parallax image comprises: a disparity value corresponding to each pixel point in the first view image;

the execution module is specifically configured to perform affine transformation on first position information of each pixel point in the first view image respectively based on a disparity value corresponding to each pixel point in the first view image, so as to obtain second position information;

the execution module is specifically configured to generate a second view image based on the second position information and the pixel information of each pixel point.

8. The apparatus of claim 6,

the execution module is specifically configured to generate a second parallax image corresponding to the first view image based on the depth information of the first view image;

the execution module is specifically configured to delete a pixel point of an invalid value region in the second parallax image, so as to obtain a first parallax image.

9. The apparatus of claim 8,

the execution module is specifically configured to perform affine transformation on the first view image based on the first parallax image to obtain a third view image;

the execution module is specifically configured to perform image filling on a target image area of the third view image to obtain a second view image;

10. The apparatus of claim 6, further comprising: training module and processing module:

the training module is used for training a binocular stereo matching network by using the first view image acquired by the shooting module and the second view image acquired by the execution module to acquire the trained binocular stereo matching network;

the processing module is used for inputting a fourth view image into the trained binocular stereo matching network and outputting a target parallax image;

the execution module is further configured to perform affine transformation on the fourth view image based on the target parallax image to obtain a target view image;