WO2022100419A1 - 一种图像处理方法及相关设备 - Google Patents

一种图像处理方法及相关设备 Download PDF

Info

Publication number
WO2022100419A1
WO2022100419A1 PCT/CN2021/126000 CN2021126000W WO2022100419A1 WO 2022100419 A1 WO2022100419 A1 WO 2022100419A1 CN 2021126000 W CN2021126000 W CN 2021126000W WO 2022100419 A1 WO2022100419 A1 WO 2022100419A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
historical
expanded
pose
training
Prior art date
Application number
PCT/CN2021/126000
Other languages
English (en)
French (fr)
Inventor
邹纯稳
刘建滨
陈江龙
刘超
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022100419A1 publication Critical patent/WO2022100419A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/04Texture mapping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/50Lighting effects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/006Mixed reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Definitions

  • the embodiments of the present application relate to the field of computer vision, and in particular, to an image processing method and related devices.
  • Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. What we need is the knowledge of the data and information of the subject being photographed. To put it figuratively, it is to install eyes (cameras/camcorders) and brains (algorithms) on the computer to identify, track and measure the target instead of the human eye, so that the computer can perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science of how to make artificial systems "perceive" from images or multidimensional data. In general, computer vision is to use various imaging systems to replace the visual organ to obtain input information, and then use the computer to replace the brain to complete the processing and interpretation of these input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • the determination of ambient light parameters is particularly important for object surface shading and ambient reflection.
  • Embodiments of the present application provide an image processing method and related equipment. Can improve the quality of subsequent ambient light rendering.
  • a first aspect of the embodiments of the present application provides an image processing method, which may be executed by an image processing apparatus, or may be executed by a component of an image processing apparatus (for example, a processor, a chip, or a chip system, etc.), wherein the image
  • the processing device may be a local device (eg, a mobile phone, a camera, etc.) or a cloud device.
  • the method can also be executed jointly by the local device and the cloud device.
  • the method includes: acquiring a first image and a second image, wherein the first image and the second image are images collected from the same scene and different viewing angles; and performing mapping processing on the first image and the second image based on a spatial mapping model to obtain a first expansion image and the second unfolded image; fuse the first unfolded image and the second unfolded image to obtain a third image; input the third image into a trained image prediction network for image prediction to obtain a predicted image, and the predicted image is used for virtual under the aforementioned scene. Ambient rendering of objects.
  • the prediction image obtained after inputting the image prediction network has more textures through spatial mapping model mapping and fusion of expanded images from multiple perspectives, thereby improving the ambient light rendering quality of virtual objects in the subsequent scene.
  • performing mapping processing on the first image and the second image based on the spatial mapping model in the above steps to obtain the first expanded image and the second expanded image including: A spatial mapping model is constructed according to the optical center of the first device, the first device is the device that collects the first image, and the second device is the device that collects the second image; according to the first pose of the first device, the spatial mapping of the first image is determined In the first projection area on the model, the first pose is the pose when the first device collects the first image; each pixel in the first image is mapped to the first projection area to obtain the first expanded image; according to the second device The second pose determines the second projection area of the second image on the spatial mapping model, and the second pose is the pose when the second device collects the second image; map each pixel in the second image to the second projection In the region, a second expanded image is obtained.
  • a spatial mapping model is constructed by using the pose when the first device collects the first image, and subsequent images are mapped to the spatial mapping model based on the pose, and then texture mapping and fusion are performed to obtain texture information and capture
  • the images are similar, so that the third image output by the subsequent image prediction network has more texture information, which improves the real effect of ambient light rendering.
  • the above steps further include: acquiring a historical image and a historical pose of the historical image from a server, and the historical image is collected at the time of the first image or the second image.
  • the historical pose is the pose when the historical device collects the historical image, and the historical image is stored with an image in the same position as the first image and/or the second image; the historical image is determined on the spatial mapping model according to the historical pose
  • the historical projection area of the historical image is mapped to the historical projection area to obtain a historical expanded image; the first expanded image and the second expanded image are fused to obtain a third image, including: fusing the first expanded image and the second expanded image Image History Expands the image to obtain a third image.
  • a third image can be obtained by combining the historical images in the cloud, so that the texture information in the scene stored in the cloud can be referred to, and the texture details of the subsequent predicted images and the quality of ambient light rendering can be improved.
  • the trained image prediction network in the above steps uses the training image as the input of the image prediction network, and takes the value of the loss function less than the first threshold as the target pair.
  • the image prediction network is trained to obtain; the loss function is used to indicate the difference between the image output by the image prediction network and the third target image, and the third target image is the collected image.
  • the training process of the image prediction network is implemented through the training image and the third target image, so as to provide a more optimized image prediction network for the follow-up, and improve the fineness of the output image (ie, the predicted image).
  • the weight of the loss function in the above steps is controlled by the mask image corresponding to the training image.
  • the weight of the loss function is controlled by the mask image.
  • the weight of the area with a scene is 1, and the weight of the area without a scene is 0, so that the invalid part can be removed and the interference of the invalid area can be reduced.
  • the above steps further include: acquiring spherical harmonic coefficients of the predicted image; and performing ambient light rendering on the virtual object by using the spherical harmonic coefficients.
  • the virtual object can be rendered by acquiring the spherical harmonic coefficient, so that the illumination of the virtual object is more realistic.
  • the angle of view of the third image in the above steps is larger than that of the first image or the second image.
  • a third image with a larger viewing angle is obtained by inputting an image with a small viewing angle into the image prediction model, which is beneficial to increase the area of subsequent ambient light rendering.
  • a second aspect of the embodiments of the present application provides an image processing apparatus, where the image processing apparatus may be a local device (for example, a mobile phone, a camera, etc.) or a cloud device.
  • the image processing device includes:
  • an acquisition unit configured to acquire a first image and a second image, where the first image and the second image are images collected from the same scene and different viewing angles;
  • mapping unit for performing mapping processing on the first image and the second image based on the spatial mapping model to obtain the first expanded image and the second expanded image
  • a fusion unit used for fusing the first expanded image and the second expanded image to obtain a third image
  • the prediction unit is used for inputting the third image into the trained image prediction network for image prediction to obtain a predicted image, and the predicted image is used for ambient light rendering.
  • the mapping unit in the image processing apparatus includes:
  • a construction subunit for constructing a spatial mapping model according to an optical center of a first device, where the first device is a device for collecting a first image, and the second device is a device for collecting a second image;
  • a determination subunit configured to determine the first projection area of the first image on the spatial mapping model according to the first pose of the first device, where the first pose is the pose when the first device collects the first image;
  • mapping subunit for mapping each pixel in the first image to the first projection area to obtain the first expanded image
  • the determining subunit is further configured to determine the second projection area of the second image on the spatial mapping model according to the second pose of the second device, where the second pose is the pose when the second device collects the second image;
  • the mapping subunit is further configured to map each pixel in the second image to the second projection area to obtain a second expanded image.
  • the acquisition unit of the above-mentioned image processing apparatus is further configured to acquire historical images and historical poses of the historical images from the cloud, and the historical poses are for the historical equipment to collect the historical images.
  • the pose at the time, the historical image stores the texture information of the same position as the first image and/or the second image;
  • the determining subunit is also used to determine the historical projection area of the historical image on the spatial mapping model according to the historical pose,
  • the mapping subunit is also used to map each pixel in the historical image to the historical projection area to obtain the historical expanded image;
  • the fusion unit is specifically configured to fuse the first expanded image and the historical expanded image of the second expanded image to obtain a third image.
  • the above-mentioned trained image prediction network uses the training image as the input of the image prediction network, and uses the value of the loss function to be less than the first threshold as the target to predict the image.
  • the network is trained to obtain; the loss function is used to indicate the difference between the image output by the image prediction network and the third target image, and the third target image is the collected image.
  • the weight of the above-mentioned loss function is controlled by the mask image corresponding to the training image.
  • the acquisition unit of the image processing apparatus is further configured to acquire spherical harmonic coefficients of the predicted image
  • the above-mentioned image processing device also includes:
  • the rendering unit is used to perform ambient light rendering on virtual objects using spherical harmonic coefficients.
  • the angle of view of the third image is larger than that of the first image or the second image.
  • a third aspect of the embodiments of the present application provides an image processing apparatus, where the image processing apparatus may be a mobile phone or a video camera. It may also be a cloud device (such as a server, etc.), and the image processing apparatus executes the method in the foregoing first aspect or any possible implementation manner of the first aspect.
  • the image processing apparatus may be a mobile phone or a video camera. It may also be a cloud device (such as a server, etc.), and the image processing apparatus executes the method in the foregoing first aspect or any possible implementation manner of the first aspect.
  • a fourth aspect of the embodiments of the present application provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a computer program or instruction, so that the chip implements the first aspect or the first aspect above method in any possible implementation of .
  • a fifth aspect of the embodiments of the present application provides a computer-readable storage medium, where an instruction is stored in the computer-readable storage medium, and when the instruction is executed on a computer, causes the computer to execute the foregoing first aspect or any possibility of the first aspect method in the implementation.
  • a sixth aspect of the embodiments of the present application provides a computer program product, which, when executed on a computer, enables the computer to execute the method in the foregoing first aspect or any possible implementation manner of the first aspect.
  • a seventh aspect of an embodiment of the present application provides an image processing apparatus, including: a processor, where the processor is coupled to a memory, and the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the image processing apparatus realizes The method in the above first aspect or any possible implementation manner of the first aspect.
  • the embodiments of the present application have the following advantages: acquiring a first image and a second image, performing mapping processing on the first image and the second image based on a spatial mapping model, and obtaining a first expanded image and a second expanded image image; fusing the first expanded image and the second expanded image to obtain a third image; inputting the third image into a trained image prediction network for image prediction to obtain a predicted image, which is used for ambient light rendering.
  • a trained image prediction network for image prediction to obtain a predicted image, which is used for ambient light rendering.
  • FIG. 1 is a schematic structural diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of another convolutional neural network provided by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a training method for an image prediction model provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a training image provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a mask image corresponding to a training image provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a third target image provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an output image provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of an image prediction model provided by an embodiment of the application.
  • FIG. 11 is a schematic structural diagram of another image prediction model provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of another image prediction model provided by an embodiment of the present application.
  • FIG. 13 is a schematic flowchart of an image processing method provided by an embodiment of the present application.
  • FIG. 14A is a schematic diagram of overlapping two viewing angles according to an embodiment of the present application.
  • FIG. 14B is another schematic diagram of overlapping two viewing angles according to an embodiment of the present application.
  • 15A is a schematic diagram of a first image provided by an embodiment of the present application.
  • 15B is a schematic diagram of a second image provided by an embodiment of the present application.
  • 16 is a schematic diagram of a spatial mapping model constructed based on a first image provided by an embodiment of the present application
  • 17 is a schematic diagram of mapping of a second image in space mapping model provided by an embodiment of the present application.
  • FIG. 18 is a schematic diagram of a second expanded image provided by an embodiment of the present application.
  • FIG. 19 is a schematic diagram of a third image provided by an embodiment of the present application.
  • FIG. 20 is a schematic diagram of vertical fusion of two images according to an embodiment of the present application.
  • 21 is a schematic diagram of a historical image provided by an embodiment of the present application.
  • FIG. 22 is another schematic diagram of a third image or a fourth image provided by an embodiment of the present application.
  • 23-26 are several schematic diagrams of user interfaces provided by the embodiments of the present application.
  • FIG. 27 is a schematic structural diagram of another image prediction model provided by an embodiment of the present application.
  • FIG. 28 is a schematic structural diagram of another image prediction model provided by an embodiment of the present application.
  • FIG. 29 is a schematic diagram of a predicted image provided by an embodiment of the present application.
  • FIG. 30 is a schematic structural diagram of another image prediction model provided by an embodiment of the present application.
  • 31A is a schematic diagram of a unit sphere model of a predicted image provided by an embodiment of the present application.
  • 31B is a schematic diagram of illumination of spherical harmonic coefficient recovery provided by an embodiment of the present application.
  • FIG. 32 is a schematic structural diagram of an image processing apparatus provided by an embodiment of the present application.
  • FIG. 33 is another schematic structural diagram of an image processing apparatus provided by an embodiment of the present application.
  • FIG. 34 is another schematic structural diagram of an image processing apparatus provided by an embodiment of the present application.
  • the image processing method provided by the embodiments of the present application can be applied to augmented reality (AR), game production, film production, and other scenes that require ambient light rendering.
  • AR augmented reality
  • game production game production
  • film production film production
  • other scenes that require ambient light rendering.
  • the following is a brief introduction to the AR scene and the movie production scene.
  • AR technology is a new technology developed on the basis of virtual reality. It is a technology that increases the user's perception of the real world through the information provided by the computer system, and superimposes the computer-generated virtual objects, scenes or system prompt information into the real scene, thereby Realizing the "augmentation" of reality is a new technology that "seamlessly" integrates real world information and virtual world information. Therefore, how to coordinate the rendering effect of virtual objects with the environment is of great significance to the user experience of AR products. Rendering virtual objects using light estimation is an important part of "seamless" AR. With the image processing method provided by the embodiment of the present application, the illumination of the virtual object in the AR scene can be made more realistic.
  • Lighting capture of unreal shots in film production requires estimating the lighting conditions of the real scene, so that the unreal shots in the film are more realistic and can present the shading, shadow and reflection effects of the real scene.
  • the illumination of the non-real shots in the film production can be made more realistic.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes X s and an intercept 1 as input, and the output of the operation unit can be:
  • W s is the weight of X s
  • b is the bias of the neural unit.
  • f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN Deep neural network
  • the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the middle layers are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer is defined as W jk L .
  • W jk L the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer.
  • the input layer does not have a W parameter.
  • W jk L the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer.
  • W jk L the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer.
  • the input layer does not have a W parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers.
  • the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolution with an input image or a convolutional feature map using a trainable filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle.
  • Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
  • multiple convolution kernels can be used to extract different image information. Generally, the more convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the convolutional neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, forwarding the input signal until the output will generate an error loss, and updating the parameters in the initial super-resolution model by back-propagating the error loss information, so that the error loss converges.
  • the back-propagation algorithm is a back-propagation motion dominated by the error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.
  • GANs Generative adversarial networks
  • the model includes at least two modules: one is the Generative Model, and the other is the Discriminative Model, through which the two modules learn from each other to produce better outputs.
  • Both the generative model and the discriminative model can be neural networks, specifically, deep neural networks or convolutional neural networks.
  • the basic principle of GAN is as follows: Take the GAN that generates pictures as an example, suppose there are two networks, G (Generator) and D (Discriminator), where G is a network that generates pictures, it receives a random noise z, through this noise Generate a picture, denoted as G(z); D is a discriminant network used to determine whether a picture is "real".
  • Its input parameter is x
  • x represents a picture
  • the output D(x) represents the probability that x is a real picture. If it is 1, it means 100% of the real picture, if it is 0, it means it is impossible to be real picture.
  • the goal of generating network G is to generate real pictures as much as possible to deceive the discriminant network D
  • the goal of discriminant network D is to try to distinguish the pictures generated by G from the real pictures. Come. In this way, G and D constitute a dynamic "game” process, that is, the "confrontation" in the "generative confrontation network”.
  • the pixel value of the image can be a red-green-blue (RGB) color value, and the pixel value can be a long integer representing the color.
  • the pixel value is 256*Red+100*Green+76Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness.
  • the pixel values can be grayscale values.
  • the encoder is used to extract the features of the input image.
  • the encoder may employ a neural network, eg, a convolutional neural network.
  • the decoder is used to restore the extracted features into an image.
  • the decoder may employ a neural network, eg, a convolutional neural network.
  • bilinear interpolation (bilinear), deconvolution (Transposed Convolution) and unpooling (Unpooling).
  • an embodiment of the present invention provides a system architecture 100 .
  • the data collection device 160 is used to collect training data
  • the training data in this embodiment of the present application includes: training images.
  • the training image and the first image and/or the second image are images collected under the same scene.
  • the training data is stored in the database 130 , and the training device 120 obtains the target model/rule 101 through training based on the training data maintained in the database 130 .
  • the first embodiment will be used to describe in more detail how the training device 120 obtains the target model/rule 101 based on the training data.
  • the target model/rule 101 can be used to implement the image processing method provided by the The predicted image can be obtained by inputting the target model/rule 101 after relevant preprocessing.
  • the target model/rule 101 in the embodiment of the present application may specifically be an image prediction network.
  • the image prediction network is obtained by training training images.
  • the training data maintained in the database 130 may not necessarily come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rule 101 based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. Limitations of Examples.
  • the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Notebook computer, AR/VR, vehicle terminal, etc., it can also be a server or cloud, etc.
  • the execution device 110 is configured with an I/O interface 112, which is used for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140, and the input data is described in the embodiments of the present application. may include: the first image and the second image, which may be input by the user, or uploaded by the user through the photographing device, and of course may also come from a database, which is not specifically limited here.
  • the preprocessing module 113 is configured to perform preprocessing according to the input data (such as the first image and the second image) received by the I/O interface 112.
  • the preprocessing module 113 may be configured to The first image and the second image are mapped to obtain the first expanded image and the second expanded image.
  • the execution device 110 When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the data and instructions obtained by corresponding processing may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the predicted image obtained as described above, to the client device 140, so as to be provided to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above task, thus providing the user with the desired result.
  • the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .
  • FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • a target model/rule 101 is obtained by training according to the training device 120, and the target model/rule 101 may be an image prediction network in this embodiment of the present application. Both prediction networks can be convolutional neural networks.
  • CNN is a very common neural network
  • a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. learning at multiple levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
  • a convolutional neural network (CNN) 100 may include an input layer 110 , a convolutional/pooling layer 120 , where the pooling layer is optional, and a neural network layer 130 .
  • the convolutional/pooling layer 120 may include layers 121-126 as examples.
  • layer 121 is a convolutional layer
  • layer 122 is a pooling layer
  • layer 123 is a convolutional layer
  • layer 124 is a convolutional layer.
  • Layers are pooling layers
  • 125 are convolutional layers
  • 126 are pooling layers; in another implementation, 121 and 122 are convolutional layers, 123 are pooling layers, 124 and 125 are convolutional layers, and 126 are pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 121 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can be essentially a weight matrix. This weight matrix is usually pre-defined. In the process of convolving an image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels after two pixels...depending on the value of stride), which completes the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will be extended to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension of the convolutional output, but in most cases a single weight matrix is not used, but multiple weight matrices of the same dimension are applied.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image. Perform fuzzification...
  • the dimensions of the multiple weight matrices are the same, and the dimension of the feature maps extracted from the weight matrices with the same dimensions are also the same, and then the multiple extracted feature maps with the same dimensions are combined to form the output of the convolution operation .
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.
  • the initial convolutional layer for example, 121
  • the features extracted by the later convolutional layers become more and more complex, such as features such as high-level semantics.
  • each layer 121-126 exemplified by 120 in Figure 2 can be a convolutional layer followed by a layer
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the average value of the pixel values in the image within a certain range.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 100 After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 100 needs to utilize the neural network layer 130 to generate one or a set of outputs of the desired number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 2) and the output layer 140, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...
  • the output layer 140 After the multi-layer hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error,
  • the forward propagation of the entire convolutional neural network 100 (as shown in Fig. 2, the propagation from 110 to 140 is forward propagation) is completed, the back propagation (as shown in Fig. 2 from 140 to 110 as the back propagation) will start to update.
  • the weight values and biases of the aforementioned layers are used to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
  • the convolutional neural network 100 shown in FIG. 2 is only used as an example of a convolutional neural network.
  • the convolutional neural network may also exist in the form of other network models, for example, such as
  • the multiple convolutional layers/pooling layers shown in FIG. 3 are in parallel, and the extracted features are input to the full neural network layer 130 for processing.
  • FIG. 4 is a hardware structure of a chip according to an embodiment of the present invention, where the chip includes a neural network processor 40 .
  • the chip can be set in the execution device 110 as shown in FIG. 1 to complete the calculation work of the calculation module 111 .
  • the chip can also be set in the training device 120 as shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101 .
  • the algorithms of each layer in the convolutional neural network shown in Figure 2 can be implemented in the chip shown in Figure 4.
  • the neural network processor 40 may be a neural-network processing unit (NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., all suitable for large-scale applications.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processor
  • NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and tasks are allocated by the main CPU.
  • the core part of the NPU is the operation circuit 403, and the controller 404 controls the operation circuit 403 to extract the data in the memory (weight memory or input memory) and perform operations.
  • the operation circuit 403 includes multiple processing units (Process Engine, PE).
  • arithmetic circuit 403 is a two-dimensional systolic array.
  • the arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 403 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 402 and buffers it on each PE in the operation circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 401 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 408 accumulator.
  • the vector calculation unit 407 can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and the like.
  • the vector calculation unit 407 can be used for network calculation of non-convolutional/non-FC layers in the neural network, such as pooling (Pooling), batch normalization (Batch Normalization), local response normalization (Local Response Normalization), etc. .
  • the vector computation unit 407 can store the processed output vectors to the unified buffer 406 .
  • the vector calculation unit 407 may apply a nonlinear function to the output of the arithmetic circuit 403, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 407 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as an activation input to the arithmetic circuit 403, such as for use in subsequent layers in a neural network.
  • Unified memory 406 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 401 and/or the unified memory 406 through the storage unit access controller 405 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 402, And the data in the unified memory 506 is stored in the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (bus interface unit, BIU) 410 is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory 409 through the bus.
  • An instruction fetch buffer 409 connected to the controller 404 is used to store the instructions used by the controller 404.
  • the controller 404 is used for invoking the instructions cached in the memory 409 to control the working process of the operation accelerator.
  • the unified memory 406, the input memory 401, the weight memory 402 and the instruction fetch memory 409 are all on-chip (On-Chip) memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access Memory (double data rate synchronous dynamic random access memory, referred to as DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • DDR SDRAM double data rate synchronous dynamic random access Memory
  • HBM high bandwidth memory
  • HBM high bandwidth memory
  • each layer in the convolutional neural network shown in FIG. 2 or FIG. 3 may be performed by the operation circuit 403 or the vector calculation unit 407 .
  • the embodiment of the present application provides an image processing method, which can improve the quality of ambient light rendering.
  • the training method 500 of the image prediction network will be introduced in detail with reference to FIG. 5 .
  • the method shown in FIG. 5 can be performed by a training device of an image prediction network.
  • the training device of the image prediction network can be a cloud service device or a terminal device.
  • the computing power of a computer, a server, etc. is sufficient to perform image prediction.
  • the apparatus of the network training method may also be a system composed of cloud service equipment and terminal equipment.
  • the method 500 may be performed by the training device 120 in FIG. 1 , the neural network processor 40 in FIG. 4 .
  • the method 500 may be processed by the CPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computing may be used without using the GPU, which is not limited in this application.
  • the method 500 includes steps 501 and 502 . Steps 501 and 502 are described in detail below.
  • Step 501 acquiring training images.
  • the training image may be a partial view image selected from a captured panoramic image, or may be obtained by fusing at least two images captured by a terminal device, which is not specifically limited here.
  • a training sample can be generated from a panoramic image during the training process, and the training sample includes: an input image (that is, the above-mentioned training image, for example: the training image is shown in FIG. 6 ), a mask image corresponding to the input image ( For example: the mask image corresponding to the training image is shown in Figure 7) and the third target image (the third target image can be a collected panoramic image or an image selected from the panoramic image, and the third target image
  • the viewing angle range is larger than the training image, for example: if the third target image is a panoramic image, the third target image is shown in Figure 8, and the training image shown in Figure 6 can be obtained from the partial viewing angle selected from the panoramic image shown in Figure 8).
  • the mask image in the embodiment of the present application can be used to distinguish the valid area from the invalid area, and the mask image can also be understood as a black and white image, and black and white represent different areas, for example: the scene part in the training image is the valid area ( or white area), the area other than the scene is invalid area (black area).
  • the acquired training image may be in the same scene as the first image and/or the second image, or may be in a different scene.
  • the explanation of the same scene can refer to the follow-up.
  • the number of acquired training images may be one or more, which is not specifically limited here.
  • Step 502 take the training image as the input of the image prediction network, train the image prediction network with the value of the loss function less than the first threshold as the target, and obtain a trained image prediction network.
  • the loss function is used to indicate the difference between the output image of the image prediction network (for example, the output image is shown in Figure 9) and the third target image.
  • the image prediction network is trained with the goal of reducing the value of the loss function, that is, the difference between the output image of the image prediction network and the third target image is continuously reduced.
  • This training process can be understood as a prediction task.
  • the loss function can be understood as the loss function corresponding to the prediction task. Wherein, the viewing angle range of the output image is larger than that of the input image.
  • the penalty weight in the loss function is controlled by the mask image corresponding to the training image.
  • Add a weight before the generic loss function which is either 0 or 1.
  • the weight corresponding to the area with a scene (also called the valid area) in Fig. 6 is 1, and the weight corresponding to the area without the scene (also called the invalid area) is 0, that is, the white area in Fig. 7 (that is, the weight is 0).
  • the weight of the valid area) is 1, and the weight of the black area (that is, the invalid area) is 0. It is equivalent to the black area not participating in subsequent calculations, which can reduce the computing power consumption during the training process.
  • the weight of the loss function is controlled by the mask image. For example, the weight of the area with the scene is 1, and the weight of the area without the scene is 0, which can remove the invalid part, reduce the interference of the invalid area, and improve the texture details of the output image.
  • the training image is used as the input of the image prediction network, and the image prediction network is trained with the value of the loss function smaller than the first threshold as the target to obtain a trained image prediction network.
  • the image prediction network can be a CNN.
  • the image prediction network is shown in Fig. 10, and the image prediction network includes an encoder and a decoder.
  • the encoder can include convolution, activation and pooling.
  • the decoder may include: convolution and upsampling. Of course, the decoder can also include deconvolution.
  • the specific structure of the image prediction network is not limited here.
  • a GAN can be introduced.
  • the generator in GAN is an image prediction network, and the generator generates an output image; the discriminator determines whether the output image is "real".
  • the goal of the generator is to generate as real pictures as possible to deceive the discriminator, and the goal of the discriminator is to try to distinguish the output pictures generated by the generator from the real pictures.
  • the real picture is the third target image.
  • the generator can generate output images that are "real" enough, and it is difficult for the discriminator to determine whether the output images generated by the generator are real or not. This results in an excellent generator that can be used to generate output images.
  • FIG. 12 can be understood as an example of FIG. 11 .
  • the encoder of the image prediction network establishes a cross-layer connection with the corresponding layer of the decoder, that is, after feature extraction, details may be lost.
  • a cross-layer connection Through the cross-layer connection, an image without feature extraction can be provided as reference, so that the result has more texture detail.
  • the training process may not adopt the aforementioned method 500 but adopt other training methods, which is not limited here.
  • FIG. 13 is an embodiment of the image processing method in the embodiment of the present application.
  • the method 1300 includes steps 1301 to 1306 .
  • Step 1301 The image processing apparatus acquires a first image and a second image.
  • the image processing apparatus in this embodiment of the present application may be a cloud service device or a terminal device, for example, a computer, a server, or other device with sufficient computing power to execute the image processing method, or may be composed of a cloud service device and a terminal device system.
  • the image processing apparatus may be the execution device 110 in FIG. 1 and the neural network processor 40 in FIG. 4 .
  • the image processing apparatus may be a CPU, a CPU and a GPU, or not a GPU, and other processors suitable for neural network computing may be used, which is not specifically limited here.
  • the first image and the second image in the embodiments of the present application are images shot from different viewing angles for the same scene.
  • the first image is an image captured by the first device at a first viewing angle
  • the second image is an image captured by the second device at a second viewing angle.
  • the first device and the second device may be the same device or different devices.
  • the moment when the first device collects the first image and the moment when the second device collects the second image may be the same or different, which is not specifically limited here.
  • the same scene may refer to two images (for example: the first image and the second image) with the same content of the screen, for example: the overlapping content (or area, area) of the first image and the second image is greater than or equal to 30 %.
  • the same scene may mean that the distance between the first position of the device when one image is collected and the position of the second device when another image is collected is less than a certain threshold (for example: the position of the first device when the first image is collected)
  • a certain threshold for example: the position of the first device when the first image is collected
  • the distance from the position where the second device collected the second image is 1 meter, and the threshold is 2 meters, that is, the distance is less than the threshold, then it can be determined that the first image and the second image are images collected in the same field); and/or
  • the overlapping angle of the field of view of the two images eg: the first image and the second image
  • a certain threshold eg: the overlapping angle of the first viewing angle and the second viewing angle is greater than 30 degrees
  • two images are collected
  • the difference in the rotation angles of the devices is less than a certain threshold.
  • the rotation angle may be an angle value rotated by the horizontal angle of the device, or may be an angle value rotated by a top-down angle
  • the above-mentioned overlapping angle may be as shown by the arrows in FIG. 14A or 14B , the first image captured by the first device at the first viewing angle, and the second image captured by the second device at the second viewing angle, overlapping in the middle.
  • the angle is called the overlap angle.
  • the position in the above can be a relative position or a geographic position, etc. If the position is a relative position, the relative position of the first device and the second position can be determined by establishing a scene model; if the position is a geographic position, it can be based on the global positioning system. The position of the first device and the position of the second device determined by the global positioning system (GPS) or the Beidou navigation system, and then the distance between the two positions is obtained.
  • GPS global positioning system
  • Beidou navigation system Beidou navigation system
  • the same scene can also be judged according to the light intensity. For example, whether the two images are the same scene is judged based on whether the weather type when one image is collected is similar to the weather type when another image is collected. If the first image is a sunny day, and the second image is collected in a sunny day, it can be determined that the first image and the second image are the same scene. If it is sunny when the first image is collected and rainy when the second image is collected, it can be determined that the first image and the second image do not belong to the same scene.
  • the same scene may also mean that the texture similarity between the first image and the second image is greater than or equal to a certain threshold. Generally, this method needs to be determined in conjunction with the other methods mentioned above.
  • the above examples can be determined individually or jointly. For example, after determining that the distance is less than a certain threshold and the weather type is the same, it is determined that the first image and the second image are images collected in the same scene. Or after it is determined that the distance is less than a certain threshold, and the texture similarity of the two images is greater than or equal to a certain threshold, it can be determined that the first image and the second image are images collected in the same scene. Or after it is determined that the distance is less than a certain threshold, it can also be judged whether the overlapping angle of the field angles of the two images is greater than a certain threshold, and if it is greater than the threshold, it is determined that the first image and the second image are images collected in the same scene.
  • the above attributes can be stored together with the first image or the second image, and the image processing device can also obtain the corresponding image before and after the image is obtained. attribute, it is convenient to judge whether several images are the same scene, and then correctly perform the subsequent fusion operation.
  • the image processing apparatus may collect or photograph the first image and the second image from different viewing angles. That is, the image processing apparatus, the first device, and the second device are the same device.
  • the image processing apparatus receives the first image and the second image sent by other devices.
  • the other device may be a camera or an inertial measurement unit (inertial measurement unit, IMU), etc., having a function of acquisition or shooting. That is, the image processing apparatus is a different device from the first device or the second device.
  • the image processing apparatus can also acquire images from other perspectives in addition to acquiring the first image and the second image, and the embodiments of the present application only take the first image and the second image as examples for schematic illustration.
  • the device for collecting the first image or the second image is a camera for illustrative description.
  • the first image is shown in FIG. 15A
  • the second image is shown in FIG. 15B .
  • Step 1302 The image processing apparatus performs mapping processing on the first image and the second image based on the spatial mapping model to obtain the first expanded image and the second expanded image.
  • the pose when the first device collects the first image is called the first pose
  • the pose when the second device collects the second image is called the second pose.
  • the pose can be understood as the position and direction when the device collects the image.
  • the pose can be described by parameters such as 6-DOF pose data (the position of the three axes of XYZ and the rotation angle of the three axes) or the transformation matrix. .
  • the poses of the first image and the second image in the embodiments of the present application are based on poses in respective spatial coordinate systems, that is, the first image corresponds to one spatial coordinate system, and the second image corresponds to another spatial coordinate system.
  • the role of the spatial mapping model is to map the first image and the second image in the same spatial coordinate system according to the first pose and the second pose, and expand to obtain the first expanded image and the second expanded image.
  • the first expanded image And the second expanded image is a two-dimensional image based on the same space coordinate system. It is beneficial to improve the realism of subsequent ambient light rendering of virtual objects in the same scene.
  • the role of the space mapping model in the embodiment of the present application is to map the first image and the second image in the same space coordinate system according to the first pose and the second pose.
  • the following only takes the space mapping model as a spherical model as an example for illustration. It can be understood that the spatial mapping model may also be a cube model, etc., which is not specifically limited here.
  • the first step is to construct the ball model of the scene.
  • a world coordinate system is constructed according to the first image, and a spherical model is constructed with the optical center position of the first device as the origin O, the projection of the optical axis OF in the horizontal direction is the X axis, and the vertical The straight direction is the Z axis, and the horizontal direction is the Y axis.
  • the center point of the world coordinate system and the position of the optical center of the first device are the same point.
  • the theoretical optical center position is the center of a convex mirror, and the actual optical center position may be the virtual center of the combination of multiple convex mirrors.
  • the radius of the ball is set as required, for example, the indoor scene can be set to more than 3 meters, and the outdoor scene can be set to more than 10 meters.
  • the specific value of the radius is not limited here.
  • the second step is to determine the projection area.
  • the first projection area of the first image on the spherical model is determined according to the pose of the first image. It can be understood that: the position and orientation of the first image in the spherical model are determined according to the first particle, and each pixel data in the first image can be mapped to the spherical model based on the principle of pinhole imaging to obtain the first projection area.
  • the image processing device determines the position and orientation of the second image on the created ball model according to the second pose.
  • the projection mode of the second image is shown in FIG. 17 , where O is the coordinate of the center point of the world coordinate system (which is also the position of the optical center when the first device collects the first image), and O 1 is the second device that collects the second image.
  • the position of the image, A 1 B 1 C 1 D 1 is the imaging plane of the camera.
  • P 1 is any pixel point in the second image
  • the projection point of P 1 in the second image on the spherical model can be determined as P 2 through the intersection of OP 1 and the surface of the spherical model, and the projection point of OP 2 relative to the XOZ plane
  • the angle is ⁇
  • the projection of OP 2 on the XOZ plane is at an angle ⁇ relative to OX.
  • the projection point of P 1 in the second image on the spherical model can also be determined as P 2 through the intersection of OP 1 and the surface of the spherical model, which is not specifically limited here.
  • Other points are similar to the way P 1 determines P 2 , and then the second projection area A 2 B 2 C 2 D 2 of the second image A 1 B 1 C 1 D 1 on the spherical model can be obtained.
  • the third step is texture mapping.
  • the position of the pixel point in the second image in the second projection area can be determined by using the straight line where OP 1 or O 1 P 1 is located.
  • the current pose of the second device O 1 (x 1 , y 1 , z 1 ), and the rotation angles of the optical axis around the X-axis, the Y-axis, and the Z-axis are: ⁇ , ⁇ , and ⁇ , respectively.
  • the coordinates of the point P 1 in the camera coordinate system are: P cam (x*dx, y*dy, fx*dx), according to the camera
  • the transformation from the coordinate system to the world coordinate system can obtain the coordinates P world (x world , y world , z world ) of the point P 1 in the world coordinate system.
  • Converting from the camera coordinate system to the world coordinate system is a rigid body transformation (the object does not change, only needs to be rotated and translated), that is, the camera coordinate system can be rotated and translated to obtain the world coordinate system.
  • pitch - ⁇
  • yaw - ⁇
  • roll - ⁇
  • the straight line equation of OP 1 can be determined according to point O and P world , that is, solving The straight line equation of OP 1 is obtained, and the specific calculation will not be repeated here.
  • the intersection point between the straight line equation and the surface of the spherical model close to P 1 is the mapping point P 2 , and the angular coordinates of the P 2 point on the spherical surface are marked as ( ⁇ , ⁇ ).
  • the pixel coordinates P 3 (x exp , y exp ) of point P 2 in the second expanded image can be obtained by the following formula.
  • the pixel coordinates of any point in the second image in the second expanded image can be obtained.
  • the way of determining other points is similar to determining P 3 through P 1.
  • the values of each pixel in the second image are filled into the second expanded image, so that the second expanded image has similar values to the second image. texture effect.
  • the value of some pixels in the second expanded image may not be determined due to the distance, that is, These points are not associated with pixels in the second image.
  • the corresponding relationship between the pixels in the second expanded image and the pixels in the second image can also be used to determine each pixel in the second expanded image. point value.
  • the straight line equation of OP 2 can be determined by point O and P 2 , and the straight line equation and the plane of OP 2 can be solved by combining the plane equation of the second image.
  • the intersection of the equations is the mapping point P 4 .
  • the plane equation can be solved by knowing a point on the plane and the normal vector of that point.
  • the intersection of the optical axis of the optical center O 1 and the phase plane is O c (x c , y c , z c ):
  • the normal vector n (cos ⁇ cos ⁇ , cos ⁇ sin ⁇ , sin ⁇ ) of the plane where the second image is located.
  • the coordinates of the mapping point P 4 in the world coordinate system can be converted into the coordinates of the P 4 point in the camera coordinate system, and the x value in the camera coordinate system is divided by dx to obtain the x value of the P 4 point in the second image, Divide the y value in the camera coordinate system by dy to obtain the y value of point P4 in the second image, that is, convert the three -dimensional coordinates in the camera coordinate system to the two-dimensional coordinates in the image coordinate system, and then obtain the mapping point in the first image.
  • Pixel coordinates in the second image The pixel value of the point in the second image is assigned to the pixel value of the corresponding point in the second expanded image, so that the pixel value of the second expanded image is closer to the second image.
  • the second projection area may be expanded according to the latitude and longitude to obtain a second expanded image.
  • the second expanded image is shown in FIG. 18 .
  • a first projection area corresponding to the first image is determined, and after mapping, the first projection area is expanded to obtain a first expanded image.
  • the image processing apparatus may also acquire the historical image from the server,
  • the historical images may be taken at different times and under different lighting conditions, and the acquisition moments of the historical images are before the acquisition moments of the first image and/or the second image.
  • the image processing apparatus may also obtain historical images from the server and historical poses when the historical device collects the historical images (for example, the position and direction when the historical device collects the historical images). Then, the historical image is placed in the aforementioned ball model through the historical pose, and similar to the acquisition of the second expanded image of the second image, the historical expanded image corresponding to the historical image is obtained.
  • the historical device, the first device and the second device are devices with the function of capturing images
  • the historical device, the first device and the second device may be the same device or different devices, which are not specifically limited here.
  • the historical images to be fused may be determined by matching based on attributes.
  • the historical image can be determined according to the pose (for example, if the distance between the position when the device captures the image and the position when the first device captures the first image or the position when the second device captures the second image is less than a certain threshold value, then the image is determined to be a historical image, indicating that the image and the first image and/or the second image are likely to be in the same scene), or the historical image can be determined according to the weather type (for example: determine the weather type and the first image or The images with the same second image are historical images), and historical images can also be determined according to attributes such as time intervals (for example, images with time intervals within a certain threshold are determined as historical images, or overlapping angles/areas of viewing angles), of course, you can also Select historical images from the server through a combination of the above attributes (for example: the distance between the position when the device captures the image and the position when the first
  • Step 1303 The image processing apparatus fuses the first expanded image and the second expanded image to obtain a third image.
  • the first expanded image and the second expanded image are fused to obtain a third image (exemplarily, the third image is as shown in FIG. Show).
  • I 1 is a part of the first expanded image
  • I 2 is a part of the second expanded image
  • different fusion methods may be adopted according to the different shapes of the overlapping areas.
  • the principle of vertical fusion is similar to that of horizontal fusion, and the following description will only take vertical fusion as an example.
  • the height of the actual fusion area can be min( ⁇ , ⁇ h).
  • the height of the fusion area is the Y coordinate of the upper boundary point of the column where a certain point P(x, y) of the fusion area is located. is y min , and the Y coordinate of the lower boundary point is y max , then the pixel value I of point P can be determined by the following formula:
  • I ⁇ *I 1 +(1 ⁇ )*I 2 ;
  • I 1 is the pixel value of point P in the first expanded image
  • I 2 is the pixel value of point P in the second expanded image
  • the image processing apparatus obtains the third image by fusing the first expanded image, the second expanded image and the historical expanded image.
  • the image processing apparatus obtains a fourth image by fusing the historically expanded image with the third image, and performs subsequent steps by using the fourth image as the third image.
  • FIG. 21 is a historical image
  • FIG. 22 is a fourth image or a third image after fusion of the historical expanded image.
  • the subsequent image processing method is similar to the operation of the historical image, and the third image is obtained by fusion.
  • the subsequent image and the target fused image can be directly fused again to obtain a fused image, and the fused image can be regarded as the third image to perform subsequent step.
  • the above method is an operation performed based on the sphere model established by the first image. It can be understood that the above method is not necessarily an operation performed on the spherical model established by the first image. Since the camera will have attitude changes of rotation and displacement at the same time, when the displacement change between the device that captures one image and the device that captures another image is greater than or equal to the second threshold, the subsequent fusion quality will be affected, and the fused image can be cleared. image, reconstruct a new ball model based on the current image (ie, the new position of the camera), and then perform the above-mentioned mapping and fusion steps on the new ball model for subsequent images.
  • the fusion of the multi-view images may also be controlled according to the difference between the rotation angles of the devices that collect the two images and the size of the third threshold.
  • the user may perform image processing settings on the above-mentioned historical image, displacement transformation, and viewing angle transformation through a user interface (UI). That is, the user inputs an instruction, and the image processing apparatus executes the corresponding steps.
  • UI user interface
  • the cloud data fusion is used for the user to set whether to combine the historical images of the server for image fusion.
  • High quality and high performance in multi-view fusion are used for the user to set the aforementioned second and third thresholds.
  • the second threshold corresponding to high performance is greater than the second threshold corresponding to high quality
  • the third threshold corresponding to high performance is greater than the third threshold corresponding to high quality. more images, which increases the processing efficiency of the image processing device.
  • the second threshold and the third threshold are slightly smaller, so that the accuracy of the obtained third image is higher.
  • the second threshold corresponding to high performance is 2 meters
  • the second threshold corresponding to high quality is 1 meter.
  • the second threshold is 2 meters. If the distance between the position when the device collects one image and the position when the device collects another image is less than 2 meters, operations such as projection and fusion can be performed on the two images. If the distance between the position when the device captures one image and the position when the device captures another image is greater than or equal to 2 meters, the previous image can be cleared, and a new ball model can be rebuilt based on the current image (ie, the new position of the camera) , and then perform the above operations such as mapping and fusion on the new spherical model for subsequent images.
  • the third threshold corresponding to high performance is 30 degrees
  • the third threshold corresponding to high quality is 5 degrees.
  • the user inputs an instruction to determine high quality, that is, the third threshold is 5 degrees. If the difference between the rotation angles is greater than or equal to 5 degrees, perform operations such as mapping and fusion on the two images. If the difference between the rotation angles is less than 5 degrees, the current image is discarded, the ball model constructed based on the previous image is used, and the above operations such as mapping and fusion are performed on subsequent images.
  • the user may have many choices through the UI.
  • cloud data fusion that is, selects fusion of historical images
  • the above steps of fusion of historical images are performed.
  • FIG. 25 if the user selects high performance, the above-mentioned second threshold and third threshold are slightly larger.
  • the above-mentioned second threshold and third threshold are slightly smaller.
  • cloud data fusion that is, selects fusion of historical images
  • FIG. 23 to FIG. 26 are just examples of several situations.
  • Step 1304 The image processing apparatus inputs the third network into the trained image prediction network to perform image prediction, and obtains a predicted image.
  • the image prediction model used in this step may be constructed by the method shown in FIG. 5 above, or constructed by other means, which is not specifically limited here.
  • the third image is input into the trained image prediction network to perform image prediction to obtain the predicted image.
  • the image processing device acquires the third image
  • the third image and the mask image corresponding to the third image are input into the trained image prediction network to perform image prediction, and the prediction is obtained. image.
  • FIG. 29 is a predicted image
  • FIG. 30 is an example of FIG. 27 .
  • the image processing device After the image processing device acquires the predicted image, it can use the predicted image to perform operations such as ambient light rendering on virtual objects in the same scene, wherein the virtual object is an object that needs to be rendered ambient light, and the virtual object may be a virtual object or a virtual scene. , which is not specifically limited here.
  • step 1305 and step 1306 will introduce the use of spherical harmonic coefficients of the predicted image for ambient light rendering. It can be understood that the predicted image can also be used in the environment through spherical Gaussian or image based lighting (IBL). Light rendering, which is not limited here.
  • Step 1305 The image processing apparatus acquires spherical harmonic coefficients of the predicted image. This step is optional.
  • the image processing apparatus maps the predicted image to a unit space mapping model, such as a unit sphere model (shown in FIG. 31A ), so as to obtain spherical harmonic coefficients of the predicted image.
  • a unit space mapping model such as a unit sphere model (shown in FIG. 31A )
  • the spherical harmonic coefficients can be used for Ambient light data describing the predicted image.
  • spherical harmonic lighting is to sample the surrounding ambient light into several coefficients (ie, several spherical harmonic coefficients), and then use these spherical harmonic coefficients to restore the lighting when rendering. Simplification of ambient light.
  • the corresponding spherical harmonic base is calculated, and the pixel and the corresponding spherical harmonic base are multiplied and then summed, which is equivalent to the integral of each spherical harmonic base over all pixels.
  • the spherical harmonic coefficients are obtained using the following formula:
  • i is the index of spherical harmonic coefficient
  • N is the number of sampling points
  • n there are n 2 spherical harmonic coefficients corresponding to a single image channel, for three-channel environment map, spherical harmonic coefficients are 3n 2
  • light (x j ) is the RGB value of the sample point.
  • y i (x j ) is a spherical harmonic basis, and for order n, y i (x j ) is divided into multiple bands: 0,...,l,...,n-1.
  • Band l includes 2l+1 spherical harmonics.
  • a set of spherical harmonic coefficients can be obtained through the above formula.
  • the spherical harmonic coefficients are of the third order, and the three RGB channels respectively include 9 coefficients, totaling 27 coefficients.
  • Step 1306 The image processing apparatus uses spherical harmonic coefficients to perform ambient light rendering on the virtual object. This step is optional.
  • the R value in the RGB value is equal to the sum of the first value, the second value, and the third value.
  • the normal vector of the P point is (x, y, z, 1)
  • the first value is equal to the dot product of the vector formed by the first 4 spherical harmonic coefficients of the R channel and the normal vector.
  • the second value is equal to the dot product of the vector formed by the 5th to 8th spherical harmonic coefficients of the R channel and VB, where VB is (xy, yz, zz, zx).
  • the third value is equal to the product of the ninth spherical harmonic coefficient of the R channel and VC, where VC is x 2 -y 2 .
  • VC is x 2 -y 2 .
  • the calculation method of the value of the G channel and the value of the B channel is similar to the calculation method of the R channel, and will not be repeated here.
  • the virtual object may be a virtual object or a virtual scene in a scene, which is not specifically limited here.
  • the image processing method includes steps 1301 to 1304 . In another possible implementation manner, the image processing method includes steps 1301 to 1306 .
  • the method shown in FIG. 13 can be performed cyclically, that is, after the spatial mapping model is constructed, multiple images can be obtained, and the multiple images can be mapped to the spatial mapping model to obtain multiple expanded images, and then fused to obtain the image to be input.
  • the third image of the prediction network may be obtained by fusing two images from different viewing angles, or may be obtained by fusing multiple images from different viewing angles.
  • the third image can also be continuously updated, that is, the third image of the fusion target and the subsequent expanded image (the mode of obtaining the subsequent expanded image and the aforementioned second expanded image)
  • the acquisition method is similar) to obtain a new third image, and then input the new third image to the image prediction network for prediction to obtain a predicted image.
  • the spatial mapping model is used to map and fuse expanded images from multiple viewing angles, so that the predicted image obtained after inputting the image prediction network has more textures, thereby improving the ambient light rendering quality of the virtual object in the subsequent scene. For example: enhancing the realism of virtual object rendering.
  • the input image may also refer to the image stored in the server at the same location (ie, the texture information of a certain location in the scene). And the user can flexibly set the way of fusion through the user interface, for example: whether it is high performance or high quality.
  • the embodiments of the present application further provide corresponding apparatuses, including corresponding modules for executing the foregoing embodiments.
  • the modules may be software, hardware, or a combination of software and hardware.
  • the image processing apparatus may be a local device (eg, a mobile phone, a camera, etc.) or a cloud device.
  • the image processing device includes:
  • the acquisition unit 3201 is used to acquire a first image and a second image, the first image and the second image are images collected from the same scene and different viewing angles;
  • the mapping unit 3202 is used to perform mapping processing on the first image and the second image based on the spatial mapping model to obtain the first expanded image and the second expanded image;
  • the fusion unit 3203 is used to fuse the first expanded image and the second expanded image to obtain a third image
  • the prediction unit 3204 is configured to input the third image into the trained image prediction network for image prediction to obtain a predicted image, and the predicted image is used for ambient light rendering of the virtual object in the foregoing scene.
  • each unit in the image processing apparatus is similar to those described in the foregoing embodiments shown in FIG. 5 to FIG. 13 , and details are not repeated here.
  • the mapping unit 3202 performs spatial mapping model mapping on the multi-view image and the fusion unit 3203 fuses the expanded images of multiple views, so that the predicted image obtained by the prediction unit 3204 after inputting the image prediction network has more textures, which improves the subsequent Ambient rendering quality of virtual objects in the scene.
  • the image processing apparatus may be a local device (eg, a mobile phone, a camera, etc.) or a cloud device.
  • the image processing device includes:
  • the acquiring unit 3301 is used to acquire a first image and a second image, where the first image and the second image are images collected from the same scene and different viewing angles;
  • the mapping unit 3302 is used to perform mapping processing on the first image and the second image based on the spatial mapping model to obtain the first expanded image and the second expanded image;
  • the fusion unit 3303 is used to fuse the first expanded image and the second expanded image to obtain a third image
  • the prediction unit 3304 is configured to input the third image into the trained image prediction network for image prediction to obtain a predicted image, and the predicted image is used for ambient light rendering of the virtual object in the foregoing scene.
  • mapping unit 3302 includes:
  • the construction subunit 33021 is used to construct a spatial mapping model according to the optical center of the first device, where the first device is the device that collects the first image, and the second device is the device that collects the second image;
  • the determination subunit 33022 is used to determine the first projection area of the first image on the spatial mapping model according to the first pose of the first device, where the first pose is the pose when the first device collects the first image;
  • the mapping subunit 33023 is used to map each pixel in the first image to the first projection area to obtain the first expanded image
  • the determination subunit 33022 is further configured to determine the second projection area of the second image on the spatial mapping model according to the second pose of the second device, where the second pose is the pose when the second device collects the second image;
  • the mapping subunit 33023 is further configured to map each pixel in the second image to the second projection area to obtain a second expanded image.
  • the obtaining unit 3301 is also used to obtain the historical image and the historical pose of the historical image from the server, the historical image is collected at the time before the first image or the second image is collected, and the historical pose is when the historical device collects the historical image. pose, the historical image is stored with an image in the same position as the first image and/or the second image;
  • the obtaining unit 3301 is also used to obtain spherical harmonic coefficients of the predicted image
  • the rendering unit 3305 is configured to perform ambient light rendering on the virtual object by using spherical harmonic coefficients.
  • the determination subunit 33022 is also used to determine the historical projection area of the historical image on the spatial mapping model according to the historical pose,
  • the mapping subunit 33023 is also used to map each pixel in the historical image to the historical projection area to obtain the historical expanded image;
  • the fusion unit 3303 is specifically configured to fuse the first expanded image and the historical expanded image of the second expanded image to obtain a third image.
  • each unit in the image processing apparatus is similar to those described in the foregoing embodiments shown in FIG. 5 to FIG. 13 , and details are not repeated here.
  • the mapping unit 3302 performs spatial mapping model mapping on the multi-view image and the fusion unit 3303 fuses the expanded images of multiple views, so that the predicted image obtained by the prediction unit 3304 after inputting the image prediction network has more textures, which improves the subsequent Ambient rendering quality of virtual objects in the scene. For example: enhancing the realism of virtual object rendering.
  • the input image may also refer to an image stored in the server at the same location (ie, texture information at a certain location in the scene). And the user can flexibly set the way of fusion through the user interface, for example: whether it is high performance or high quality.
  • FIG. 34 is a schematic diagram of a hardware structure of an image processing apparatus provided by an embodiment of the present application.
  • the image processing apparatus 3400 shown in FIG. 34 (the apparatus 3400 may specifically be a computer device) includes a memory 3401 , a processor 3402 , a communication interface 3403 and a bus 3404 .
  • the memory 3401 , the processor 3402 , and the communication interface 3403 are connected to each other through the bus 3404 for communication.
  • the memory 3401 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 3401 may store a program. When the program stored in the memory 3401 is executed by the processor 3402, the processor 3402 and the communication interface 3403 are used to execute each step of the image processing method of the embodiment of the present application.
  • the processor 3402 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processor (graphics processing unit, GPU), or one or more
  • the integrated circuit is used to execute the relevant program to realize the function required to be performed by the unit in the image processing apparatus of the embodiment of the present application, or to execute the image processing method of the method embodiment of the present application.
  • the processor 3402 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the image processing method of the present application can be completed by an integrated logic circuit of hardware in the processor 3402 or instructions in the form of software.
  • the above-mentioned processor 3402 may also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices. , discrete gate or transistor logic devices, discrete hardware components.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 3401, and the processor 3402 reads the information in the memory 3401 and, in combination with its hardware, completes the functions required to be performed by the units included in the image processing apparatus of the embodiments of the present application, or performs the image processing of the method embodiments of the present application. method.
  • the communication interface 3403 implements communication between the apparatus 3400 and other devices or a communication network using a transceiving device such as, but not limited to, a transceiver.
  • a transceiving device such as, but not limited to, a transceiver.
  • training data (such as the training images described in the embodiments of the present application) can be acquired through the communication interface 3403 .
  • the bus 3404 may include a pathway for communicating information between the various components of the device 3400 (eg, the memory 3401, the processor 3402, the communication interface 3403).
  • the apparatus 3400 shown in FIG. 34 only shows a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the apparatus 3400 also includes other devices necessary for normal operation . Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 3400 may further include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the apparatus 3400 may only include the necessary devices for implementing the embodiments of the present application, and does not necessarily include all the devices shown in FIG. 34 .
  • Embodiments of the present application also provide a computer program product that, when running on a computer, causes the computer to perform the steps performed by the aforementioned image processing apparatus, or causes the computer to perform the steps performed by the aforementioned image processing apparatus.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program for performing signal processing is stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, it causes the computer to execute the above-mentioned image processing apparatus. step.
  • the image processing apparatus or terminal device may specifically be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or circuit, etc.
  • the processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip in the image processing apparatus executes the image processing method described in the above embodiments.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, removable hard disk, ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种图像处理方法及相关设备。涉及人工智能领域,具体涉及计算机视觉领域。该方法包括:获取第一图像以及第二图像,基于空间映射模型对第一图像以及第二图像进行映射处理,得到第一展开图像以及第二展开图像;融合第一展开图像以及第二展开图像,得到第三图像;将第三图像输入训练好的图像预测网络进行图像预测,得到预测图像,该预测图像用于环境光渲染。通过空间映射模型映射以及融合多个视角的展开图像使得输入图像预测网络后得到的预测图像具有更多纹理,提升后续该场景下虚拟对象的环境光渲染质量。

Description

一种图像处理方法及相关设备
本申请要求于2020年11月10日提交中国专利局、申请号为202011248424.1、发明名称为“一种图像处理方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机视觉领域,尤其涉及一种图像处理方法及相关设备。
背景技术
计算机视觉是各个应用领域,如制造业、检验、文档分析、医疗诊断,和军事等领域中各种智能/自主系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取我们所需的,被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机/摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成象系统代替视觉器官获取输入信息,再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。
在AR图像编辑和电影制作等一些领域,需要将三维虚拟物体合成到真实场景的图像中。为了实现逼真的渲染效果,需要估计真实场景的照明情况,以使得合成图像中的虚拟物体能够呈现与该场景中的真实物体一致的着色、阴影和反射效果,并且使得合成图像能够正确地呈现虚拟物体和真实物体之间的投影或遮挡。
因此,为了物体表面着色和环境反射,对于环境光参数的确定尤为重要。
发明内容
本申请实施例提供了一种图像处理方法及相关设备。可以提升后续环境光渲染的质量。
本申请实施例第一方面提供了一种图像处理方法,该方法可以由图像处理装置执行,也可以由图像处理装置的部件(例如处理器、芯片、或芯片系统等)执行,其中,该图像处理装置可以是本地设备(例如,手机、摄像机等)或云端设备。该方法也可以由本地设备以及云端设备共同执行。该方法包括:获取第一图像以及第二图像,第一图像与第二图像为同一场景不同视角下采集的图像;基于空间映射模型对第一图像以及第二图像进行映射处理,得到第一展开图像以及第二展开图像;融合第一展开图像以及第二展开图像,得到第三图像;将第三图像输入训练好的图像预测网络进行图像预测,得到预测图像,预测图像用于前述场景下虚拟对象的环境光渲染。
本申请实施例中,通过空间映射模型映射以及融合多个视角的展开图像使得输入图像预测网络后得到的预测图像具有更多纹理,提升后续该场景下虚拟对象的环境光渲染质量。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的基于空间映射模型对第一 图像以及第二图像进行映射处理,得到第一展开图像以及第二展开图像,包括:根据第一设备的光心构建空间映射模型,第一设备为采集第一图像的设备,第二设备为采集第二图像的设备;根据第一设备的第一位姿确定第一图像在空间映射模型上的第一投影区域,第一位姿为第一设备采集第一图像时的位姿;将第一图像内各像素点映射到第一投影区域,得到第一展开图像;根据第二设备的第二位姿确定第二图像在空间映射模型上的第二投影区域,第二位姿为第二设备采集第二图像时的位姿;将第二图像内各像素点映射到第二投影区域内,得到第二展开图像。
该种可能的实现方式中,通过第一设备采集第一图像时的位姿构建空间映射模型,后续的图像基于位姿映射到该空间映射模型后,经过纹理映射、融合,得到纹理信息与采集的图像(第一图像以及第二图像)相近,使得后续图像预测网络输出的第三图像具有更多的纹理信息,提升环境光渲染的真实效果。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:从服务器获取历史图像以及该历史图像的历史位姿,历史图像的采集时刻在第一图像或第二图像的采集时刻之前,历史位姿为历史设备采集历史图像时的位姿,历史图像存储有与第一图像和/或第二图像中相同位置的图像;根据历史位姿确定历史图像在空间映射模型上的历史投影区域,将历史图像内各像素点映射到历史投影区域,得到历史展开图像;融合第一展开图像以及第二展开图像,得到第三图像,包括:融合第一展开图像、第二展开图像历史展开图像,得到第三图像。
该种可能的实现方式中,可以结合云端的历史图像进行融合,得到第三图像,使得可以参考云端存储的该场景下的纹理信息,可以提升后续预测图像的纹理细节以及环境光渲染的质量。
可选地,在第一方面的一种可能的实现方式中,上述步骤中训练好的图像预测网络是通过以训练图像作为图像预测网络的输入,以损失函数的值小于第一阈值为目标对图像预测网络进行训练得到;损失函数用于指示图像预测网络输出的图像与第三目标图像之间的差异,第三目标图像为采集的图像。
该种可能的实现方式中,通过训练图像以及第三目标图像实现图像预测网络的训练过程,为后续提供更加优化的图像预测网络,提升输出图像(即预测图像)的精细程度。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的损失函数的权重由训练图像对应的蒙板图像控制。
该种可能的实现方式中,通过蒙板图像控制损失函数的权重,示例性,有场景的区域权重为1,没有场景的区域权重为0,这样可以除去无效的部分,减少无效区域的干扰,提升输出图像的纹理细节。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:获取预测图像的球谐系数;利用球谐系数对虚拟物体进行环境光渲染。
该种可能的实现方式中,可以通过获取球谐系数的方式,对虚拟物体进行渲染,使得虚拟物体的光照更加真实。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的第三图像的视场角大于第一图像或第二图像。
该种可能的实现方式中,通过将小视角的图像输入图像预测模型的方式,得到更大视角 的第三图像,并且利于增大后续环境光渲染的面积。
本申请实施例第二方面提供一种图像处理装置,该图像处理装置可以是本地设备(例如,手机、摄像机等)或云端设备。该图像处理装置包括:
获取单元,用于获取第一图像以及第二图像,第一图像与第二图像为同一场景不同视角下采集的图像;
映射单元,用于基于空间映射模型对第一图像以及第二图像进行映射处理,得到第一展开图像以及第二展开图像;
融合单元,用于融合第一展开图像以及第二展开图像,得到第三图像;
预测单元,用于将第三图像输入训练好的图像预测网络进行图像预测,得到预测图像,预测图像用于环境光渲染。
可选地,在第二方面的一种可能的实现方式中,上述图像处理装置中的映射单元包括:
构建子单元,用于根据第一设备的光心构建空间映射模型,第一设备为采集第一图像的设备,第二设备为采集第二图像的设备;
确定子单元,用于根据第一设备的第一位姿确定第一图像在空间映射模型上的第一投影区域,第一位姿为第一设备采集第一图像时的位姿;
映射子单元,用于将第一图像内各像素点映射到第一投影区域,得到第一展开图像;
确定子单元,还用于根据第二设备的第二位姿确定第二图像在空间映射模型上的第二投影区域,第二位姿为第二设备采集第二图像时的位姿;
映射子单元,还用于将第二图像内各像素点映射到第二投影区域内,得到第二展开图像。
可选地,在第二方面的一种可能的实现方式中,上述图像处理装置的获取单元,还用于从云端获取历史图像以及历史图像的历史位姿,历史位姿为历史设备采集历史图像时的位姿,历史图像存储有与第一图像和/或第二图像中相同位置的纹理信息;
确定子单元,还用于根据历史位姿确定历史图像在空间映射模型上的历史投影区域,
映射子单元,还用于将历史图像内各像素点映射到历史投影区域,得到历史展开图像;
融合单元,具体用于融合第一展开图像、第二展开图像历史展开图像,得到第三图像。
可选地,在第二方面的一种可能的实现方式中,上述训练好的图像预测网络是通过以训练图像作为图像预测网络的输入,以损失函数的值小于第一阈值为目标对图像预测网络进行训练得到;损失函数用于指示图像预测网络输出的图像与第三目标图像之间的差异,第三目标图像为采集的图像。
可选地,在第二方面的一种可能的实现方式中,上述损失函数的权重由训练图像对应的蒙板图像控制。
可选地,在第二方面的一种可能的实现方式中,上述图像处理装置的获取单元,还用于获取预测图像的球谐系数;
上述图像处理装置还包括:
渲染单元,用于利用球谐系数对虚拟物体进行环境光渲染。
可选地,在第二方面的一种可能的实现方式中,上述第三图像的视场角大于第一图像或第二图像。
本申请实施例第三方面提供了一种图像处理装置,该图像处理装置可以是手机或摄像机。 也可以是云端设备(例如服务器等),该图像处理装置执行前述第一方面或第一方面的任意可能的实现方式中的方法。
本申请实施例第四方面提供了一种芯片,该芯片包括处理器和通信接口,通信接口和处理器耦合,处理器用于运行计算机程序或指令,使得该芯片实现上述第一方面或第一方面的任意可能的实现方式中的方法。
本申请实施例第五方面提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,该指令在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。
本申请实施例第六方面提供了一种计算机程序产品,该计算机程序产品在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。
本申请实施例第七方面提供了一种图像处理装置,包括:处理器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该图像处理装置实现上述第一方面或第一方面的任意可能的实现方式中的方法。
其中,第二、第三、第四、第五、第六、第七方面或者其中任一种可能实现方式所带来的技术效果可参见第一方面或第一方面不同可能实现方式所带来的技术效果,此处不再赘述。
从以上技术方案可以看出,本申请实施例具有以下优点:获取第一图像以及第二图像,基于空间映射模型对第一图像以及第二图像进行映射处理,得到第一展开图像以及第二展开图像;融合第一展开图像以及第二展开图像,得到第三图像;将第三图像输入训练好的图像预测网络进行图像预测,得到预测图像,该预测图像用于环境光渲染。通过空间映射模型映射以及融合多个视角的展开图像使得输入图像预测网络后得到的预测图像具有更多纹理,提升后续该场景下虚拟对象的环境光渲染质量。
附图说明
图1为本申请实施例提供的系统架构的结构示意图;
图2为本发明实施例提供的一种卷积神经网络结构示意图;
图3为本发明实施例提供的另一种卷积神经网络结构示意图;
图4为本申请实施例提供的一种芯片硬件结构示意图;
图5为本申请实施例提供的一种图像预测模型的训练方法的示意性流程图;
图6为本申请实施例提供的一个训练图像的示意图;
图7为本申请实施例提供的一个训练图像对应的蒙板图像的示意图;
图8为本申请实施例提供的一个第三目标图像的示意图;
图9为本申请实施例提供的一个输出图像的示意图;
图10为本申请实施例提供的一种图像预测模型的结构示意图;
图11为本申请实施例提供的另一种图像预测模型的结构示意图;
图12为本申请实施例提供的另一种图像预测模型的结构示意图;
图13为本申请实施例提供的图像处理方法一个流程示意图;
图14A为本申请实施例提供的两个视角重叠的一种示意图;
图14B为本申请实施例提供的两个视角重叠的另一种示意图;
图15A为本申请实施例提供的第一图像的示意图;
图15B为本申请实施例提供的第二图像的示意图;
图16为本申请实施例提供的基于第一图像构建的空间映射模型的一种示意图;
图17为本申请实施例提供的第二图像在空间映射模型的映射示意图;
图18为本申请实施例提供的第二展开图像的示意图;
图19为本申请实施例提供的第三图像的一种示意图;
图20为本申请实施例提供的两个图像竖直融合的一种示意图;
图21为本申请实施例提供的历史图像的示意图;
图22为本申请实施例提供的第三图像或第四图像的另一种示意图;
图23-图26为本申请实施例提供的用户界面的几种示意图;
图27为本申请实施例提供的另一种图像预测模型的结构示意图;
图28为本申请实施例提供的另一种图像预测模型的结构示意图;
图29为本申请实施例提供的预测图像的示意图;
图30为本申请实施例提供的另一种图像预测模型的结构示意图;
图31A为本申请实施例提供的预测图像的单位球模型示意图;
图31B为本申请实施例提供的球谐系数恢复的照度示意图;
图32为本申请实施例提供的图像处理装置一个结构示意图;
图33为本申请实施例提供的图像处理装置另一结构示意图;
图34为本申请实施例提供的图像处理装置另一结构示意图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
本申请实施例提供的图像处理方法能够应用在增强现实(augmented reality,AR)、游戏制作、电影制作以及需要进行环境光渲染的其他场景。下面分别对AR场景和电影制作场景进行简单的介绍。
AR场景:
AR技术是在虚拟现实基础上发展起来的新技术,是通过计算机系统提供的信息增加用户对现实世界感知的技术,并将计算机生成的虚拟对象、场景或系统提示信息叠加到真实场景中,从而实现对现实的“增强”,是一种将真实世界信息和虚拟世界信息“无缝”集成的新技术。因此,如何将虚拟对象的渲染效果与环境协调,对AR产品的用户体验有重要意义。利用光照估计对虚拟对象进行渲染,是“无缝”AR的重要组成部分。通过本申请实施例提供的图像处理方法,可以使得AR场景下虚拟对象的光照更加真实。
电影制作:
电影制作中非真实镜头的光照捕捉需要估计真实场景的照明情况,以使得电影中的非真实镜头更加真实,能够呈现真实场景下的着色、阴影和反射效果。通过本申请实施例提供的图像处理方法,可以使得电影制作中非真实镜头的光照更加真实。
由于本申请实施例涉及神经网络的应用,为了便于理解,下面先对本申请实施例主要涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以X s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2021126000-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为X s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2021126000-appb-000002
其中,
Figure PCTCN2021126000-appb-000003
是输入向量,
Figure PCTCN2021126000-appb-000004
是输出向量,
Figure PCTCN2021126000-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量经过如此简单的操作得到输出向量。由于DNN层数多,则系数W和偏移向量
Figure PCTCN2021126000-appb-000006
的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为W 24 3。上标3代表系数所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为W jk L。需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中, 通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(5)反向传播算法
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。
(6)生成式对抗网络
生成式对抗网络(generative adversarial networks,GAN)是一种深度学习模型。该模型中至少包括两个模块:一个模块是生成模型(Generative Model),另一个模块是判别模型(Discriminative Model),通过这两个模块互相博弈学习,从而产生更好的输出。生成模型和判别模型都可以是神经网络,具体可以是深度神经网络,或者卷积神经网络。GAN的基本原理如下:以生成图片的GAN为例,假设有两个网络,G(Generator)和D(Discriminator),其中G是一个生成图片的网络,它接收一个随机的噪声z,通过这个噪声生成图片,记做G(z);D是一个判别网络,用于判别一张图片是不是“真实的”。它的输入参数是x,x代表一张图片,输出D(x)代表x为真实图片的概率,如果为1,就代表100%是真实的图片,如果为0,就代表不可能是真实的图片。在对该生成式对抗网络进行训练的过程中,生成网络G的目标就是尽可能生成真实的图片去欺骗判别网络D,而判别网络D的目标就是尽量把G生成的图片和真实的图片区分开来。这样,G和D就构成了一个动态的“博弈”过程,也即“生成式对抗网络”中的“对抗”。最后博弈的结果,在理想的状态下,G可以生成足以“以 假乱真”的图片G(z),而D难以判定G生成的图片究竟是不是真实的,即D(G(z))=0.5。这样就得到了一个优异的生成模型G,它可以用来生成图片。
(7)像素值
图像的像素值可以是一个红绿蓝(RGB)颜色值,像素值可以是表示颜色的长整数。例如,像素值为256*Red+100*Green+76Blue,其中,Blue代表蓝色分量,Green代表绿色分量,Red代表红色分量。各个颜色分量中,数值越小,亮度越低,数值越大,亮度越高。对于灰度图像来说,像素值可以是灰度值。
(8)编码器、解码器
编码器(encoder)用于提取输入图像的特征。具体地,编码器可以采用神经网络,例如,卷积神经网络。
解码器(decoder)用于将提取的特征恢复为图像。具体地,解码器可以采用神经网络,例如,卷积神经网络。
(9)上采样
在应用在计算机视觉的深度学习领域,由于输入图像通过卷积神经网络(CNN)提取特征后,输出的尺寸往往会变小,而有时我们需要将图像恢复到原来的尺寸以便进行进一步的计算(例如:图像的语义分割),这个采用扩大图像尺寸,实现图像由小分辨率到大分辨率的映射的操作,叫做上采样(Upsample)。
其中,上采样有3种常见的方法:双线性插值(bilinear)、反卷积(Transposed Convolution)以及反池化(Unpooling)。
下面介绍本申请实施例提供的系统架构。
参见附图1,本发明实施例提供了一种系统架构100。如所述系统架构100所示,数据采集设备160用于采集训练数据,本申请实施例中训练数据包括:训练图像。进一步的,训练图像与第一图像和/第二图像是针对同一场景下采集的图像。并将训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。下面将以实施例一更详细地描述训练设备120如何基于训练数据得到目标模型/规则101,该目标模型/规则101能够用于实现本申请实施例提供的图像处理方法,即,将第三图像通过相关预处理后输入该目标模型/规则101,即可得到预测图像。本申请实施例中的目标模型/规则101具体可以为图像预测网络,在本申请提供的实施例中,该图像预测网络是通过训练训练图像得到的。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,AR/VR,车载终端等,还可以是服务器或者云端等。在附图1中,执行设备110配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:第一图像以及第二图像,可以是用户输入的,也可以是用户通过拍摄设备上传的,当然还可以来自数据库,具体此处不做限 定。
预处理模块113用于根据I/O接口112接收到的输入数据(如第一图像以及第二图像)进行预处理,在本申请实施例中,预处理模块113可以用于基于空间映射模型对第一图像以及第二图像进行映射处理,得到第一展开图像以及第二展开图像。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如上述得到的预测图像返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在附图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,附图1仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图1所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是图像预测网络,具体的,在本申请实施例提供的网络中,图像预测网络都可以是卷积神经网络。
由于CNN是一种非常常见的神经网络,下面结合图2重点对CNN的结构进行详细的介绍。如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图2所示,卷积神经网络(CNN)100可以包括输入层110,卷积层/池化层120,其中池化层为可选的,以及神经网络层130。
卷积层/池化层120:
卷积层:
如图2所示卷积层/池化层120可以包括如示例121-126层,在一种实现中,121层为卷 积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
以卷积层121为例,卷积层121可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化……该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以从输入图像中提取信息,从而帮助卷积神经网络100进行正确的预测。
当卷积神经网络100有多个卷积层的时候,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图2中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像大小相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层130:
在经过卷积层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层120只会提取特征,并减少输入图像带来的参数。然而为了生 成最终的输出信息(所需要的类信息或别的相关信息),卷积神经网络100需要利用神经网络层130来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层130中可以包括多层隐含层(如图2所示的131、132至13n)以及输出层140,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等……
在神经网络层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出层140,该输出层140具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络100的前向传播(如图2由110至140的传播为前向传播)完成,反向传播(如图2由140至110的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络100的损失及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图2所示的卷积神经网络100仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图3所示的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层130进行处理。
下面介绍本申请实施例提供的一种芯片硬件结构。
图4为本发明实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器40。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图2所示的卷积神经网络中各层的算法均可在如图4所示的芯片中得以实现。
神经网络处理器40可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器NPU40作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路403,控制器404控制运算电路403提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路403内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路403是二维脉动阵列。运算电路403还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路403是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器402中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器401中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器408accumulator中。
向量计算单元407可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元407可以用于神经网络中非卷积/非FC层的网络计算,如池化(Pooling),批归一化(Batch Normalization),局部响应归一化(Local Response Normalization)等。
在一些实现种,向量计算单元能407将经处理的输出的向量存储到统一缓存器406。例如,向量计算单元407可以将非线性函数应用到运算电路403的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元407生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路403的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器406用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器405(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器401和/或统一存储器406、将外部存储器中的权重数据存入权重存储器402,以及将统一存储器506中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)410,用于通过总线实现主CPU、DMAC和取指存储器409之间进行交互。
与控制器404连接的取指存储器(instruction fetch buffer)409,用于存储控制器404使用的指令。
控制器404,用于调用指存储器409中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器406,输入存储器401,权重存储器402以及取指存储器409均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
其中,图2或图3所示的卷积神经网络中各层的运算可以由运算电路403或向量计算单元407执行。
在AR图像编辑和电影制作等一些领域,需要将三维虚拟物体合成到真实场景的图像中。为了实现逼真的渲染效果,需要估计真实场景的照明情况,以使得合成图像中的虚拟物体能够呈现与该场景中的真实物体一致的着色、阴影和反射效果,并且使得合成图像能够正确地呈现虚拟物体和真实物体之间的投影或遮挡。
因此,如何确定更加真实的环境光参数成为一个亟待解决的问题。
本申请实施例提出一种图像处理方法,能够提升环境光渲染的质量。
下面结合附图对本申请实施例的图像预测网络的训练方法和图像处理方法进行详细的介绍。
首先,结合图5对本申请实施例的图像预测网络的训练方法500进行详细介绍。图5所示的方法可以由图像预测网络的训练装置来执行,该图像预测网络的训练装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运算能力足以用来执行图像预测网络的训练方法的装置,也可以是由云服务设备和终端设备构成的系统。示例性地,方法500可以由图1中的训练设备120、图4中的神经网络处理器40执行。
可选地,方法500可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。
方法500包括步骤501与步骤502。下面对步骤501与步骤502进行详细说明。
步骤501、获取训练图像。
可选地,训练图像可以是从拍摄的全景图像中选取出来的部分视角图像,也可以是通过 终端设备拍摄的至少两张图像融合得到,具体此处不做限定。
可选地,在训练过程中可以从全景图像中生成训练样本,训练样本包括:输入图像(也即上述的训练图像,例如:训练图像如图6所示)、输入图像对应的蒙板图像(例如:训练图像对应的蒙板图像如图7所示)以及第三目标图像(第三目标图像可以是采集的全景图像,也可以是从全景图像中选取出来的图像,且第三目标图像的视角范围大于训练图像,例如:若第三目标图像是全景图像,第三目标图像如图8所示,图6所示的训练图像可以从图8所示的全景图像选取的部分视角得到)。
本申请实施例中的蒙板图像可以用于区分有效区域与无效区域,蒙板图像也可以理解为黑白色图像,黑色和白色表示不同的区域,例如:训练图像中的场景部分为有效区域(或白色区域),除了场景外的区域为无效区域(黑色区域)。
可选地,获取的训练图像可以与第一图像和/或第二图像在同一场景,也可以在不同场景,当然,如果在同一场景下,训练的效果更佳,同一场景的解释可参照后续第一图像与第二图像在同一场景下的解释。可选地,获取的训练图像的数量可以是一个或多个,具体此处不做限定。
步骤502、以训练图像作为图像预测网络的输入,以损失函数的值小于第一阈值为目标对图像预测网络进行训练,得到训练好的图像预测网络。
其中,损失函数用于指示图像预测网络的输出图像(例如:输出图像如图9所示)与第三目标图像之间的差异。
在该情况下,以减小损失函数的值为目标对图像预测网络进行训练,也就是不断缩小图像预测网络的输出图像与第三目标图像之间的差异。该训练过程可以理解为预测任务。损失函数可以理解为预测任务对应的损失函数。其中,输出图像的视角范围大于输入图像。
可选地,损失函数中的惩罚权重由训练图像对应的蒙板图像控制。在通用的损失函数之前增加一个权重,该权重为0或1。例如:图6中有场景的区域(也可以称为有效区域)对应的权重为1,没有场景的区域(也可以称为无效区域)对应的权重为0,即图7中的白色区域(即有效区域)的权重为1,黑色区域(即无效区域)的权重为0。相当于黑色区域不参与后续计算,这样可以减少训练过程中的算力消耗。通过蒙板图像控制损失函数的权重,示例性,有场景的区域权重为1,没有场景的区域权重为0,这样可以除去无效的部分,减少无效区域的干扰,提升输出图像的纹理细节。
在一种可能实现的方式中,以训练图像作为图像预测网络的输入,以损失函数的值小于第一阈值为目标对图像预测网络进行训练,得到训练好的图像预测网络。可选地,图像预测网络可以是CNN。示例性的,图像预测网络如图10所示,图像预测网络包括编码器和解码器。其中,编码器可以包括卷积、激活以及池化。解码器可以包括:卷积以及上采样。当然,解码器也可以包括反卷积。对于图像预测网络的具体结构此处不做限定。
在另一种可能实现的方式中,如图11所示,可以引入GAN。GAN中的生成器为图像预测网络,生成器生成输出图像;判别器判别该输出图像是不是“真实的”。在对该生成式对抗网络进行训练的过程中,生成器的目标就是尽可能生成真实的图片去欺骗判别器,而判别器的目标就是尽量把生成器生成的输出图片和真实的图片区分开来。其中,该真实的图片为第三目标图像。在理想的状态下,生成器可以生成足以“以假乱真”的输出图像,而判别器难以 判定生成器生成的输出图片究竟是不是真实的。这样就得到了一个优异的生成器,它可以用来生成输出图像。
示例性的,延续上述举例,图12可以理解为图11的一种示例。
可选地,上述两种方式中图像预测网络的编码器与解码器的对应层建立跨层连接,即在特征提取后,可能丢失细节,通过跨层连接,可以提供未进行特征提取的图像作为参考,使得结果具有更多纹理细节。
需要说明的是,训练过程也可以不采用前述方法500而采用其他训练方法,此处不做限定。
下面结合附图对本申请实施例的图像处理方法进行详细的介绍。
请参阅图13,本申请实施例中图像处理方法一个实施例,该方法1300包括步骤1301至步骤1306。
步骤1301、图像处理装置获取第一图像以及第二图像。
本申请实施例中的图像处理装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运算能力足以用来执行图像处理方法的装置,也可以是由云服务设备和终端设备构成的系统。示例性地,图像处理装置可以是图1中的执行设备110、图4中的神经网络处理器40。
可选的,图像处理装置可以是CPU,也可以是CPU和GPU,也可以不是GPU,而使用其他适合用于神经网络计算的处理器,具体此处不做限定。
本申请实施例中的第一图像以及第二图像是针对同一场景不同视角下拍摄的图像。第一图像为第一设备在第一视角下采集的图像,第二图像为第二设备在第二视角下采集的图像。其中,第一设备与第二设备可以是同一设备,也可以是不同设备。第一设备采集第一图像的时刻与第二设备采集第二图像的时刻可能相同或不同,具体此处不做限定。
本申请实施例中的同一场景可以理解为以下属性中的至少一项被满足:
1、同一场景可以是指两个图像(例如:第一图像与第二图像)中的画面内容部分相同,例如:第一图像与第二图像重叠的内容(或区域、面积)大于或等于30%。
2、同一场景可以是指采集一张图像时设备的第一位置与采集另一张图像时第二设备的位置之间的距离小于某一阈值(例如:第一设备采集第一图像时的位置与第二设备采集第二图像时的位置之间距离为1米,阈值为2米,即距离小于阈值,则可以确定第一图像与第二图像为同一场下采集的图像);和/或两个图像(例如:第一图像与第二图像)的视场角的重叠角度大于某一阈值(例如:第一视角与第二视角的重叠角度大于30度);和/或采集两个图像的设备旋转角度的差异小于某一阈值。其中,旋转角度可以是设备水平角旋转的角度值,也可以是相机俯视角旋转的角度值。
示例性的,上述的重叠角度可以如图14A或14B中的箭头所示,第一设备在第一视角下采集的第一图像,第二设备在第二视角下采集的第二图像,中间重叠的角度称为重叠角度。
上述中的位置可以是相对位置或地理位置等,如果位置是相对位置,可以通过建立场景模型等方式确定第一设备与第二位置的相对位置;如果位置是地理位置,可以是基于全球定位系统(global positioning system,GPS)或北斗导航系统等确定的第一设备位置与第二设备的位置,进而得到两个位置之间的距离。
3、同一场景还可以是根据光照强度来评判,例如:基于采集一张图像时的天气类型与采集另一张图像时的天气类型是否相近来判断两个图像是否为同一场景,例如:若采集第一图像时为晴天,采集第二图像时为晴天,则可以确定第一图像与第二图像为同一场景。若采集第一图像时为晴天,采集第二图像时为雨天,则可以确定第一图像与第二图像不属于同一场景。
4、同一场景还可以是指第一图像与第二图像的纹理相似度大于或等于某一阈值,一般这种方式需要联合上述其他方式一起判定。
可以理解的是,上述确定第一图像与第二图像是否为同一场景只是举例,实际应用中,还可以有其他方式,具体此处不做限定。
上述几种举例可以单独判定,也可以联合判定,例如:确定距离小于某一阈值后,且天气类型一致,确定第一图像与第二图像为同一场景下采集的图像。或者确定距离小于某一阈值后,且两个图像的纹理相似度大于或等于某一阈值,则可以确定第一图像与第二图像为同一场景下采集的图像。或者确定距离小于某一阈值后,还可以判断两个图像的视场角的重叠角度是否大于某一阈值,若大于该阈值,则确定第一图像与第二图像为同一场景下采集的图像。
可选地,上述中的属性(位置、重叠角度、重叠面积和/或光照强度信息)可以跟随第一图像或第二图像一起存储,图像处理装置获取图像的前后,还可以获取该图像对应的属性,便于判断几张图像是否为同一场景,进而正确执行后续的融合操作。
一种可能实现的方式中,图像处理装置可以在不同视角下采集或拍摄第一图像以及第二图像。即图像处理装置、第一设备以及第二设备为同一设备。
另一种可能实现的方式中,图像处理装置接收其他设备发送的第一图像以及第二图像。其中,该其他设备可以是相机或惯性测量单元(inertial measurement unit,IMU)等具有采集或拍摄功能。即图像处理装置与第一设备或第二设备为不同设备。
可以理解的是,图像处理装置除了获取第一图像以及第二图像还可以获取其他视角的图像,本申请实施例仅以第一图像以及第二图像为例进行示意性说明。
本申请实施例仅以采集第一图像或第二图像的设备是相机为例进行示意性说明。
示例性的,第一图像如图15A所示,第二图像如图15B所示。
步骤1302、图像处理装置基于空间映射模型对第一图像以及第二图像进行映射处理,得到第一展开图像以及第二展开图像。
本申请实施例中,第一设备采集第一图像时的位姿称为第一位姿,第二设备采集第二图像时的位姿称为第二位姿。其中,位姿可以理解为设备采集图像时的位置和方向,示例性的,位姿可以用6自由度姿态数据(XYZ三个轴的位置以及三个轴的旋转角度)或变换矩阵等参数描述。
本申请实施例中的第一图像与第二图像的位姿是基于各自空间坐标系下的位姿,即第一图像对应一个空间坐标系,第二图像对应另一个空间坐标系。空间映射模型的作用是根据第一位姿以及第二位姿分别将第一图像以及第二图像映射在同一空间坐标系下,并展开得到第一展开图像以及第二展开图像,第一展开图像以及第二展开图像是基于同一空间坐标系下的二维图像。有利于提升后续对同一场景下虚拟对象的环境光渲染的真实度。
本申请实施例中的空间映射模型的作用是根据第一位姿以及第二位姿将第一图像以及第二图像映射在同一空间坐标系,下面仅以空间映射模型是球模型为例进行示意性说明,可以理解的是空间映射模型也可以是立方体模型等,具体此处不做限定。
第一步、构建场景的球模型。
示例性的,如图16所示,根据第一图像构建世界坐标系,并以第一设备的光心位置为原点O构建球模型,以光轴OF在水平方向的投影为X轴,以竖直方向为Z轴,水平方向为Y轴。其中,世界坐标系的中心点与第一设备的光心位置为同一点。理论的光心位置为一个凸镜的中心,实际的光心位置可能是多个凸镜组合的虚拟中心。
其中,球的半径根据需要设置,例如:室内场景可以设置3米以上,室外场景可以设置10米以上,半径的具体数值此处不做限定。
第二步、确定投影区域。
根据第一图像的位姿确定第一图像在球模型上的第一投影区域。可以理解为:根据第一位子确定第一图像在球模型中的位置和朝向,并基于小孔成像原理,可以将第一图像中的各像素数据映射到球模型上并得到第一投影区域。下面详细介绍第二图像上述球模型上如何确定第二投影区域。
图像处理装置根据第二位姿,确定第二图像在上述创建的球模型的位置和朝向。示例性的,第二图像的投影方式如图17所示,O为世界坐标系的中心点坐标(也是第一设备采集第一图像时的光心位置),O 1为第二设备采集第二图像时的位置,A 1B 1C 1D 1为相机成像平面。其中,P 1为第二图像中的任意一个像素点,可以通过OP 1与球模型表面的交点确定第二图像中的P 1在球模型上的投影点为P 2,OP 2相对XOZ平面的角度为Φ,OP 2在XOZ平面上的投影相对OX的角度为θ。当然,也可以通过OP 1与球模型表面的交点确定第二图像中的P 1在球模型上的投影点为P 2,具体此处不做限定。其他点类似于P 1点确定P 2的方式,进而可以得到第二图像A 1B 1C 1D 1在球模型上的第二投影区域A 2B 2C 2D 2
第三步、纹理映射。
确定第二图像对应的第二投影区域后,可以通过OP 1或O 1P 1所在的直线确定第二图像中的像素点在第二投影区域中的位置。
下面简单介绍通过下述公式确定第二图像与第二展开图像中各像素点的对应关系:
假设第二设备的内部参数:焦距fx,fy,像元尺寸dx,dy;第二图像的宽为w,高为h。第二设备的当前的位姿:O 1(x 1,y 1,z 1),光轴绕X轴、Y轴以及Z轴的旋转角度分别为:α、β、γ。
对于第二图像上任意一点P 1的像素坐标为(x,y),则P 1点在相机坐标系中的坐标为:P cam(x*dx,y*dy,fx*dx),根据相机坐标系到世界坐标系的变换可以得到P 1点在世界坐标系中的坐标P world(x world,y world,z world)。从相机坐标系转换到世界坐标系属于刚体变换(物体不会发生变化,只需要进行旋转和平移),即相机坐标系经过旋转和平移后可以得到世界坐标系。
Figure PCTCN2021126000-appb-000007
其中:
Figure PCTCN2021126000-appb-000008
为P cam的转置矩阵,即
Figure PCTCN2021126000-appb-000009
由于相机坐标系的旋转与XYZ三个轴方向的旋转相关,所以相机坐标系转换到世界坐标系中的旋转矩阵包括三个分量,即R=R xR yR z,T cam→world表示相机坐标系转换到世界坐标系的位移矩阵。
Figure PCTCN2021126000-appb-000010
Figure PCTCN2021126000-appb-000011
Figure PCTCN2021126000-appb-000012
其中,pitch=-α、yaw=-β、roll=-γ。
确定P 1点在世界坐标系中的坐标P world之后,可以根据O点与P world可以确定OP 1的直线方程,即求解
Figure PCTCN2021126000-appb-000013
得到OP 1的直线方程,具体计算这里不再赘述。求解得到OP 1的直线方程之后,可以根据该直线方程与球模型的表面靠近P 1侧的交点即为映射点P 2,该P 2点在球面上的角度坐标记为(θ,Φ)。根据P 2点的坐标可以通过下述公式求取P 2点在第二展开图像中的像素坐标P 3(x exp,y exp)。
Figure PCTCN2021126000-appb-000014
Figure PCTCN2021126000-appb-000015
通过上述方式可以获取第二图像中任一点在第二展开图像中的像素点坐标。其他点的确定方式与通过P 1确定P 3类似,通过该种方式,将第二图像中各像素点的数值填充至第二展开图像中,可以使得第二展开图像具有与第二图像相近的纹理效果。
可选地,通过OP 1的直线方程将第二图像中的各像素点映射到第二展开图像中之后,由于距离的关系,第二展开图像中可能还有一些像素点的数值无法确定,即这些点并未与第二图像中的像素点关联起来。为了后续渲染的真实性,还可以根据上述通过P 1确定P 3的逆过程,通过将第二展开图像中像素点与第二图像中像素点的对应关系,从而确定第二展开图像中各像素点的数值。
延续上述举例,假设P 2为第二展开图像中的任意一点,可以通过O点与P 2点确定OP 2的 直线方程,联立第二图像的平面方程可以求解出OP 2的直线方程与平面方程的交点为映射点P 4
已知平面上的一点与该点的法向量可以求解出平面方程。光心O 1的光轴与相平面的交点为O c(x c,y c,z c):
O c=OO 1+f*n=(x 1+f*cosαcosβ,y 1+f*cosαsinβ,z 1+f*sinα)。
其中,f为焦距物理长度:f=fx*dx。第二图像所在平面的法向量n=(cosαcosβ,cosαsinβ,sinα)。
则第二图像的平面方程为:
(cosβcosγ,cosβsinγ,sinβ)*(x 1+f*cosβcosγ-x,y 1+f*cosβsinγ-y,y 1+sinβ-z)=0。
因此,联立OP 2的直线方程以及第二图像的平面方程可以求解出OP 2的直线方程与平面方程的交点为映射点P 4在世界坐标系下的三维坐标。
由前述公式
Figure PCTCN2021126000-appb-000016
可以将世界坐标系下该映射点P 4的坐标转换为P 4点在相机坐标系下的坐标,并将相机坐标系下的x值除以dx得到P 4点在第二图像的x值,将相机坐标系下的y值除以dy得到P 4点在第二图像的y值,即从相机坐标系下的三维坐标转换为图像坐标系下的二维坐标,进而得到该映射点在第二图像中的像素点坐标。并将该点在第二图像中的像素值赋予第二展开图像中对应点的像素值,从而使得第二展开图像的像素值更接近第二图像。
第二图像中的各像素点映射完成后,可以将第二投影区域按照经纬度展开,得到第二展开图像。示例性的,第二展开图像如图18所示。同理,确定第一图像对应的第一投影区域,映射后展开第一投影区域得到第一展开图像。
可选地,若服务器(例如:云端服务器)存储有与第一图像和/或第二图像中某一相同位置的图像(以下称为历史图像),图像处理装置还可以从服务器获取历史图像,该历史图像可能是在不同时刻及光照下拍摄的,且该历史图像的采集时刻在第一图像和/或第二图像的采集时刻之前。图像处理装置还可以从服务器获取历史图像以及历史设备采集历史图像时的历史位姿(例如:历史设备采集历史图像时的位置和方向)。然后通过历史位姿将历史图像放置在前述的球模型中,与获取第二图像的第二展开图像类似,获取历史图像对应的历史展开图像。其中,历史设备、第一设备以及第二设备是具有采集图像功能的设备,历史设备、第一设备以及第二设备可以是同一设备,也可以是不同的设备,具体此处不做限定。
可选地,若服务器存储的图像较多,可以根据属性进行匹配确定需要融合的历史图像。示例性的,可以根据位姿确定历史图像(例如:若设备采集图像时的位置与第一设备采集第一图像时的位置或第二设备采集第二图像时的位置之间的距离小于某一阈值,则确定该图像 为历史图像,说明该图像与第一图像和/或第二图像大概率在同一场景下),也可以根据天气类型确定历史图像(例如:确定天气类型与第一图像或第二图像一致的图像为历史图像),还可以根据时间间隔等属性确定历史图像(例如:确定时间间隔在一定阈值内的图像为历史图像,或视角的重叠角度/区域),当然,也可以通过上述属性的组合从服务器选取历史图像(例如:设备采集图像时的位置与第一设备采集第一图像时的位置或第二设备采集第二图像时的位置之间的距离小于某一阈值,且第一图像与第二图像的重叠区域大于某一阈值),具体的选取方式此处不做限定。
步骤1303、图像处理装置融合第一展开图像以及第二展开图像,得到第三图像。
获取第一展开图像以及第二展开图像之后,融合第一展开图像以及第二展开图像,得到视场角(或视角范围)更大的第三图像(示例性的,第三图像如图19所示)。
示例性的,请参阅图20,I 1为第一展开图像的一部分,I 2为第二展开图像的一部分,由于第一图像与第二图像是同一场景不同视角下采集的图像。因此,第一展开图像与第二展开图像会存在重叠区域。
可选地,根据重叠区域形状的不同可以采用不同的方式融合,示例性的,若重叠区域的宽度大于高度,则进行竖直融合;若重叠区域的宽度小于或等于高于,则进行水平融合。竖直融合与水平融合原理类似,下面仅以竖直融合为例进行描述。
确定重叠区域内各列像素的高度Δh以及融合区域的像素高度阈值δ,融合区域太大会导致模糊区域过大,融合区域太小会导致融合不够,δ设置需要适中,示例性的,δ取值范围为0至150内,δ根据实际需要设置,具体此处不做限定。实际融合区域的高度取值可以为min(δ,Δh),示例性的,如图20中融合区域的高度,融合区域的某一点P(x,y)所在列的上侧边界点的Y坐标为y min,下侧边界点的Y坐标为y max,则P点的像素值I可以通过下述公式确定:
I=α*I 1+(1-α)*I 2
Figure PCTCN2021126000-appb-000017
上述公式中,I 1为P点在第一展开图像中的像素值,I 2为P点在第二展开图像中的像素值。
可以理解的是,上述公式只是举例,P点的像素值可以是通过上述加权的方式,也可以是平均的方式等得到,具体方式此处不做限定。
可选地,图像处理装置融合第一展开图像、第二展开图像以及历史展开图像,得到第三图像。或者图像处理装置将历史展开图像与第三图像进行融合得到第四图像,并将第四图像当做第三图像执行后续步骤。
示例性的,图21为历史图像,图22为融合历史展开图像后的第四图像或第三图像。
可选地,当图像为多个时,后续的图像处理方式与历史图像的操作类似,融合得到第三图像。当然,如果后续待融合的图像与第一图像的采集时刻在一定阈值范围内,可以直接将后续图像与目标融合的图像进行再次融合,得到融合图像,并将融合进图像当做第三图像执行后续步骤。
可选地,上述的方法是基于第一图像建立的球模型进行的操作。可以理解的是,上述的方法不一定是在第一图像建立的球模型上进行的操作。由于相机同时会有旋转和位移的姿态变化,当采集一个图像的设备与采集另一个图像的设备之间的位移变化大于或等于第二阈值时,会影响后续的融合质量,可以清空已融合的图像,基于当前的图像(即相机的新位置)重新构建新球模型,再对后续图像在新的球模型上执行上述映射、融合的步骤。
可选地,多视角图像的融合还可以根据采集两个图像的设备旋转角度的差异与第三阈值的大小进行控制。可选地,用户可以通过用户界面(user interface,UI)对上述历史图像、位移变换、视角变换进行图像处理设置。即用户输入指令,图像处理装置执行相应的步骤。如图23所示,云数据融合用于用户设置是否结合服务器的历史图像进行图像融合。多视角融合中的高质量以及高性能用于用户设置前述的第二阈值与第三阈值。高性能对应的第二阈值大于高质量对应的第二阈值,高性能对应的第三阈值大于高质量对应的第三阈值,即高性能时,第二阈值和第三阈值稍大一些,使得淘汰的图像多一些,增加图像处理装置的处理效率。高质量时,第二阈值和第三阈值稍小一些,使得获得的第三图像的精度高一些。
示例性的,高性能对应的第二阈值为2米,高质量对应的第二阈值为1米。假设用户输入确定高性能的指令,即第二阈值为2米。若设备采集一个图像时的位置与设备采集另一个图像时的位置之间的距离小于2米,则可以对两个图像执行投影、融合等操作。若设备采集一个图像时的位置与设备采集另一个图像时的位置之间的距离大于或等于2米,则可以清空前面的图像,基于当前的图像(即相机的新位置)重新构建新球模型,再对后续图像在新的球模型上执行上述映射、融合等操作。
示例性的,高性能对应的第三阈值为30度,高质量对应的第三阈值为5度。假设用户输入确定高质量的指令,即第三阈值为5度。若旋转角度的差异大于或等于5度,则对两个图像进行映射、融合等操作。若旋转角度的差异小于5度,则丢弃当前图像,基于之前图像构建的球模型,再对后续图像执行上述映射、融合等操作。
示例性的,用户可以通过UI的选择方式很多。一种可能实现的方式中,如图24所示,用户选择云数据融合,即选择融合历史图像,则执行上述融合历史图像的步骤。另一种可能实现的方式中,如图25所示,用户选择高性能,则上述的第二阈值和第三阈值稍大一些。另一种可能实现的方式中,如图26所示,用户选择高质量,则上述的第二阈值和第三阈值稍小一些。用户选择云数据融合,即选择融合历史图像,则执行上述融合历史图像的步骤。当然,用户选择的方式有多种,图像处理装置根据用户的设置进行相应的操作,图23至图26只是几种情况的举例。
步骤1304、图像处理装置将第三网络输入训练好的图像预测网络进行图像预测,得到预测图像。
本步骤中使用的图像预测模型可以是通过上述图5中的方法构建的,也可以是通过其他方式构建的,具体此处不做限定。
一种可能实现的方式中,如图27所示,图像处理装置获取第三图像之后,将第三图像输入训练好的图像预测网络进行图像预测,得到预测图像。
另一种可能实现的方式中,如图28所示,图像处理装置获取第三图像之后,将第三图像以及第三图像对应的蒙板图像输入训练好的图像预测网络进行图像预测,得到预测图像。
示例性的,延续上述举例,图29为预测图像,图30为图27的一种示例。
图像处理装置获取预测图像后,可以利用预测图像对同一场景下虚拟对象进行环境光渲染等操作,其中,虚拟对象为需要进行环境光渲染的对象,虚拟对象可以是虚拟物体,也可以是虚拟场景,具体此处不做限定。下面通过步骤1305以及步骤1306介绍通过利用预测图像的球谐系数进行环境光渲染,可以理解的是,预测图像也可以通过球面高斯或基于图像的光照(image based lighting,IBL)等方式用于环境光渲染,具体此处不做限定。
步骤1305、图像处理装置获取预测图像的球谐系数。本步骤是可选地。
可选地,图像处理装置获取预测图像之后,将预测图像映射到单位空间映射模型上,例如:单位球模型(图31A所示),从而获取预测图像的球谐系数,球谐系数可以用于描述预测图像的环境光数据。
球谐光照实际上就是将周围的环境光采样成几个系数(即几个球谐系数),然后渲染的时候用这几个球谐系数来对光照进行还原,这种过程可以看作是对周围环境光的简化。每采样一个像素,就计算相应的球谐基,并且对像素与对应的球谐基相乘后再求和,这样相当于每个球谐基在所有像素上的积分。不过,为了得到球谐基上的平均光照强度,还需要将积分得到的数值乘以立体角并且除以总像素。简单来说就是运用下面的公式求得球谐系数:
Figure PCTCN2021126000-appb-000018
其中,i为球谐系数索引,N为采样点的个数,对于阶数n,对应单个图像通道的球谐系数有n 2个,对于三通道环境图,球谐系数为3n 2个,light(x j)为样本点的RGB值。y i(x j)为球谐基,对于阶数n,y i(x j)分为多个带:0,…,l,…,n-1。带l中包括2l+1个球谐基。球谐基可以通过公式
Figure PCTCN2021126000-appb-000019
计算得到,其中,i=l(l+1)+m,m取值:-l,-(l-1),…,0,…,(l-1),l。
Figure PCTCN2021126000-appb-000020
Figure PCTCN2021126000-appb-000021
Figure PCTCN2021126000-appb-000022
其中,
Figure PCTCN2021126000-appb-000023
相当于一个缩放系数,用于归一化,
Figure PCTCN2021126000-appb-000024
为伴随勒让德多项式,含义为P l(x)对x的m阶导数。
Figure PCTCN2021126000-appb-000025
若图像处理装置将预测图像映射到单位球模型上,通过上述公式,可以得到一组球谐系数。示例性的,球谐系数取三阶,RGB三个通道分别包括9个系数,共27个系数。
步骤1306、图像处理装置利用球谐系数对虚拟物体进行环境光渲染。本步骤是可选地。
下面以RGB中的R通道为例,进行示意性说明。
计算虚拟物体上任意一点P的RGB值。RGB值中的R数值等于第一数值、第二数值以及第三数值的和。首先确定该P点的法向量为(x,y,z,1),则第一数值等于R通道的球谐系数前4个构成的向量与法向量的点积。第二数值等于R通道的球谐系数的第5个至第8个构成的向量与VB的点积,其中VB为(xy,yz,zz,zx)。第三数值等于R通道的球谐系数的第9个与VC的乘积,其中,VC为x 2-y 2。通过上述方式,可以得到P点对应的R通道数值。G通道数值以及B通道数值的计算方式与R通道的计算方式类似,此处不再赘述。
确定P点的RGB值之后,球面上其他点的RGB值计算与P点类似,从而完成虚拟对象的渲染。其中,虚拟对象可以是场景下的虚拟物体或虚拟场景,具体此处不做限定。
示例性的,球谐光照效果如图31B。
一种可能实现的方式中,图像处理方法包括步骤1301至步骤1304。另一种可能实现的方式中,图像处理方法包括步骤1301至步骤1306。
可选地,图13所示的方法可以循环执行,即构建空间映射模型之后,可以获取多个图像,并将多个图像映射到空间映射模型,得到多个展开图像,再融合得到待输入图像预测网络的第三图像。换句话说,第三图像可以由不同视角的两个图像融合得到,也可以是由不同视角的多个图像融合得到。
可选地,融合第三图像之后,未输入图像预测网络之前,还可以不断更新第三图像,即融合目标的第三图像以及后续的展开图像(获取后续展开图像的方式与前述第二展开图像获取的方式类似),得到新的第三图像,再将新的第三图像输入图像预测网络进行预测,得到预测图像。
本申请实施例,通过空间映射模型映射以及融合多个视角的展开图像使得输入图像预测网络后得到的预测图像具有更多纹理,提升后续该场景下虚拟对象的环境光渲染质量。例如:增强虚拟物体渲染的真实性。进一步的,输入图像还可以参考服务器存储的相同位置的图像(即场景中某一位置的纹理信息)。以及用户可以通过用户界面灵活设置融合的方式,例如:是高性能还是高质量。
相应于上述方法实施例给出的方法,本申请实施例还提供了相应的装置,包括用于执行上述实施例相应的模块。所述模块可以是软件,也可以是硬件,或者是软件和硬件结合。
请参阅图32,本申请实施例中图像处理装置的一个实施例,该图像处理装置可以是本地设备(例如,手机、摄像机等)或云端设备。该图像处理装置包括:
获取单元3201,用于获取第一图像以及第二图像,第一图像与第二图像为同一场景不同视角下采集的图像;
映射单元3202,用于基于空间映射模型对第一图像以及第二图像进行映射处理,得到第一展开图像以及第二展开图像;
融合单元3203,用于融合第一展开图像以及第二展开图像,得到第三图像;
预测单元3204,用于将第三图像输入训练好的图像预测网络进行图像预测,得到预测图像,预测图像用于前述场景下虚拟对象的环境光渲染。
本实施例中,图像处理装置中各单元所执行的操作与前述图5至图13所示实施例中描述的类似,此处不再赘述。
本实施例中,通过映射单元3202对多视角图像进行空间映射模型映射以及融合单元3203融合多个视角的展开图像使得预测单元3204输入图像预测网络后得到的预测图像具有更多纹理,提升后续该场景下虚拟对象的环境光渲染质量。
请参阅图33,本申请实施例中图像处理装置的另一实施例,该图像处理装置可以是本地设备(例如,手机、摄像机等)或云端设备。该图像处理装置包括:
获取单元3301,用于获取第一图像以及第二图像,第一图像与第二图像为同一场景不同视角下采集的图像;
映射单元3302,用于基于空间映射模型对第一图像以及第二图像进行映射处理,得到第一展开图像以及第二展开图像;
融合单元3303,用于融合第一展开图像以及第二展开图像,得到第三图像;
预测单元3304,用于将第三图像输入训练好的图像预测网络进行图像预测,得到预测图像,预测图像用于前述场景下虚拟对象的环境光渲染。
上述映射单元3302包括:
构建子单元33021,用于根据第一设备的光心构建空间映射模型,第一设备为采集第一图像的设备,第二设备为采集第二图像的设备;
确定子单元33022,用于根据第一设备的第一位姿确定第一图像在空间映射模型上的第一投影区域,第一位姿为第一设备采集第一图像时的位姿;
映射子单元33023,用于将第一图像内各像素点映射到第一投影区域,得到第一展开图像;
确定子单元33022,还用于根据第二设备的第二位姿确定第二图像在空间映射模型上的第二投影区域,第二位姿为第二设备采集第二图像时的位姿;
映射子单元33023,还用于将第二图像内各像素点映射到第二投影区域内,得到第二展开图像。
获取单元3301,还用于从服务器获取历史图像以及该历史图像的历史位姿,历史图像的采集时刻在第一图像或第二图像的采集时刻之前,历史位姿为历史设备采集历史图像时的位姿,历史图像存储有与第一图像和/或第二图像中相同位置的图像;
获取单元3301,还用于获取预测图像的球谐系数;
本实施例中的图像处理装置还包括:
渲染单元3305,用于利用球谐系数对虚拟物体进行环境光渲染。
确定子单元33022,还用于根据历史位姿确定历史图像在空间映射模型上的历史投影区域,
映射子单元33023,还用于将历史图像内各像素点映射到历史投影区域,得到历史展开图像;
融合单元3303,具体用于融合第一展开图像、第二展开图像历史展开图像,得到第三图像。
本实施例中,图像处理装置中各单元所执行的操作与前述图5至图13所示实施例中描述的类似,此处不再赘述。
本实施例中,通过映射单元3302对多视角图像进行空间映射模型映射以及融合单元3303融合多个视角的展开图像使得预测单元3304输入图像预测网络后得到的预测图像具有更多纹理,提升后续该场景下虚拟对象的环境光渲染质量。例如:增强虚拟物体渲染的真实性。进一步的,输入图像还可以参考服务器存储的相同位置的图像(即场景中某一位置的纹理信息)。以及用户可以通过用户界面灵活设置融合的方式,例如:是高性能还是高质量。
图34是本申请实施例提供的图像处理装置的硬件结构示意图。图34所示的图像处理装置3400(该装置3400具体可以是一种计算机设备)包括存储器3401、处理器3402、通信接口3403以及总线3404。其中,存储器3401、处理器3402、通信接口3403通过总线3404实现彼此之间的通信连接。
存储器3401可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器3401可以存储程序,当存储器3401中存储的程序被处理器3402执行时,处理器3402和通信接口3403用于执行本申请实施例的图像处理方法的各个步骤。
处理器3402可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的图像处理装置中的单元所需执行的功能,或者执行本申请方法实施例的图像处理方法。
处理器3402还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的图像处理方法的各个步骤可以通过处理器3402中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器3402还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器3401,处理器3402读取存储器3401中的信息,结合其硬件完成本申请实施例的图像处理装置中包括的单元所需执行的功能,或者执行本申请方法实施例的图像处理方法。
通信接口3403使用例如但不限于收发器一类的收发装置,来实现装置3400与其他设备或通信网络之间的通信。例如,可以通过通信接口3403获取训练数据(如本申请实施例所述的训练图像)。
总线3404可包括在装置3400各个部件(例如,存储器3401、处理器3402、通信接口3403)之间传送信息的通路。
应注意,尽管图34所示的装置3400仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置3400还包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置3400还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置3400也可仅仅包括实现本申请实施例所必须的器件,而不必包括图34中所示的全部器件。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述图像处理装置所执行的步骤,或者,使得计算机执行如前述图像处理装置所执行的步骤。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述图像处理装置所执行的步骤。
本申请实施例提供的图像处理装置或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使图像处理装置内的芯片执行上述实施例描述的图像处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代 码的介质。

Claims (16)

  1. 一种图像处理方法,其特征在于,包括:
    获取第一图像以及第二图像,所述第一图像与所述第二图像为同一场景不同视角下采集的图像;
    基于空间映射模型对所述第一图像以及所述第二图像进行映射处理,得到第一展开图像以及第二展开图像;
    融合所述第一展开图像以及所述第二展开图像,得到第三图像;
    将所述第三图像输入训练好的图像预测网络进行图像预测,得到预测图像,所述预测图像用于所述场景下虚拟对象的环境光渲染。
  2. 根据权利要求1所述的方法,其特征在于,所述基于空间映射模型对所述第一图像以及所述第二图像进行映射处理,得到第一展开图像以及第二展开图像,包括:
    根据第一设备的光心构建所述空间映射模型,所述第一设备为采集所述第一图像的设备,所述第二设备为采集所述第二图像的设备;
    根据所述第一设备的第一位姿确定所述第一图像在所述空间映射模型上的第一投影区域,所述第一位姿为所述第一设备采集所述第一图像时的位姿;
    将所述第一图像内各像素点映射到所述第一投影区域,得到所述第一展开图像;
    根据所述第二设备的第二位姿确定所述第二图像在所述空间映射模型上的第二投影区域,所述第二位姿为所述第二设备采集所述第二图像时的位姿;
    将所述第二图像内各像素点映射到所述第二投影区域内,得到所述第二展开图像。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    从服务器获取历史图像以及所述历史图像的历史位姿,所述历史图像的采集时刻在所述第一图像或所述第二图像的采集时刻之前,所述历史位姿为历史设备采集所述历史图像时的位姿,所述历史图像存储有与所述第一图像和/或所述第二图像中相同位置的图像;
    根据所述历史位姿确定所述历史图像在所述空间映射模型上的历史投影区域;
    将所述历史图像内各像素点映射到所述历史投影区域,得到历史展开图像;
    所述融合所述第一展开图像以及所述第二展开图像,得到第三图像,包括:
    融合所述第一展开图像、所述第二展开图像所述历史展开图像,得到所述第三图像。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述训练好的图像预测网络是通过以训练图像作为所述图像预测网络的输入,以损失函数的值小于第一阈值为目标对图像预测网络进行训练得到;
    所述损失函数用于指示图像预测网络输出的图像与第三目标图像之间的差异,所述第三目标图像为采集的图像。
  5. 根据权利要求4所述的方法,其特征在于,所述损失函数的权重由所述训练图像对应的蒙板图像控制。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述方法还包括:
    获取所述预测图像的球谐系数;
    利用所述球谐系数对虚拟物体进行环境光渲染。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述第三图像的视场角大于 所述第一图像或所述第二图像。
  8. 一种图像处理装置,其特征在于,包括:
    获取单元,用于获取第一图像以及第二图像,所述第一图像与所述第二图像为同一场景不同视角下采集的图像;
    映射单元,用于基于空间映射模型对所述第一图像以及所述第二图像进行映射处理,得到第一展开图像以及第二展开图像;
    融合单元,用于融合所述第一展开图像以及所述第二展开图像,得到第三图像;
    预测单元,用于将所述第三图像输入训练好的图像预测网络进行图像预测,得到预测图像,所述预测图像用于所述场景下虚拟对象的环境光渲染。
  9. 根据权利要求8所述的装置,其特征在于,所述映射单元包括:
    构建子单元,用于根据第一设备的光心构建所述空间映射模型,所述第一设备为采集所述第一图像的设备,所述第二设备为采集所述第二图像的设备;
    确定子单元,用于根据所述第一设备的第一位姿确定所述第一图像在所述空间映射模型上的第一投影区域,所述第一位姿为所述第一设备采集所述第一图像时的位姿;
    映射子单元,用于将所述第一图像内各像素点映射到所述第一投影区域,得到所述第一展开图像;
    所述确定子单元,还用于根据所述第二设备的第二位姿确定所述第二图像在所述空间映射模型上的第二投影区域,所述第二位姿为所述第二设备采集所述第二图像时的位姿;
    所述映射子单元,还用于将所述第二图像内各像素点映射到所述第二投影区域内,得到所述第二展开图像。
  10. 根据权利要求9所述的装置,其特征在于,所述获取单元,还用于从服务器获取历史图像以及所述历史图像的历史位姿,所述历史图像的采集时刻在所述第一图像或所述第二图像的采集时刻之前,所述历史位姿为历史设备采集所述历史图像时的位姿,所述历史图像存储有与所述第一图像和/或所述第二图像中相同位置的图像;
    所述确定子单元,还用于根据所述历史位姿确定所述历史图像在所述空间映射模型上的历史投影区域;
    所述映射子单元,还用于将所述历史图像内各像素点映射到所述历史投影区域,得到历史展开图像;
    所述融合单元,具体用于融合所述第一展开图像、所述第二展开图像所述历史展开图像,得到所述第三图像。
  11. 根据权利要求8至10中任一项所述的装置,其特征在于,所述训练好的图像预测网络是通过以训练图像作为所述图像预测网络的输入,以损失函数的值小于第一阈值为目标对图像预测网络进行训练得到;
    所述损失函数用于指示图像预测网络输出的图像与第三目标图像之间的差异,所述第三目标图像为采集的图像。
  12. 根据权利要求11所述的装置,其特征在于,所述损失函数的权重由所述训练图像对应的蒙板图像控制。
  13. 根据权利要求8至12中任一项所述的装置,其特征在于,所述获取单元,还用于获 取所述预测图像的球谐系数;
    所述图像处理装置还包括:
    渲染单元,用于利用所述球谐系数对虚拟物体进行环境光渲染。
  14. 根据权利要求8至13中任一项所述的装置,其特征在于,所述第三图像的视场角大于所述第一图像或所述第二图像。
  15. 一种图像处理装置,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述图像处理装置执行如权利要求1至7中任一项所述的方法。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,所述指令在计算机上执行时,使得所述计算机执行如权利要求1至7中任一项所述的方法。
PCT/CN2021/126000 2020-11-10 2021-10-25 一种图像处理方法及相关设备 WO2022100419A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011248424.1 2020-11-10
CN202011248424.1A CN114463230A (zh) 2020-11-10 2020-11-10 一种图像处理方法及相关设备

Publications (1)

Publication Number Publication Date
WO2022100419A1 true WO2022100419A1 (zh) 2022-05-19

Family

ID=81404592

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/126000 WO2022100419A1 (zh) 2020-11-10 2021-10-25 一种图像处理方法及相关设备

Country Status (2)

Country Link
CN (1) CN114463230A (zh)
WO (1) WO2022100419A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114782911A (zh) * 2022-06-20 2022-07-22 小米汽车科技有限公司 图像处理的方法、装置、设备、介质、芯片及车辆
CN117495938A (zh) * 2024-01-02 2024-02-02 山东力乐新材料研究院有限公司 一种基于图像处理的可折叠中空板生产数据提取方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116385705B (zh) * 2023-06-06 2023-08-29 北京智拓视界科技有限责任公司 用于对三维数据进行纹理融合的方法、设备和存储介质
CN117333403B (zh) * 2023-12-01 2024-03-29 合肥金星智控科技股份有限公司 图像增强方法、存储介质和图像处理系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364292A (zh) * 2018-03-26 2018-08-03 吉林大学 一种基于多幅视角图像的光照估计方法
US20200074600A1 (en) * 2017-11-28 2020-03-05 Adobe Inc. High dynamic range illumination estimation
CN111652963A (zh) * 2020-05-07 2020-09-11 浙江大学 一种基于神经网络的增强现实绘制方法
CN111710049A (zh) * 2020-06-18 2020-09-25 三星电子(中国)研发中心 Ar场景中的环境光照确定方法和装置
CN111723902A (zh) * 2019-03-21 2020-09-29 奥多比公司 使用神经网络动态估计增强现实场景中位置的照明参数

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200074600A1 (en) * 2017-11-28 2020-03-05 Adobe Inc. High dynamic range illumination estimation
CN108364292A (zh) * 2018-03-26 2018-08-03 吉林大学 一种基于多幅视角图像的光照估计方法
CN111723902A (zh) * 2019-03-21 2020-09-29 奥多比公司 使用神经网络动态估计增强现实场景中位置的照明参数
CN111652963A (zh) * 2020-05-07 2020-09-11 浙江大学 一种基于神经网络的增强现实绘制方法
CN111710049A (zh) * 2020-06-18 2020-09-25 三星电子(中国)研发中心 Ar场景中的环境光照确定方法和装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114782911A (zh) * 2022-06-20 2022-07-22 小米汽车科技有限公司 图像处理的方法、装置、设备、介质、芯片及车辆
CN114782911B (zh) * 2022-06-20 2022-09-16 小米汽车科技有限公司 图像处理的方法、装置、设备、介质、芯片及车辆
CN117495938A (zh) * 2024-01-02 2024-02-02 山东力乐新材料研究院有限公司 一种基于图像处理的可折叠中空板生产数据提取方法
CN117495938B (zh) * 2024-01-02 2024-04-16 山东力乐新材料研究院有限公司 一种基于图像处理的可折叠中空板生产数据提取方法

Also Published As

Publication number Publication date
CN114463230A (zh) 2022-05-10

Similar Documents

Publication Publication Date Title
WO2022100419A1 (zh) 一种图像处理方法及相关设备
Su et al. A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose
WO2022042049A1 (zh) 图像融合方法、图像融合模型的训练方法和装置
US12008797B2 (en) Image segmentation method and image processing apparatus
CN110910486B (zh) 室内场景光照估计模型、方法、装置、存储介质以及渲染方法
CN112446834B (zh) 图像增强方法和装置
US11232286B2 (en) Method and apparatus for generating face rotation image
WO2022165809A1 (zh) 一种训练深度学习模型的方法和装置
WO2021164731A1 (zh) 图像增强方法以及图像增强装置
US11823327B2 (en) Method for rendering relighted 3D portrait of person and computing device for the same
CN110532871A (zh) 图像处理的方法和装置
CN109377530A (zh) 一种基于深度神经网络的双目深度估计方法
CN112446380A (zh) 图像处理方法和装置
CN109314753A (zh) 使用光流生成中间视图
WO2022165722A1 (zh) 单目深度估计方法、装置及设备
WO2021042774A1 (zh) 图像恢复方法、图像恢复网络训练方法、装置和存储介质
US20240242451A1 (en) Method for 3d reconstruction, apparatus, system, and storage medium
WO2022052782A1 (zh) 图像的处理方法及相关设备
US11138812B1 (en) Image processing for updating a model of an environment
CN112529904A (zh) 图像语义分割方法、装置、计算机可读存储介质和芯片
CN113284055B (zh) 一种图像处理的方法以及装置
CN117274501B (zh) 一种可驱动数字人建模方法、装置、设备及介质
CN111862278A (zh) 一种动画获得方法、装置、电子设备及存储介质
CN112241934B (zh) 一种图像处理方法以及相关设备
CN115222578A (zh) 图像风格迁移方法、程序产品、存储介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21890946

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21890946

Country of ref document: EP

Kind code of ref document: A1