CN116630514A

CN116630514A - Image processing method, device, computer readable storage medium and electronic equipment

Info

Publication number: CN116630514A
Application number: CN202310596577.2A
Authority: CN
Inventors: 张琦; 杨明川; 刘巧俏; 邹航
Original assignee: Beijing Research Institute Of China Telecom Corp ltd; China Telecom Corp Ltd
Current assignee: Beijing Research Institute Of China Telecom Corp ltd; China Telecom Corp Ltd
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-08-22

Abstract

The application belongs to the field of artificial intelligence, and relates to an image processing method, an image processing device, a storage medium and electronic equipment. The method comprises the following steps: acquiring an image to be processed and preset camera pose and view angle information corresponding to the image to be processed; inputting the image to be processed, the preset camera pose and the visual angle information into a three-dimensional reconstruction model, determining color information and depth information corresponding to the image to be processed according to the preset camera pose and the visual angle information through the three-dimensional reconstruction model, and rendering and generating a two-dimensional image corresponding to the image to be processed and having the visual angle information according to the color information and the depth information. The application can improve the efficiency and quality of three-dimensional reconstruction and ensure the consistency of visual angles.

Description

Image processing method, device, computer readable storage medium and electronic equipment

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an image processing method, an image processing apparatus, a computer readable storage medium, and an electronic device.

Background

Three-dimensional reconstruction and new view angle image rendering are the core of the field of computer graphics and are also the focus of image processing technology research. With the new concepts of digital twinning, holographic communication, metauniverse and the like, the demands of the industry for three-dimensional reconstruction and new view angle image rendering are gradually increased.

At present, when three-dimensional reconstruction and new view angle image rendering are carried out, the position information of a two-dimensional image is used as the input of a three-dimensional reconstruction model, color information is used as supervision to recover the three-dimensional information of an object, so that the whole system can only conduct new view angle rendering on a single object in a targeted mode after long training is finished.

It should be noted that the information disclosed in the foregoing background section is only for enhancement of understanding of the background of the application.

Disclosure of Invention

The application aims to provide an image processing method, an image processing device, a computer readable storage medium and electronic equipment, so that the accuracy of a three-dimensional reconstruction model is improved at least to a certain extent, and the modeling capability of different scenes is improved.

Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.

According to a first aspect of the present application, there is provided an image processing method comprising: acquiring an image to be processed and preset camera pose and view angle information corresponding to the image to be processed; inputting the image to be processed, the preset camera pose and the visual angle information into a three-dimensional reconstruction model, determining color information and depth information corresponding to the image to be processed according to the preset camera pose and the visual angle information through the three-dimensional reconstruction model, and rendering and generating a two-dimensional image corresponding to the image to be processed and having the visual angle information according to the color information and the depth information.

According to a second aspect of the present application, there is provided an image processing apparatus comprising: the acquisition module is used for acquiring an image to be processed and the position and view angle information of a preset camera corresponding to the image to be processed; the reconstruction module is used for inputting the image to be processed, the preset camera pose and the visual angle information into a three-dimensional reconstruction model, determining color information and depth information corresponding to the image to be processed according to the preset camera pose and the visual angle information through the three-dimensional reconstruction model, and rendering according to the color information and the depth information to generate a two-dimensional image corresponding to the image to be processed and having the visual angle information.

According to a third aspect of the present application, there is provided a computer storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the above-mentioned image processing method.

According to a fourth aspect of the present application, there is provided an electronic apparatus characterized by comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the image processing method described above via execution of the executable instructions.

As can be seen from the above technical solutions, the image processing method, the image processing apparatus, the computer-readable storage medium, and the electronic device according to the exemplary embodiments of the present application have at least the following advantages and positive effects:

the image processing method in the embodiment of the application comprises the steps of firstly, acquiring an image to be processed with a preset camera pose and visual angle information; and then inputting the image to be processed and the view angle information into a three-dimensional reconstruction model, determining color information and depth information corresponding to the image to be processed according to the preset camera pose and view angle information through the three-dimensional reconstruction model, and rendering according to the color information and the depth information to generate a two-dimensional image corresponding to the image to be processed and having the view angle information. According to the image processing method, when three-dimensional reconstruction is carried out, the reconstruction is carried out according to the RGB features and the depth features of the image to be processed, and compared with the reconstruction carried out according to the RGB features of the image to be processed, the accuracy of the three-dimensional reconstruction can be improved, and the consistency of the visual angles is maintained.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 schematically shows a system architecture diagram to which an image processing method is applied in an embodiment of the present application.

Fig. 2 schematically shows a flowchart of an image processing method in an embodiment of the application.

Fig. 3 schematically shows a schematic structural diagram of a three-dimensional reconstruction model in an embodiment of the present application.

Fig. 4 schematically illustrates a flowchart of acquiring color information and depth information corresponding to a pixel point in an image to be processed in an embodiment of the present application.

Fig. 5 schematically illustrates a flowchart of acquiring target features corresponding to each spatial point in an embodiment of the present application.

Fig. 6 schematically shows a training flow diagram of a three-dimensional reconstruction model in an embodiment of the application.

Fig. 7 schematically shows a schematic structural diagram of a three-dimensional reconstruction model to be trained in an embodiment of the present application.

Fig. 8 schematically shows a schematic configuration of an image processing apparatus in an embodiment of the present application.

Fig. 9 schematically shows a block diagram of a computer system suitable for use in implementing embodiments of the application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

The terms "a," "an," "the," and "said" are used in this specification to denote the presence of one or more elements/components/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. in addition to the listed elements/components/etc.; the terms "first" and "second" and the like are used merely as labels, and are not intended to limit the number of their objects.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

In the related art in the field, when training a three-dimensional reconstruction model, the position information of a two-dimensional image is usually used as input, the color information is used as supervision to recover the three-dimensional information of an object, but the three-dimensional reconstruction model obtained by training can only perform new view rendering on a single object, so that the three-dimensional reconstruction model is very low in efficiency and impractical, in the training process, only the color information is used as supervision, the information amount is less for three-dimensional reconstruction, the three-dimensional reconstruction model obtained by training cannot realize accurate three-dimensional reconstruction, the model accuracy is low, and the problem of inconsistent view angles exists. In addition, because a plurality of different spatial points possibly exist to correspond to the same color feature, three-dimensional confusion can be generated, when a three-dimensional reconstruction model is trained, the model is difficult to capture accurate three-dimensional information, and further accurate three-dimensional reconstruction and new view angle rendering cannot be realized through the three-dimensional reconstruction model.

Aiming at the technical problems in the related art, the embodiment of the application provides an image processing method for improving the accuracy and viewing angle consistency of three-dimensional reconstruction. Before describing the technical solution in the embodiments of the present application in detail, technical terms that may be related to the embodiments of the present application will be explained and described first.

(1) Three-dimensional reconstruction: the mathematical model suitable for computer representation and processing is established for the three-dimensional object, is the basis for processing, operating and analyzing the three-dimensional object in a computer environment, and is also a key technology for establishing virtual reality expressing objective world in a computer.

(2) Viewing angle: when shooting an object, the included angles formed by the light rays led out from the two ends (up, down or left and right) of the object at the optical center of the camera are formed.

(3) Camera pose: the position and pose of a camera is typically characterized by a rotation matrix, a translation matrix, and camera internal parameters.

(4) World coordinate system: the three-dimensional rectangular coordinate system is also called a measurement coordinate system, the spatial positions of the camera and the object to be measured can be described by taking the three-dimensional rectangular coordinate system as a reference, and the position of the world coordinate system can be freely determined according to actual conditions.

(5) Camera coordinate system: the lens is also a three-dimensional rectangular coordinate system, the origin is positioned at the optical center of the lens, the x-axis and the y-axis are respectively parallel to the two sides of the phase plane, and the z-axis is the optical axis of the lens and is perpendicular to the image plane.

(6) Image coordinate system: the plane two-dimensional coordinate system is parallel to the imaging plane, and the origin is at the center of the image.

(7) Pixel coordinate system: the origin is in the upper left corner of the image in pixels.

After describing the technical terms possibly related to the embodiments of the present application, the image processing method in the present application is described in detail.

FIG. 1 schematically illustrates a block diagram of a system architecture of a high performance computing storage system employing the technical solution of the present application.

As shown in fig. 1, a system architecture 100 may include a terminal device 101, a server 102, and a network 103. The terminal device 101 may include various electronic devices with display screens, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart television, an intelligent vehicle-mounted terminal, and the like, and the terminal device 101 may also be an electronic device with a display screen and a shooting unit, so that not only can a two-dimensional image be generated by shooting an object in a three-dimensional scene through the shooting unit, but also the two-dimensional image generated by shooting and the two-dimensional image with a preset viewing angle generated by a three-dimensional reconstruction module can be displayed through the display screen. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The network 103 may be a communication medium of various connection types capable of providing a communication link between the terminal device 101 and the server 102, and may be a wired communication link or a wireless communication link, for example.

The system architecture in embodiments of the present application may have any number of terminal devices, networks, and servers, as desired for implementation. For example, the server may be a server group composed of a plurality of server devices.

In one embodiment of the present application, a user determines an image to be processed in a terminal device 101, sets preset camera pose and viewing angle information corresponding to the image to be processed, that is, camera pose and new viewing angle when three-dimensionally reconstructing, sends the image to be processed, the preset camera pose and viewing angle information to a server 102 through a network 103, after receiving the information, the server 102 invokes a three-dimensional reconstruction model built therein, determines color information and depth information corresponding to the image to be processed according to the preset camera pose and viewing angle information through the three-dimensional reconstruction model, and renders the two-dimensional image corresponding to the image to be processed and having the viewing angle information according to the color information and the depth information.

The technical scheme provided by the embodiment of the application can also be applied to the terminal equipment 101, a three-dimensional reconstruction model is built in the terminal equipment 101 and trained, after the image to be processed, the preset camera pose and the visual angle information are acquired, the three-dimensional reconstruction model is called to determine color information and depth information corresponding to the image to be processed according to the preset camera pose and the visual angle information, and a two-dimensional image which corresponds to the image to be processed and has the visual angle information is rendered and generated according to the color information and the depth information.

The image processing method is realized based on a three-dimensional reconstruction model, which is a machine learning model and relates to artificial intelligence.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image information labeling, OCR, video processing, video semantic understanding, video content/behavior recognition, abnormal sound detection, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.

The image processing method provided by the embodiment of the application can be executed by a cloud server, and relates to cloud computing and cloud storage.

Cloud computing (closed computing) refers to the delivery and usage mode of an IT infrastructure, meaning that required resources are obtained in an on-demand, easily scalable manner through a network; generalized cloud computing refers to the delivery and usage patterns of services, meaning that the required services are obtained in an on-demand, easily scalable manner over a network. Such services may be IT, software, internet related, or other services. Cloud Computing is a product of fusion of traditional computer and network technology developments such as Grid Computing (Grid Computing), distributed Computing (distributed Computing), parallel Computing (Parallel Computing), utility Computing (Utility Computing), network storage (Network Storage Technologies), virtualization (Virtualization), load balancing (Load balancing), and the like.

With the development of the internet, real-time data flow and diversification of connected devices, and the promotion of demands of search services, social networks, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Unlike the previous parallel distributed computing, the generation of cloud computing will promote the revolutionary transformation of the whole internet mode and enterprise management mode in concept.

Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and a distributed cloud storage system (hereinafter referred to as a storage system for short) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network to work cooperatively through application software or application interfaces through functions such as cluster application, grid technology, and a distributed storage file system, so as to provide data storage and service access functions for the outside.

At present, the storage method of the storage system is as follows: when creating logical volumes, each logical volume is allocated a physical storage space, which may be a disk composition of a certain storage device or of several storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as a data Identification (ID) and the like, the file system writes each object into a physical storage space of the logical volume, and the file system records storage position information of each object, so that when the client requests to access the data, the file system can enable the client to access the data according to the storage position information of each object.

The process of allocating physical storage space for the logical volume by the storage system specifically includes: physical storage space is divided into stripes in advance according to the set of capacity measures for objects stored on a logical volume (which measures tend to have a large margin with respect to the capacity of the object actually to be stored) and redundant array of independent disks (RAID, redundant Array of Independent Disk), and a logical volume can be understood as a stripe, whereby physical storage space is allocated for the logical volume.

The image processing method provided by the application is described in detail below with reference to the specific embodiments.

Fig. 2 shows a flowchart of an image processing method, as shown in fig. 2, including:

step S210: acquiring an image to be processed and preset camera pose and view angle information corresponding to the image to be processed;

step S220: inputting the image to be processed, the preset camera pose and the visual angle information into a three-dimensional reconstruction model, determining color information and depth information corresponding to the image to be processed according to the preset camera pose and the visual angle information through the three-dimensional reconstruction model, and rendering and generating a two-dimensional image corresponding to the image to be processed and having the visual angle information according to the color information and the depth information.

The image processing method comprises the steps of firstly, obtaining an image to be processed, and presetting pose and visual angle information of a camera; and then inputting the image to be processed and the view angle information into a three-dimensional reconstruction model, determining color information and depth information corresponding to the image to be processed according to the preset camera pose and view angle information through the three-dimensional reconstruction model, and rendering according to the color information and the depth information to generate a two-dimensional image corresponding to the image to be processed and having the view angle information. According to the image processing method, when three-dimensional reconstruction is carried out, the reconstruction is carried out according to the RGB features and the depth features of the image to be processed, and compared with the reconstruction carried out according to the RGB features of the image to be processed, the accuracy of the three-dimensional reconstruction can be improved, and the consistency of the visual angles is maintained.

The respective steps of the image processing method shown in fig. 2 are described in detail below.

In step S210, an image to be processed, and preset camera pose and view angle information corresponding to the image to be processed are acquired.

In an exemplary embodiment of the present application, a user may capture an object in a three-dimensional scene using a terminal device having a capturing unit, acquire a two-dimensional image, and may acquire a two-dimensional image in other manners, such as downloading a picture from a network, or the like. A preset camera pose which can be used for forming a required target image can be set for the two-dimensional image, meanwhile, according to the required target image, view angle information can be determined, or a view angle set can be generated according to a desired rendering effect, the view angle set comprises a plurality of view angle information, and the view angle information can be formed by performing angle deflection, depth of field adjustment and shaking in a 360-degree range on a view angle corresponding to the two-dimensional image.

In an exemplary embodiment of the present application, the acquired two-dimensional image may be used as an image to be processed, and then the image to be processed, the preset camera pose and the view angle information are input into a three-dimensional reconstruction model to perform three-dimensional reconstruction, so as to acquire a two-dimensional image corresponding to the image to be processed and having the view angle information, where the two-dimensional image is an image obtained after the three-dimensional reconstruction and the new view angle rendering of the image to be processed.

In step S220, the image to be processed, the preset camera pose and the view angle information are input into a three-dimensional reconstruction model, color information and depth information corresponding to the image to be processed are determined according to the preset camera pose and the view angle information through the three-dimensional reconstruction model, and a two-dimensional image corresponding to the image to be processed and having the view angle information is rendered according to the color information and the depth information.

In an exemplary embodiment of the present application, after determining an image to be processed, a preset camera pose and view angle information, the information may be input into a three-dimensional reconstruction model to perform three-dimensional reconstruction, so as to obtain a two-dimensional image corresponding to the image to be processed and having the view angle information.

Next, a method for three-dimensional reconstruction of the three-dimensional reconstruction model according to the image to be processed, the preset camera pose and the view angle information will be described in detail.

In one embodiment of the present application, when three-dimensional reconstruction is performed, depth information and color information corresponding to each pixel point in an image to be processed are obtained, and rendering is performed according to the depth information and the color information corresponding to each pixel point to generate a new view angle two-dimensional image corresponding to the image to be processed.

Before explaining a method of acquiring depth information and color information, the structure of the three-dimensional reconstruction model in the present application will be explained first.

Fig. 3 schematically shows a schematic structure of a three-dimensional reconstruction model, which, as shown in fig. 3, includes a motion restoration structure SFM module 301, a first image feature extraction module 302, a second image feature extraction module 303, a multi-modal attention module 304, a transducer network module 305, a spatial point sampling module 306, and a neural rendering network module 307. The motion recovery structure module 301 is used for performing pose estimation and sparse depth estimation of the camera; the first image feature extraction module 302 is configured to perform image feature extraction on an image to be processed to obtain RGB features; the second image feature extraction module 303 is configured to perform depth feature extraction on the sparse depth map including HSV values to obtain depth features; the multi-modal attention module 304 is configured to perform feature fusion on the RGB features and the depth features to obtain multi-modal features; the transform network module 305 is configured to encode the multi-mode feature and the view angle feature of the image to be processed, and decode the encoded fusion feature and the view angle information determined by the user, so as to obtain a target fusion feature corresponding to the view angle required by the user; the space point sampling module 306 generates a three-dimensional virtual ray based on a preset camera pose and an image to be processed, samples and acquires a space point on the three-dimensional virtual ray, and further can encode three-dimensional coordinates corresponding to the space point to generate coordinate encoding and convert the three-dimensional coordinates into two-dimensional coordinates; the neural rendering network module 307 comprises a neural radiation field network module 307-1 and a three-dimensional rendering module 307-2, the neural radiation field network module 307-1 is used for carrying out feature extraction on coordinate codes and target features corresponding to space points to obtain three-dimensional image information, the three-dimensional rendering module 307-2 is used for carrying out volume rendering on three-dimensional image information corresponding to space points located on the same three-dimensional virtual ray to obtain depth values and RGB values corresponding to pixel points in an image to be processed, and a two-dimensional image which corresponds to the image to be processed and has a preset viewing angle can be obtained by rendering according to the depth values and the RGB values corresponding to all the pixel points.

In one embodiment of the present application, the SFM module 301 is an algorithm module, and the first image feature extraction module 302, the second image feature extraction module 303, and the spatial point sampling module 306 are all pre-trained network modules, so that the parameter optimization amount of the three-dimensional reconstruction model in the training process can be reduced, and the training efficiency and quality can be improved.

Based on the structure of the three-dimensional reconstruction model shown in fig. 3, fig. 4 schematically shows a flowchart of acquiring color information and depth information corresponding to pixel points in an image to be processed, and as shown in fig. 4, the flowchart at least includes steps S401 to S404, specifically:

in step S401, a three-dimensional virtual ray is constructed according to the preset camera pose and the pixel points in the image to be processed, and sampling is performed on the three-dimensional virtual ray to generate a plurality of spatial points.

In one embodiment of the application, a three-dimensional virtual ray can be constructed by a space point sampling module in the three-dimensional reconstruction model based on a preset camera pose and pixel points in an image to be processed, and the space points are sampled and acquired on the three-dimensional virtual ray. Specifically, for each pixel point in the image to be processed, a virtual ray is generated by taking a camera optical center under the condition of a preset camera pose as a center, the virtual ray is a three-dimensional virtual ray, and then the three-dimensional virtual ray is uniformly sampled according to the sampling quantity of the preset spatial points, so that a plurality of spatial points can be obtained, wherein the sampling quantity of the preset spatial points can be any value between 60 and 128, and the method can be set according to actual needs. It should be noted that the camera in the present application is preferably a monocular camera, so that the optical center of the camera can be uniquely determined.

In step S402, RGB features and depth features of the image to be processed are obtained, and the RGB features and the depth features are fused to obtain multi-modal features.

In one embodiment of the application, sparse depth estimation can be performed on an image to be processed through an SFM module in a three-dimensional reconstruction model to obtain depth features, image feature extraction is performed on the image to be processed through a first image feature extraction module to obtain RGB features, and then feature fusion is performed on the depth features and the RGB features through a multi-modal attention module to obtain multi-modal features.

When the SFM module is adopted to carry out sparse depth estimation on the image to be processed to obtain depth characteristics, firstly, the SFM module is adopted to carry out sparse depth estimation on the image to be processed to obtain a sparse depth map; and then, converting the single-channel depth value in the sparse depth map into an HSV value, and inputting the converted sparse depth map into a second image feature extraction module for feature extraction so as to obtain depth features.

After the depth features and the RGB features corresponding to the image to be processed are acquired, the depth features and the RGB features can be input into a multi-modal attention module for feature extraction, and the depth features are space information, the RGB features are semantic and texture information of the image, and the depth features and the RGB features are mutually complementary, so that the space information, the semantic information and the texture information in the image to be processed can be fully acquired through the depth features and the RGB features based on an attention mechanism, and the visual angle consistency and the accuracy of a new visual angle two-dimensional image generated by rendering are further ensured.

In step S403, the multi-modal feature is fused with the view angle feature of the image to be processed, so as to obtain a fused feature, and a target feature corresponding to the spatial point is determined according to the two-dimensional coordinates corresponding to the spatial point, the fused feature and the view angle information.

In an exemplary embodiment of the present application, after the multi-modal feature corresponding to the image to be processed is acquired, the multi-modal feature may be fused with the view angle feature of the image to be processed to acquire the fused feature. The view angle feature of the image to be processed is a view angle feature generated according to the view angle value of the image to be processed, and when the view angle value is converted into the view angle feature, the vector conversion can be performed by adopting a position coding Position Encoding method to obtain the view angle feature.

When the multi-mode feature and the view angle feature are fused, the multi-mode feature and the view angle feature can be input into a transducer network module, the transducer network module comprises an encoder and a decoder, and the multi-mode feature and the view angle feature are fused and learned through the encoder, so that the fused feature can be obtained.

Further, the feature decoding can be performed on the fusion feature and the view feature corresponding to the view information to be rendered through a decoder, so as to obtain the feature corresponding to the view information. Target features corresponding to different spatial points may be determined based on the features corresponding to the perspective information.

Fig. 5 schematically illustrates a flowchart of acquiring target features corresponding to each spatial point, as shown in fig. 5, in step S501, the view information is encoded to acquire a view feature, and the fusion feature and the view feature are decoded by a decoder in the transform network module to generate a target fusion feature corresponding to the view information; in step S502, bilinear interpolation is performed on the target fusion feature to obtain a feature map with the same size as the image to be processed; in step S503, the three-dimensional coordinates of the spatial point are converted into two-dimensional coordinates, and the target feature corresponding to the spatial point is determined in the feature map according to the two-dimensional coordinates.

The reason why the bilinear interpolation is performed in step S502 is that the fused feature is obtained by feature extraction and feature fusion, the size of the fused feature is smaller than the image to be processed, the spatial point is generated based on the image to be processed, and the target feature corresponding to all the spatial points cannot be obtained based on the fused feature, so that the fused feature needs to be subjected to interpolation processing, expanded to the size of the image to be processed, and then matched with the two-dimensional coordinate corresponding to the spatial point to obtain the target feature corresponding to the spatial point. In step S503, when converting the three-dimensional coordinates of the space points into two-dimensional coordinates, the three-dimensional coordinates may be first converted from the world coordinate system to the camera coordinate system according to the preset camera pose and the camera internal parameters to obtain the first coordinates; then converting the first coordinate from the camera coordinate system to the image coordinate system to obtain a second coordinate; and finally, converting the second coordinate from the image coordinate system to the pixel coordinate system to obtain the two-dimensional coordinate corresponding to the space point.

In step S404, the three-dimensional coordinates of the spatial points are encoded to generate spatial point coordinate codes, and color information and depth information corresponding to each pixel point in the image to be processed are obtained according to the target feature and the spatial point coordinate codes.

In the exemplary embodiment of the application, after the target features corresponding to the space points are acquired, the three-dimensional coordinates of the space points can be encoded to generate space point coordinate codes, and then the feature extraction and the volume rendering are performed on the target features and the space point coordinate codes corresponding to all the space points through a nerve rendering module in the three-dimensional reconstruction model, so as to acquire color information and depth information corresponding to each pixel point in the image to be processed. When the three-dimensional coordinates are encoded to generate space point coordinate codes, vector conversion can be performed by adopting a position encoding mode to generate space point coordinate codes corresponding to the space points.

In an exemplary embodiment of the present application, a neural rendering module in a three-dimensional reconstruction model performs feature extraction and volume rendering on target features and space point coordinate codes corresponding to all space points to obtain color information and depth information corresponding to each pixel point in an image to be processed, which may be specifically implemented by the following procedures: firstly, inputting target features and space point coordinate codes corresponding to target space points into a neural radiation field network module for feature extraction so as to acquire three-dimensional image information corresponding to the target space points; and then, inputting three-dimensional image information corresponding to all the space points on the same three-dimensional virtual ray into a three-dimensional volume rendering module for volume rendering so as to acquire depth information and RGB information corresponding to the pixel points corresponding to the three-dimensional virtual ray in the image to be processed. That is, to obtain depth information and RGB information corresponding to each pixel point, a three-dimensional volume rendering module needs to perform volume rendering operation along a spatial point on each three-dimensional virtual ray, if each three-dimensional virtual ray has n spatial points, a neural rendering network needs to perform n times of processing, the values of the n points can be calculated to obtain the depth information and RGB information of one pixel point on the new-view two-dimensional image, when the size of the image to be processed is h×w, the neural rendering network needs to perform h×w times of processing to obtain the depth information and RGB information corresponding to all the pixel points, and then the new-view two-dimensional image can be rendered.

In an exemplary embodiment of the present application, before three-dimensional reconstruction and new view rendering of an image to be processed using a three-dimensional reconstruction model, the three-dimensional reconstruction model needs to be trained to obtain a three-dimensional reconstruction model with stable performance.

Next, a training method of the three-dimensional reconstruction model will be described.

Fig. 6 schematically illustrates a training process of the three-dimensional reconstruction model, as shown in fig. 6, where the training process at least includes steps S601-S604, specifically:

in step S601, a plurality of two-dimensional image sample sets corresponding to different three-dimensional scenes are obtained, and each two-dimensional image sample set includes a plurality of two-dimensional image samples corresponding to different viewing angles under the same three-dimensional scene.

In the exemplary embodiment of the application, a plurality of two-dimensional image sample sets corresponding to different three-dimensional scenes can be acquired, each two-dimensional image sample set respectively comprises two-dimensional image samples corresponding to the same three-dimensional scene but with different view angles, and the three-dimensional reconstruction model is trained by adopting the two-dimensional image sample sets corresponding to the different three-dimensional scenes, so that the recognition and reconstruction capability of the model to the different scenes can be improved, and the problem that the three-dimensional reconstruction model in the related art can only reconstruct a single scene is avoided.

In step S602, an input image and a target image, which is different from the input image, are determined from a set of two-dimensional image samples corresponding to a target scene.

In an exemplary embodiment of the present application, any one of a plurality of different three-dimensional scenes may be taken as a target scene, and an input image and a target image, which are different from the input image, may be determined from a two-dimensional image sample set corresponding to the target scene, that is, the training is performed for three-dimensional reconstruction and new view rendering of the input image, and a target image corresponding to the same three-dimensional scene as the input image but having a different view angle may be generated.

In step S603, the input image and the target image are input into a three-dimensional reconstruction model to be trained for feature extraction, so as to obtain a predicted image corresponding to the input image.

In an exemplary embodiment of the present application, after determining the input image and the target image, the input image and the target image may be input into a three-dimensional reconstruction model to be trained to perform feature extraction, so as to obtain a predicted image corresponding to the input image, where the training target of the three-dimensional reconstruction model is to make the predicted image infinitely close to the target image.

In an exemplary embodiment of the present application, the structure of the three-dimensional reconstruction model to be trained is slightly different from that of the three-dimensional reconstruction model shown in fig. 3, and fig. 7 schematically illustrates a schematic structural diagram of the three-dimensional reconstruction model to be trained, and as shown in fig. 7, the three-dimensional reconstruction model to be trained includes an SFM module 701, a first image feature extraction module 702, a second image feature extraction module 703, a multi-modal attention module 704 to be trained, a transformation network module 705 to be trained, a spatial point sampling module 706, a neural rendering network 707 to be trained, and a depth complement module 708. The SFM module 701 and the spatial point sampling module 706 are algorithm modules, and the first image feature extraction module 702, the second image feature extraction module 703 and the depth complement module 708 are pretrained neural network modules, so that parameter training is not required.

Specifically, the pose of the camera corresponding to the target image may be estimated by the SFM module 701, and a plurality of three-dimensional virtual rays may be constructed based on the pose of the camera and the pixel points in the target image, and uniformly sampled on each three-dimensional virtual ray, so as to generate a plurality of spatial points. After the input image is input to the first image feature extraction module 702, the first image feature extraction module may perform image feature extraction on the input image to obtain RGB features in the input image, and at the same time, the input image may be input to the SFM module 701 to perform sparse depth estimation to obtain a sparse depth map, where the sparse depth map includes depth values of partial pixels in the input image, then convert a single-channel depth value in the sparse depth map into three-channel HSV values, and then input the converted sparse depth map to the second image feature extraction module 703 to perform feature extraction, so as to obtain depth features in the input image. Further, the depth feature and the RGB feature may be input to the multimode attention module 704 to be trained to perform feature fusion, so as to obtain a multimode feature fused with multiple modes, where the multimode feature may be further fused with a view angle feature of the input image through an encoder in the transducer network module 705 to be trained, so as to form a fused feature fused with the multimode feature and the view angle feature.

In the exemplary embodiment of the application, the target feature corresponding to each space point can be determined and obtained based on the fusion feature, when the target feature corresponding to each space point is determined, all the space points can be polled, any one space point is taken as a target space point, and the three-dimensional coordinate of the target space point is converted into a two-dimensional coordinate; then, the fusion characteristic and a target view corresponding to the target image can be input to a decoder in a to-be-trained transducer network module for decoding so as to obtain a target fusion characteristic corresponding to the target view; then bilinear interpolation is carried out on the target fusion characteristic so as to obtain a fusion characteristic diagram with the same size as the input image; and finally, comparing the two-dimensional coordinates corresponding to the target space points in the fusion feature map to determine the target features corresponding to the target space points. The method for converting the three-dimensional coordinates into two-dimensional coordinates is the same as the coordinate conversion method in step S503 in the above embodiment, and the embodiments of the present application are not described herein again.

After the target feature corresponding to the target space point is obtained, the three-dimensional coordinate corresponding to the target space point can be encoded to obtain the coordinate code corresponding to the target space point, then the depth information and the RGB information corresponding to each pixel point in the input image can be obtained according to the coordinate code corresponding to the target space point and the target feature, specifically, the target feature and the coordinate code corresponding to the target space point are firstly input to the neural radiation field network module to be trained to perform feature extraction so as to obtain the three-dimensional image information corresponding to the target space point; and then inputting three-dimensional image information corresponding to all target space points positioned on the same three-dimensional virtual ray to a three-dimensional rendering module to be trained for volume rendering so as to acquire depth information and RGB information corresponding to each pixel point in the input image.

Further, a predicted image corresponding to the input image can be obtained by rendering based on depth information and RGB information corresponding to all pixel points, and the predicted image is a two-dimensional image after new view rendering according to the view angle of the target image.

In an exemplary embodiment of the present application, in addition to performing feature extraction on a sparse depth map including HSV values to obtain a depth feature for multi-mode information fusion, the sparse depth map may be input to a depth complement module 708 to perform depth complement to obtain a full-pixel depth map, then a single-channel depth value in the full-pixel depth map is converted into HSV values, and the converted full-pixel depth map is input to a second image feature extraction module to perform feature extraction to obtain a depth feature. Compared with extracting depth features based on sparse depth maps, extracting depth features based on full-pixel depth maps can improve accuracy of the depth features, and further improve three-dimensional reconstruction effect and viewing angle consistency, but the data processing capacity is large, and timeliness is poor.

In step S604, a loss function is determined according to the predicted image and the target image, and parameters of the three-dimensional reconstruction model to be trained are optimized according to the loss function until convergence.

In an exemplary embodiment of the present application, the purpose of model training is to bring the predicted image infinitely close to the target image, so that a loss function can be determined from the predicted image and the target image, and parameter optimization is performed on the three-dimensional reconstruction model to be trained according to the loss function until the model converges.

When determining the loss function according to the predicted image and the target image, the loss function may be constructed according to the depth information and RGB information corresponding to the predicted image and the depth information and RGB information corresponding to the target image, specifically, the sparse depth map is input to the depth complement module 708 for depth complement, so as to obtain a full-pixel depth map including depth values of all pixels; then constructing a predicted full-pixel depth map according to depth information corresponding to all pixel points, and constructing a first loss function according to the predicted full-pixel depth map and the full-pixel depth map; then constructing a second loss function according to the RGB information corresponding to all the pixel points and the RGB information of the target image; and finally, carrying out weighted summation on the first loss function and the second loss function to obtain the loss function. When the first loss function and the second loss function are weighted and summed, the weight of the first loss function can be set smaller than that of the second loss function, that is, consistency of RGB information is mainly considered, and depth information is only an auxiliary judging condition.

The image processing method can be applied to any scene needing to carry out three-dimensional reconstruction and new view angle rendering on a two-dimensional image, such as the fields of construction, medical treatment, production and processing and the like. By adopting the image processing method in the application, the two-dimensional image with the new viewing angle can be generated according to the two-dimensional image with the original viewing angle, and the viewing angle consistency is ensured.

According to the image processing method, the trained three-dimensional reconstruction model is adopted to determine color information and depth information corresponding to the image to be processed according to the preset camera pose and visual angle information, and a two-dimensional image corresponding to the image to be processed and having the visual angle information is rendered according to the color information and the depth information. When the three-dimensional reconstruction is carried out, the three-dimensional reconstruction is carried out according to the RGB features and the depth features of the image to be processed, so that the accuracy of the three-dimensional reconstruction can be improved and the consistency of the visual angles can be maintained compared with the reconstruction carried out according to the RGB features of the image to be processed.

In addition, in the embodiment of the application, when the three-dimensional reconstruction model is trained, two-dimensional image sets corresponding to a plurality of different three-dimensional scenes are adopted as training samples, and each two-dimensional image set contains a plurality of two-dimensional images with different visual angles, so that the three-dimensional reconstruction model has the capability of distinguishing different scenes; in the training process, depth information is introduced to conduct supervision, three-dimensional information is explicitly introduced, and compared with the process of conducting supervision by adopting only RGB two-dimensional information, the method overcomes the defect of RGB characteristics, can enable a model to capture accurate three-dimensional information, improves the precision of a three-dimensional reconstruction model, improves the efficiency and effect of three-dimensional reconstruction, and reduces the problem of inconsistent visual angles; in addition, a fusion characteristic is introduced to guide a three-dimensional reconstruction process in a self-adaptive manner, and each different characteristic corresponds to a different three-dimensional object, so that the three-dimensional reconstruction model has universality and can reconstruct a plurality of different three-dimensional objects.

The present application also provides an image processing apparatus, fig. 8 shows a schematic structural diagram of the image processing apparatus, and as shown in fig. 8, the image processing apparatus 800 may include an acquisition module 801 and a reconstruction module 802, specifically:

an obtaining module 801, configured to obtain an image to be processed, and preset camera pose and view angle information corresponding to the image to be processed;

the reconstruction module 802 is configured to input the image to be processed, the preset camera pose and the view angle information into a three-dimensional reconstruction model, determine color information and depth information corresponding to the image to be processed according to the preset camera pose and the view angle information through the three-dimensional reconstruction model, and render and generate a two-dimensional image corresponding to the image to be processed and having the view angle information according to the color information and the depth information.

In an exemplary embodiment of the present application, the reconstruction module 802 includes: the sampling unit is used for constructing a three-dimensional virtual ray according to the preset camera pose and the pixel points in the image to be processed, and sampling the three-dimensional virtual ray to generate a plurality of space points; the first fusion unit is used for acquiring RGB features and depth features of the image to be processed, and fusing the RGB features and the depth features to acquire multi-mode features; the second fusion unit is used for fusing the multi-modal feature with the visual angle feature of the image to be processed to obtain a fusion feature, and determining a target feature corresponding to the space point according to the two-dimensional coordinate corresponding to the space point, the fusion feature and the visual angle information; the rendering unit is used for encoding the three-dimensional coordinates of the space points to generate space point coordinate codes, and acquiring color information and depth information corresponding to each pixel point in the image to be processed according to the target characteristics and the space point coordinate codes; and the circulation unit is used for repeating the steps until the color information and the depth information corresponding to all the pixel points in the image to be processed are acquired.

In an exemplary embodiment of the present application, the first fusing unit is configured to: extracting image features of the image to be processed through a first image feature extraction module to obtain the RGB features; performing sparse depth estimation on the image to be processed through a motion recovery structure module to obtain a sparse depth map; and converting the depth value in the sparse depth map into an HSV value, and inputting the converted sparse depth map into a second image feature extraction module for feature extraction so as to acquire the depth feature.

In an exemplary embodiment of the present application, the second fusing unit is configured to: obtaining a view angle value corresponding to the image to be processed, and encoding the view angle value to generate the view angle characteristic; and inputting the multi-mode features and the view angle features into a converter network module, and encoding the multi-mode features and the view angle features through an encoder in the converter network module to generate the fusion features.

In an exemplary embodiment of the application, the second fusing unit is further configured to: encoding the view information to obtain view features, and decoding the fusion features and the view features by a decoder in the transform network module to generate target fusion features corresponding to the view information; bilinear interpolation is carried out on the target fusion characteristics so as to obtain a characteristic diagram with the same size as the image to be processed; and converting the three-dimensional coordinates of the space points into two-dimensional coordinates, and determining target features corresponding to the space points in the feature map according to the two-dimensional coordinates.

In an exemplary embodiment of the present application, the converting the three-dimensional coordinates of the spatial point into two-dimensional coordinates is configured to: converting the three-dimensional coordinates from a world coordinate system to a camera coordinate system according to the preset camera pose and camera internal parameters so as to obtain first coordinates; converting the first coordinates from the camera coordinate system to an image coordinate system to obtain second coordinates; the second coordinates are converted from the image coordinate system to a pixel coordinate system to obtain the two-dimensional coordinates.

In an exemplary embodiment of the present application, the image processing apparatus 800 further includes: the sample acquisition module is used for acquiring a plurality of two-dimensional image sample sets corresponding to different three-dimensional scenes before the image to be processed, the preset camera pose and the visual angle information are input into the three-dimensional reconstruction model, wherein each two-dimensional image sample set comprises a plurality of two-dimensional image samples corresponding to different visual angles under the same three-dimensional scene; a training image determining module for determining an input image and a target image from a set of two-dimensional image samples corresponding to a target scene, the target image being different from the input image; the prediction module is used for inputting the input image and the target image into a three-dimensional reconstruction model to be trained to perform feature extraction so as to obtain a predicted image corresponding to the input image; and the optimizing module is used for determining a loss function according to the predicted image and the target image, and optimizing the parameters of the three-dimensional reconstruction model to be trained according to the loss function until convergence.

In an exemplary embodiment of the present application, the prediction module includes: the sampling unit is used for acquiring a camera pose corresponding to the target image and determining a plurality of space points according to the camera pose and the target image; the feature extraction unit is used for extracting features of the input image to obtain depth features and RGB features corresponding to the input image; the fusion unit is used for fusing the depth features and the RGB features to obtain multi-modal features, and fusing the multi-modal features and the visual angle features of the input image to obtain fusion features; a target feature determining unit, configured to determine target features corresponding to the spatial points according to two-dimensional coordinates corresponding to the spatial points and a target viewing angle corresponding to the target image, and the fusion features; and the rendering unit is used for acquiring depth information and RGB information corresponding to each pixel point in the input image according to the target feature and coordinate code corresponding to each spatial point, and rendering according to the depth information and RGB information corresponding to all the pixel points so as to acquire the predicted image.

In an exemplary embodiment of the application, the three-dimensional reconstruction model to be trained comprises a motion restoration structure module; the sampling unit is configured to: inputting the target image into the motion restoration structure module to acquire a camera pose corresponding to the target image; and generating a three-dimensional virtual ray based on the camera pose and each pixel point in the target image, and sampling on the three-dimensional virtual ray to generate the space point.

In an exemplary embodiment of the present application, the three-dimensional reconstruction model to be trained further includes a first image feature extraction module for pre-training and a second image feature extraction module for pre-training; the feature extraction unit is configured to: inputting the input image to the first image feature extraction module for image feature extraction to acquire the RGB features; inputting the input image to the motion restoration structure module for sparse depth estimation to obtain a sparse depth map; and converting the depth value in the sparse depth map into an HSV value, and inputting the converted sparse depth map into the second image feature extraction module for feature extraction so as to acquire the depth feature.

In an exemplary embodiment of the present application, the three-dimensional reconstruction model to be trained further includes a multi-modal attention module to be trained and a transducer network module to be trained; the fusion unit is configured to: inputting the depth features and the RGB features into the multi-modal attention module to be trained for feature fusion so as to acquire the multi-modal features; inputting the multi-modal feature and the view angle feature corresponding to the input image to the to-be-trained converter network module, and encoding the multi-modal feature and the view angle feature through an encoder in the to-be-trained converter network module to generate the fusion feature, wherein the view angle feature is generated by encoding a view angle value corresponding to the input image.

In an exemplary embodiment of the application, the target feature determination unit is configured to: polling all the space points, taking any one space point as a target space point, and converting the three-dimensional coordinates of the target space point into two-dimensional coordinates; inputting the fusion characteristics and the target view angle to a decoder in the to-be-trained converter network module for decoding so as to obtain target fusion characteristics corresponding to the target view angle; bilinear interpolation is carried out on the target fusion characteristic so as to obtain a fusion characteristic diagram with the same size as the input image; and determining target features corresponding to the target space points in the fusion feature map according to the two-dimensional coordinates corresponding to the target space points.

In an exemplary embodiment of the present application, the three-dimensional reconstruction model to be trained further includes a neural rendering module to be trained, the neural rendering module to be trained including a neural radiation field network module to be trained and a three-dimensional volume rendering module to be trained; the rendering unit is configured to: encoding the three-dimensional coordinates of the target space point to obtain a coordinate code corresponding to the target space point; inputting target characteristics and coordinate codes corresponding to the target space points into the neural radiation field network module to be trained to perform characteristic extraction so as to acquire three-dimensional image information corresponding to the target space points; and inputting the three-dimensional image information corresponding to all target space points positioned on the same three-dimensional virtual ray to the three-dimensional rendering module to be trained for volume rendering so as to acquire depth information and RGB information corresponding to each pixel point in the input image.

In an exemplary embodiment of the present application, the three-dimensional reconstruction model to be trained further includes a pre-trained depth completion module; the optimizing unit is configured to: inputting the sparse depth map to a depth complement module for depth complement to obtain a full-pixel depth map containing depth values of all pixel points; constructing a predicted full-pixel depth map according to depth information corresponding to all pixel points in the predicted image, and constructing a first loss function according to the predicted full-pixel depth map and the full-pixel depth map; constructing a second loss function according to the RGB information corresponding to all the pixel points and the RGB information of the target image; and carrying out weighted summation on the first loss function and the second loss function to obtain the loss function.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Furthermore, although the steps of the methods of the present application are depicted in the accompanying drawings in a particular order, this is not required to either imply that the steps must be performed in that particular order, or that all of the illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

Fig. 9 schematically shows a block diagram of a computer system of an electronic device for implementing an embodiment of the application, which may be provided in the terminal device 101 or the server 102.

It should be noted that, the computer system 900 of the electronic device shown in fig. 9 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 9, the computer system 900 includes a central processing unit 901 (Central Processing Unit, CPU) which can execute various appropriate actions and processes according to a program stored in a Read-Only Memory 902 (ROM) or a program loaded from a storage portion 908 into a random access Memory 903 (Random Access Memory, RAM). In the random access memory 903, various programs and data required for system operation are also stored. The cpu 901, the rom 902, and the ram 903 are connected to each other via a bus 904. An Input/Output interface 905 (i.e., an I/O interface) is also connected to bus 904.

In some embodiments, the following components are connected to the input/output interface 905: an input section 906 including a keyboard, a mouse, and the like; an output section 907 including a speaker and the like, such as a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and the like; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a local area network card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the input/output interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.

In particular, the processes described in the various method flowcharts may be implemented as computer software programs according to embodiments of the application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. When executed by the central processor 901, performs various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable medium, or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, comprising several instructions for causing an electronic device to perform the method according to the embodiments of the present application.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. An image processing method, comprising:

acquiring an image to be processed and preset camera pose and view angle information corresponding to the image to be processed;

inputting the image to be processed, the preset camera pose and the visual angle information into a three-dimensional reconstruction model, determining color information and depth information corresponding to the image to be processed according to the preset camera pose and the visual angle information through the three-dimensional reconstruction model, and rendering and generating a two-dimensional image corresponding to the image to be processed and having the visual angle information according to the color information and the depth information.

2. The method according to claim 1, wherein the determining, by the three-dimensional reconstruction model, color information and depth information corresponding to the image to be processed according to the preset camera pose and the view angle information, includes:

constructing a three-dimensional virtual ray according to the preset camera pose and pixel points in the image to be processed, and sampling on the three-dimensional virtual ray to generate a plurality of space points;

acquiring RGB features and depth features of the image to be processed, and fusing the RGB features and the depth features to acquire multi-mode features;

fusing the multi-modal features with view angle features of the image to be processed to obtain fusion features, and determining target features corresponding to the space points according to two-dimensional coordinates corresponding to the space points, the fusion features and the view angle information;

coding the three-dimensional coordinates of the space points to generate space point coordinate codes, and acquiring color information and depth information corresponding to each pixel point in the image to be processed according to the target characteristics and the space point coordinate codes;

repeating the steps until the color information and the depth information corresponding to all the pixel points in the image to be processed are obtained.

3. The method of claim 2, wherein the acquiring RGB features and depth features of the image to be processed comprises:

extracting image features of the image to be processed through a first image feature extraction module to obtain the RGB features;

performing sparse depth estimation on the image to be processed through a motion recovery structure module to obtain a sparse depth map;

and converting the depth value in the sparse depth map into an HSV value, and inputting the converted sparse depth map into a second image feature extraction module for feature extraction so as to acquire the depth feature.

4. The method of claim 2, wherein the fusing the multi-modal feature with the perspective feature of the image to be processed to obtain a fused feature comprises:

obtaining a view angle value corresponding to the image to be processed, and encoding the view angle value to generate the view angle characteristic;

and inputting the multi-mode features and the view angle features into a converter network module, and encoding the multi-mode features and the view angle features through an encoder in the converter network module to generate the fusion features.

5. The method of claim 4, wherein the determining the target feature corresponding to the spatial point based on the two-dimensional coordinates corresponding to the spatial point, the fusion feature, and the perspective information comprises:

encoding the view information to obtain view features, and decoding the fusion features and the view features by a decoder in the transform network module to generate target fusion features corresponding to the view information;

bilinear interpolation is carried out on the target fusion characteristics so as to obtain a characteristic diagram with the same size as the image to be processed;

and converting the three-dimensional coordinates of the space points into two-dimensional coordinates, and determining target features corresponding to the space points in the feature map according to the two-dimensional coordinates.

6. The method of claim 5, wherein converting the three-dimensional coordinates of the spatial point to two-dimensional coordinates comprises:

converting the three-dimensional coordinates from a world coordinate system to a camera coordinate system according to the preset camera pose and camera internal parameters so as to obtain first coordinates;

converting the first coordinates from the camera coordinate system to an image coordinate system to obtain second coordinates;

The second coordinates are converted from the image coordinate system to a pixel coordinate system to obtain the two-dimensional coordinates.

7. The method according to claim 1, characterized in that before inputting the image to be processed, the preset camera pose and the view angle information into a three-dimensional reconstruction model, the method further comprises:

acquiring a plurality of two-dimensional image sample sets corresponding to different three-dimensional scenes, wherein each two-dimensional image sample set comprises a plurality of two-dimensional image samples corresponding to different visual angles under the same three-dimensional scene;

determining an input image and a target image from a set of two-dimensional image samples corresponding to a target scene, the target image being different from the input image;

inputting the input image and the target image into a three-dimensional reconstruction model to be trained for feature extraction so as to obtain a predicted image corresponding to the input image;

and determining a loss function according to the predicted image and the target image, and optimizing parameters of the three-dimensional reconstruction model to be trained according to the loss function until convergence.

8. The method of claim 7, wherein the inputting the input image and the target image into a three-dimensional reconstruction model to be trained for feature extraction to obtain a predicted image corresponding to the input image comprises:

Acquiring a camera pose corresponding to the target image, and determining a plurality of space points according to the camera pose and the target image;

extracting features of the input image to obtain depth features and RGB features corresponding to the input image;

fusing the depth features and the RGB features to obtain multi-modal features, and fusing the multi-modal features and visual angle features of the input image to obtain fusion features;

determining target features corresponding to the space points according to the two-dimensional coordinates corresponding to the space points, the target viewing angles corresponding to the target images and the fusion features;

and obtaining depth information and RGB information corresponding to each pixel point in the input image according to the target feature and coordinate code corresponding to each spatial point, and rendering according to the depth information and RGB information corresponding to all the pixel points to obtain the predicted image.

9. The method of claim 8, wherein the three-dimensional reconstruction model to be trained comprises a motion restoration structure module;

the obtaining the camera pose corresponding to the target image, and determining a plurality of spatial points according to the camera pose and the target image, includes:

Inputting the target image into the motion restoration structure module to acquire a camera pose corresponding to the target image;

and generating a three-dimensional virtual ray based on the camera pose and each pixel point in the target image, and sampling on the three-dimensional virtual ray to generate the space point.

10. The method of claim 9, wherein the three-dimensional reconstruction model to be trained further comprises a pre-trained first image feature extraction module and a pre-trained second image feature extraction module;

the feature extraction of the input image to obtain depth features and RGB features corresponding to the input image includes:

inputting the input image to the first image feature extraction module for image feature extraction to acquire the RGB features;

inputting the input image to the motion restoration structure module for sparse depth estimation to obtain a sparse depth map;

and converting the depth value in the sparse depth map into an HSV value, and inputting the converted sparse depth map into the second image feature extraction module for feature extraction so as to acquire the depth feature.

11. The method of claim 10, wherein the three-dimensional reconstruction model to be trained further comprises a multi-modal attention module to be trained and a transducer network module to be trained;

The fusing the depth feature and the RGB feature to obtain a multi-modal feature, and fusing the multi-modal feature and a view angle feature corresponding to the input image to obtain a fused feature, including:

inputting the depth features and the RGB features into the multi-modal attention module to be trained for feature fusion so as to acquire the multi-modal features;

inputting the multi-modal feature and the view angle feature corresponding to the input image to the to-be-trained converter network module, and encoding the multi-modal feature and the view angle feature through an encoder in the to-be-trained converter network module to generate the fusion feature, wherein the view angle feature is generated by encoding a view angle value corresponding to the input image.

12. The method of claim 11, wherein the determining the target feature corresponding to each spatial point from the two-dimensional coordinates corresponding to each spatial point and the target perspective corresponding to the target image and the fusion feature comprises:

polling all the space points, taking any one space point as a target space point, and converting the three-dimensional coordinates of the target space point into two-dimensional coordinates;

Inputting the fusion characteristics and the target view angle to a decoder in the to-be-trained converter network module for decoding so as to obtain target fusion characteristics corresponding to the target view angle;

bilinear interpolation is carried out on the target fusion characteristic so as to obtain a fusion characteristic diagram with the same size as the input image;

and determining target features corresponding to the target space points in the fusion feature map according to the two-dimensional coordinates corresponding to the target space points.

13. The method of claim 12, wherein the three-dimensional reconstruction model to be trained further comprises a neural rendering module to be trained comprising a neural radiation field network module to be trained and a three-dimensional volume rendering module to be trained;

the obtaining depth information and RGB information corresponding to each pixel point in the input image according to the target feature and the coordinate code corresponding to each spatial point includes:

encoding the three-dimensional coordinates of the target space point to obtain a coordinate code corresponding to the target space point;

inputting target characteristics and coordinate codes corresponding to the target space points into the neural radiation field network module to be trained to perform characteristic extraction so as to acquire three-dimensional image information corresponding to the target space points;

And inputting the three-dimensional image information corresponding to all target space points positioned on the same three-dimensional virtual ray to the three-dimensional rendering module to be trained for volume rendering so as to acquire depth information and RGB information corresponding to each pixel point in the input image.

14. The method of claim 10, wherein the three-dimensional reconstruction model to be trained further comprises a pre-trained depth completion module;

the determining a loss function from the predicted image and the target image includes:

inputting the sparse depth map to a depth complement module for depth complement to obtain a full-pixel depth map containing depth values of all pixel points;

constructing a predicted full-pixel depth map according to depth information corresponding to all pixel points in the predicted image, and constructing a first loss function according to the predicted full-pixel depth map and the full-pixel depth map;

constructing a second loss function according to the RGB information corresponding to all the pixel points and the RGB information of the target image;

and carrying out weighted summation on the first loss function and the second loss function to obtain the loss function.

15. An image processing apparatus, comprising:

The acquisition module is used for acquiring an image to be processed and preset camera pose and view angle information corresponding to the image to be processed;

the reconstruction module is used for inputting the image to be processed, the preset camera pose and the visual angle information into a three-dimensional reconstruction model, determining color information and depth information corresponding to the image to be processed according to the preset camera pose and the visual angle information through the three-dimensional reconstruction model, and rendering according to the color information and the depth information to generate a two-dimensional image corresponding to the image to be processed and having the visual angle information.

16. A computer storage medium having stored thereon a computer program, which when executed by a processor implements the image processing method of any of claims 1 to 14.

17. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the controller;

wherein the processor is configured to perform the image processing method of any one of claims 1 to 14 via execution of the executable instructions.