CN116012547A

CN116012547A - Image processing method, device, equipment and computer readable storage medium

Info

Publication number: CN116012547A
Application number: CN202111238437.5A
Authority: CN
Inventors: 董广泽
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2023-04-25

Abstract

The embodiment of the application discloses an image processing method, an image processing device, image processing equipment and a computer readable storage medium. The method comprises the following steps: acquiring a multi-view image set of a target object, and predicting a three-dimensional coordinate point of the target object under a view point corresponding to each view point image based on the position information of the target object in each view point image of the multi-view image set; calibrating a three-dimensional coordinate point of the target object under each viewpoint according to the position relation between the viewpoint corresponding to each viewpoint image and the calibration point, so as to obtain a three-dimensional calibration coordinate point of the target object based on the calibration point; and fusing the three-dimensional calibration coordinate points of the target object based on the calibration points to obtain a three-dimensional model of the target object. Therefore, the three-dimensional calibration coordinate point of the target object based on the calibration point can be obtained based on the position information of the target object in the multi-viewpoint image, and the three-dimensional model of the target object is molded by fusing the three-dimensional calibration coordinate point based on the calibration point, so that the modeling efficiency is improved well.

Description

Image processing method, device, equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image processing method, an apparatus, a device, and a computer readable storage medium.

Background

With the continuous development of computer technology, augmented reality (Augmented Reality, AR) technology is widely used in various fields in life; for example, AR games, AR house seeing, AR home, etc. AR technology relies on three-dimensional models; taking an AR home as an example, in order to show target furniture through an AR technology, a three-dimensional model of the target furniture needs to be built first. Practice shows that the three-dimensional model is obtained by modeling by three-dimensional modeling staff according to relevant data (such as length, width, height and the like) of an actual object, and the modeling efficiency is low.

Disclosure of Invention

The embodiment of the invention provides an image processing method, an image processing device, image processing equipment and a computer readable storage medium, which can better promote modeling efficiency.

In one aspect, an embodiment of the present application provides an image processing method, including:

acquiring a multi-view image set of a target object, wherein the multi-view image set comprises a plurality of view images; the multiple viewpoint images are obtained by shooting the target object at multiple viewpoints respectively;

Predicting a three-dimensional coordinate point of the target object under the view point corresponding to each view point image based on the position information of the target object in each view point image of the multi-view image set;

calibrating a three-dimensional coordinate point of the target object under each viewpoint according to the position relation between the viewpoint corresponding to each viewpoint image and the calibration point, so as to obtain a three-dimensional calibration coordinate point of the target object based on the calibration point;

and fusing the three-dimensional calibration coordinate points of the target object based on the calibration points to obtain a three-dimensional model of the target object.

In one aspect, an embodiment of the present application provides an image processing apparatus, including:

an acquisition unit, configured to acquire a multi-view image set of a target object, where the multi-view image set includes a plurality of view images; the multiple viewpoint images are obtained by shooting the target object at multiple viewpoints respectively;

the processing unit is used for predicting three-dimensional coordinate points of the target object under the view points corresponding to the view point images based on the position information of the target object in the view point images of the multi-view image set;

the three-dimensional coordinate point calibration method comprises the steps that according to the position relation between the corresponding view point of each view point image and the calibration point, the three-dimensional coordinate point of a target object under each view point is calibrated, and the three-dimensional calibration coordinate point of the target object based on the calibration point is obtained;

And the three-dimensional calibration coordinate points are used for fusing the three-dimensional calibration coordinate points of the target object based on the calibration points, so that a three-dimensional model of the target object is obtained.

In one embodiment, the processing unit is configured to predict, based on the position information of the target object in each viewpoint image, a three-dimensional coordinate point of the target object in a viewpoint corresponding to each viewpoint image, specifically configured to:

and carrying out convolution processing on the position information of the target object in each view point image of the multi-view image set to obtain a three-dimensional coordinate point of the target object under the view point corresponding to each view point image.

In one embodiment, the processing unit is configured to perform convolution processing on position information of the target object in each view point image of the multi-view image set to obtain a three-dimensional coordinate point of the target object under a view point corresponding to each view point image, and specifically is configured to:

encoding each view image to obtain a binary mask of each view image;

invoking a three-dimensional structure generation model to generate and process binary masks of all the viewpoint images to obtain three-dimensional coordinate points of the target object under the viewpoints corresponding to all the viewpoint images;

the three-dimensional structure generation model comprises a two-dimensional convolution layer, and the two-dimensional convolution layer is used for carrying out two-dimensional convolution operation on binary masks of each view image.

In one embodiment, the processing unit is further configured to:

acquiring a spatial position of a reference viewpoint, wherein the reference viewpoint refers to any viewpoint except the viewpoint corresponding to each viewpoint image in the multi-viewpoint image set;

based on the space position of the reference viewpoint, projecting the three-dimensional model to obtain a projection image of the three-dimensional model of the target object under the reference viewpoint;

performing pseudo rendering processing on the projection image of the three-dimensional model at the reference viewpoint to obtain a predicted image of the three-dimensional model at the reference viewpoint;

and optimizing parameters in the three-dimensional structure generating model according to the loss value between the predicted image under the reference viewpoint and the marked image under the reference viewpoint to obtain the optimized three-dimensional structure generating model.

In one embodiment, the processing unit is configured to perform pseudo-rendering processing on a projection image of the three-dimensional model at the reference viewpoint to obtain a predicted image of the three-dimensional model at the reference viewpoint, and is specifically configured to:

performing U times up-sampling processing on a projection image of the three-dimensional model under a reference viewpoint to obtain an up-sampled image, wherein U is a positive integer;

and carrying out downsampling processing on the updated upsampled image, and reserving the pixel point with the minimum depth value in each pixel position of the downsampled image to obtain a predicted image of the three-dimensional model under the reference viewpoint.

In one embodiment, the processing unit is configured to optimize the three-dimensional structure generating model according to a loss value between the predicted image at the reference viewpoint and the labeling image at the reference viewpoint, and specifically is configured to:

calculating a depth loss component according to the difference between the depth image corresponding to the predicted image under the reference viewpoint and the depth image corresponding to the labeling image under the reference viewpoint;

calculating a mask loss component based on a difference between a mask image corresponding to the predicted image at the reference viewpoint and a mask image corresponding to the annotation image at the reference viewpoint;

obtaining a loss value between the predicted image under the reference viewpoint and the marked image under the reference viewpoint through the depth loss component and the mask loss component;

and optimizing parameters in the three-dimensional structure generating model according to the loss value so that the three-dimensional structure generating model meets the optimization condition.

In one embodiment, the processing unit is configured to calibrate a three-dimensional coordinate point of the target object under each viewpoint according to a positional relationship between a viewpoint corresponding to each viewpoint image and a calibration point, so as to obtain a three-dimensional calibration coordinate point of the target object based on the calibration point, and specifically is configured to:

Determining a calibration matrix corresponding to each viewpoint according to the spatial position of the viewpoint corresponding to each viewpoint image and the spatial position of the calibration point;

and calibrating the three-dimensional coordinate point under each viewpoint through the corresponding calibration matrix of each viewpoint to obtain the three-dimensional calibration coordinate point of the target object based on the calibration point.

In one embodiment, the processing unit is further configured to:

constructing a three-dimensional point cloud map, and acquiring the placing position information of a three-dimensional model of a target object in the three-dimensional point cloud map;

in response to the pose location information being confirmed, adding the three-dimensional model of the target object to a location in the three-dimensional point cloud map indicated by the pose location information.

In one embodiment, the processing unit is configured to construct a three-dimensional point cloud map, specifically configured to:

acquiring position information of a reference object;

establishing a reference coordinate system based on the position information of the reference object, wherein the reference coordinate system is associated with projection information, and the projection information is used for indicating the projection relation between the reference coordinate system and the real world;

and constructing a three-dimensional point cloud map in a reference coordinate system through projection information.

In one embodiment, the processing unit is further configured to:

responding to the viewing operation of the three-dimensional point cloud map, and acquiring the observation position of the three-dimensional point cloud map;

And carrying out plane projection on the three-dimensional point cloud map based on the observation position to obtain an observation effect image of the three-dimensional point cloud map under the observation position.

Accordingly, the present application provides an image processing apparatus comprising:

a processor for loading and executing the computer program;

a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements the above-described image processing method.

Accordingly, the present application provides a computer readable storage medium storing a computer program adapted to be loaded by a processor and to perform the above-described image processing method.

Accordingly, the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the above-described image processing method.

In the embodiment of the application, a multi-view image set of a target object is obtained, and three-dimensional coordinate points of the target object under the view points corresponding to each view point image are predicted based on the position information of the target object in each view point image of the multi-view image set; calibrating a three-dimensional coordinate point of the target object under each viewpoint according to the position relation between the viewpoint corresponding to each viewpoint image and the calibration point, so as to obtain a three-dimensional calibration coordinate point of the target object based on the calibration point; and fusing the three-dimensional calibration coordinate points of the target object based on the calibration points to obtain a three-dimensional model of the target object. Therefore, the three-dimensional calibration coordinate point of the target object based on the calibration point can be obtained based on the position information of the target object in the multi-viewpoint image, and the three-dimensional model of the target object is molded by fusing the three-dimensional calibration coordinate point based on the calibration point, so that the modeling efficiency is improved well.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a scene architecture diagram of an image processing system according to an embodiment of the present application;

fig. 2 is a schematic flow chart of an image processing method according to an embodiment of the present application;

fig. 3 is a flowchart of another image processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of model optimization according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Artificial intelligence (Artificial Intelligence, AI): AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The embodiment of the application mainly relates to convolution processing of binary masks of view images of a target object through a three-dimensional structure generation model in an image processing process so as to obtain three-dimensional coordinate points of the target object under view corresponding to the view images.

AI technology is a comprehensive discipline, and relates to a wide range of technologies, both hardware and software. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, processing technology for large applications, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the heart of AI, a fundamental approach for making computers intelligent, which is applied throughout various areas of artificial intelligence. Machine learning/deep learning typically includes techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The embodiment of the application mainly relates to training a three-dimensional structure generating model through the difference between a labeling image under a reference viewpoint and a predicted image of a point cloud model under the reference viewpoint so as to obtain an optimized three-dimensional structure generating model.

Deep learning: the concept of deep learning is derived from the study of artificial neural networks. The multi-layer sensor with multiple hidden layers is a deep learning structure. Deep learning forms more abstract high-level representation attribute categories or features by combining low-level features to discover distributed feature representations of data. The embodiment of the application mainly relates to encoding processing of each view image of a target object through an encoder to obtain a binary mask of each view image.

The embodiment of the application provides an image processing method and an image processing system so as to improve modeling efficiency. Referring to fig. 1, fig. 1 is a scene structure diagram of an image processing system according to an embodiment of the present application. As shown in fig. 1, the image processing system may include: a terminal device 101 and a server 102. The image processing method provided by the embodiment of the present application may be executed by the server 102. Terminal device 101 may include, but is not limited to: smart phones (such as Android phones, IOS phones, etc.), tablet computers, portable personal computers, mobile internet devices (MID for short), vehicle terminals, etc. devices having a display function are not limited in this embodiment. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network ), and basic cloud computing services such as big data and an artificial intelligence platform.

It should be noted that, in fig. 1, the terminal and the server may be directly or indirectly connected through a wired communication or a wireless communication, which is not limited herein. The number of terminal devices and servers is for example only and does not constitute a practical limitation of the present application; for example, the terminal device 103, the server 104, or the like may be also included in the image processing system.

In a specific implementation, the general principle of the image processing method is as follows:

(1) The server 102 acquires a multi-viewpoint image set of the target object transmitted by the terminal device 101, wherein the multi-viewpoint image set contains a plurality of viewpoint images; the multiple viewpoint images are obtained by shooting the target object at multiple viewpoints respectively; for example, it is assumed that a multi-viewpoint image set includes a viewpoint image 1-a viewpoint image 3, the viewpoint image 1 is obtained by photographing a target object at a first viewpoint, and the viewpoint image 2 and the viewpoint image 3 are obtained by photographing the target object at a second viewpoint, the first viewpoint and the second viewpoint being different two viewpoints. The target object may be furniture, decorations, automobiles, workpieces, animals, etc., which the present application is not limited to.

(2) The server 102 predicts three-dimensional coordinate points of the target object at the viewpoints corresponding to the respective viewpoint images based on the positional information of the target object in the respective viewpoint images of the multi-viewpoint image set. The position information of the target object in each view image of the multi-view image set is used for indicating the position of the target object in each view image; for example, the position information may be a coordinate point indicating the position of the target object in each viewpoint image, or may be a pixel point for displaying the target object in each viewpoint image. The server 102 carries out convolution processing on the position information of the target object in each view point image of the multi-view point image set to obtain a three-dimensional coordinate point of the target object under the view point corresponding to each view point image; the convolution processing may specifically be two-dimensional convolution processing or three-dimensional convolution processing. In one embodiment, the server 102 may perform convolution processing on position information of the target object in each view point image of the multi-view point image set through a neural network to obtain a three-dimensional coordinate point of the target object under a view point corresponding to each view point image, where the neural network includes a two-dimensional convolution layer, or a three-dimensional convolution layer; the two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on the position information of the target object in each view point image of the multi-view point image set, and the three-dimensional convolution layer is used for carrying out three-dimensional convolution processing on the position information of the target object in each view point image of the multi-view point image set.

(3) The server 102 calibrates the three-dimensional coordinate point of the target object under each viewpoint according to the position relation between the viewpoint corresponding to each viewpoint image and the calibration point, and obtains the three-dimensional calibration coordinate point of the target object based on the calibration point. The calibration point may be any point of the viewpoints corresponding to the respective viewpoint images, or may be a point other than the viewpoint corresponding to the respective viewpoint images; the calibration points may be preset or specified by the user. The position of the viewpoint corresponding to each viewpoint image means: the image acquisition equipment shoots the spatial position of the target object; for example, assuming that the viewpoint image 1 is obtained by capturing the target object at the spatial position a by the image capturing apparatus, the spatial position of the viewpoint corresponding to the viewpoint image 1 is the spatial position a. The three-dimensional calibration coordinate point of the target object at each viewpoint means: absolute positioning coordinate points of the three-dimensional coordinate points of the target object under each viewpoint relative to the calibration points; for example, the three-dimensional coordinate point of the point a in the target object at the viewpoint 1 is (a, b, c), and after the three-dimensional coordinate point of the point a at the viewpoint 1 is calibrated based on the positional relationship between the calibration point and the viewpoint 1, the three-dimensional calibration coordinate point (e, f, g) of the point a in the target object based on the calibration point is obtained.

(4) The server 102 fuses the three-dimensional calibration coordinate points of the target object based on the calibration points to obtain a three-dimensional model of the target object. Fusion refers to combining three-dimensional calibration coordinate points of a target object based on calibration points to obtain a three-dimensional model of the target object. After obtaining the three-dimensional model of the target object, the server 102 returns the three-dimensional model of the target object to the terminal device 101.

In practical applications, the computer program corresponding to the image processing method provided in the present application may be installed in the terminal device 101 or the server 102. It is understood that when the computer program corresponding to the image processing method provided in the present application is loaded in the terminal apparatus 101, the above-described image processing method may be executed by the terminal apparatus 101 alone.

The image processing method provided in the embodiment of the present application is described in detail below with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a flowchart of an image processing method according to an embodiment of the present application. The image processing method may be performed by an image processing apparatus, which may be specifically the terminal apparatus 101 in fig. 1, or the server 102; the method comprises steps S201-S204, wherein:

s201, acquiring a multi-view image set of the target object.

The multi-view image set comprises a plurality of view images; the multiple viewpoint images are obtained by shooting the target object at multiple viewpoints respectively; the multi-view image set of the target object may include one or more images of the target object at one view point; for example, it is assumed that a multi-viewpoint image set includes a viewpoint image 1-a viewpoint image 3, the viewpoint image 1 is obtained by photographing a target object at a first viewpoint, and the viewpoint image 2 and the viewpoint image 3 are obtained by photographing the target object at a second viewpoint, the first viewpoint and the second viewpoint being different two viewpoints. The target object may specifically be furniture, decorations, automobiles, workpieces, animals, etc., which the present application is not limited to.

In one embodiment, the image capturing device captures the target object in different directions, so as to obtain a multi-view image set of the target object, where the image capturing device may be mounted on an independent image capturing device or on an image processing device.

S202, based on the position information of the target object in each view point image of the multi-view image set, predicting the three-dimensional coordinate point of the target object under the view point corresponding to each view point image.

The position information of the target object in each view image of the multi-view image set is used for indicating the position of the target object in each view image; the positional information of the target object in any one viewpoint image includes: and the target object is a pixel point corresponding to the visible area in the viewpoint image.

The image processing equipment carries out convolution processing on the position information of the target object in each view point image of the multi-view point image set to obtain a three-dimensional coordinate point of the target object under the view point corresponding to each view point image; the convolution processing may specifically be two-dimensional convolution processing or three-dimensional convolution processing.

In one embodiment, the image processing device performs convolution processing on position information of a target object in each view point image of the multi-view point image set through a neural network to obtain a three-dimensional coordinate point of the target object under a view point corresponding to each view point image, wherein the neural network comprises a two-dimensional convolution layer or a three-dimensional convolution layer; the two-dimensional convolution layer is used for carrying out two-dimensional convolution processing on the position information of the target object in each view point image of the multi-view point image set, and the three-dimensional convolution layer is used for carrying out three-dimensional convolution processing on the position information of the target object in each view point image of the multi-view point image set.

In one embodiment, the neural network includes an encoder and a three-dimensional structure generating model, and the convolution processing is performed on the position information of the target object in each view point image of the multi-view point image set by using the neural network, so that the three-dimensional coordinate point of the target object under the view point corresponding to each view point image is: and carrying out encoding processing on each view image by an encoder to obtain a binary mask of each view image, wherein the binary mask of each view image carries characteristic information of pixel points corresponding to a visible region of a target object in the view image. And after the binary mask of each view point image is obtained, a three-dimensional structure generation model is called to generate the binary mask of each view point image, so that a three-dimensional coordinate point of the target object under the view point corresponding to each view point image is obtained. The three-dimensional structure generation model comprises a two-dimensional convolution layer, the two-dimensional convolution layer is used for carrying out two-dimensional convolution operation on the binary mask of each view point image, and the three-dimensional coordinate point of the target object under the view point corresponding to each view point image is predicted based on the two-dimensional convolution operation result of the binary mask of the view point image.

Optionally, the three-dimensional structure generating model includes a three-dimensional convolution layer, and the three-dimensional convolution layer is used for performing three-dimensional convolution operation on the binary mask of each viewpoint image. In this case, the three-dimensional coordinate point of the target object at the viewpoint corresponding to each viewpoint image is predicted based on the three-dimensional convolution operation result of the binary mask of the viewpoint image. Practice finds that compared with the method for carrying out convolution processing on the position information of the target object in each view point image of the multi-view point image set by adopting the three-dimensional convolution layer, the method for carrying out convolution processing on the position information of the target object in each view point image of the multi-view point image set by adopting the two-dimensional convolution layer saves more calculation resources.

And S203, calibrating the three-dimensional coordinate points of the target object under each viewpoint according to the position relation between the viewpoints corresponding to the viewpoint images and the calibration points, so as to obtain the three-dimensional calibration coordinate points of the target object based on the calibration points.

The calibration point may be any point of the viewpoints corresponding to the respective viewpoint images, or may be a point other than the viewpoint corresponding to the respective viewpoint images; the calibration points may be preset or specified by the user.

The image processing apparatus determines a positional relationship between the viewpoint corresponding to each viewpoint image and the calibration point according to the spatial position of the viewpoint corresponding to each viewpoint image and the spatial position of the calibration point, the spatial position of the viewpoint corresponding to each viewpoint image being: the image acquisition equipment shoots the spatial position of the target object; for example, assuming that the viewpoint image 1 is obtained by capturing the target object at the spatial position a by the image capturing apparatus, the spatial position of the viewpoint corresponding to the viewpoint image 1 is the spatial position a.

The image processing equipment calibrates the three-dimensional coordinate point of the target object under each viewpoint according to the position relation between the viewpoint corresponding to each viewpoint image and the calibration point, and obtains the three-dimensional calibration coordinate point of the target object based on the calibration point; for example, assuming that the positional relationship between the viewpoint a corresponding to the viewpoint image 1 and the calibration point is a first positional relationship, the positional relationship between the three-dimensional coordinate point 1 under the viewpoint a and the viewpoint a is a second positional relationship, and based on the first positional relationship and the second positional relationship, a three-dimensional calibration coordinate point (i.e., determining coordinates of the three-dimensional coordinate point 1 based on the calibration point) of the three-dimensional coordinate point 1 based on the calibration point can be obtained.

S204, fusing the three-dimensional calibration coordinate points of the target object based on the calibration points to obtain a three-dimensional model of the target object.

In one embodiment, the image processing apparatus combines three-dimensional calibration coordinate points of the target object based on the calibration points to obtain a three-dimensional model of the target object.

Referring to fig. 3, fig. 3 is a flowchart illustrating another image processing method according to an embodiment of the present application. The image processing method may be performed by an image processing apparatus, which may be specifically the terminal apparatus 101 in fig. 1, or the server 102; the method comprises steps S301-S310, wherein:

s301, acquiring a multi-view image set of a target object.

The specific embodiment of step S301 can refer to the embodiment of step S201 in fig. 2, and will not be described herein.

S302, encoding processing is carried out on each view image, and a binary mask of each view image is obtained.

In one embodiment, the image processing apparatus invokes an encoder to encode each view image to obtain a binary mask for each view image.

S303, calling a three-dimensional structure generation model to generate and process binary masks of all the viewpoint images, and obtaining three-dimensional coordinate points of the target object under the viewpoints corresponding to all the viewpoint images.

The three-dimensional structure generation model comprises a two-dimensional convolution layer, the two-dimensional convolution layer is used for carrying out two-dimensional convolution operation on the binary mask of each view point image, and the three-dimensional coordinate point of the target object under the view point corresponding to each view point image is obtained by prediction based on the two-dimensional convolution operation result of the binary mask of the view point image.

Optionally, the three-dimensional structure generating model may also include a three-dimensional convolution layer, where the three-dimensional convolution layer is configured to perform three-dimensional convolution operation on the binary mask of each view image, and the three-dimensional coordinate point of the target object under the view corresponding to each view image is predicted based on the three-dimensional convolution operation result of the binary mask of the view image. Practices find that compared with the adoption of a three-dimensional convolution layer, the adoption of the two-dimensional convolution layer can save operation resources and improve operation efficiency.

S304, determining a calibration matrix corresponding to each viewpoint according to the spatial position of the viewpoint corresponding to each viewpoint image and the spatial position of the calibration point.

The calibration matrix is used to estimate the relevant parameters of the viewpoint image and the image sensor. The size of the target object in the real world can be measured using these parameters, or the positional relationship of the viewpoint corresponding to the viewpoint image and the calibration point can be indicated.

And S305, calibrating the three-dimensional coordinate points under each viewpoint through the calibration matrix corresponding to each viewpoint to obtain the three-dimensional calibration coordinate points of the target object based on the calibration points.

In one embodiment, the image processing apparatus converts three-dimensional coordinate points at the respective viewpoints into three-dimensional coordinate points at the calibration points (i.e., three-dimensional calibration coordinate points) through a calibration matrix corresponding to each viewpoint. Equation 1 is an expression for calibrating a three-dimensional coordinate point under a viewpoint corresponding to a viewpoint image provided in an embodiment of the present application:

Wherein, the liquid crystal display device comprises a liquid crystal display device,

three-dimensional calibration coordinate point based on calibration point for the i-th point, +.>

An inverse matrix of the orientation matrix representing the nth viewpoint; k (K) ^-1 An inverse matrix representing the calibration matrix; />

For the three-dimensional coordinate point of the ith point at the nth viewpoint, +.>

Can be expressed as +.>

For the coordinates of the ith point, +.>

Depth value for the i-th point; t is t _n Is the offset vector for the nth view.

S306, fusing the three-dimensional calibration coordinate points of the target object based on the calibration points to obtain a three-dimensional model of the target object.

In one embodiment, after the three-dimensional model of the target object is obtained, parameters in the three-dimensional structure generating model may be optimized based on the three-dimensional model of the target object, so as to obtain an optimized three-dimensional structure generating model. The specific optimization steps comprise the steps S11-S14:

s11, acquiring the spatial position of the reference viewpoint.

The reference viewpoint refers to any viewpoint other than the viewpoint corresponding to each viewpoint image in the multi-viewpoint image set. In one embodiment, the spatial position of the reference viewpoint is determined by an orientation matrix and an offset vector. The orientation matrix and the offset vector of the reference viewpoint are used to indicate the positional relationship of the reference viewpoint and the calibration point, and the spatial position of the reference viewpoint can be determined based on the positional relationship of the reference viewpoint and the calibration point.

And S12, based on the spatial position of the reference viewpoint, projecting the three-dimensional model to obtain a projection image of the three-dimensional model of the target object under the reference viewpoint.

The projected image of the three-dimensional model of the target object at the reference viewpoint is used to simulate the viewpoint image of the target object at the reference viewpoint. Because the three-dimensional model is projected into a two-dimensional plane, each pixel position in the projected image may include one or more pixels, each pixel carrying the position coordinates of the pixel, and depth value information. The depth value information of each pixel point is used for indicating the position of the pixel point from the reference viewpoint; the greater the distance of a pixel point from a reference viewpoint, the greater the depth value of the pixel point.

Equation 2 is an expression for projecting a point on a three-dimensional model of a target object into a two-dimensional plane corresponding to a reference viewpoint according to an embodiment of the present application:

equation 2 is the inverse of equation 1 above. Wherein, the liquid crystal display device comprises a liquid crystal display device,

for the three-dimensional coordinate point of the ith point at the kth reference viewpoint, +.>

Can be expressed as +.>

For the coordinates of the ith point, +.>

Depth value for the i-th point; k is a calibration matrix which can be obtained through the position relation between a reference viewpoint and a calibration point; r is R _k An orientation matrix for the kth reference viewpoint, t _k An offset vector for the kth reference viewpoint, < +.>

The coordinate point is calibrated for the ith point based on the three dimensions of the calibration point.

And S13, performing pseudo rendering processing on the projection image of the three-dimensional model at the reference viewpoint to obtain a predicted image of the three-dimensional model at the reference viewpoint.

The predicted image is a pixelated depth image of the three-dimensional model of the target object at the reference viewpoint, and the pixelated depth image is an image containing depth values of pixels in each pixel point information in the image. Equation 3 is an expression of performing pseudo rendering processing on a projection image of a three-dimensional model at a reference viewpoint to obtain the prediction image of the three-dimensional model at the reference viewpoint, which is provided in the embodiment of the present application:

pixelated depth image of three-dimensional model of target object at reference viewpoint +.>

Three-dimensional calibration coordinate point set based on calibration points for target object, < >>

Refers to a pseudo rendering place for a three-dimensional calibration coordinate point set based on calibration points for a target object by a pseudo rendererAnd (5) managing.

Further, a mask corresponding to the predicted image can be obtained based on the predicted image of the three-dimensional model at the reference viewpoint

Wherein, the mask of the pixel points belonging to the target object is 1, and the mask of the pixel points not belonging to the target object is 0.

In one embodiment, a U-times up-sampling process is performed on a projection image of the three-dimensional model at a reference viewpoint to obtain an up-sampled image, where U is a positive integer. After an up-sampling image is obtained, updating the depth value of each pixel point contained in the up-sampling image to be the inverse of the depth value of the pixel point; for example, assuming that the depth value of the pixel point a before updating is 2, the depth value of the pixel point a after updating is 0.5; for another example, assuming that the depth value of the pixel B before updating is 5, the depth value of the pixel B after updating is 0.2. After the depth value of each pixel point in the up-sampling image is updated, a convolution kernel with the size of U is adopted to check the up-sampling image to perform down-sampling processing (such as performing maximum pooling operation on the up-sampling image) so as to obtain a predicted image of the three-dimensional model under the reference viewpoint.

The up-sampling image is checked to be subjected to down-sampling processing through a convolution kernel with the size of U, so that the up-sampling image can be restored to the resolution of the projection image; by updating the depth value of each pixel included in the up-sampled image to the inverse of the depth value of that pixel and performing a max pooling operation on the up-sampled image, the pixel with the smallest depth value in each pixel position of the down-sampled image can be retained.

In practical applications, since the rendered image of the rendering process is generally not differentiable, this may result in the rendered image not being directly utilized and incorporated into the deep learning framework. Therefore, the method and the device perform pseudo-rendering processing on the projection image of the three-dimensional model under the reference viewpoint, wherein the pseudo-rendering is differential rendering, the micro-rendering is a rendering process capable of differentiating and deriving, the forward process is the same as the traditional rendering, the input model and the parameter obtain a picture, the reverse direction is the derivative of the pixel on the scene parameter, and the micro-rendering needs to have the two processes, so that not only a rendering result, but also the derivative of the rendering result on the input is needed. The differentiable rendering can maintain the distinguishability and parallelism of the predicted image of the three-dimensional model at the reference viewpoint.

And S14, optimizing parameters in the three-dimensional structure generating model according to the loss value between the predicted image under the reference viewpoint and the marked image under the reference viewpoint to obtain an optimized three-dimensional structure generating model.

Calculating a depth loss component according to the difference between the depth image corresponding to the predicted image under the reference viewpoint and the depth image corresponding to the labeling image under the reference viewpoint; the reference viewpoint labeling image is a viewpoint image obtained by photographing the target object at the reference viewpoint. Equation 4 is a calculation expression of a depth loss component provided in the embodiment of the present application:

Wherein L is _depth Represents a depth loss component, K represents a kth reference view among K reference views,

z is a predicted image of a three-dimensional model of the target object at a kth reference viewpoint _k For an annotated image of the target object at the kth reference viewpoint, the term x represents the norm of the solution x.

Calculating a mask loss component based on a difference between a mask image corresponding to the predicted image at the reference viewpoint and a mask image corresponding to the annotation image at the reference viewpoint; equation 5 is a calculation expression of a mask loss component provided in the embodiment of the present application:

wherein L is _mask Representing the loss of mask component(s),k represents the kth reference view of the K reference views, M _k A mask for the annotation image of the target object at the kth reference viewpoint,

masking of a predicted image of the three-dimensional model of the target object at the kth reference viewpoint.

Obtaining a loss value between the predicted image under the reference viewpoint and the marked image under the reference viewpoint through the depth loss component and the mask loss component; equation 6 is a calculation expression of a loss value provided in the embodiment of the present application:

L＝L _mask +λ·L _depth equation 6

Wherein L represents a loss value between the predicted image at the reference viewpoint and the marked image at the reference viewpoint, L _mask Representing the mask loss component, L _depth The depth loss component is represented, λ is a weighting coefficient, and the range of values of λ is (0, 1).

Optimizing parameters in the three-dimensional structure generating model according to the loss value so that the three-dimensional structure generating model meets optimization conditions; for example, the L obtained by generating a model based on the optimized three-dimensional structure is made smaller than a loss threshold value.

Fig. 4 is a schematic diagram of model optimization according to an embodiment of the present application. As shown in fig. 4, first, n viewpoint images of a target object are encoded by an encoder to obtain binary masks corresponding to the n viewpoint images; then, a three-dimensional structure generating model is called to generate and process binary masks corresponding to each viewpoint image, and three-dimensional coordinate points of the target object under n viewpoints are obtained; based on the spatial positions of the n viewpoints (e.g., by (R ₁ ,t ₁ )…(R _N ,t _N ) To represent the spatial positions of n viewpoints) and the calibration points, and calibrating the three-dimensional coordinate points under the n viewpoints to obtain a three-dimensional calibration coordinate point of the target image based on the calibration points; and then fusing the three-dimensional calibration coordinate points of the target object based on the calibration points to obtain a three-dimensional model of the target object. Further, after the three-dimensional model of the target object is obtained, k can be based on Projecting the three-dimensional model of the target object by referring to the space positions of the viewpoints to obtain k projection images of the three-dimensional model of the target object under k reference viewpoints; then, performing pseudo rendering processing on the k projection images to obtain predicted images of the three-dimensional model of the target object under k reference viewpoints; calculating a loss value based on the difference between the predicted image under each reference viewpoint and the marked image under the reference point, and adjusting one or more parameters in the three-dimensional structure generating model based on the loss value so as to enable the three-dimensional structure generating model to meet the optimization condition; for example, the loss value obtained by generating the model based on the optimized three-dimensional structure is made smaller than the loss threshold value.

It should be noted that, the steps S11 to S14 may be performed by a single model optimizing apparatus or may be performed by an image processing apparatus, which is not limited in this application.

Therefore, the three-dimensional structure generating model is optimized through the loss value between the predicted image under the reference viewpoint and the marked image under the reference viewpoint, so that the prediction accuracy of the three-dimensional model of the target object is improved. It should be noted that, by optimizing the three-dimensional structure generating model by reference viewpoints other than the viewpoints corresponding to the respective viewpoint images in the multi-viewpoint image set, the errors may be uniformly distributed on the reference viewpoints, instead of being concentrated on the viewpoints corresponding to the respective viewpoint images in the multi-viewpoint image set.

S307, constructing a three-dimensional point cloud map, and acquiring the placing position information of the three-dimensional model of the target object in the three-dimensional point cloud map.

In one embodiment, the image processing apparatus acquires position information of a reference object, which may be specified by a user; for example, a user designates a target object as a specified reference object, and designates three-dimensional coordinates corresponding to the reference object; the position information of the reference object may be set in advance. After the position information of the reference object is acquired, a reference coordinate system is established based on the position information of the reference object, projection information is associated with the reference coordinate system, and the projection information is used for indicating the projection relation between the reference coordinate system and the real world; for example, the projection information may be scale information.

Further, the image processing apparatus may construct a three-dimensional point cloud map in the reference coordinate system based on the projection information; for example, three-dimensional point cloud maps are tracked and expanded in a reference frame by synchronous localization and mapping (Simultaneous Localization And Mapping, SLAM) techniques.

And S308, in response to the confirmation of the placement position information, adding the three-dimensional model of the target object to the position indicated by the placement position information in the three-dimensional point cloud map.

In one embodiment, after the user determines the placement position of the three-dimensional model of the target object in the point cloud map, the three-dimensional model of the target object after being projected according to the projection information is added to the position indicated by the placement position information in the three-dimensional point cloud map.

S309, responding to the viewing operation of the three-dimensional point cloud map, and acquiring the observation position of the three-dimensional point cloud map.

In one embodiment, when a user selects an observation position in a three-dimensional point cloud map, the image processing apparatus acquires coordinates of the observation position in the three-dimensional point cloud map.

And S310, carrying out plane projection on the three-dimensional point cloud map based on the observation position to obtain an observation effect image of the three-dimensional point cloud map under the observation position.

Taking AR home as an example, after a user puts a furniture model in a virtual room, different viewpoints can be selected to observe the furniture model; each time a user selects an observation position, the furniture model in the virtual room is subjected to plane projection based on the observation position, and a two-dimensional observation effect image of the furniture model in the virtual room under the observation position is obtained.

Optionally, the user can save or share the observation effect images at each observation position, so as to facilitate subsequent use; for example, after a plurality of observation effect images are obtained, the observation effect images may be compared with each other.

In the embodiment of the application, a multi-view image set of a target object is obtained, and three-dimensional coordinate points of the target object under the view points corresponding to each view point image are predicted based on the position information of the target object in each view point image of the multi-view image set; calibrating a three-dimensional coordinate point of the target object under each viewpoint according to the position relation between the viewpoint corresponding to each viewpoint image and the calibration point, so as to obtain a three-dimensional calibration coordinate point of the target object based on the calibration point; and fusing the three-dimensional calibration coordinate points of the target object based on the calibration points to obtain a three-dimensional model of the target object. Therefore, the three-dimensional calibration coordinate point of the target object based on the calibration point can be obtained based on the position information of the target object in the multi-viewpoint image, and the three-dimensional model of the target object is molded by fusing the three-dimensional calibration coordinate point based on the calibration point, so that the modeling efficiency is improved well. In addition, the two-dimensional convolution layer is adopted for carrying out two-dimensional convolution operation on the binary mask of each view image, so that operation resources are saved, and operation efficiency is improved; the method comprises the steps of performing pseudo rendering processing on a projection image of a three-dimensional model at a reference viewpoint, and keeping the distinguishability and parallelism of the projection image of the three-dimensional model at the reference viewpoint; the three-dimensional structure generation model is optimized through the loss value between the predicted image under the reference viewpoint and the marked image under the reference viewpoint, so that the prediction accuracy of the three-dimensional model of the target object is improved.

The foregoing details of the method of embodiments of the present application are set forth in order to provide a better understanding of the foregoing aspects of embodiments of the present application, and accordingly, the following provides a device of embodiments of the present application.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application, where the apparatus may be mounted on the terminal device 101 or the server 102 shown in fig. 1. The image processing device shown in fig. 5 may be used to perform some or all of the functions of the method embodiments described above with respect to fig. 2 and 3. Referring to fig. 5, the detailed descriptions of the respective units are as follows:

an obtaining unit 501, configured to obtain a multi-view image set of a target object, where the multi-view image set includes a plurality of view images; the multiple viewpoint images are obtained by shooting the target object at multiple viewpoints respectively;

a processing unit 502, configured to predict a three-dimensional coordinate point of the target object under a viewpoint corresponding to each viewpoint image based on position information of the target object in each viewpoint image of the multi-viewpoint image set;

In one embodiment, the processing unit 501 is configured to perform convolution processing on position information of a target object in each view point image of the multi-view image set to obtain a three-dimensional coordinate point of the target object in a view point corresponding to each view point image, and specifically is configured to:

encoding each view image to obtain a binary mask of each view image;

In one embodiment, the processing unit 501 is further configured to:

In one embodiment, the processing unit 501 is configured to perform pseudo-rendering processing on a projection image of the three-dimensional model at the reference viewpoint to obtain a predicted image of the three-dimensional model at the reference viewpoint, specifically configured to:

In one embodiment, the processing unit 501 is configured to optimize the three-dimensional structure generating model according to a loss value between the predicted image at the reference viewpoint and the labeling image at the reference viewpoint, specifically configured to:

In one embodiment, the processing unit 501 is configured to calibrate a three-dimensional coordinate point of the target object under each viewpoint according to a positional relationship between a viewpoint corresponding to each viewpoint image and a calibration point, so as to obtain a three-dimensional calibration coordinate point of the target object based on the calibration point, and specifically is configured to:

In one embodiment, the processing unit 501 is further configured to:

In one embodiment, the processing unit 501 is configured to construct a three-dimensional point cloud map, specifically:

acquiring position information of a reference object;

In one embodiment, the processing unit 501 is further configured to:

According to one embodiment of the present application, part of the steps involved in the image processing methods shown in fig. 2 and 3 may be performed by respective units in the image processing apparatus shown in fig. 5. For example, step S201 shown in fig. 2 may be performed by the acquisition unit 501 shown in fig. 5, and steps S202 to S204 may be performed by the processing unit 502 shown in fig. 5. Step S301 shown in fig. 3 may be performed by the acquisition unit 501 shown in fig. 5, and steps S302 to S310 may be performed by the processing unit 502 shown in fig. 5. The respective units in the image processing apparatus shown in fig. 5 may be individually or collectively combined into one or several additional units, or some unit(s) thereof may be further split into a plurality of units smaller in function, which may achieve the same operation without affecting the achievement of the technical effects of the embodiments of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present application, the image processing apparatus may also include other units, and in practical applications, these functions may also be realized with assistance of other units, and may be realized by cooperation of a plurality of units.

According to another embodiment of the present application, an image processing apparatus as shown in fig. 5 may be constructed by running a computer program (including program code) capable of executing the steps involved in the respective methods as shown in fig. 2 and 3 on a general-purpose computing apparatus such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and the like, and a storage element, and the image processing method of the embodiments of the present application may be implemented. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into and run in the above-described computing device through the computer-readable recording medium.

Based on the same inventive concept, the principle and beneficial effects of the image processing device for solving the problems provided in the embodiments of the present application are similar to those of the image processing method for solving the problems in the embodiments of the method of the present application, and may refer to the principle and beneficial effects of implementation of the method, which are not described herein for brevity.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application, and as shown in fig. 6, the image processing apparatus at least includes a processor 601, a communication interface 602, and a memory 603. Wherein the processor 601, the communication interface 602 and the memory 603 may be connected by a bus or other means. The processor 601 (or called central processing unit (Central Processing Unit, CPU)) is a computing core and a control core of the terminal, and can parse various instructions in the terminal and process various data of the terminal, for example: the CPU can be used for analyzing a startup and shutdown instruction sent by a user to the terminal and controlling the terminal to perform startup and shutdown operation; and the following steps: the CPU can transmit various kinds of interactive data between the internal structures of the terminal, and so on. Communication interface 602 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI, mobile communication interface, etc.), and may be controlled by processor 601 to receive and transmit data; the communication interface 602 may also be used for transmission and interaction of data inside the terminal. The Memory 603 (Memory) is a Memory device in the terminal for storing programs and data. It will be appreciated that the memory 603 herein may include both the internal memory of the terminal and the expansion memory supported by the terminal. The memory 603 provides storage space that stores the operating system of the terminal, which may include, but is not limited to: android systems, iOS systems, windows Phone systems, etc., which are not limiting in this application.

The embodiment of the application also provides a computer readable storage medium (Memory), which is a Memory device in the terminal and is used for storing programs and data. It will be appreciated that the computer readable storage medium herein may include both a built-in storage medium in the terminal and an extended storage medium supported by the terminal. The computer readable storage medium provides a storage space that stores a processing system of the terminal. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor 601. Note that the computer readable storage medium can be either a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory; alternatively, it may be at least one computer-readable storage medium located remotely from the aforementioned processor.

In one embodiment, the computer-readable storage medium has one or more instructions stored therein; loading and executing, by the processor 601, one or more instructions stored in a computer-readable storage medium to implement the respective steps in the image processing method embodiments described above; in particular implementations, one or more instructions in a computer-readable storage medium are loaded by processor 601 and perform the following:

Acquiring a multi-view image set of a target object through a communication interface 602, wherein the multi-view image set comprises a plurality of view images; the multiple viewpoint images are obtained by shooting the target object at multiple viewpoints respectively;

As an alternative embodiment, the processor 601 predicts, based on the position information of the target object in each viewpoint image, a specific embodiment of the three-dimensional coordinate point of the target object under the viewpoint corresponding to each viewpoint image as follows:

As an optional embodiment, the processor 601 performs convolution processing on position information of the target object in each view point image of the multi-view image set, so as to obtain a specific embodiment of a three-dimensional coordinate point of the target object under the view point corresponding to each view point image, where the specific embodiment is as follows:

encoding each view image to obtain a binary mask of each view image;

As an alternative embodiment, the processor 601 further performs the following operations by executing executable program code in the memory 603:

As an alternative embodiment, the processor 601 performs pseudo rendering processing on a projection image of the three-dimensional model at the reference viewpoint, and specific embodiments of obtaining a predicted image of the three-dimensional model at the reference viewpoint are as follows:

As an alternative embodiment, the specific embodiment of optimizing the three-dimensional structure generating model by the processor 601 according to the loss value between the predicted image at the reference viewpoint and the labeling image at the reference viewpoint is as follows:

As an optional embodiment, the processor 601 calibrates a three-dimensional coordinate point of the target object under each viewpoint according to a positional relationship between a viewpoint corresponding to each viewpoint image and a calibration point, so as to obtain a specific embodiment of the three-dimensional calibration coordinate point of the target object based on the calibration point, where the specific embodiment is as follows:

As an alternative embodiment, the specific embodiment of the processor 601 constructing the three-dimensional point cloud map is:

acquiring position information of a reference object;

The present application also provides a computer readable storage medium having one or more instructions stored therein, the one or more instructions adapted to be loaded by a processor and to perform the image processing method of the above method embodiment.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the image processing method of the method embodiment described above.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the method of image processing described above.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

The modules in the device of the embodiment of the application can be combined, divided and deleted according to actual needs.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the readable storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The foregoing disclosure is only a preferred embodiment of the present application, and it is not intended to limit the scope of the claims, and one of ordinary skill in the art will understand that all or part of the processes for implementing the embodiments described above may be performed with equivalent changes in the claims of the present application and still fall within the scope of the claims.

Claims

1. An image processing method, the method comprising:

acquiring a multi-view image set of a target object, wherein the multi-view image set comprises a plurality of view images; the plurality of viewpoint images are obtained by shooting the target object at a plurality of viewpoints respectively;

2. The method of claim 1, wherein predicting the three-dimensional coordinate point of the target object at the viewpoint corresponding to each viewpoint image based on the position information of the target object in each viewpoint image of the multi-viewpoint image set comprises:

3. The method of claim 2, wherein the convolving the position information of the target object in each view point image of the multi-view image set to obtain a three-dimensional coordinate point of the target object at the view point corresponding to each view point image comprises:

encoding each view point image to obtain a binary mask of each view point image;

invoking a three-dimensional structure generation model to generate binary masks of the viewpoint images to obtain three-dimensional coordinate points of the target object under the viewpoints corresponding to the viewpoint images;

The three-dimensional structure generation model comprises a two-dimensional convolution layer, and the two-dimensional convolution layer is used for carrying out two-dimensional convolution operation on binary masks of the view images.

4. A method as claimed in claim 3, wherein the method further comprises:

acquiring a spatial position of a reference viewpoint, wherein the reference viewpoint refers to any viewpoint except for a viewpoint corresponding to each viewpoint image in the multi-viewpoint image set;

based on the spatial position of the reference viewpoint, projecting the three-dimensional model to obtain a projection image of the three-dimensional model of the target object under the reference viewpoint;

performing pseudo rendering processing on the projection image of the three-dimensional model under the reference viewpoint to obtain a prediction image of the three-dimensional model under the reference viewpoint;

and optimizing parameters in the three-dimensional structure generating model according to the loss value between the predicted image under the reference viewpoint and the marked image under the reference viewpoint to obtain an optimized three-dimensional structure generating model.

5. The method of claim 4, wherein pseudo-rendering the projected image of the three-dimensional model at the reference viewpoint to obtain the predicted image of the three-dimensional model at the reference viewpoint comprises:

Performing U times up-sampling processing on the projection image of the three-dimensional model under the reference viewpoint to obtain an up-sampled image, wherein U is a positive integer;

and carrying out downsampling processing on the updated upsampled image, and reserving a pixel point with the minimum depth value in each pixel position of the downsampled image to obtain a predicted image of the three-dimensional model under the reference viewpoint.

6. The method of claim 4, wherein optimizing the three-dimensional structure generation model based on a loss value between the predicted image at the reference viewpoint and the annotated image at the reference viewpoint comprises:

And optimizing parameters in the three-dimensional structure generating model according to the loss value so that the three-dimensional structure generating model meets optimization conditions.

7. The method of claim 1, wherein calibrating the three-dimensional coordinate point of the target object under each viewpoint according to the positional relationship between the viewpoint corresponding to each viewpoint image and the calibration point to obtain the three-dimensional calibration coordinate point of the target object based on the calibration point comprises:

and calibrating the three-dimensional coordinate point under each viewpoint through the calibration matrix corresponding to each viewpoint to obtain the three-dimensional calibration coordinate point of the target object based on the calibration point.

8. The method of claim 1, wherein the method further comprises:

constructing a three-dimensional point cloud map, and acquiring the placing position information of a three-dimensional model of the target object in the three-dimensional point cloud map;

and in response to the placement location information being confirmed, adding the three-dimensional model of the target object to a location in the three-dimensional point cloud map indicated by the placement location information.

9. The method of claim 8, wherein the constructing a three-dimensional point cloud map comprises:

acquiring position information of a reference object;

establishing a reference coordinate system based on the position information of the reference object, wherein projection information is associated with the reference coordinate system and used for indicating the projection relation between the reference coordinate system and the real world;

and constructing a three-dimensional point cloud map in the reference coordinate system through the projection information.

10. The method of claim 8, wherein the method further comprises:

11. An image processing apparatus, comprising:

an acquisition unit, configured to acquire a multi-view image set of a target object, where the multi-view image set includes a plurality of view images; the plurality of viewpoint images are obtained by shooting the target object at a plurality of viewpoints respectively;

a processing unit, configured to predict a three-dimensional coordinate point of the target object under a viewpoint corresponding to each viewpoint image based on position information of the target object in each viewpoint image of the multi-viewpoint image set;

The three-dimensional coordinate point of the target object under each viewpoint is calibrated according to the position relation between the viewpoint corresponding to each viewpoint image and the calibration point, and the three-dimensional calibration coordinate point of the target object based on the calibration point is obtained;

and the three-dimensional calibration coordinate points are used for fusing the target object based on the calibration points to obtain a three-dimensional model of the target object.

12. An image processing apparatus, characterized by comprising: a memory device and a processor;

the storage device stores a computer program;

processor executing a computer program for implementing the image processing method according to any of claims 1-10.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, is implemented as an image processing method according to any one of claims 1-10.

14. A computer program product or computer program, characterized in that the computer program product or the computer program comprises computer instructions stored in a computer readable storage medium;

A processor of a computer device reads the computer instructions from the computer readable storage medium, the processor executing the computer instructions, causing the computer device to perform the image processing method of any one of claims 1-10.