CN112802202A

CN112802202A - Image processing method, image processing device, electronic equipment and computer storage medium

Info

Publication number: CN112802202A
Application number: CN201911115151.0A
Authority: CN
Inventors: 李炜明; 考月英; 刘洋; 汪昊; 洪性勳; 金祐湜; 张超; 马林; 王强
Original assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecom R&D Center; Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2021-05-14
Also published as: KR20210058638A

Abstract

The embodiment of the invention provides an image processing method, an image processing device, electronic equipment and a computer storage medium, wherein the method comprises the following steps: acquiring an image to be processed, wherein the image to be processed comprises a depth image of a scene; determining three-dimensional point cloud data corresponding to the depth image based on the depth image; and obtaining a proposal result of the object in the scene based on the three-dimensional point cloud data. According to the scheme, the three-dimensional point cloud data represents a point set consisting of a plurality of three-dimensional discrete points, and the data volume of the point set is smaller than that of the corresponding data volume of the three-dimensional voxel, so that the proposal result of the object in the scene is determined based on the three-dimensional point cloud data, the storage space can be saved, the data calculation amount is reduced, and the algorithm operation efficiency is improved.

Description

Image processing method, image processing device, electronic equipment and computer storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image processing method and apparatus, an electronic device, and a computer storage medium.

Background

Conventionally, in order to determine a proposed result of an object in an image, that is, a region including the object in the image, a proposed result of the object is generally obtained based on a three-dimensional voxel by using a three-dimensional voxel of a depth image as a feature. The scheme for determining the proposal result in the prior art has the following defects: the solution of obtaining an object proposal based on three-dimensional voxels may result in a large amount of memory space and computational resources being consumed, making the algorithm inefficient.

Disclosure of Invention

The embodiments of the present invention mainly aim to provide an image processing method, an image processing apparatus, an electronic device, and a computer storage medium.

In a first aspect, an embodiment of the present invention provides an image processing method, where the method includes acquiring an image to be processed, where the image to be processed includes a depth image of a scene;

determining three-dimensional point cloud data corresponding to the depth image based on the depth image;

and obtaining a proposal result of the object in the scene based on the three-dimensional point cloud data.

In an optional embodiment of the first aspect, obtaining a proposed result of an object in a scene based on three-dimensional point cloud data includes:

converting the three-dimensional point cloud data into a matrix based on the three-dimensional point cloud data;

determining a first feature map based on the matrix;

and obtaining a proposal result of the object in the scene based on the first feature map.

In an optional embodiment of the first aspect, determining a matrix corresponding to the three-dimensional point cloud data based on the three-dimensional point cloud data includes:

determining point cloud data belonging to an object in the three-dimensional point cloud data;

and determining a matrix corresponding to the three-dimensional point cloud data based on the point cloud data belonging to the object in the three-dimensional point cloud data.

In an optional embodiment of the first aspect, the image to be processed further includes a color image of the scene, and the method further includes:

performing feature extraction on the color image to obtain a second feature map;

obtaining a proposal result of an object in the scene based on the first feature map, comprising:

and obtaining a proposal result of the object in the scene based on the first feature map and the second feature map.

In an optional embodiment of the first aspect, obtaining a proposed result of an object in the scene based on the first feature map and the second feature map includes:

fusing the first characteristic diagram and the second characteristic diagram to obtain a third characteristic diagram corresponding to the image to be processed;

and obtaining a proposal result of the object in the scene based on the third feature map.

In an optional embodiment of the first aspect, obtaining a proposed result of an object in the scene based on the third feature map includes:

segmenting an image to be processed to obtain at least two sub-images;

determining a proposal result corresponding to each sub-image based on the third feature map corresponding to each sub-image and/or the third feature map corresponding to the adjacent sub-image of each sub-image;

and fusing the proposal results corresponding to the sub-images to obtain the proposal result of the object in the scene.

In an optional embodiment of the first aspect, determining a proposal result corresponding to each sub-image based on the third feature map corresponding to each sub-image and/or the third feature maps corresponding to adjacent sub-images of each sub-image includes:

determining a weight for each sub-image;

and determining a proposal result corresponding to each sub-image based on the third feature map corresponding to each sub-image and/or the third feature map corresponding to the adjacent sub-image of each sub-image and the weight corresponding to each sub-image.

In an optional embodiment of the first aspect, determining the weight for each sub-image comprises any one of:

determining the weight of each sub-image based on the sub-feature map corresponding to each sub-image;

and determining candidate points of the image to be processed, and determining the weight corresponding to each sub-image based on the candidate points corresponding to each sub-image or the sub-feature maps corresponding to the candidate points corresponding to each sub-image.

In an optional embodiment of the first aspect, determining a weight corresponding to each sub-image based on candidate points corresponding to the respective sub-images includes:

for the candidate point corresponding to each sub-image, determining the similarity relation between the candidate point and the candidate points of the adjacent sub-images; determining the weight corresponding to each sub-image based on the similarity relation between each candidate point and the candidate points of the adjacent sub-images;

determining the weight of each sub-image based on the sub-feature map corresponding to each sub-image, wherein the weight comprises any one of the following:

for each sub-image, determining a first feature vector corresponding to the center position of the sub-image and a second feature vector corresponding to a sub-feature map corresponding to the sub-image; determining the weight of each sub-image based on the first characteristic vector and the second characteristic vector corresponding to each sub-image;

for the corresponding sub-feature map of each sub-image, the sub-feature map corresponds to at least one probability value, and each probability value represents the probability that the sub-feature map belongs to the corresponding object; the highest probability value of the at least one probability value is used as the weight of the subimage.

In an optional embodiment of the first aspect, the method further comprises:

and determining a three-dimensional detection result of the object in the image to be processed based on the proposal result, wherein the three-dimensional detection result comprises at least one of a three-dimensional posture result and a three-dimensional segmentation result.

In an optional embodiment of the first aspect, the three-dimensional detection result comprises a three-dimensional pose result and a three-dimensional segmentation result;

based on the proposal result, determining the three-dimensional detection result of the object in the image to be processed comprises:

extracting three-dimensional point cloud characteristics and two-dimensional image characteristics corresponding to the proposal result;

splicing the three-dimensional point cloud characteristic and the two-dimensional image characteristic to obtain a fourth characteristic diagram;

and determining a three-dimensional detection result of the object in the image to be processed based on the fourth feature map.

In an optional embodiment of the first aspect, determining a result of three-dimensional detection of an object in the image to be processed based on the proposal result comprises:

determining an initial three-dimensional detection result of an object in the image to be processed based on the proposal result;

determining an original image corresponding to an object in an image to be processed;

determining difference information corresponding to the initial three-dimensional detection result of each object based on the initial three-dimensional detection result of each object and the corresponding original image;

and updating the initial three-dimensional detection result of the corresponding object based on the difference information corresponding to the initial three-dimensional detection result of each object to obtain the three-dimensional detection result of each object in the image to be processed.

In a second aspect, the present invention provides an image processing method, comprising:

acquiring deformation information of a virtual object to a real object in an image to be processed;

and deforming the real object based on the deformation information to obtain a deformed image to be processed.

In an optional embodiment of the second aspect, deforming the real object based on the deformation information to obtain a deformed image to be processed includes:

determining an original image corresponding to a real object;

determining a transformation relation between a deformed image corresponding to the real object and an image before deformation based on a three-dimensional attitude result corresponding to the real object, deformation information and an original image corresponding to the real object, wherein the image before deformation is an image corresponding to the real object in the image to be processed;

determining a deformed image corresponding to the real object based on the transformation relation and the image corresponding to the real object;

and determining the deformed image to be processed based on the deformed image corresponding to the real object.

In an optional embodiment of the second aspect, determining a transformation relationship between a deformed image and a pre-deformed image corresponding to the object to be deformed based on the three-dimensional pose result corresponding to the object to be deformed, the deformation information, and the original image corresponding to the object to be deformed includes:

determining deformed deformation points corresponding to the object to be deformed in the original image based on the original image, the deformation information and the corresponding relation of the object to be deformed, wherein the corresponding relation is established based on the corresponding deformation points of the object in the sample image before and after deformation under different deformation information;

and determining a transformation relation between the deformed image corresponding to the object to be deformed and the image before deformation based on the deformed deformation point corresponding to the object to be deformed, the deformed point of the object to be deformed before deformation and the three-dimensional posture result corresponding to the object to be deformed.

In an optional embodiment of the second aspect, determining a transformation relationship between a deformed image corresponding to the object to be deformed and a pre-deformed image based on the deformed deformation point corresponding to the object to be deformed, the deformation point before the object to be deformed is deformed, and the three-dimensional posture result corresponding to the object to be deformed includes:

determining the weight of each deformation point in the deformation points corresponding to the object to be deformed;

and determining a transformation relation between the deformed image corresponding to the object to be deformed and the image before deformation based on the weight of each deformation point, the deformed deformation point corresponding to the object to be deformed, the deformation point before deformation of the object to be deformed and the three-dimensional posture result corresponding to the object to be deformed.

In an optional embodiment of the second aspect, the determining the deformed image to be processed based on the deformed image corresponding to the object to be deformed includes at least one of:

replacing the image before deformation in the image to be processed with the deformed image corresponding to the object to be deformed to obtain the deformed image to be processed;

determining a difference image based on the deformed image corresponding to the object to be deformed and the image before deformation corresponding to the object to be deformed, and determining the deformed image to be processed based on the difference image.

In a third aspect, the present invention provides an image processing apparatus comprising:

the image acquisition module is used for acquiring an image to be processed, and the image to be processed comprises a depth image of a scene;

the three-dimensional point cloud data determining module is used for determining three-dimensional point cloud data corresponding to the depth image based on the depth image;

and the proposal result determining module is used for obtaining the proposal result of the object in the scene based on the three-dimensional point cloud data.

In an optional embodiment of the third aspect, when obtaining a proposed result of an object in a scene based on three-dimensional point cloud data, the proposed result determining module is specifically configured to:

determining a first feature map based on the matrix;

In an optional embodiment of the third aspect, when the proposed result determining module determines the matrix corresponding to the three-dimensional point cloud data based on the three-dimensional point cloud data, the proposed result determining module is specifically configured to:

In an optional embodiment of the third aspect, the image to be processed further includes a color image of the scene, and the apparatus further includes:

the characteristic extraction module is used for extracting the characteristics of the color image to obtain a second characteristic diagram;

the proposal result determining module is specifically configured to, when obtaining a proposal result of an object in a scene based on the first feature map:

In an optional embodiment of the third aspect, when obtaining a proposed result of an object in a scene based on the first feature map and the second feature map, the proposed result determining module is specifically configured to:

In an optional embodiment of the third aspect, when obtaining a proposed result of an object in a scene based on the third feature map, the proposed result determining module is specifically configured to:

segmenting an image to be processed to obtain at least two sub-images;

In an optional embodiment of the third aspect, when determining the proposal result corresponding to each sub-image based on the third feature map corresponding to each sub-image and/or the third feature maps corresponding to adjacent sub-images of each sub-image, the proposal result determining module is specifically configured to:

determining a weight for each sub-image;

In an optional embodiment of the third aspect, the proposal result determination module, when determining the weight of each sub-image, determines the weight of each sub-image by any one of:

In an optional embodiment of the third aspect, when the proposed result determining module determines the weight corresponding to each sub-image based on the candidate point corresponding to each sub-image, the proposed result determining module is specifically configured to:

the proposal result determining module determines the weight of each sub-image based on the sub-feature map corresponding to each sub-image by any one of the following methods:

In an optional embodiment of the third aspect, the apparatus further comprises:

and the three-dimensional detection result determining module is used for determining a three-dimensional detection result of the object in the image to be processed based on the proposal result, wherein the three-dimensional detection result comprises at least one of a three-dimensional posture result and a three-dimensional segmentation result.

In an optional embodiment of the third aspect, the three-dimensional detection result comprises a three-dimensional pose result and a three-dimensional segmentation result;

the three-dimensional detection result determining module is specifically configured to, when determining the three-dimensional detection result of the object in the image to be processed based on the proposal result:

In an optional embodiment of the third aspect, when determining the three-dimensional detection result of the object in the image to be processed based on the proposed result, the three-dimensional detection result determining module is specifically configured to:

In a fourth aspect, the present invention provides an image processing apparatus comprising:

the deformation information acquisition module is used for acquiring the deformation information of the virtual object to the real object in the image to be processed;

and the image deformation module is used for deforming the real object based on the deformation information to obtain a deformed image to be processed.

In an optional embodiment of the fourth aspect, the image deformation module is specifically configured to, when deforming the real object based on the deformation information to obtain the deformed to-be-processed image:

determining an original image corresponding to a real object;

In an optional embodiment of the fourth aspect, when determining, based on the three-dimensional pose result corresponding to the object to be deformed, the deformation information, and the original image corresponding to the object to be deformed, a transformation relationship between the deformed image corresponding to the object to be deformed and the image before deformation, the image deformation module is specifically configured to:

In an optional embodiment of the fourth aspect, when the image deformation module determines a transformation relationship between the deformed image corresponding to the object to be deformed and the pre-deformed image based on the deformed deformation point corresponding to the object to be deformed, the deformation point before the object to be deformed is deformed, and the three-dimensional posture result corresponding to the object to be deformed, the image deformation module is specifically configured to:

In an optional embodiment of the fourth aspect, when determining the deformed to-be-processed image based on the deformed image corresponding to the to-be-deformed object, the image deformation module determines by at least one of:

In a fifth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes a processor and a memory; the memory has stored therein readable instructions which, when loaded and executed by the processor, implement the method as shown in any one of the optional embodiments of the first or second aspect described above.

In a sixth aspect, the present invention provides a computer-readable storage medium, in which readable instructions are stored, and when the readable instructions are loaded and executed by a processor, the method as shown in any optional embodiment of the first aspect or the second aspect is implemented.

The technical scheme provided by the embodiment of the invention has the following beneficial effects: according to the scheme of the image processing method, the image processing device, the electronic equipment and the computer storage medium, after the image to be processed is obtained, three-dimensional point cloud data corresponding to the depth image can be determined based on the depth image of the scene in the image to be processed; and then obtaining a proposal result of the object in the scene based on the three-dimensional point cloud data. According to the scheme, the three-dimensional point cloud data represents a point set consisting of a plurality of three-dimensional discrete points, and the data volume of the point set is smaller than that of the corresponding data volume of the three-dimensional voxel, so that the proposal result of the object in the scene is determined based on the three-dimensional point cloud data, the storage space can be saved, the data calculation amount is reduced, and the algorithm operation efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly described below.

Fig. 1 is a schematic flow chart illustrating an image processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a proposed result method for determining an object based on sub-images according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating a method for inferring weights of a single grid based on information of the grid provided in an embodiment of the present invention;

FIG. 4a is a schematic diagram illustrating a distribution of five adjacent lattices as provided in an embodiment of the invention;

FIG. 4b is a diagram illustrating a dependency relationship between adjacent grids provided in an embodiment of the present invention;

FIG. 4c is a schematic diagram illustrating a further dependency relationship between adjacent grids provided in embodiments of the present invention;

FIG. 5 is a flow chart illustrating a method for inferring weights of grids according to relationships between adjacent grids provided in an embodiment of the present invention;

FIG. 6 is a flow chart illustrating a proposed result method for determining an object based on a color image and a depth image according to an embodiment of the present invention;

FIG. 7 is a flow chart illustrating a proposed result method for determining an object based on a color image and a depth image according to an embodiment of the present invention;

FIG. 8 is a flow chart illustrating a method of shape completion provided in an embodiment of the present invention;

FIG. 9 is a flow chart illustrating a method of completing a shape provided in an embodiment of the present invention;

FIG. 10 is a flow chart illustrating a method for training a model based on a spatial loss function according to an embodiment of the present invention;

FIG. 11 is a schematic diagram illustrating a spatial position relationship between three-dimensional bounding boxes of two adjacent three-dimensional objects according to an embodiment of the present invention;

fig. 12 is a schematic diagram illustrating a spatial position relationship of three-dimensional bounding boxes of two adjacent three-dimensional objects provided in the embodiment of the present invention;

fig. 13 is a schematic flowchart illustrating a method for refining a three-dimensional detection result according to an embodiment of the present invention;

fig. 14 is a schematic flow chart illustrating a further method for refining a three-dimensional detection result according to an embodiment of the present invention;

fig. 15 is a schematic flow chart illustrating a method for determining a three-dimensional detection result of an object based on a color image and a depth image according to an embodiment of the present invention;

FIG. 16 is a flow chart illustrating a further image processing method according to an embodiment of the present invention;

fig. 17 is a flowchart illustrating a method for a virtual object to deform an object to be deformed in an image to be processed according to an embodiment of the present invention;

fig. 18 is a flowchart illustrating a method for deforming an object to be deformed in an image to be processed by using a virtual object according to another embodiment of the present invention;

fig. 19a is a schematic diagram illustrating an effect of a virtual object to deform a sofa in an image to be processed according to an embodiment of the present invention;

FIG. 19b is a schematic diagram illustrating an effect of a virtual object provided in an embodiment of the present invention to deform a sofa in an image to be processed;

fig. 20 is a schematic structural diagram showing an image processing apparatus provided in an embodiment of the present invention;

fig. 21 is a schematic configuration diagram showing still another image processing apparatus provided in the embodiment of the present invention;

fig. 22 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

For better understanding and description of the embodiments of the present invention, some technical terms used in the embodiments of the present invention will be briefly described below.

Voxel volume: a voxel is an abbreviation of a volume element, and is a minimum unit of digital data on three-dimensional space segmentation, similar to a minimum unit-pixel of two-dimensional space.

Three-dimensional geometrical characteristics: a three-dimensional geometric feature is a geometric representation of a three-dimensional element. The element may be a point cloud, a mesh, or a point in a point cloud, a vertex in a mesh, or a surface.

Three-dimensional point cloud data: a collection of points consisting of a plurality of three-dimensional discrete points, the three-dimensional point cloud data may include three-dimensional geometric features of the object.

Depth image: an image or image channel containing information about the distance of the surface of the scene object from the viewpoint. The gray value of each pixel point of the depth image can be used for representing the distance between a certain point in a scene and the camera.

Feature map (Feature map): the Feature map is obtained after the image and the filter are convolved, and the Feature map can be convolved with the filter to generate a new Feature map.

Neural Networks (NN): the method is an arithmetic mathematical model simulating animal neural network behavior characteristics and performing distributed parallel information processing. The network achieves the aim of processing information by adjusting the mutual connection relationship among a large number of nodes in the network depending on the complexity of the system.

MLP (Multilayer Perceptron): also called Artificial Neural Network (ANN), there may be multiple hidden layers in the middle, in addition to the input and output layers.

CAD (Computer Aided Design): the interactive drawing system helps designers to design by using a computer and a graphic device thereof.

In the prior art, the proposal result of how to represent the object in the image, that is, the region of the image containing the object, can be generally realized by the following ways:

the first mode is as follows: obtaining a proposal result of the object based on the two-dimensional image characteristics of the image: and determining a bounding box of the object on the color image based on the object detection result of the color image, and intercepting the viewing cone from the depth point cloud data. 3D object segmentation and 3D bounding box and pose estimation based on intercepted point clouds

The second mode is as follows: extracting an image region and a 2D bounding box of the object from the color image based on a trained model, wherein the model is obtained by training based on the image region and the 2D bounding box of the object in the sample image and is used for determining the image region and the 2D bounding box of the two-dimensional image, then obtaining a three-dimensional voxel corresponding to the object based on the characteristics of the color image and the characteristics of the depth image, and obtaining a posture estimation result of the object based on the three-dimensional voxel.

The third mode is as follows: object pose is estimated from a single image based on appearance image features and structural information of the object.

The fourth mode is that: the three-dimensional model of the object is aligned with the object in the image, the three-dimensional model matched with the three-dimensional model can be retrieved based on the shape style of the object in the image, and the visual angle of the three-dimensional model relative to the camera is estimated.

The proposal result of the object can be obtained through the above solutions, and the proposal result can include the image area, the 2D bounding box and the object posture of the object, but the above solutions have the following technical problems:

the first mode is as follows: the method is only suitable for the object proposal of the color image, and omits the three-dimensional characteristics of the object, so that the proposal result is inaccurate.

The second mode is as follows: the method is only suitable for the object proposal of the color image, but not suitable for the object proposal of the depth image, and the scheme of obtaining the object proposal based on the three-dimensional voxels can cause the consumption of a large amount of storage space and computing resources, so that the algorithm efficiency is low.

The third mode is as follows: the method is only suitable for the object proposal of the color image and is not suitable for the object proposal of the depth image.

The fourth mode is that: the scheme is a scheme for determining the object proposal based on the structural characteristics of the object, and the structural characteristics of the object cannot reflect the detailed characteristics of the object, so that the obtained object proposal is inaccurate.

Aiming at the technical problem, the three-dimensional point cloud data corresponding to the depth image in the image to be processed can be determined after the image to be processed is obtained; and then obtaining a proposal result of the object in the scene based on the three-dimensional point cloud data. The three-dimensional point cloud data represents a point set consisting of a plurality of three-dimensional discrete points, and the data volume of the point set is smaller than the data volume corresponding to the three-dimensional voxel, so that the proposal result of the object in the scene is determined based on the three-dimensional point cloud data, the storage space can be saved, the data calculation amount is reduced, and the algorithm operation efficiency is improved. Meanwhile, the three-dimensional point cloud data can describe the three-dimensional structural features of the object, and the proposal result determined based on the three-dimensional point cloud data is more accurate. In addition, when the features of the three-dimensional point cloud data are extracted, the MLP encoder is adopted for extracting the features, and the three-dimensional point cloud data can be converted into a matrix, so that the data processing amount is further reduced, and the algorithm operation efficiency is improved.

The following describes the technical solution of the present invention and how to solve the above technical problems with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Fig. 1 shows a flow chart of an image processing method provided by the present invention, and as shown in the figure, the method may include steps S110 to S130, where:

step S110: and acquiring an image to be processed, wherein the image to be processed comprises a depth image of a scene.

The image to be processed refers to an image in which a proposal result of an object needs to be determined, and the image to be processed may be a depth image obtained by shooting through a terminal device having a depth image shooting function, or a depth image obtained by processing based on a color image. Objects included in the scene include, but are not limited to, people, animals, and the like. One scene may simultaneously include one or more objects.

Step S120: and determining three-dimensional point cloud data corresponding to the depth image based on the depth image.

Specifically, based on the depth image, one implementation manner of determining the three-dimensional point cloud data corresponding to the depth image is as follows: the two-dimensional image coordinates and depth information of the depth image are converted from an image coordinate system to a world coordinate system. The three-dimensional point cloud data can describe three-dimensional structural features, namely three-dimensional geometrical features, of an object in a three-dimensional space, and each three-dimensional point transformed into the three-dimensional space through back projection of a depth image corresponds to each pixel point in an original depth image.

Step S130: and obtaining a proposal result of the object in the scene based on the three-dimensional point cloud data.

The proposal result of the object represents the region including the object in the image to be processed, and if a plurality of objects are included in the scene, the proposal result represents the corresponding object region of each object in the image to be processed. The proposal result can be an image with object region identification, the object region identification can be a mark frame, and the region framed by the mark frame is the object region.

According to the scheme of the invention, after the image to be processed is obtained, the three-dimensional point cloud data corresponding to the depth image is determined based on the depth image of the scene in the image to be processed; and then obtaining a proposal result of the object in the scene based on the three-dimensional point cloud data. According to the scheme, the three-dimensional point cloud data represents a point set consisting of a plurality of three-dimensional discrete points, and the data volume of the point set is smaller than that of the corresponding data volume of the three-dimensional voxel, so that the proposal result of the object in the scene is determined based on the three-dimensional point cloud data, the storage space can be saved, the data calculation amount is reduced, and the algorithm operation efficiency is improved.

In an alternative embodiment of the present invention, in step S130, obtaining a proposed result of an object in a scene based on three-dimensional point cloud data may include:

determining a matrix corresponding to the three-dimensional point cloud data based on the three-dimensional point cloud data;

determining a first feature map based on the matrix;

Specifically, when feature extraction is performed on the three-dimensional point cloud data, the three-dimensional point cloud data can be converted into a matrix to reduce data processing amount. When the features of the three-dimensional point cloud data are extracted, an MLP encoder can be adopted, and when the features of the data are extracted, the MLP encoder converts the data into a matrix first and then carries out subsequent processing on the matrix so as to obtain a feature map corresponding to the data. For example, the three-dimensional point cloud data includes N points, and a matrix corresponding to the three-dimensional point cloud data is represented as an Nx3 matrix.

In the scheme of the invention, an MLP encoder is adopted to extract features, and the MLP encoder can be obtained by training in the following way: obtaining sample images, wherein each sample image comprises a depth image of a scene, each sample image is marked with a marking result of each object, and the marking result represents a three-dimensional detection result of each object in the sample image; training the initial network model based on the corresponding depth image in each sample image until the loss function of the initial network model converges, and taking the model after training as an MLP encoder; the value of the loss function represents the difference degree between the prediction result and the labeling result of each sample image.

The three-dimensional detection result may include a three-dimensional object frame, a three-dimensional key point, a three-dimensional object segmentation result, and the like, and the corresponding prediction result corresponds to the three-dimensional detection result. It is understood that the above three-dimensional detection results can be used in combination during training. And determining whether the extracted features of the trained MLP encoder are accurate or not according to the three-dimensional detection result.

In an alternative of the present invention, determining a matrix corresponding to three-dimensional point cloud data based on the three-dimensional point cloud data may include:

Before feature extraction is carried out on the three-dimensional point cloud data, point cloud data belonging to an object in the three-dimensional point cloud data can be determined, so that during feature extraction, feature extraction is only carried out on the point cloud data belonging to the object in the three-dimensional point cloud data, feature extraction is not carried out on the point cloud data not belonging to the object, and data processing amount can be reduced. The point cloud data not belonging to the object may be point cloud data corresponding to a background in the image.

In an alternative of the present invention, if the image to be processed further includes a color image of the scene, the depth image is determined based on the color image.

In some scenes, if the depth image is not easily acquired, the corresponding depth image may be obtained based on the color image corresponding to the same scene.

One implementation of obtaining the depth image based on the color image may be: and based on the color image, predicting by a depth image prediction model to obtain a depth image corresponding to the color image. The input of the depth image prediction model is a color image of a scene, and the output is a depth image of the scene. The model can be obtained by training an initial model based on a sample image, wherein the sample image comprises a color image and a corresponding depth image belonging to the same scene.

In an alternative of the present invention, the image to be processed further includes a color image of the scene, and the method may further include:

in step S130, obtaining a proposed result of the object in the scene based on the first feature map may include:

If the image to be processed further includes a color image of the scene, and the color image can reflect the two-dimensional features of the object, then when the proposal result of the object in the scene is obtained based on the first feature map (three-dimensional features), the two-dimensional features (second feature map) of the color image can be combined on the basis of the first feature map, so that the obtained proposal result is more accurate.

The method for extracting the features of the color image can be implemented by a feature extraction method in the prior art, such as a convolutional neural network.

It is understood that, if the depth image is not predicted based on the color image, two images captured from the same scene may be aligned in advance, for example, the two images may be converted into images at the same angle or into images at the same light in order to minimize the difference between the two images. And the aligned depth images correspond to the pixels in the color images one by one, so that the influence of parallax on the two images is avoided. The above-mentioned alignment processing on the image can be implemented by using a method in the prior art, and is not described herein again.

In an alternative aspect of the present invention, obtaining a proposed result of an object in a scene based on a first feature map and a second feature map may include:

When the proposed result of the object is obtained based on the first feature map and the second feature map, the two feature maps may be fused to form one feature map (a third feature map), where the third feature map includes the three-dimensional geometric feature in the first feature map and also includes the two-dimensional pixel feature in the second feature map.

In an alternative of the present invention, the point in the point cloud data is in the form of an image, and the first feature map and the second feature map may be concatenated together to obtain a third feature map.

In an alternative of the present invention, based on the third feature map, the proposed result of the object in the scene may be obtained through an output of a neural network model, and the neural network model is obtained through training in the following manner: obtaining sample images, wherein each sample image comprises a depth image and a color image of the same scene, each sample image is marked with a marking result of each object, and the marking result represents a three-dimensional detection result of each object in the sample image; determining a third feature map corresponding to each sample image based on the second feature map of the corresponding color image and the first feature map of the depth image in each sample image; training the initial network model based on the third feature map corresponding to each sample image until the loss function of the initial network model converges, and taking the model after training as a neural network model; the value of the loss function represents the difference degree between the prediction result and the labeling result of each sample image.

It is to be understood that, when the neural network model is trained, the neural network model may be trained based on actual requirements, for example, the labeling result includes at least one of a region image, a two-dimensional image region segmentation result, a bounding box, or a keypoint corresponding to each object in the image, and accordingly, the output of the neural network model may include at least one of a region image, a bounding box, or a keypoint corresponding to each object in the image to be processed. Based on the output of the neural network model, the proposal result of the object can be obtained.

In an alternative aspect of the present invention, obtaining a proposed result of an object in a scene based on the third feature map may include:

segmenting the image to be processed corresponding to the third feature map to obtain at least two sub-images;

If a partial image area (sub-image) in the image to be processed may have a corresponding sub-feature map, each sub-image corresponds to a proposal result of an object in the scene, that is, a proposal result corresponding to each sub-feature map. In the solution of the present invention, if the image includes a plurality of objects, each sub-image may be a sub-image corresponding to a different object, and in each sub-image, the proposed results corresponding to the plurality of sub-images may correspond to the same object or to different objects.

For the third feature map, each object in the image to be processed may have a corresponding sub-feature map, that is, the sub-feature map is a part of the feature map in the third feature map, and thus, the proposal result corresponding to the sub-image may represent the proposal result of the object corresponding to the sub-feature map corresponding to the sub-image, the sub-feature map refers to a part of the feature map in the third feature map, and all the sub-feature maps correspond to one complete third feature map, and then the proposal results corresponding to the sub-images are fused to obtain the proposal of the object in the image to be processed (the proposal result of the object in the scene).

It is understood that in the solution of the present invention, if the proposed result of the object in the scene is determined based on the first feature map, the sub-image is determined based on the first feature map. If the proposed result of the object in the scene is determined based on the second feature map, the sub-image is determined based on the second feature map.

In an alternative aspect of the present invention, determining a proposed result corresponding to each sub-image based on the third feature map corresponding to each sub-image and/or the third feature maps corresponding to adjacent sub-images of each sub-image may include:

determining a weight for each sub-image;

Wherein, for an object, the probability of whether each sub-image belongs to the object can be represented by a weight, i.e. the greater the weight, the greater the probability that the sub-image is the object. When the proposal results corresponding to the sub-images are fused, the determined proposal result corresponding to each sub-image can be more accurate by considering the weight corresponding to each sub-image.

The weight corresponding to each sub-image can be determined through the neural network model, that is, in the process of training the model, the model can be trained based on the proposal result corresponding to each sub-image and the corresponding weight, and the weight of each sub-image can be determined based on the trained model.

In an alternative aspect of the invention, determining the weight for each sub-image comprises any of:

first, the weight of each sub-image is determined based on the sub-feature map corresponding to each sub-image.

The weight of each sub-image can be determined according to the features in the sub-feature map corresponding to the sub-image, and in the sub-feature map corresponding to the sub-image, the probability that each feature in the sub-feature map belongs to each object is different, so that the weight of the sub-image can be determined based on the feature of each sub-image, and the probability that the sub-image belongs to a certain object is represented by the weight.

Secondly, determining candidate points of the image to be processed, and determining the weight corresponding to each sub-image based on the candidate points corresponding to each sub-image.

The candidate points are points which can represent the positions of the objects, the positions of the objects in the image can be determined through the candidate points, and the probability that each candidate point belongs to each object is different.

Thirdly, determining the weight corresponding to each sub-image based on the sub-feature map corresponding to the candidate point corresponding to each sub-image.

The probability that each sub-feature map belongs to each object is different, and the weight of the corresponding sub-image can be determined based on the sub-feature map corresponding to the candidate point.

In an alternative of the present invention, determining candidate points of the image to be processed may include any one of:

firstly, each pixel point in the image to be processed is taken as a candidate point of the image to be processed.

The positions of the objects in the image to be processed can be accurately reflected based on the pixel points, and then the pixel points are used as candidate points, so that the proposal result of the objects can be accurately determined.

And secondly, determining candidate points corresponding to each sub-image based on the pixel points corresponding to each sub-image.

The candidate point corresponding to each sub-feature graph can be determined based on the pixel point corresponding to each sub-image, and one candidate point can correspond to a plurality of pixel points and can also correspond to one pixel point.

Taking a pixel point corresponding to one sub-image as an example, based on the pixel point corresponding to the sub-image, one implementation manner for determining the candidate point corresponding to the sub-image may be as follows:

and taking the pixel point positioned at the middle position in the pixel points corresponding to the sub-image as a candidate point of the sub-image.

And thirdly, sampling the image to be processed to obtain at least two sampling points, dividing the image to be processed according to the at least two sampling points to obtain at least two corresponding sub-images, and taking the sampling point corresponding to each sub-image as a candidate point.

The sampling point may be a pixel point, wherein sampling may be performed according to a set sampling rule, for example, sampling is performed every N pixel points. The sampling rule may be set based on an actual rule, and the present invention is not limited to the above sampling rule.

Wherein, the at least two sub-images may include the following cases:

in the first case, the plurality of sampling points corresponds to one sub-image. For example, in at least two sampling points, the distance between two adjacent sampling points is smaller than the set value, which indicates that the two sampling points may correspond to the same object, and then the area corresponding to the two sampling points may be used as a sub-image.

In the second case, one sample point corresponds to one sub-image. Namely, at least two sampling points obtained by sampling are several, and several sub-images are obtained by correspondingly dividing.

In the first case, any one of the plurality of sampling points corresponding to the sub-image may be used as a candidate point of the sub-image. In the second case, since one sampling point corresponds to one sub-image, the sampling point corresponding to the sub-image can be directly used as a candidate point.

In an alternative of the present invention, determining the weight corresponding to each sub-image based on the candidate points corresponding to each sub-image may include:

for the candidate point corresponding to each sub-image, determining the similarity relation between the candidate point and the candidate points of the adjacent sub-images; and determining the weight corresponding to each sub-image based on the similarity relation between each candidate point and the candidate point of the adjacent sub-image.

For the adjacent sub-images, considering that the adjacent sub-images may correspond to the same object, the weight corresponding to each sub-image may be determined based on the similarity relationship between the objects corresponding to the adjacent sub-images. The similarity between objects corresponding to adjacent sub-images can be represented by the similarity between candidate points corresponding to the sub-images in the adjacent sub-images.

In an alternative of the present invention, each candidate point may be represented by a vector, and a similarity relationship between one candidate point and its neighboring candidate points may be represented by an inner product of vectors, and if the value of the inner product of vectors is greater than a threshold, it indicates that the two candidate points are similar, otherwise, if the value of the inner product of vectors is less than the threshold, it indicates that the two candidate points are not similar. For one candidate point and its adjacent candidate points, respectively determining the number of similar candidate points corresponding to each candidate point, where different numbers correspond to different weights, and the greater the number, the greater the possibility that the candidate point belongs to a certain object, and the greater the corresponding weight, after determining the weight corresponding to each candidate point in the one candidate point and its adjacent candidate points, the weights may be fused (for example, averaged), and the fused weight is used as the weight corresponding to the one candidate point. Therefore, when the weight corresponding to one sub-image is determined, the similarity relation between the sub-image and the adjacent sub-image is considered, so that the weight determination of the sub-image can be more accurate.

In an alternative of the present invention, each candidate point may correspond to a score, where the score represents a probability that the candidate point belongs to an object of a certain class, and the probability is higher, and the probability is higher.

In an alternative, the probability value may be further normalized, and whether the candidate point belongs to the class of object is identified through a normalization result, for example, a normalization result of a candidate point greater than the set probability value is 1, which indicates that the candidate point belongs to the class of object, and a normalization result of a candidate point not greater than the set probability value is 0, which indicates that the candidate point does not belong to the class of object.

As an example, taking the determination of the weight corresponding to sub-image a as an example, the neighboring sub-images of sub-image a are sub-image B and sub-image C, the candidate point corresponding to sub-image a is candidate point a, the candidate point corresponding to sub-image B is candidate point B, the candidate point corresponding to sub-image C is candidate point C, each candidate point corresponds to a vector, the vector corresponding to candidate point a is x, the vector corresponding to candidate point B is y, and the vector corresponding to candidate point C is z, the vector inner product between each two candidate points is calculated, if the number of the candidate points corresponding to candidate point a is 2, that is, the candidate point B and the candidate point C are both similar candidate points of candidate point a, the candidate point corresponding to candidate point B is 1, that is, the candidate point a, the candidate point C is 1, that is, the weight corresponding to 2 is w1, the weight corresponding to 1 similar candidate point is w2, and the weight of the sub-image A corresponding to the candidate point a is (w1+ w2+ w 1)/3. Similarly, the weights of other sub-images may also be determined based on the above method, and are not described herein again.

Each candidate point corresponds to a loss when training a neural network comprising the above method, and the evaluation of each sub-image also corresponds to a loss. When the gradient is reversely transmitted, the corresponding gradient of each sub-image is restricted, and overlarge gradient is avoided. One way to do this is to multiply the gradient by a factor less than 1.

In the solution of the present invention, the category and position of the object can be predicted based on the feature of the center point of the sub-image (which may be referred to as an anchor point) (the corresponding feature at the center position of the sub-image), but the object in the natural scene has various challenges, such as occlusion and deformation. Previous single-step anchor-based methods use features of the center of an anchor point to predict the class and location of an object, with the appearance that implicitly represents the entire object being used for the prediction. Because it is difficult for the training data to contain all the half-occlusions, the trained model has a difficult time learning the appearance in all cases. When the object features are in the occluded area, the detection accuracy may be degraded. To solve this problem, we use multiple adjacent lattices (which may be referred to as sub-images) for each particular anchor point to make the prediction. Each adjacent lattice mainly represents the features (which may be referred to as sub-feature maps) of a part of the object, i.e. the appearance of the object of the part of interest. By prediction of the non-occluded regions we can still get robust detection. Our model is based on refineedet. However, the RefineDet only makes one prediction for an anchor point, while our method makes multiple predictions. In this way, our method can be more robust to partial occlusions.

As shown in the network structure diagram of fig. 2, in our network, we use the anchor point update module and the migration link module as the RefineDet, and use the following feature maps (P3, P4, P5, P6) for detection. For each of the four feature maps, a multi-prediction mode is adopted for prediction. During the training phase, the plurality of predictions provides a plurality of predicted losses. In the testing stage, based on the weight of each grid, we combine the results of multiple predictions as the final prediction result (which may be referred to as the proposed result of the object).

Detection using multi-region prediction. We performed tests on 4 signatures, P3, P4, P5, P6. For each updated anchor point, the class label and location are represented by a vector. The category labels and locations are predicted simultaneously. To get a location sensitive prediction, we use not only the middle lattice, but also the nearby lattices for each anchor point. For convenience, the middle lattice and the surrounding lattices are referred to herein as nearby lattices. After obtaining the combined feature map, such as P3, for each anchor point, we obtain its prediction by counting the predictions of multiple bins. As shown in fig. 2, for each feature map, a multi-region prediction module is used to obtain a prediction corresponding to each feature map. In each multi-region prediction module, K offset convolutions are performed on a feature map, such as P3, to obtain prediction outputs of K adjacent grids. Meanwhile, we use a grid prediction module to obtain the weight of each grid. Then, the information is fused through a prediction fusion module to obtain the final prediction output. One penalty for each prediction of the trellis. Meanwhile, the output of the fusion prediction module also corresponds to a loss. This may reduce overfitting.

In this example, the number of classes is defined as N and the number of nearby bins is defined as K. Assume that a feature layer has M anchor points. Thus, the dimension of the prediction output within one layer is (N +4) × M × K. Here the position is represented using a 4-dimensional vector. In this context, we use 5 nearby cells, but other numbers of cells may be used. Different regions have different reliabilities. We provide two ways to infer the reliability of the trellis and combine the predictions of the various trellis based on this reliability. Definition a_kIs the weight of the grid K, K1_kIs a characteristic diagram corresponding to the lattice k, s.t. indicates that the constraint condition is satisfied, s.t. is a shorthand that satisfies satisfy to or inclines to constraint to, and the combined prediction results

Is defined as:

wherein, a is more than or equal to 0_k≦ 1, the bounding box of the final object on a graph may be obtained by non-maximum suppression based on the combined prediction results.

For each of the K nearby bins, we define a predictor. Each predictor interprets only the information of its corresponding trellis. For example, the predictor of the upper lattice only utilizes the feature information around the upper lattice. Other lattice predictors and so on. In general, regional features can be used to infer information as a whole. For example, given the area of the head, we can infer where the whole object is. Thus, the detectors of nearby grids can infer information of the object at the central grid. Furthermore, when some regions suffer from occlusion, we can still get robust prediction through prediction of other regions. The K lattices correspond to the same anchor point. That is, they have the same anchor parameters, including position (x, y), width and height.

Anchors are of different sizes, and for large anchors, nearby lattices tend to fall within the object region. Then, the nearby lattice tends to represent partial information of the object. That is, our method is similar to segmenting an object. In this case, when a portion of the object is occluded, the entire object can still be detected through other portions. For small anchor points, nearby lattices tend to contain both the partial appearance of the object and environmental information nearby. This strategy is useful for detecting small objects because environmental information is useful for distinguishing small objects.

A loss function. In this system, there are two loss functions, the classification loss L_classAnd positioning loss L_loc. The overall loss function is defined as:

L＝L_loc+L_conf (2)

here, L_locIndicating the loss of position, L, of all anchor points_classRepresenting the classification loss of all anchor points. For L_classWe use the soft-max loss function for L_locWe used smooth L1 losses. In the training phase, for each nearby lattice predictor, we define aAn independent loss. Therefore, K nearby cells have K losses. This combined prediction also corresponds to a loss. For the ith feature map, the kth nearby lattice predictor, define

And

respectively its classification and loss of location. Definition of

And

the classification and the loss of position of the ith feature map are respectively. Definition F is a set of feature maps used for prediction. The classification penalty and the position penalty are defined as,

by defining multiple penalties, we add more regularization constraints, which can reduce overfitting.

In the scheme of the invention, we propose two strategies to infer the weight of the trellis. One is based on the information of the trellis itself and one is based on the relationship between the trellis.

First, the weight of each child is determined from the information of the lattice itself. The weight of the trellis is affected by the characteristics of the trellis. For example, as a lattice feature is more discriminative, this feature tends to provide more reliability to the prediction. Conversely, if the feature is occluded or very noisy, the feature will be less reliable for prediction. And obtaining the optimal grid weight by adopting a learning mode based on the characteristics of the grid. That is, we can get the trellis weights in a convolution manner.

As shown in fig. 3, given a feature map, such as P3, it gets its class and location predictions by offset convolution, while it gets its weight by another offset convolution and Sigmoid. And inputting the prediction and the weight of each grid into a prediction fusion layer for fusion to obtain final prediction output.

Wherein a 'is defined for the lattice k'_kRepresenting the weights after convolution. Then, we apply sigmoid function to it to get a'_kThen the final weight a is obtained_kComprises the following steps:

where i 1.., K, in this way, the constraint in equation (1) can be satisfied. According to this approach, more reliable lattices can be given more weight.

Second, the weights of the lattices are inferred from the relationship between the lattices. In the first strategy we do not use information between the grids, but this information is very useful. There are three relationships between the grids that can be used to infer the weight of the grid.

As shown in fig. 4a, for an anchor point at lattice a, there are 5 nearby lattices.

As shown in FIG. 4B, taking lattice B as an example, the three relationships include, 1) the feature set F of its neighbor lattice_B2) prediction of anchor point at B

3) Prediction of B's neighbor pairs and anchor points at A

From these relationships, the weight of anchor point B at prediction a is defined as:

here, the first and second liquid crystal display panels are,

representing the object relationship of adjacent lattices. For example, the image has a person riding a horse inside. Horses and people appear simultaneously. When a grid is judged as a horse, the grid above it has a high probability of containing a person. However, as shown in FIG. 4(b), when we want to predict the class information of the anchor point at A, we need to know

While

Is dependent on

Thus, given the following feature map, the associative relationships form a graph structure over the entire feature map. This graph structure can be solved by a probabilistic graph model. To solve this problem, the lattice weights need to be inferred by means of belief propagation. This makes the model difficult to train end-to-end.

To solve this problem, the present solution does not consider

So we define:

in this way we can train end-to-end. This new relationship is shown in FIG. 4(c), which shows that the new relationship is not considered

The relationship between the lattices. In this figure, circles represent boxes, and the connections between boxes represent the connections between two boxes when inferring the box weightsAre neighbors. In this figure, each cell has four sides connected to other cells, and the weights of different cells are obtained by different characteristics. For simplicity, we further simplify this relationship (as shown in FIG. 5). For a given feature map, we convolve it separately by K offsets to obtain the prediction of each trellis. These predictions are concatenated to obtain a feature map. And simultaneously obtaining a feature map by carrying out offset convolution and convolution on the feature map. The two feature maps are spliced to show the relationship between the grids, and the weight of each grid is obtained by convolving the spliced feature maps and sigmoid convolution. Then, these information are fused by the prediction fusion layer to obtain the final prediction output. That is to say, we splice the category predictions and features of K adjacent grids together to obtain a feature map, and then perform convolution and sigmoid operation on the feature map to obtain the weight of each grid.

Where for offset convolutional layers, in our method, K adjacent lattices are predicted using a common anchor point. For computational efficiency, we propose a new layer for convolving different adjacent lattices. In this layer, the receptive field of the upper lattice is shifted by-1 along the vertical direction for a particular anchor point. The same applies to the shifting of the receptive fields for other adjacent grids. In the second way of inferring the weight of the grid, we select five grids as the receptive field. In this way, the combination of multiple predictions and the subsequent calculation of the loss function can be facilitated.

The gradient of the convolution branch is constrained. When the gradients are propagated backwards, the gradients of K adjacent lattices are added together to propagate backwards. This can be seen as multiplying the gradient by K times. Sometimes this results in a gradient divergence. To solve this problem, we can multiply the gradient by a fraction.

In an alternative of the present invention, the determining the weight of each sub-image based on the sub-feature map corresponding to each sub-image may include any one of the following:

firstly, for each sub-image, determining a first feature vector corresponding to the center position of the sub-image and a second feature vector corresponding to a sub-feature map corresponding to the sub-image; determining the weight of each sub-image based on the first characteristic vector and the second characteristic vector corresponding to each sub-image;

the probability that the feature corresponding to the central position of each sub-image belongs to a certain class of objects is the highest, the feature corresponding to the central position can be represented by a feature vector (first feature vector), the probability that the sub-feature map belongs to the certain class of objects can be determined by the sub-feature map corresponding to each sub-image, the sub-feature map can also be represented by a feature vector (second feature vector), for the same sub-image, the weight of the sub-image is represented by the weight based on the inner product between the first feature vector and the second feature vector, and the probability that the sub-image belongs to the certain class of objects can be determined more accurately. The second feature vector may be determined by the neural network model.

Secondly, for the corresponding sub-feature map of each sub-image, the sub-feature map corresponds to at least one probability value, and each probability value represents the probability that the sub-feature map belongs to the corresponding object; the highest probability value of the at least one probability value is used as the weight of the subimage.

Each sub-image corresponds to one sub-feature map, for each sub-feature map, a probability value corresponding to each object exists, each sub-feature map may correspond to at least one probability value, one probability value represents the probability that the sub-feature map belongs to a certain class of objects, and the maximum probability value represents the maximum possibility that the sub-feature map belongs to a certain class of objects, and the maximum probability value may be used as the weight of the sub-feature map.

The following further describes the scheme of obtaining the proposed result of the object based on the depth image and the color image with reference to fig. 6 and 7:

as shown in fig. 6, the method is divided into two parts, namely a model prediction part and a model training part, where the model training part mainly describes a scheme of determining a proposed result of an object in an image to be processed based on the image to be processed, where the image to be processed includes a depth image and a color image corresponding to the same scene. The training section mainly describes a scheme of training an MLP encoder, wherein the MLP encoder obtained based on the training can be used to extract three-dimensional point cloud data (3D point cloud shown in fig. 6).

In this embodiment, the MLP encoder and the neural network model are trained first, and the specific training process is as described above and will not be described herein again. In the training process of the MLP encoder, as described above, the parameters of the MLP encoder may be adjusted based on the three-dimensional detection result of the sample image, and the specific process is as follows: and comparing the prediction result (the prediction three-dimensional detection result of the sample image) with the labeling result (the labeling three-dimensional detection result of the sample image), if the difference between the prediction result and the labeling result does not meet the convergence condition, adjusting the parameters of the MLP encoder until the difference between the prediction result and the labeling result meets the convergence condition, and taking the trained model as the MLP encoder.

The prediction results may include a three-dimensional object box (three-dimensional box detection in fig. 7), a three-dimensional key point (3D key point estimation in fig. 7), and a three-dimensional object segmentation result (three-dimensional shape segmentation shown in fig. 7). It is understood that the above three-dimensional detection results can be used in combination during training. And determining whether the extracted features of the trained MLP encoder are accurate or not according to the three-dimensional detection result.

The neural network model comprises the convolutional neural network and the object proposal neural network shown in fig. 6, and based on the trained neural network model, a proposal result (the object proposal shown in fig. 6) of an object in the image to be processed can be obtained based on the third feature map.

For the color image, the feature of the color image is extracted through a convolution neuron network to obtain a second feature map, wherein the second feature map is the image feature of pixel by pixel, namely the two-dimensional feature.

For a depth image, the depth image is converted into three-dimensional point cloud data (3D point cloud shown in fig. 6), and then feature extraction is performed on the 3D point cloud through a trained MLP encoder to obtain a first feature map, where the first feature map is a point-by-point three-dimensional feature, and the three-dimensional feature can describe a three-dimensional structural feature of an object in a three-dimensional space.

And fusing the first characteristic diagram and the second characteristic diagram to obtain a third characteristic diagram, inputting the third characteristic diagram into a convolutional neural network, further processing the third characteristic diagram through the convolutional neural network, inputting the output of the convolutional neural network into an object proposal neural network, and obtaining an object proposal through the network. As shown in fig. 7, the output of the object neuron network may include at least one of a region image (object region proposal shown in fig. 7) corresponding to an object in the image to be processed, a bounding box, a two-dimensional image region segmentation result, or a keypoint (semantic keypoint estimation shown in fig. 7). An object proposal can be determined based on the output of the object neuron network.

Because the image to be processed comprises a depth image and a color image, if the proposal result is an image with an object area identifier, the depth image and the color image respectively correspond to a proposal result, namely the proposal result corresponding to the depth image is a depth image with an object area identifier, and the proposal result corresponding to the color image is a color image with an object area identifier.

In an alternative aspect of the present invention, the method may further comprise:

After the proposal result of the object in the image to be processed is determined, further processing may be performed based on the proposal result, for example, a three-dimensional detection result of the object in the image to be processed is determined based on the proposal result. The three-dimensional pose result represents the pose of the object in the image, such as the rotation angle, translation distance, etc. of the object in the image. The three-dimensional segmentation result indicates that no image is segmented from the image, for example, the image includes a bed and a sofa, the three-dimensional segmentation result indicates that the bed and the sofa in the image are segmented respectively, and the segmentation result is three-dimensional, that is, the three-dimensional geometric characteristics of the object can be displayed.

In the alternative scheme of the invention, the three-dimensional detection result comprises a three-dimensional posture result and a three-dimensional segmentation result; based on the proposal result, determining the three-dimensional detection result of the object in the image to be processed may include:

When the three-dimensional detection result of the object is determined, feature extraction can be performed on the proposal result, and since the proposal result is obtained based on the depth image and the color image, a three-dimensional point cloud feature (a feature corresponding to the depth image) and a two-dimensional image feature (a feature corresponding to the color image) can be extracted from the proposal result, and the three-dimensional detection result of the object can be more accurately determined based on the three-dimensional point cloud feature and the two-dimensional image feature.

In an alternative of the present invention, if the three-dimensional detection result includes a three-dimensional segmentation result, and the image to be processed includes an object having an incomplete shape, obtaining a proposed result of the object in the scene based on the three-dimensional point cloud data, may include:

performing shape completion on the three-dimensional point cloud data corresponding to the incomplete object on the basis of the incomplete object to obtain the three-dimensional point cloud data after completion;

and obtaining a proposal result of the object in the scene based on the supplemented three-dimensional point cloud data.

When the image is shot, objects in the image may not be completely shot due to shooting reasons or other reasons, for example, the depth image is shot based on a depth sensor, and the shape of some object in the shot image may be incomplete and have a missing part due to occlusion or reflection on the surface of the object. Then, in order to make the corresponding object in the proposed result of the object a complete-shape object, the shape of the incomplete-shape object may be supplemented.

In an alternative aspect of the present invention, the shape completion of the three-dimensional point cloud data corresponding to the incompletely-shaped object may be performed based on a three-dimensional shape completion network of the object constituted by the MLP encoder and the MLP decoder. The three-dimensional point cloud data corresponding to the incomplete object is input into the object three-dimensional shape completion network, the three-dimensional point cloud data after completion is output, the object three-dimensional shape completion network is obtained by training an initial model based on the three-dimensional point cloud data corresponding to the complete object and the three-dimensional point cloud data corresponding to the incomplete object, the difference between a prediction result and a labeling result (the three-dimensional point cloud data corresponding to the complete object) is used as a loss function, and when the loss function is converged, the corresponding initial model is the object three-dimensional shape completion network. The EMD Distance (Earth Mover's Distance) between the feature point corresponding to the predicted result and the feature point corresponding to the labeled result can be used to represent the difference between the predicted result and the labeled result, and when the EMD Distance is smaller than the set Distance, the loss function is represented to be converged, and when the EMD Distance is not smaller than the set Distance, the loss function is represented to be not converged.

The testing process of the three-dimensional shape complementing network can be as shown in fig. 8, in the testing process of the three-dimensional shape complementing network, the proposed result of the object is an image including an object region, the proposed result of the object in the color image is a first image, the proposed result of the object in the depth image is a second image, based on the first image and the second image, the second image is converted into three-dimensional point cloud data (point cloud shown in fig. 8), then the three-dimensional object segmentation is performed on the three-dimensional point cloud data, points belonging to the object in the three-dimensional point cloud data are segmented, then feature extraction is performed on the three-dimensional point cloud data after the three-dimensional object segmentation processing through an MLP encoder, and a feature map (three-dimensional point cloud feature) corresponding to the second image is obtained; based on the feature map, shape completion is performed on the object having the incomplete shape in the feature map through a three-dimensional shape completion network composed of an MLP encoder and an MLP decoder, the completed feature map is used as a prediction result, and a difference between the prediction result and a labeling result corresponding to the object having the incomplete shape is determined, where the difference is smaller than a first set value, indicating that a loss function (a three-dimensional segmentation loss function shown in fig. 8) converges, and if the difference is not smaller than the first set value, indicating that the three-dimensional segmentation loss function does not converge, parameters of the three-dimensional shape completion network need to be adjusted so that the loss function converges.

Similarly, feature extraction is carried out on the first image through a convolution neuron network to obtain a feature map (two-dimensional image feature) corresponding to the first image, performing feature splicing on the feature map corresponding to the first image and the feature map corresponding to the second image to obtain a spliced feature map (fourth feature map), the spliced characteristic diagram is passed through a convolution neural network to obtain the three-dimensional attitude result of the incomplete object, the three-dimensional attitude result is used as a prediction result, the difference between the prediction result and the labeling result corresponding to the incomplete object is determined, the difference is smaller than a second set value, which indicates convergence of the three-dimensional attitude estimation loss function, and if the difference is not smaller than the second set value, it means that the three-dimensional attitude estimation loss function does not converge, and the parameters of the three-dimensional shape completion network need to be adjusted to make the loss function converge.

In the process of training the three-dimensional shape completion network, not only the three-dimensional posture result of the object can be used as a prediction result, but also at least one of the three-dimensional key point estimation result, the shape completion result and the three-dimensional shape matching result of the object can be used as a prediction result, and parameters of the three-dimensional shape completion network are adjusted through corresponding loss functions based on the prediction result and the corresponding labeling result.

As shown in fig. 9, which illustrates a schematic diagram of training a three-dimensional shape completion network by using other prediction results, a result corresponding to the optional design in fig. 9 may be used as the prediction result, a loss function corresponding to the three-dimensional keypoint estimation result (the three-dimensional keypoint estimation shown in fig. 9) is a 3D euclidean distance loss function, a loss function corresponding to the shape completion result (the completion shown in fig. 9) is also a 3D euclidean distance loss function, and a loss function corresponding to the three-dimensional shape matching result (the three-dimensional shape matching shown in fig. 9) is a shape matching loss function. Based on any of the above prediction results and the corresponding loss function, the parameters of the three-dimensional shape completion network can be adjusted in the above manner.

In an alternative scheme of the invention, based on the first characteristic diagram, the proposal result of the object in the scene is obtained through the output of a neural network model, and the neural network model is obtained through the following training:

obtaining sample images, wherein each sample image comprises a depth image of a scene, each sample image is marked with a marking result of each object, and the marking result represents a proposal result of each object in the sample image;

training the initial network model based on the feature map of the corresponding depth image in each sample image until the loss function of the initial network model converges, and taking the model after training as a neural network model;

the value of the loss function represents the difference degree between the prediction result and the labeling result of each sample image.

The proposed result of the object in the scene obtained based on the first feature map may be obtained through output of the neural network model, that is, the input of the neural network model is the first feature map, and the output may be at least one of the region image, the bounding box, the two-dimensional image region segmentation result, or the keypoint corresponding to the object in the to-be-processed image described above. The proposed result of the object in the image to be processed can be obtained based on the output of the neural network model.

It is understood that the proposed result of obtaining the object in the scene based on the third feature map as described above can also be obtained from the output of the neural network model, and the input of the neural network model is the third feature map, and the output is consistent.

Correspondingly, the training of the neural network model can also be obtained based on the same training mode, and details are not repeated here.

In an alternative scheme of the present invention, the sample image includes at least two objects, the labeling result further includes a spatial position relationship between each pair of objects in the at least two objects, the prediction result is a proposed result of each object in the at least two objects, and a spatial position relationship between each pair of objects in the at least two objects, each pair of objects includes two adjacent objects; the spatial position relationship characterizes the overlapping volume between two adjacent objects;

the loss function of the initial network model includes a first loss function and a second loss function, a value of the first loss function represents a degree of difference between a prediction result of each object in the sample image and an annotation result corresponding to each object, and a value of the second loss function represents a degree of difference between a prediction result corresponding to each object pair in each object pair of the at least two objects and a corresponding annotation result.

Adjacent objects may appear in the scene, and two adjacent objects may overlap or may not overlap. The position relationship between two objects may affect the proposed result of the object, for example, in a scene, a part of a chair is placed under a desktop, that is, there is an overlapping volume between the chair and the desk, and when the proposed results of the desk and the chair are determined separately, if the three-dimensional position relationship between the two objects is considered, the obtained proposed result may be more accurate.

Based on the above, in the process of training the neural network model, the loss function not only includes the degree of difference between the prediction result of each individual object and the corresponding labeling result of each object, but also takes into account the degree of difference between the prediction result corresponding to each pair of objects in each pair of objects and the corresponding labeling result. The spatial position relationship may be determined based on a three-dimensional bounding box of each object in the object pair, and based on the three-dimensional bounding boxes corresponding to the two objects, it may be determined whether there is an overlapping volume between the two objects.

In one alternative, the second loss function may be represented by the following equation (8):

loss_s＝(1-s)overlap(3Dbox_1,3Dbox_2)+s*margin(1-t) (8)

where loss _ s is the second loss function, 3Dbox _1 represents the three-dimensional bounding box of one object, 3Dbox _2 represents the three-dimensional bounding box of another object, overlap (3Dbox _1,3Dbox _2) represents the overlapping volume between the two objects, and s is the real value group route (GT), i.e. the labeling result corresponding to the two objects, where s ∈ {0, 1}, and margin is a constant greater than 0, and can be configured based on actual requirements, for example, greater than the maximum value of all possible overlapping region volume values. When the second loss function is equal to margin, it indicates that the second loss function is not converged, and when the second loss function is equal to 0, it indicates that the second loss function is converged.

If overlap (3Dbox _1,3Dbox _2) >0, t is 1; if overlap (3Dbox _1,3Dbox _2) is 0, t is 0. 1 indicates that there is overlap between the two objects and 0 indicates that there is no overlap between the two objects.

Based on the above formula of the second loss function, it can be seen that, when overlap (3Dbox _1,3Dbox _2) >0 and t is 1, loss _ s is (1-s) overlap (3Dbox _1,3Dbox _2), and when overlap (3Dbox _1,3Dbox _2) is 0 and t is 0, loss _ s is s mark (1-t). When s is 1, it indicates that there is an overlap between two objects, and at this time, if the prediction result is an overlap (3Dbox _1,3Dbox _2) >0, and t is a loss function corresponding to 1, and loss _ s is (1-s) overlap (3Dbox _1,3Dbox _2) >0, and loss _ s is 0, it indicates that there is no difference between the prediction result and the labeling result, the second loss function converges, otherwise, if the prediction result corresponds to a loss function corresponding to verlap (3Dbox _1,3Dbox _2) >0, loss _ s is margin (1-t), and loss _ s is margin, and at this time, the second loss function does not converge.

Similarly, if s is 0, it indicates that there is no overlap between the two objects, and in this case, if the prediction result is an overlap (3Dbox _1,3Dbox _2) is 0, t is a loss function corresponding to 0, loss _ s is s mark (1-t) is 0, and loss _ s is 0, which indicates that there is no difference between the prediction result and the labeling result, the second loss function converges. If the prediction result is that the overlap (3Dbox _1,3Dbox _2) >0, t is the loss function corresponding to 1, and loss _ s is (1-s) overlap (3Dbox _1,3Dbox _2) ═ overlap (3Dbox _1,3Dbox _2) >0, it indicates that the second loss function is not converged.

In the training process of the model, a second loss function may be calculated based on the above method corresponding to a pair of object three-dimensional pose results obtained from the proposal results of the adjacent objects, and as shown in fig. 10, the second loss function updates the parameters of the model through back propagation in the training process, thereby enabling the model to learn the ability to use the spatial relationship of the adjacent objects in the three-dimensional space. Specifically, as shown in the schematic diagram of performing model training based on the second loss function shown in fig. 10, based on the proposal results of two adjacent objects, which are respectively the object proposal 1 and the object proposal 2 shown in fig. 10, based on the two object proposals, the three-dimensional poses of the corresponding objects are respectively determined, and the determination process of the three-dimensional poses is the same as the process of determining the three-dimensional poses in fig. 8, and is not repeated here. Determining the degree of difference between the prediction result corresponding to the object proposal 1 and the corresponding labeling result and the degree of difference between the prediction result corresponding to the object proposal 2 and the corresponding labeling result by using the obtained two three-dimensional poses as prediction results, and updating the parameters of the model based on the two degrees of difference and a second loss function (the spatial loss function shown in fig. 10) so that the model learns the capability of using the spatial relationship of the adjacent objects in the three-dimensional space.

As an example, in the spatial position relationship between two images shown in fig. 11, case 1 indicates that there is an overlapping volume between the object corresponding to 3Dbox _1 and the object corresponding to 3Dbox _2, where S is 1; case 2 indicates that there is no overlapping volume between the object corresponding to 3Dbox _1 and the object corresponding to 3Dbox _2, and S is 0. The spatial position relationship between the three-dimensional bounding boxes corresponding to the objects is shown in fig. 12, and as shown in fig. 12, the three-dimensional bounding boxes corresponding to three objects are shown in fig. 12, and the three bounding boxes can respectively correspond to three different objects, and the three bounding boxes do not overlap with each other, which corresponds to the above case 2.

It can be understood that, in the process of training the neural network model, if the first loss function is the three-dimensional pose estimation loss function, the loss function of the initial model is the three-dimensional pose estimation loss function and the spatial loss function shown in fig. 9.

In an alternative aspect of the present invention, determining a three-dimensional detection result of an object in an image to be processed based on a proposed result may include:

determining an original image corresponding to an object in an image to be processed, wherein the original image is an image corresponding to the object in a reference posture;

In the process of determining the three-dimensional detection result of each object in the image to be processed based on the proposal result, in order to improve the accuracy of the three-dimensional detection result, the initial three-dimensional detection result can be adjusted based on the original image corresponding to each object, namely, the initial three-dimensional detection result is refined based on the original image, so that the initial three-dimensional detection result is more accurate. Whether the initial three-dimensional detection result is accurate or not is represented by difference information corresponding to the initial three-dimensional detection result, if the difference information corresponding to the initial three-dimensional detection result meets a set condition, the initial three-dimensional detection result is relatively accurate and does not need to be updated, and if the difference information corresponding to the initial three-dimensional detection result does not meet the set condition, the initial three-dimensional detection result is not accurate enough and needs to be updated. Wherein the set condition may be configured based on the actual demand.

In an alternative embodiment of the present invention, the original image may be an image in a CAD model of the object, and the reference pose may be any pose of the object.

It can be understood that, the determining of the three-dimensional detection result of the object in the image to be processed based on the proposed result can also be realized by the neural network model, and in the process of training the neural network model, the parameters of the neural network model can be updated in a manner of updating the initial three-dimensional detection result of the corresponding object according to the difference information corresponding to the initial three-dimensional detection result, that is, when the difference information does not satisfy the set condition, the model parameters are updated until the difference information corresponding to the updated initial three-dimensional detection result satisfies the set condition, the updating of the model parameters is stopped, and based on the neural network model obtained at this time, a more accurate three-dimensional detection result can be obtained.

In an optional scheme, the determining an original image corresponding to an object in the image to be processed, where the initial three-dimensional detection result includes an initial three-dimensional segmentation result, may include:

determining an object type of each object based on the initial three-dimensional segmentation result of each object;

and determining an original image corresponding to each object based on the object class of each object.

Different objects have different object types, and the original image corresponding to the object can be more accurately determined according to the object types. The original image may be a three-dimensional computer-aided design CAD image.

In an alternative aspect of the present invention, the determining the difference information corresponding to the initial three-dimensional detection result of each object based on the initial three-dimensional detection result of each object and the corresponding original image may include:

based on the initial three-dimensional attitude result of each object, performing attitude transformation on the corresponding original image to obtain a transformed image corresponding to each object;

and determining difference information corresponding to the initial three-dimensional detection result of each object based on the initial three-dimensional detection result of each object and the corresponding transformed image.

Based on the initial three-dimensional detection result of each object and the corresponding original image, the difference information corresponding to the initial three-dimensional detection result of each object can be determined by adopting an alignment estimation mode. Specifically, the initial three-dimensional detection result of each object includes pose information corresponding to each object, that is, an initial three-dimensional pose result, and based on the pose information of each object, the pose of the corresponding original image is transformed so that the object in the transformed image and the object corresponding to the initial three-dimensional pose result have the same pose, and based on the transformed image and the corresponding initial three-dimensional detection result, difference information between each object and the transformed image, that is, difference information corresponding to the initial three-dimensional detection result of each object may be determined, where the difference information may include at least one of difference information corresponding to the initial three-dimensional pose result or difference information corresponding to the initial three-dimensional segmentation result. That is, if the determined difference information is difference information corresponding to the initial three-dimensional pose result, the corresponding initial three-dimensional pose result may be updated based on the difference information, and if the determined difference information is difference information corresponding to the initial three-dimensional segmentation result, the corresponding initial three-dimensional segmentation result may be updated based on the difference information.

The difference information may include points, error points, and the like missing from the initial three-dimensional segmentation result, and three-dimensional pose error points corresponding to the initial three-dimensional pose result.

In an alternative, in the process of determining difference information corresponding to the initial three-dimensional detection result of each object based on the initial three-dimensional detection result of each object and the corresponding original image, difference information corresponding to the initial three-dimensional detection result of each object may be determined based on three-dimensional point cloud data corresponding to the initial three-dimensional detection result of each object and three-dimensional point cloud data corresponding to the corresponding original image.

In an optional scheme, in the process of determining difference information corresponding to the initial three-dimensional detection result of each object based on the three-dimensional point cloud data corresponding to the initial three-dimensional detection result of each object and the three-dimensional point cloud data corresponding to the corresponding original image, for convenience of processing, normalization processing may be performed on the three-dimensional point cloud data corresponding to the initial three-dimensional detection result and the three-dimensional point cloud data corresponding to the corresponding original image, and then difference information corresponding to the initial three-dimensional detection result of each object is determined based on the three-dimensional point cloud data corresponding to the normalized original image and the three-dimensional point cloud data corresponding to the normalized initial three-dimensional detection result.

In an optional scheme, a normalization processing manner is: and sampling the three-dimensional point cloud data corresponding to the original image, so that the three-dimensional point cloud data corresponding to the original image and the three-dimensional point cloud data corresponding to the initial three-dimensional detection result have the same point cloud density.

As an example, as a schematic diagram of a refinement method of a three-dimensional segmentation result and a three-dimensional pose result shown in fig. 13, based on a color image and a depth image (color depth input shown in fig. 13), an object proposal is determined based on the scheme of determining a proposal result of an object in an image described earlier, the object proposal comprises a proposal result of an object in a depth image and a proposal result of an object in a color image, and based on the object proposal, a three-dimensional detection result (initial three-dimensional detection result) is determined, the initial three-dimensional detection result comprises a three-dimensional segmentation result and a three-dimensional pose result (three-dimensional segmentation and pose estimation shown in fig. 13); based on the initial three-dimensional segmentation result, the object class of the object in the image and the point cloud data corresponding to the object (corresponding to the segmented object point cloud in fig. 13) are determined, and based on the initial three-dimensional posture result, the three-dimensional posture of the object is determined.

Based on the object type, an original image corresponding to the object type is retrieved from a CAD database (corresponding to the object CAD model < retrieval > shown in fig. 13), based on the three-dimensional posture of the object, posture transformation is performed on the original image so that the posture of the object in the original image is consistent with the three-dimensional posture, a transformed image is obtained, alignment estimation is performed on the three-dimensional point cloud data of the transformed image and the three-dimensional point cloud data of the object corresponding to the three-dimensional segmentation result respectively (corresponding to the CAD-point cloud posture alignment estimation in fig. 13), and an alignment error (difference information) is obtained.

Based on this, the alignment error is compared with a set threshold, if the alignment error is smaller than the set threshold, the alignment error is small enough, the initial three-dimensional detection result does not need to be updated, the initial three-dimensional detection result is taken as a final three-dimensional detection result, and the final three-dimensional detection result comprises a final three-dimensional posture and a final three-dimensional segmentation. On the contrary, if the alignment error is not less than the set threshold, it indicates that the alignment error is not small enough, the initial three-dimensional detection result needs to be updated, and if the alignment error is an error corresponding to the error point and the missing point, only the initial three-dimensional segmentation result can be updated until the alignment error corresponding to the updated three-dimensional segmentation result is less than the set threshold, and the three-dimensional segmentation result at this time is taken as the final three-dimensional segmentation result. If the alignment error is a posture error, only the initial three-dimensional posture result can be updated until the alignment error corresponding to the updated three-dimensional posture result is smaller than a set threshold value, and the three-dimensional posture result at the moment is used as a final three-dimensional posture result.

In an optional scheme, the difference information corresponding to the initial three-dimensional detection result of each object can be determined through two MLP networks.

As an example, as shown in fig. 14, a schematic diagram of a three-dimensional detection result refinement scheme based on alignment of a CAD image and a point cloud is shown, in fig. 14, an initial three-dimensional detection result includes an initial three-dimensional segmentation result and an initial three-dimensional pose result, point cloud normalization is performed on three-dimensional point cloud data corresponding to the initial three-dimensional segmentation result, feature extraction is performed on the normalized three-dimensional point cloud data through an MLP encoder to obtain a first feature, an original image corresponding to an object is determined from a CAD model, pose transformation is performed on the original image based on the initial three-dimensional pose result (the three-dimensional pose shown in fig. 13) to obtain a transformed image, point cloud normalization is performed on three-dimensional point cloud data corresponding to the object in the transformed image, feature extraction is also performed on the normalized three-dimensional point cloud data through the MLP encoder to obtain a second feature, and determining difference information corresponding to an initial three-dimensional detection result of the object in the first feature by using an MLP encoder based on the first feature and the second feature, wherein the difference information comprises an error point and a missing point corresponding to the initial three-dimensional segmentation result and a posture error corresponding to the initial three-dimensional posture result, finally updating the initial three-dimensional segmentation result based on the error point and the missing point (corresponding to the three-dimensional segmentation updating shown in FIG. 14), and updating the initial three-dimensional posture result based on the posture error (corresponding to the three-dimensional posture updating shown in FIG. 14) until the difference information corresponding to the updated three-dimensional detection result of each object meets a set condition, and stopping updating to obtain a final three-dimensional detection result.

Fig. 14 includes two MLP networks, one MLP network is used to process three-dimensional point cloud data corresponding to a three-dimensional segmentation result, and the other MLP network is used to process three-dimensional point cloud data corresponding to a transformed image.

Based on the scheme described above, the scheme is further explained with reference to fig. 15:

as shown in fig. 15, which is a schematic flowchart of obtaining a three-dimensional detection result of an object based on a color image and a depth image, in fig. 15, first, a proposal result of the object in the image (corresponding to the object proposal based on color and depth features shown in fig. 15) is determined based on the depth image and the color image, and then, based on the proposal result, a three-dimensional detection result of the object is determined, the three-dimensional detection result including a three-dimensional segmentation result and a three-dimensional pose result (corresponding to the joint three-dimensional segmentation and pose estimation shown in fig. 15). Then, based on the three-dimensional detection result, an original image (correspondence) corresponding to the object can be based on

Object three-dimensional shape information shown in fig. 15), refining the three-dimensional detection result, including refining the three-dimensional segmentation result and the three-dimensional pose result (corresponding to the three-dimensional segmentation and pose estimation refinement shown in fig. 15), to obtain a refined three-dimensional detection result (corresponding to the object three-dimensional segmentation and object three-dimensional pose shown in fig. 15).

Fig. 16 shows a schematic flowchart of an image processing method provided by the present invention, and as shown in fig. 16, the method includes steps S210 and S220, wherein:

step S210, obtaining deformation information of the virtual object to the real object (also referred to as an object to be deformed) in the image to be processed.

Step S220 is to deform the real object based on the deformation information, and obtain a deformed image to be processed.

Based on the deformation information, the real object in the image to be processed can be deformed, so that the virtual object and the real object are interacted.

In an alternative scheme of the present invention, deforming a real object based on deformation information to obtain a deformed image to be processed includes:

determining an original image corresponding to the object to be deformed, wherein the original image is an image corresponding to the object to be deformed in a reference posture;

determining a transformation relation between a deformed image corresponding to the object to be deformed and an image before deformation based on a three-dimensional posture result corresponding to the object to be deformed, deformation information and an original image corresponding to the object to be deformed, wherein the image before deformation is an image corresponding to the object to be deformed in the image to be processed;

determining a deformed image corresponding to the object to be deformed based on the transformation relation and the image corresponding to the object to be deformed;

and determining the deformed image to be processed based on the deformed image corresponding to the object to be deformed.

The object to be deformed refers to an object that can be deformed, such as a bed, a sofa, etc. The deformation request refers to a request for deformation of an object to be deformed, which may be triggered by a user through a specified identifier on a user interface, in an alternative of the present invention, if a virtual object is included in an image to be processed, the virtual object may be a virtual object implemented by an augmented reality technology, and the deformation request may also be triggered based on motion information of the virtual object to be deformed, where the deformation information may be determined based on the motion information, and the deformation information includes a deformation direction of the object and a deformation displacement.

The deformation information in each deformation request may be different or the same. The deformation information may be pre-configured, for example, based on the object type of the object to be deformed, the deformation information corresponding to the object in different object types is different.

In order to enable the object to be deformed to be correspondingly deformed based on the deformation information, a transformation relation is determined based on the deformation information, the transformation relation represents the corresponding relation between the deformed image corresponding to the object to be deformed and the image before deformation, namely the image corresponding to the object to be deformed in the image to be processed is the image before deformation, the image obtained by deformation based on the deformation information is the image after deformation, and the image after deformation can be obtained based on the image before deformation based on the transformation relation. Because the object to be deformed has a corresponding posture (posture corresponding to the three-dimensional posture result) in the image to be processed, when the transformation relation is determined, the three-dimensional posture result of the object to be deformed can be combined, so that the determined transformation relation is more accurate.

It is understood that the above-mentioned image to be processed may be the image to be processed in the scheme shown in fig. 1, and the three-dimensional pose result may also be the three-dimensional pose result based on the scheme described above.

In an alternative of the invention, the object to be deformed is determined based on the result of the three-dimensional segmentation of the image to be processed.

Each object in the image to be processed has a corresponding three-dimensional segmentation result, the object to be deformed is any object in the image to be processed, each object in the image to be processed can be distinguished based on the three-dimensional segmentation result, and the object to be deformed in the image to be processed can be accurately determined based on the three-dimensional segmentation result. Since the image to be processed includes the depth image and the color image, the image of the object to be deformed corresponding to the image to be processed may be the color image or the depth image.

In an alternative of the present invention, the three-dimensional detection result includes a three-dimensional segmentation result, and determining an original image corresponding to the object to be deformed may include:

determining the object type of the object to be deformed based on the three-dimensional segmentation result of the object to be deformed;

and determining an original image corresponding to the object to be deformed based on the object type of the object to be deformed.

The objects of different physical categories correspond to different original images, and the original images corresponding to the objects can be more accurately determined according to the object categories.

In an alternative of the present invention, determining a transformation relationship between a deformed image corresponding to an object to be deformed and an image before deformation based on a three-dimensional posture result corresponding to the object to be deformed, deformation information, and an original image corresponding to the object to be deformed may include:

The corresponding relationship may be established in advance based on the sample image, the object in the sample image may also be a deformable object, and the sample image may be an original image. For the object in the original image, the corresponding relation between the deformation points of different objects before and after deformation can be determined based on different deformation information. Based on the corresponding relation, the deformation point of the object to be deformed after deformation can be determined under different deformation information. After the deformed deformation point corresponding to the object to be deformed in the original image is determined, the transformation relation can be determined by combining the deformed point before the object to be deformed is deformed and the three-dimensional posture result corresponding to the object to be deformed.

In an alternative of the present invention, after the deformed deformation point corresponding to the object to be deformed in the original image is determined, since the original image is a three-dimensional image and the three-dimensional posture result is three-dimensional data, before the transformation relationship is determined, the three-dimensional data can be converted into two-dimensional data, and the obtained transformation relationship is also obtained based on the two-dimensional data. Wherein converting the three-dimensional data into the two-dimensional data may be based on a projection relationship between the three-dimensional data and the two-dimensional data.

In an alternative of the present invention, determining a transformation relationship between a deformed image corresponding to an object to be deformed and an image before deformation based on a deformed deformation point corresponding to the object to be deformed, a deformation point before deformation of the object to be deformed, and a three-dimensional posture result corresponding to the object to be deformed may include:

Wherein, to the deformation point that waits to warp the object and correspond, the deformation effect of waiting to warp the object corresponds the deformation effect of every deformation point, to the deformation effect of each deformation point, in practical application, owing to wait to warp the stress point or the application of force object (for example, virtual object) of object, can make and wait to warp the deformation effect that the point corresponds different deformation intensity of every deformation point of object. For example, the deformation strength corresponding to the stress point of the object is greater than the deformation strength corresponding to other points around the stress point, so that the deformation effect of the object to be deformed is more real.

In order to enable the deformation effect of the object to be deformed to be more real, before the object to be deformed is deformed, the weight of each deformation point in each deformation point corresponding to the object to be deformed can be determined, the size of the weight represents the deformation strength of the deformation point, the larger the weight is, the larger the deformation strength is, and therefore based on different weights corresponding to the deformation points, the more real deformation effect can be obtained when the object to be deformed is deformed.

In an alternative scheme of the invention, the deformed image to be processed is determined based on the deformed image corresponding to the object to be deformed, and the method comprises at least one of the following steps:

When the deformed image to be processed is determined based on the deformed image corresponding to the object to be deformed, the method can be implemented by at least one of the following modes:

firstly, an image replacement mode is adopted to replace an image before deformation with an image after deformation, namely, an object to be deformed after deformation in the image replaces an object to be deformed before deformation in the image.

Secondly, determining a difference image based on the image before deformation and the image after deformation by adopting an image fusion mode, reflecting the corresponding change of the image to be processed before and after deformation through the difference image, and processing the image to be processed before deformation based on the difference image to obtain the deformed image to be processed.

In an alternative of the present invention, the image to be processed may be an image in a video, and based on a processing manner of an object to be deformed in the image to be processed, the same processing may be performed on the associated frame image related to the object to be deformed in the video, so that the object to be deformed in the associated frame image also has a corresponding deformation effect, and based on the processing, the deformation effect of the object to be deformed in the video may be obtained.

In an alternative of the present invention, the acquiring the deformation request for the object to be deformed in the image to be processed, where the image to be processed is an image in a video, may include:

determining an image and deformation information corresponding to the motion information based on the motion information of the object to be deformed by the virtual object in the video;

and generating a deformation request for the image corresponding to the motion information based on the image corresponding to the motion information and the deformation information.

The image corresponding to the motion information may include continuous multi-frame images in the video, the motion information is information of the motion direction, the motion intensity, and the like of the virtual object, and based on the motion information of the virtual object, the deformation information of the object to be deformed may be determined. The larger the volume of the virtual object is, the larger the corresponding movement strength is, the larger the distance between the virtual object and the object to be deformed is, and the larger the corresponding movement strength is. The greater the movement strength, the greater the corresponding deformation strength.

In order to better understand the above solution, the following describes the solution of the embodiment of the present invention in further detail with reference to the example of a specific application scenario.

As a flow diagram of a method for interacting a virtual object with a real deformable object shown in fig. 17, a three-dimensional detection result of an object in an image to be processed is first determined based on the image to be processed, which includes a color image and a depth image, based on the method described above, and the three-dimensional detection result includes a three-dimensional segmentation result and a three-dimensional pose result (corresponding to the estimation of the three-dimensional pose of the object shown in fig. 17).

The user triggers a deformation request of an object to be deformed in the image to be processed through the augmented reality ar (augmented reality) controller, specifically, the deformation request may be triggered based on a virtual object in a scene corresponding to the image to be processed, where the deformation request includes deformation information.

Determining the object type of the object to be deformed based on the three-dimensional segmentation result of the object to be deformed (corresponding to the object detection in fig. 17), and retrieving the corresponding original image of the object to be deformed from the three-dimensional CAD model based on the object type of the object to be deformed; and determining a deformation point of the object to be deformed in the original image after deformation based on the deformation information, the corresponding original image of the object to be deformed and the corresponding relation. The correspondence relationship is established based on corresponding deformation points of the object in the sample image before and after deformation under different deformation information, the object deformable surface control point at the time t0 is a deformation point before deformation, the object deformable surface control point at the time t1 is a deformation point after deformation, and the correspondence relationship can be established based on the deformation points before and after deformation (corresponding to the deformable model mesh generation shown in fig. 17).

Since the original image is a three-dimensional image and the three-dimensional pose result is three-dimensional data, the three-dimensional data and the three-dimensional image are converted into two-dimensional data by a projection relationship between the three-dimensional data and the two-dimensional data (3D-2D projection shown in fig. 17). And after conversion, based on the three-dimensional posture result of the object to be deformed, carrying out posture transformation on the deformed deformation point of the object to be deformed in the original image, so that the object to be deformed in the transformed original image has the same corresponding posture as that in the three-dimensional posture result. Then, based on the deformed point of the object to be deformed in the original image after the posture transformation and the deformed point of the object to be deformed before the deformation, a transformation relationship between the deformed image corresponding to the object to be deformed and the image before the deformation is determined (corresponding to the generated image deformation map shown in fig. 17).

Based on the transformation relation, the image of the object to be deformed after being deformed can be determined based on the image of the object to be deformed before being deformed. Then, for the image to be processed, an image corresponding to the object to be deformed may be cut from the color image based on the object to be deformed (corresponding to the color image cut shown in fig. 17); then, based on the established transformation relationship, the image corresponding to the object to be deformed is subjected to image deformation, so as to obtain a deformed image (corresponding to the deformed color image shown in fig. 17).

Determining the deformed image to be processed based on the deformed image corresponding to the object to be deformed may include two ways, the first way being: based on the deformed image, the object to be deformed before deformation in the image to be processed is replaced with the deformed object (corresponding to the object in the substitute video shown in fig. 17) by the principle of video transmission (which is applied in the AR system), where the image to be processed may be the image in the video. The second method is as follows: determining a difference image (a difference image corresponding to the image before and after the deformation shown in fig. 17) based on the image after the deformation and the image before the deformation (an image corresponding to the object to be deformed in the color image); based on an optical transmission principle (the principle is applied to an AR system), the deformed image to be processed is determined based on the differential image, and the differential image can be specifically added in the augmented reality light path, so that the object to be deformed in the image to be processed has a deformation effect.

Specifically, with reference to the schematic diagram of the deformation process of the virtual object to deform the real object in the image shown in fig. 18, the scene corresponding to the color image and the depth image in the image to be processed is a bedroom, and the objects in the bedroom include a bed (bed), a sofa (sofa), a pillow (pillow), a curtain (curtain), and the like, where the bed, the sofa, the pillow, and the curtain are deformable objects and can be used as the objects to be deformed.

Based on the depth image and the color image, the three-dimensional detection result of each object in the image to be processed can be determined and obtained based on the scheme described in the foregoing, and the three-dimensional detection result comprises a three-dimensional segmentation result (corresponding to the three-dimensional object segmentation shown in fig. 18) and a three-dimensional posture result (three-dimensional object posture); as can be seen from the schematic diagram corresponding to the three-dimensional object segmentation in fig. 18, the bed, the sofa, the pillow, and the curtain in the image to be processed all have corresponding segmentation results, and as can be seen from the schematic diagram corresponding to the posture of the three-dimensional object, the bed, the sofa, the pillow, and the curtain in the image to be processed all have corresponding posture results. The object to be deformed has deformable surface control points, that is, deformation points at which the surface of the object can be deformed, for example, a schematic diagram of a segmentation result, and the mesh corresponding to each object to be deformed may be a surface deformable point, for example, the surface deformable point of the bed may be a mesh on the upper surface of the bed.

When the virtual object is to interact with the bed in the image to be processed, the original image corresponding to the bed is determined from the object CAD model based on the three-dimensional segmentation result of the bed, and as can be seen from the object CAD model shown in fig. 18, the model includes original images corresponding to objects of different object categories, including the original image of the bed, the original image of the sofa, and the original images of other non-image (other).

Based on the deformation information of the bed by the virtual object and the original image corresponding to the bed, the image corresponding to the bed is deformed in the manner described above, that is, the three-dimensional mesh corresponding to the bed is deformed (corresponding to the deformation of the three-dimensional mesh shown in fig. 18), so as to obtain the deformed deformation point of the bed in the original image, since the deformed deformation point of the bed is three-dimensional data, the three-dimensional data after the deformation of the bed is converted into two-dimensional data (corresponding to the deformation of the two-dimensional image shown in fig. 18) by 3D-2D projection, and after the conversion, the deformed deformation point of the bed in the original image is subjected to posture conversion based on the three-dimensional posture result of the bed, so that the posture of the bed in the converted original image is the same as the posture of the corresponding bed in the three-dimensional posture result.

Determining a transformation relation between a deformed image corresponding to the bed and the image before deformation based on the deformation point before deformation and the deformation point after deformation in the original image after posture transformation, performing image deformation on a two-dimensional image (the image corresponding to the bed in the image to be processed) based on the transformation relation to obtain the deformed image corresponding to the bed, and finally determining the deformed image to be processed based on the deformed image corresponding to the bed. As shown in the AR effect in fig. 18, in the image to be processed, the virtual object deforms the bed in the image, thereby achieving interaction between the virtual object and the object to be deformed in the image to be processed.

Based on the scheme, the same deformation processing can be carried out on the deformable objects such as sofas, curtains and the like in the scene. Fig. 19a shows the effect diagram before the sofa is deformed, in which the sphere with SAIT characters represents a virtual object, and it can be seen from fig. 19a that the surface of the sofa is in a flat state and is not deformed, i.e. the virtual object does not interact with the sofa. Fig. 19b shows the effect diagram after the sofa is deformed, where SAIT represents a virtual object, and it can be seen from fig. 19b that there is a deformed position on the surface of the sofa, which is in a concave state, i.e. the virtual object interacts with the sofa.

Based on the same principle as the method shown in fig. 1, an embodiment of the present invention further provides an image processing apparatus 20, as shown in fig. 20, the image processing apparatus 20 may include an image acquisition module 210, a three-dimensional point cloud data determination module 220, and a proposal result determination module 230, wherein:

an image obtaining module 210, configured to obtain an image to be processed, where the image to be processed includes a depth image of a scene;

a three-dimensional point cloud data determining module 220, configured to determine, based on the depth image, three-dimensional point cloud data corresponding to the depth image;

and a proposal result determining module 230, configured to obtain a proposal result of the object in the scene based on the three-dimensional point cloud data.

Optionally, when obtaining a proposal result of an object in a scene based on the three-dimensional point cloud data, the proposal result determining module 230 is specifically configured to:

determining a first feature map based on the matrix;

Optionally, when the proposed result determining module 230 determines the matrix corresponding to the three-dimensional point cloud data based on the three-dimensional point cloud data, it is specifically configured to:

Optionally, the image to be processed further includes a color image of the scene, and the apparatus further includes:

Optionally, when obtaining the proposed result of the object in the scene based on the first feature map and the second feature map, the proposed result determining module 230 is specifically configured to:

Optionally, when obtaining the proposed result of the object in the scene based on the third feature map, the proposed result determining module 230 is specifically configured to:

segmenting an image to be processed to obtain at least two sub-images;

Optionally, when determining the proposal result corresponding to each sub-image based on the third feature map corresponding to each sub-image and/or the third feature map corresponding to the neighboring sub-image of each sub-image, the proposal result determining module 230 is specifically configured to:

determining a weight for each sub-image;

Optionally, the proposal result determining module 230 determines the weight of each sub-image by any one of the following methods:

Optionally, when the proposed result determining module 230 determines the weight corresponding to each sub-image based on the candidate point corresponding to each sub-image, it is specifically configured to:

the proposal result determining module 230 determines the weight of each sub-image based on the sub-feature map corresponding to each sub-image by any one of the following methods:

Optionally, the apparatus further comprises:

Optionally, the three-dimensional detection result includes a three-dimensional posture result and a three-dimensional segmentation result;

Optionally, the three-dimensional detection result determining module is specifically configured to, when determining the three-dimensional detection result of the object in the image to be processed based on the proposal result:

Based on the same principle as the method shown in fig. 16, an embodiment of the present invention also provides an image processing apparatus 30, as shown in fig. 21, the image processing apparatus 30 may include a deformation information obtaining module 310 and an image deformation module 320, wherein:

a deformation information obtaining module 310, configured to obtain deformation information of a virtual object on a real object in an image to be processed;

and the image deformation module 320 is configured to deform the real object based on the deformation information to obtain a deformed to-be-processed image.

Optionally, the image deformation module 320 is specifically configured to, when deforming the real object based on the deformation information to obtain the deformed to-be-processed image:

determining an original image corresponding to a real object;

Optionally, when determining the transformation relationship between the deformed image corresponding to the object to be deformed and the image before deformation based on the three-dimensional posture result, the deformation information, and the original image corresponding to the object to be deformed, the image deformation module 320 is specifically configured to:

Optionally, the image deformation module 320 is specifically configured to, when determining the transformation relationship between the deformed image corresponding to the object to be deformed and the image before deformation based on the deformed deformation point corresponding to the object to be deformed, the deformation point before deformation of the object to be deformed, and the three-dimensional posture result corresponding to the object to be deformed:

Optionally, when determining the deformed to-be-processed image based on the deformed image corresponding to the to-be-deformed object, the image deformation module 320 determines by at least one of the following methods:

Since the image processing apparatus provided in the embodiment of the present invention is an apparatus capable of executing the image processing method in the embodiment of the present invention, a person skilled in the art can understand a specific implementation manner of the image processing apparatus in the embodiment of the present invention and various modifications thereof based on the image processing method provided in the embodiment of the present invention, and therefore, how to implement the image processing method in the embodiment of the present invention by the image processing apparatus is not described in detail herein. The image processing apparatus used by those skilled in the art to implement the image processing method in the embodiments of the present invention is within the scope of the present application.

Based on the same principle as the image processing method and the image processing apparatus provided by the embodiment of the present invention, an embodiment of the present invention also provides an electronic device, which may include a processor and a memory. Wherein the memory has stored therein readable instructions, which when loaded and executed by the processor, may implement the method shown in any of the embodiments of the present invention.

As an example, fig. 22 shows a schematic structural diagram of an electronic device 4000 to which the solution of the embodiment of the present application is applied, and as shown in fig. 22, the electronic device 4000 may include a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 22, but this does not indicate only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. The processor 4001 is configured to execute application code stored in the memory 4003 to implement the scheme shown in any one of the foregoing method embodiments.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An image processing method, comprising:

acquiring an image to be processed, wherein the image to be processed comprises a depth image of a scene;

2. The method of claim 1, wherein obtaining a proposal for an object in the scene based on the three-dimensional point cloud data comprises:

determining the first feature map based on the matrix;

3. The method of claim 2, wherein determining a corresponding matrix for the three-dimensional point cloud data based on the three-dimensional point cloud data comprises:

4. The method according to any one of claims 1 to 3, wherein the image to be processed further comprises a color image of the scene, the method further comprising:

the obtaining of the proposal result of the object in the scene based on the first feature map comprises:

5. The method according to claim 4, wherein obtaining a proposed result of the object in the scene based on the first feature map and the second feature map comprises:

6. The method according to claim 5, wherein the obtaining a proposed result of the object in the scene based on the third feature map comprises:

segmenting the image to be processed to obtain at least two sub-images;

and fusing the proposal results corresponding to the sub-images to obtain the proposal results of the objects in the scene.

7. The method according to claim 6, wherein the determining the proposed result corresponding to each sub-image based on the third feature map corresponding to each sub-image and/or the third feature maps corresponding to adjacent sub-images of each sub-image comprises:

determining a weight for each sub-image;

8. The method of claim 7, wherein determining the weight for each sub-image comprises any of:

and determining candidate points of the image to be processed, and determining the weight corresponding to each sub-image based on the candidate points corresponding to each sub-image or the sub-feature map corresponding to the candidate points corresponding to each sub-image.

9. The method of claim 8, wherein determining the weight corresponding to each sub-image based on the candidate points corresponding to each sub-image comprises:

the determining the weight of each sub-image based on the sub-feature map corresponding to each sub-image comprises any one of the following steps:

for each sub-image, determining a first feature vector corresponding to the center position of the sub-image and a second feature vector corresponding to a sub-feature map corresponding to the sub-image; determining the weight of each sub-image based on the first feature vector and the second feature vector corresponding to each sub-image;

for the corresponding sub-feature map of each sub-image, the sub-feature map corresponds to at least one probability value, and each probability value represents the probability that the sub-feature map belongs to the corresponding object; and taking the maximum probability value of the at least one probability value as the weight of the subimage.

10. The method according to any one of claims 1 to 3, further comprising:

11. The method of claim 10, wherein the three-dimensional detection results comprise three-dimensional pose results and three-dimensional segmentation results;

the determining the three-dimensional detection result of the object in the image to be processed based on the proposal result comprises the following steps:

12. The method according to claim 10, wherein the determining a three-dimensional detection result of the object in the image to be processed based on the proposal result comprises:

determining an original image corresponding to an object in the image to be processed;

13. An image processing method, comprising:

and deforming the real object based on the deformation information to obtain the deformed image to be processed.

14. The method according to claim 13, wherein the deforming the real object based on the deformation information to obtain the deformed image to be processed comprises:

determining an original image corresponding to the real object;

determining a transformation relation between a deformed image corresponding to the real object and an image before deformation based on a three-dimensional posture result corresponding to the real object, the deformation information and the original image corresponding to the real object, wherein the image before deformation is the image corresponding to the real object in the image to be processed;

and determining the to-be-processed image after deformation based on the deformed image corresponding to the real object.

15. The method according to claim 14, wherein the determining a transformation relationship between the deformed image and the pre-deformed image corresponding to the object to be deformed based on the three-dimensional pose result corresponding to the object to be deformed, the deformation information, and the original image corresponding to the object to be deformed comprises:

determining deformed deformation points corresponding to the object to be deformed in the original image based on the original image of the object to be deformed, the deformation information and the corresponding relationship, wherein the corresponding relationship is established based on the deformed deformation points corresponding to the object in the sample image before and after deformation under different deformation information;

and determining a transformation relation between the deformed image corresponding to the object to be deformed and the image before deformation based on the deformed deformation point corresponding to the object to be deformed, the deformed point before deformation of the object to be deformed and the three-dimensional posture result corresponding to the object to be deformed.

16. The method according to claim 15, wherein the determining the transformation relationship between the deformed image and the pre-deformed image corresponding to the object to be deformed based on the deformed deformation point corresponding to the object to be deformed, the deformed point before the object to be deformed is deformed, and the three-dimensional posture result corresponding to the object to be deformed comprises:

and determining the transformation relation between the deformed image corresponding to the object to be deformed and the deformed image corresponding to the object to be deformed based on the weight of each deformation point, the deformed deformation point corresponding to the object to be deformed, the deformation point before the object to be deformed is deformed and the three-dimensional posture result corresponding to the object to be deformed.

17. The method according to claim 16, wherein the determining the deformed image to be processed based on the deformed image corresponding to the object to be deformed comprises at least one of:

18. An image processing method, comprising:

and the image deformation module is used for deforming the real object based on the deformation information to obtain the deformed image to be processed.

19. An electronic device, comprising a processor and a memory;

the memory has stored therein readable instructions which, when loaded and executed by the processor, implement the method of any one of claims 1 to 17.

20. A computer-readable storage medium, characterized in that it stores at least one computer program which is loaded and executed by a processor to implement the method of any of claims 1 to 17.