CN114663502A

CN114663502A - Object posture estimation and image processing method and related equipment

Info

Publication number: CN114663502A
Application number: CN202011446331.XA
Authority: CN
Inventors: 考月英; 李炜明; 金知姸; 张现盛; 洪性勋; 王强; 刘洋; 汪昊
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2022-06-24
Also published as: KR20220081261A

Abstract

The application provides object attitude estimation and image processing methods and related equipment, and belongs to the technical field of image processing and artificial intelligence. The object posture estimation method comprises the following steps: acquiring image characteristics corresponding to point clouds of an input image; determining semantic segmentation information, example mask information and key point information of the object based on the image characteristics; and estimating the object posture based on the semantic segmentation information, the example mask information and the key point information. Based on the method provided by the application, the time required by the estimation of the object posture can be effectively reduced. Meanwhile, the object posture estimation and image processing method executed by the electronic equipment can be executed by using an artificial intelligent model.

Description

Object posture estimation and image processing method and related equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an object posture estimation method, an image processing method and related equipment.

Background

With the development of artificial intelligence technology, technologies such as augmented reality, computer vision, map navigation and the like are more and more important in life and work of people. The attitude estimation can be used for estimating the attitude of an object in an image shot by a camera and is applied to a plurality of artificial intelligence technologies.

Attitude estimation relates to an instance segmentation task and a key point detection task, and in the prior art, each task is generally performed in a clustering mode, but clustering is time-consuming and cannot meet the requirements of some scenes with high real-time requirements.

Disclosure of Invention

The application aims to provide an object posture estimation method, an image processing method and a related device, so as to reduce the time required for image processing. The scheme provided by the embodiment of the application is as follows:

in a first aspect, the present application provides an object pose estimation method, including:

acquiring image characteristics corresponding to point clouds of an input image;

determining semantic segmentation information, example mask information and key point information of the object based on the image features;

and estimating the object posture based on the semantic segmentation information, the example mask information and the key point information.

In a second aspect, the present application provides an image processing method, including:

determining example mask information of the image by taking the point cloud corresponding to the center of the object as a reference through a multilayer perceptron network of example mask segmentation based on the image characteristics;

image processing is performed based on the example mask information.

In a third aspect, the present application provides an object posture estimation device, including:

the first acquisition module is used for acquiring image characteristics corresponding to point clouds of an input image;

the first determination module is used for determining semantic segmentation information, example mask information and key point information of the object based on the image characteristics;

and the attitude estimation module is used for estimating the attitude of the object based on the semantic segmentation information, the example mask information and the key point information.

In a fourth aspect, the present application provides an image processing apparatus comprising:

the second acquisition module is used for acquiring image characteristics corresponding to point clouds of the input image;

the second determining module is used for determining example mask information of the image by taking the point cloud corresponding to the center of the object as a reference through a multi-layer perceptron network of example mask segmentation based on the image characteristics;

and the processing module is used for carrying out image processing based on the example mask information.

In a fifth aspect, the present application provides an electronic device comprising a memory and a processor; the memory has a computer program stored therein; and the processor is used for executing the object posture estimation method or the image processing method provided by the embodiment of the application when the computer program is run.

In a sixth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the object posture estimation method or the image processing method provided in the embodiments of the present application.

The advantageous effects brought by the technical solutions provided by the embodiments of the present application will be described in detail in the following description of the specific embodiments with reference to various alternative embodiments, and will not be further described herein.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a flow chart of a method for estimating an object pose according to an embodiment of the present application;

FIG. 2 is a flow chart of object pose estimation according to an example of the present application;

FIG. 3 is a diagram of a network architecture for performing pose estimation of an object in an example of the present application;

FIG. 4 is a flow chart of object pose estimation according to an example of the present application;

FIG. 5 is a diagram of a network architecture for performing pose estimation of an object in an example of the present application;

FIG. 6 is a flow chart of object pose estimation in an example of the present application;

FIG. 7 is a flow chart of object pose estimation according to an example of the present application;

FIG. 8 illustrates an example mask segmentation schematic based on location awareness according to the present application;

FIG. 9 is a schematic diagram illustrating another example mask segmentation based on location awareness according to the present application;

FIG. 10 is a diagram of a network architecture for performing pose estimation of an object in an example of the present application;

FIG. 11 is a flow chart illustrating object pose estimation according to an example of the present application;

FIG. 12 is a schematic diagram illustrating an application scenario of the present application;

FIG. 13 is a flowchart of an image processing method according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of an object posture estimation apparatus according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 16 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

For better understanding and description of the solutions provided by the embodiments of the present application, the related art to which the present application relates will be described first.

The technical field of the application belongs to the technical field of augmented reality, and particularly relates to a multi-modal (color and depth) image processing and identifying technology based on a learning algorithm, an object identifying technology, an instance segmenting technology and a 6-freedom object posture estimating technology.

Augmented Reality (AR) is a technology that calculates the position and angle of a camera image in real time and adds a corresponding image, and superimposes a virtual object generated by a computer or information about a real object on a scene of the real world to realize the enhancement of the real world. Augmented reality technology provides a user with a real information experience by adding virtual content in the real scene in front of the user. In a three-dimensional space, an augmented reality system needs to have high-precision real-time processing and understanding on the three-dimensional state of a surrounding object so as to achieve the effect of presenting high-quality virtual-real fusion in front of a user.

Attitude estimation, which represents the structure and shape of an object by using a geometric model or structure, establishes a correspondence between the geometric model and an image by extracting the characteristics of the object, and then estimates the spatial attitude of the object by a geometric method or other methods. The pose estimation technology can comprise an example segmentation task and a key point detection task, wherein the example segmentation task comprises a semantic segmentation task for separating a target object from a background to determine the category of the target object and an example mask segmentation task for determining pixel points belonging to each target object; and the keypoint detection task is used to determine the position of the target object in the image. The attitude estimation method provided by the embodiment of the application can be applied to the technical field of augmented reality.

6DoF pose estimation of image: for a given image containing color and depth information, a 6DoF pose, also referred to as a 6-dimensional degree of freedom pose, comprising a 3-dimensional position and a 3-dimensional spatial orientation, of a target object in the image is estimated.

In the prior art, an object 6DoF attitude estimation method based on RGBD (Red + Green + Blue + depth, including images of three primary colors of Red, Green and Blue and depth information) is proposed for attitude estimation of an object. The method includes the steps that voting information is predicted by using a deep Hough voting network based on key points of an object, and finally 3D key points of the object are determined by adopting a clustering method and the posture of the object is estimated by using least square fitting. The method is an extension of the 2D key point method, and can fully utilize the depth information in the depth image. However, this method requires object instance segmentation and key point detection by means of clustering, which is a very time-consuming method.

In the prior art, object instance segmentation methods generally adopt detection branches to extract objects or rely on grouping or aggregation (grouping) to aggregate the same instance points to find out objects. However, the branch-detection-based approach cannot ensure that the instance labels of each point are consistent, and the grouping approach requires parameter adjustment and is computationally expensive. At present, a short time-consuming scheme in an object example segmentation method is performed by depending on a two-dimensional image, however, two-dimensional projection becomes an image which may have artifacts, such as occlusion and overlapping between image areas of a plurality of objects.

In order to solve at least one of the above problems, the present application provides an object posture estimation method, which performs example segmentation and key point regression of an object based on position information in a three-dimensional space, thereby effectively avoiding the adoption of a clustering method and reducing the time consumption for estimating the object posture. The application also provides an image processing method, the multilayer perceptron network adopting object example mask segmentation determines example mask information of the object based on point cloud corresponding to the center of the object, and time consumption for processing the image is reduced while a clustering method is avoided.

In order to make the objects, technical solutions and advantages of the present application clearer, various alternative embodiments of the present application and how the technical solutions of the embodiments of the present application solve the above technical problems will be described in detail below with reference to specific embodiments and drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings. Fig. 1 illustrates an object pose estimation method provided by an embodiment of the present application, where the method includes the following steps S101 to S103:

step S101: and acquiring image characteristics corresponding to point clouds of the input image.

Specifically, the object is an object having six degrees of freedom in a three-dimensional space, also referred to as an object of 6-dimensional degree of freedom; an object is an object belonging to the real world, such as: mouse, fan, desk, etc.

The input image of the object may be an RGBD image (Red + Green + Blue + depth, an image containing three primary colors of Red, Green, and Blue and depth information) acquired by an image acquisition device, including a color image and a depth image, and in some embodiments, the image of the object may also be a grayscale image. Alternatively, the image of the object may include only the depth image, or may include a color image or a grayscale image, and the depth image. The color image, the grayscale image, and the depth image are images corresponding to the same scene and containing the same object, the same image may include a plurality of different types of objects, and the types and the number of the objects are not limited herein. The color image, the grayscale image, and the corresponding depth image are acquired by an image acquisition device or a video acquisition device having the functions of acquiring the depth image, the grayscale image, and the depth image, which are not limited in the embodiments of the present application.

Specifically, a color image or a grayscale image may be configured and selected according to the actual application requirements, for example, in order to obtain a better pose estimation effect, the color image may be used, and if the requirement on the efficiency of pose estimation is high, the grayscale image may be used. A grayscale image is an image with only one sample color per pixel, and the grayscale image also has many levels of color depth between black and white.

In practical application, continuous real-time object pose estimation is required in many application scenes, at this time, optionally, an RGBD video of an object may be acquired by a video acquisition device, an image of each frame in the video is an RGBD image, a color image and a depth image of the object may be extracted from the same video frame, for each video frame, a corresponding color image and a corresponding depth image may be obtained, the color image corresponding to the same video frame is consistent with the object in the depth image, and the number of the object may be one or more. The color image and the depth image corresponding to each acquired video frame can be estimated in real time based on the scheme provided by the embodiment of the application.

The point cloud may be a collection of a large number of points representing the surface characteristics of the object. The point cloud can be obtained according to a laser measurement principle and a photogrammetry principle, wherein the point cloud obtained according to the laser measurement principle comprises three-dimensional coordinates (XYZ) and laser reflection Intensity (Intensity); a point cloud obtained according to photogrammetry principles, comprising three-dimensional coordinates (XYZ) and color information (RGB); and combining laser measurement and photogrammetry principles to obtain a point cloud comprising three-dimensional coordinates (XYZ), laser reflection Intensity (Intensity) and color information (RGB). After obtaining the spatial coordinate information of each sampling Point on the surface of the object, a set of points is obtained, which is referred to as "Point Cloud" in this application.

Step S102: semantic segmentation information, instance mask information, and keypoint information of the object are determined based on the image features.

Specifically, the semantic segmentation task refers to: separating an object from a background, and obtaining a probability map of semantic segmentation by inputting image features; a prediction category for each point cloud may be determined based on a probability map of semantic segmentation.

Example mask segmentation (object detection) refers to: the position of each object in the image is determined, a probability map of different positions of example segmentation can be obtained through inputting image features, and the position of the object corresponding to each point cloud can be determined based on the probability map.

Optionally, since the example mask segmentation task involves processing of the object position, the example mask segmentation is performed in the present application based primarily on point cloud information provided in the image features in combination with other image feature information. The point cloud information may include information characterizing spatial coordinate values of the point cloud.

Specifically, the key point information may be determined based on example mask information of the object, or may also be determined based on semantic segmentation information of the object and example mask information; optionally, the spatial coordinate value of the key point is obtained in step S102.

The key points of the object may be artificially defined key points, or key points calculated by using a farthest point sampling method (FPS).

Step S103: and estimating the object posture based on the semantic segmentation information, the example mask information and the key point information.

Specifically, based on the semantic segmentation information, the example mask information and the key point information, and the geometric model of the same object (which may be a Computer Aided Design (CAD) model), the six-degree-of-freedom pose of the object obtained by three-dimensional rotation and three-dimensional translation transformation between two sets of three-dimensional key points is calculated by a least square fitting method.

Alternatively, the above steps S101-S103 may be implemented by a network as shown in fig. 2. As shown in fig. 2, the system comprises a feature extraction module, a semantic segmentation module, an example mask segmentation module, a keypoint detection module and an attitude estimation module. Inputting an image of an object into a feature extraction module to extract image features corresponding to the point cloud; respectively inputting the image characteristics into a semantic segmentation module and an example mask segmentation module to determine semantic segmentation information (such as semantic category of point cloud) and example mask information of the object; inputting image features into a key point detection module and obtaining key points (such as space coordinate information of the key points) by combining example mask information or (example mask information and semantic segmentation information); and then inputting the key points or the key points and the semantic segmentation information into a posture estimation module to determine the object and the posture information corresponding to the object. The semantic segmentation module, the example mask segmentation module and the keypoint detection module may respectively form a location-aware semantic segmentation network, an example mask segmentation network and a location-aware 3D keypoint regression network based on a location-aware network (MLP).

The image characteristics (including point cloud information) corresponding to the point cloud can be extracted from the image of the object, and the points and the surface of the object in the 3D space cannot be overlapped and shielded, so that the 3D position information is utilized to sense the position and the integrity of the object, and the method has an important function. According to the embodiment of the application, the key point information of each object can be quickly calculated in a regression mode without a complex clustering mode according to the semantic segmentation information and the example mask segmentation information of the object sensed based on the 3D position information, so that the calculation time is reduced, and the key point detection efficiency is improved; the six-degree-of-freedom attitude of the object can be calculated by using the predicted key point information and the key point information of the three-dimensional model of the object through a least square fitting method. In addition, the object posture estimation method provided by the application is also beneficial to improving the efficiency, the precision and the robustness of the system in the augmented reality application; the example segmentation and 3D keypoint regression networks may be connected to form an end-to-end trainable network, avoiding tedious parameter adjustments in multiple post-processing (voting clustering) steps.

Various alternative embodiments provided by the present application are described in detail below.

In an optional embodiment of the present application, the step S101 of extracting the image feature including the point cloud information based on the image of the object includes at least one of the following steps a1-a 2:

step A1: and extracting point cloud features based on the input depth image, and confirming the extracted point cloud features as image features corresponding to the point cloud.

Step A2: extracting a first image feature based on the input color image and/or grayscale image; extracting point cloud features based on the input depth image; and fusing the first image characteristic and the point cloud characteristic to obtain an image characteristic corresponding to the point cloud.

The scheme provided by the embodiment can extract image features based on the depth image, and can also extract image features based on the color image and/or the gray image and the depth image, wherein the extraction of the image features can be performed in an image feature extraction network, an image feature extraction algorithm and the like.

In the scheme, feature extraction is performed based on a color image and/or a grayscale image and a depth image, and image feature extraction is performed through an image feature extraction network, that is, the color image and/or the grayscale image and the depth image are input to the image feature extraction network, when the input image is the color image, the input of the image feature extraction network is an H × W4-channel image obtained by pixel-by-pixel splicing the color image and the depth image, where H is an image height, W is an image width, the 4 channels are channels corresponding to three channels corresponding to RGB data of the color image and depth data of the depth image, and an output of the image feature extraction network includes an image feature vector of each pixel.

Optionally, the point cloud features are extracted according to the depth image, the depth image may be subjected to point cloud conversion to obtain point cloud data corresponding to the depth image (for example, two-dimensional image conversion is performed to obtain three-dimensional image), and then the point cloud features are extracted based on the point cloud data to obtain point cloud features corresponding to the depth image. The extraction of the point cloud features can be performed through a point cloud feature extraction network, the input of the point cloud feature extraction network is point cloud data, the output of the point cloud feature extraction network comprises a point cloud feature vector of each three-dimensional point, and then the point cloud feature vector of each pixel is obtained.

In one embodiment, step a2 may be implemented using a network architecture as shown in fig. 3 and a flowchart as shown in fig. 4. The feature extraction network layer comprises an image convolution network (CNN), a point cloud feature extraction network and a fusion network. Other possible network configurations, such as an example position probability map, may also be used as input data for 3D keypoint offsets, as indicated by the dashed lines in fig. 4. Specifically, the extracting of the first image feature based on the input color image and/or grayscale image in step a2 includes: step A21: based on the input color image and/or grayscale image, a first image feature is extracted by a convolutional neural network.

Optionally, the image convolution network may be a deep learning network based on convolution neurons, and the input data of the network may be a 3-channel color image I with the size of H multiplied by W, where H is the image height, W is the image width, and 3 channels are three color channels of RGB in the color image, respectively; the output data of the network may comprise a vector map of image features (first image features) for each pixel, of size H W F1. In an embodiment, the network structure may be a multilayer convolutional neural network built according to actual scene requirements, or a network part in front of a full connection layer of a neural network, such as AlexNet, VGG net (visual Geometry Group net), ResNet (deep residual error network), and the like.

Specifically, the extracting point cloud features based on the input depth image in step a2 includes: step A22: and extracting point cloud characteristics through a multi-layer perceptron network based on the input depth image.

Alternatively, the point cloud feature extraction network may be a multi-layer perceptron (MLP) network, such as PointNet + + (segmentation algorithm network). The input data of the network may be point clouds obtained by two-dimensional to three-dimensional projection of the depth image or other features including each point cloud, such as RGB colors, normal vectors, and the like, and the input size is N × M, where N is the number of the point clouds, and M is the length of the feature of each point cloud. The output data of the network may include a point cloud feature vector of size N x F2 for each three-dimensional point.

In an embodiment, the fusing the first image feature and the point cloud feature in the step a2 to obtain an image feature corresponding to the point cloud, including: step A23: and carrying out pixel-by-pixel fusion on the first image characteristic and the point cloud characteristic to obtain an image characteristic corresponding to the point cloud.

Optionally, according to a one-to-one correspondence relationship between the depth image and the three-dimensional point cloud projection, an image pixel corresponding to each three-dimensional point can be known. And for each pixel point in the image, performing pixel-by-pixel dense fusion on the pixel through a fusion network by using an image convolution network to obtain an image pixel feature vector (first image feature) and a point cloud feature vector (point cloud feature) obtained by the pixel through a point cloud feature extraction network. As an example, the fusion may be performed by stitching the image pixel feature vector and the point cloud feature vector, and the size of the fused feature (the image feature corresponding to the point cloud) is N (F1+ F2) ═ N × F.

In the embodiment of the present application, based on image features including point cloud information obtained by dense fusion, the image features are sent to a plurality of parallel multi-layer perceptron (MLP) network structures to extract features with stronger expression capability, and then a task as shown in fig. 3 is performed (which may be implemented by using a network structure as shown in fig. 4), which includes: semantic segmentation estimation and example mask segmentation estimation of the object, offset estimation of three-dimensional key points of the object, and coordinate prediction (key point regression) of the three-dimensional key points of the object; the four tasks respectively have corresponding loss functions during training; and semantic segmentation tasks, example mask segmentation tasks, and keypoint detection (including offset estimation and keypoint regression) belong to parallel tasks. An example of performing each task based on the image feature corresponding to the point cloud obtained in step S101 is described below.

In one embodiment, step S102 determines semantic segmentation information, instance mask information, and keypoint information for an object based on image features, including the following steps B1-B3:

step B1: and determining semantic segmentation information corresponding to the point cloud based on the image characteristics.

Specifically, a semantic segmentation task may be performed based on the multi-layered perceptron 1, the semantic segmentation task of the object being used to estimate the semantic category of each point cloud. As an embodiment, assuming that the semantic category (preset object category) has C categories, or C objects, the task aims to output a probability map of semantic segmentation, where the size is N × C, and each value represents the probability of the nth point cloud belonging to the C category. During network training, a real group route (namely, real semantic information of each point cloud) is constructed through real semantic labeling information, and then a value of a preset loss function (which may be dice loss, a metric function for evaluating similarity of two samples, softmax loss and the like) is determined based on network output and labeling information, so as to update network parameters based on the loss function. Through continuous supervision of data with real labeling information, the network parameters of semantic segmentation are learned when the network converges. When the method is used in a test, the RGBD image can be sent into a convolutional neural network and a point cloud feature extraction network to obtain corresponding image features, and then the image features are input into the multilayer perceptron 1, so that a probability graph of semantic segmentation can be obtained. And finally, taking the category corresponding to the maximum probability of each point cloud of the probability map as the predicted semantic category of the point cloud as semantic segmentation information corresponding to the point cloud.

Step B2: and creating a three-dimensional grid based on the point cloud information corresponding to the input image, and determining example mask information of the object based on the three-dimensional grid.

Wherein, the creation of the three-dimensional grid can be understood as dividing the three-dimensional space in which the object is located into very small three-dimensional space cells, as shown in fig. 8. And if the center of the object falls into a certain cell of the three-dimensional grid, carrying out example segmentation and three-dimensional key point detection on the object by using the related information obtained by the corresponding cell.

In particular, the multi-layered perceptron 2 may be employed for instance mask segmentation tasks, which are perceptually predictive of the location of each instance by means of 3D location information. As an embodiment, as shown in fig. 3, a three-dimensional space is first created according to the positions of all input point clouds (the size is N × 3, N is the number of the point clouds, 3 represents the horizontal direction x, the vertical direction y, the depth direction z, and the three-dimensional space coordinates of each point cloud), and a preset three-dimensional grid division strategy is adopted to divide the corresponding three-dimensional space to obtain a three-dimensional grid. For example, when a three-dimensional mesh partition strategy of an equidistant partition single-branch method is adopted, in the process of network training, a network node corresponding to a cell where an object center is located is set to be 1, and the rest are 0, and then the value of a loss function (which may be a dice loss-a measurement function for evaluating the similarity of two samples, softmax loss and the like) is determined based on network output and labeling information, so as to update network parameters through the loss function. Through continuous supervision of data with real labeling information, network parameters of instance segmentation are learned when the network converges finally. When the method is used in a test, the image characteristics obtained after the RGBD image is input are input into a network, and probability maps of different positions of example segmentation, namely an example position probability map, can be obtained. And finally, the position corresponding to the maximum probability of each point cloud of the probability map is taken to be determined as the position of the object where the point cloud is located as example mask information. All point clouds at each cell position can be regarded as the same example object.

In one embodiment, the example mask information characterizes corresponding mesh information of the point cloud in the three-dimensional mesh; the network information corresponding to each point cloud of the object is determined and obtained based on the grid information corresponding to the point cloud of the center of the object.

Specifically, in a three-dimensional mesh divided based on point cloud information, mesh information corresponding to respective point clouds of an object in the three-dimensional mesh can be known (the respective point clouds of the object may be dispersed in a plurality of cells of the three-dimensional mesh). In the embodiment of the application, a target cell where a point cloud corresponding to the center of an object is located is determined, and then all point clouds of the object are uniformly regarded as corresponding to the target cell. For example, the following steps are carried out: in a three-dimensional grid (horizontal direction x, vertical direction y, depth direction z), the object correspondence includes 3 point clouds (the actual point clouds may be many, and are only described as examples), point cloud 1, point cloud 2, and point cloud 3; point cloud 1 corresponds to a cell (1,3,5), point cloud 2 corresponds to a cell (6,7,8), and point cloud 3 corresponds to a cell (2,4, 7). If the current object center corresponds to the point cloud 2, the cell (6,7,8) is regarded as a cell corresponding to 3 point clouds (the channel corresponding to the cell is used for predicting the position of the object).

Step B3: the keypoint information is determined based on the instance mask information, or based on the semantic segmentation information and the instance mask information.

Specifically, after obtaining the corresponding semantic segmentation information (the predicted semantic category corresponding to each point cloud) and the example mask information (the object position where each point cloud is located) based on steps B1 and B2, the key point information may be determined by the image features and the example mask information, or may be determined by the image features in combination with the semantic segmentation information and the example mask information. A specific process of determining the keypoint information will be described in the following embodiments.

In one embodiment, as shown in fig. 9, the preset three-dimensional mesh partition strategy includes a single-branch method for equal-interval partition, a multi-branch method for cell size change, and a multi-branch method for start position change. Specifically, the step B2 of creating the three-dimensional mesh based on the point cloud information corresponding to the input image may include at least one of the following steps B21-B23:

step B21: and dividing a three-dimensional space corresponding to the point cloud information at equal intervals to obtain a three-dimensional grid.

Specifically, the three-dimensional space corresponding to the point cloud information may be understood as a three-dimensional physical space where the object is located, and when the space is divided, the three-dimensional space is divided into D equal parts at equal intervals in each direction, and further, the three-dimensional space is divided into D × D cells. In this case, the number of nodes in the network of the multi-layered perceptron 2 is N × D³And N is the number of the point clouds. If the center of the object is located at the position of the cell (i, j, k), the object corresponds to the node (idx) at the position of the layer of neural network (i × D + j × D + k). In the dividing method, if any two objects are not in the same cell, the size of the divided cell is smaller than that of any one object, and if the object is very small, the value of D is large, so that the network parameter is increased. Therefore, in order to solve this problem, the present application also proposes two methods implemented by the following steps B22 and B23.

Step B22: and dividing three-dimensional spaces corresponding to the point cloud information based on a plurality of preset intervals respectively to obtain a plurality of three-dimensional grids.

It can also be called a multi-branch method of cell size change, specifically, different intervals are set for each branch in step 22, i.e. different copies D can be obtained. As an embodiment, as shown in fig. 9, two branches (or a plurality of branches) are provided, corresponding to two parallel position sensing layers, the distance between each layer is set differently, that is, the obtained D is different, and the number of network nodes of the two layers is N × D1 respectively³，N*D2³. Wherein N is the number of point clouds. As can be seen from fig. 9, the first three-dimensional mesh formed includes D1 × D1 × D1 cells divided at equal intervals based on the preset interval 1; forming a second three-dimensional grid including a division into equal intervals based on a predetermined interval 2D2 × D2 × D2 cells.

Step B23: and dividing the three-dimensional space corresponding to the point cloud information to obtain a plurality of three-dimensional grids based on the same division starting points with different intervals.

It may also be referred to as a multi-branch method of starting position change, and specifically, a different starting point is set in dividing the interval for each branch in step B23. As an embodiment, as shown in fig. 9, two branches (or multiple branches) are provided, corresponding to two parallel position sensing layers, the starting points of each layer are different and the intervals are the same, and the number of network nodes of the two layers is N × D³. As can be seen from fig. 9, the first and second three-dimensional grids are divided into a plurality of cells based on the preset distance, and it can be seen that the cells formed by dividing the two three-dimensional grids are different.

In one embodiment, a key point offset estimation task is included before the key point regression is performed, and a predicted value for performing the key point regression can be obtained based on the estimated offset in combination with the point cloud information. The key point offset estimation task is used for estimating the offset of each point cloud relative to each three-dimensional key point in the object; the three-dimensional key points in the object can be defined based on empirical values and can also be calculated based on a correlated sampling algorithm. In the following embodiments, an example is described in which each object in an image has K key points.

Specifically, determining the keypoint information based on the example mask information in step B3 includes at least one of the following steps C1-C2:

step C1: estimating a first offset of each point cloud corresponding to the key point based on the image characteristics; and determining key point information of the object in a regression mode based on the first offset and the example mask information.

Specifically, as shown in fig. 3 and 4, in the embodiment of the present application, when K keypoints correspond to each object in the image, the size of the network layer corresponding to the task of performing three-dimensional keypoint offset estimation is N × K × 3, where N denotes the number of point clouds, K denotes the number of keypoints, and 3 denotes three coordinates of the three-dimensional space. In the process of training a network (corresponding to the multilayer perceptron 3 in fig. 4), taking the difference between the preset spatial coordinate information of the key point and the spatial coordinate value of each point cloud as a real label of the offset, and further determining a loss function (which may adopt an euclidean distance loss function) based on the prediction label of the offset obtained by the network execution task and the real label, so as to update network parameters based on the loss function.

Optionally, in step C1, determining the keypoint information of the object by regression based on the first offset and the example mask information includes: determining an initial predicted value of the key point predicted by taking the point cloud as a reference based on the first offset and the point cloud information; and determining a target predicted value corresponding to the key point in the three-dimensional grid based on the initial predicted value and the example mask information, and determining key point information of the object in a regression mode based on the target predicted value.

The coordinate information (initial predicted value) of each key point predicted based on each point cloud can be known by adding the first offset predicted by each key point (the key points referred to in the embodiment of the present application are all three-dimensional key points, 3D key points) to the original coordinate information (known from the point cloud information) of each point cloud. Namely, the first offset + the point cloud information is the initial predicted value.

Specifically, since the offset estimation task is performed for the entire three-dimensional space in step C1, it is necessary to add example mask information in the step of keypoint regression to calculate the spatial coordinate information of the keypoints of the object in each cell. As shown in fig. 3 and 4, the flowchart shows the example mask segmentation result and the offset estimation result as input data of the regressor, and the network structure diagram shows the regressor connected to the multi-layer sensor 3 and the multi-layer sensors 1 and 2. Optionally, a target predicted value of each key point of the point cloud prediction in each cell may be determined based on the initial predicted value and the example mask information.

Step C2: estimating second offset of key points corresponding to each point cloud in each cell of the three-dimensional grid based on the image characteristics and the example mask information; and determining the key point information of the object by means of regression based on the second offset.

Specifically, as shown in fig. 3 and 4, in the embodiment of the present application, when K keypoints correspond to each object in an image, the size of the network layer corresponding to the task of performing three-dimensional keypoint offset estimation is N × D³K × 3 (corresponding to calculating the offset of each key point corresponding to each point cloud in each cell), where N represents the number of point clouds, K represents the number of key points, and 3 represents three coordinates of the three-dimensional space. In the process of training a network (corresponding to the multilayer perceptron 3 in fig. 4), taking the difference between the preset spatial coordinate information of the key point and the spatial coordinate value of each point cloud as a real label of the offset, and further determining a loss function (which may adopt an euclidean distance loss function) based on the prediction label of the offset obtained by the network execution task and the real label, so as to update network parameters based on the loss function.

Unlike step C1, step C2 performs offset estimation based on each point cloud included in each cell, and step C1 performs offset estimation for all point clouds based on the entire three-dimensional space in which the object is located.

Optionally, in step C2, determining the key point information of the object by regression based on the second offset includes: determining a target predicted value of the key point predicted by taking the point cloud as a reference based on the second offset and the point cloud information; and determining key point information (such as a space coordinate value) of the object by a regression mode based on the target predicted value.

For each cell, the coordinate information (target predicted value) of each key point predicted based on each point cloud can be known by adding the second offset predicted by each point cloud for each key point (the key points referred to in the embodiment of the present application are three-dimensional key points, and 3D key points) to the original coordinate information (known from the point cloud information) of each point cloud. Namely, the second offset and the point cloud information are the target predicted value.

Comparing the steps C1 and C2, it can be seen that the information indicated by the first offset and the second offset is the predicted offset of each point cloud for each key point, and the difference is that the step C1 is to process all point clouds in the three-dimensional space, and the step C2 is to process the point cloud in each cell in the three-dimensional space; that is, the position perception information (example mask information) is already added in the step C2 when the offset estimation is performed, so that when the keypoint regression is performed, the processing can be directly performed based on the second offset and the target predicted value determined by the point cloud information.

Specifically, since the offset estimation task is performed for each cell in the three-dimensional space in step C2, there is no need to add instance mask information in the step of keypoint regression. In conjunction with the dashed line in fig. 4, the flowchart shows the result of the 3D keypoint offset estimation as the input data of the regressor, and the network structure diagram shows the regressor connected to the multi-layered perceptron 3.

In one embodiment, as shown in FIGS. 5, 6 and 7, determining the keypoint information based on the semantic segmentation information and the instance mask information in step B3 includes the following step C3: determining instance segmentation information based on the semantic segmentation information and the instance mask information; estimating a first offset of each point cloud corresponding to the key point based on the image characteristics; and determining key point information of the object in a regression mode based on the first offset and the example segmentation information.

Specifically, the semantic segmentation information represents a category corresponding to each point cloud, and the example mask information may represent a cell (a cell corresponding to an object) corresponding to each point cloud in the three-dimensional mesh; example segmentation information determined based on the semantic segmentation information and the example mask information may be used to characterize the class of the object and the location information of the object. The semantic segmentation information can be obtained by processing the fused image features by adopting a multilayer perceptron of semantic segmentation calculation shown in fig. 7; the example mask information may be obtained by processing the fused image features using a multi-layered perceptron of location-aware 3D example segmentation as shown in fig. 7.

Alternatively, the process of determining the instance division information based on the semantic division information and the instance mask information may be understood as a process of excluding redundant information by excluding hash information in the instance mask information by means of the semantic division information. Specifically, the result corresponding to the semantic segmentation information may be a matrix of size N × C, corresponding to the example mask informationThe result may be a size of N x D³The process of combining the two can be understood as a process of matrix multiplication, and the result of processing is example segmentation information.

Alternatively, the specific process of determining the first offset may refer to the content of step C1 described above. The first offset may be obtained by processing the fused image features by using a multi-layer perceptron calculated by using the 3D keypoint offset shown in fig. 7.

In an embodiment, the determining the key point information of the object by regression based on the first offset and the example segmentation information in step C3 may include the following steps: determining an initial predicted value of the key point predicted by taking the point cloud as a reference based on the first offset and the point cloud information; determining a target predicted value of a key point in the three-dimensional grid based on the initial predicted value and the example segmentation information; and determining key point information of the object in a regression mode based on the target predicted value.

Specifically, the determination of the key point information of the object by the regression method may be obtained by processing the first offset and the example segmentation information by using a multilayer perceptron of 3D key point regression for location perception shown in fig. 7.

Specifically, the specific process of determining the initial predicted value and the target predicted value may refer to the content shown in step C1 above. The difference between the step C3 and the step C1 is that: step C1 is a target predicted value determined based on the initial predicted value and the example mask information; step C3 is a target prediction value determined based on the initial prediction value and the instance partition information. In contrast, since the example segmentation information has effectively eliminated partially redundant information, the accuracy of the target prediction value determined by the step C3 is high.

In one embodiment, the determining the key point information of the object by regression based on the target predicted value in the steps C1-C2 includes at least one of the following steps C01-C04:

step C01: and determining the average value of the target predicted values respectively corresponding to the key point and each point cloud as key point information aiming at each key point of the object.

Wherein for each location (which may be understood as each cell), the target prediction value characterizes coordinate information (predicted coordinate values) of each keypoint predicted based on each point cloud. In conjunction with table 1 below, an example is illustrated (assuming that there are currently 3 point clouds, 2 keypoints):

table 1 (target prediction value)

Point cloud/keypoints	1	2
			1	(a1，b1，c1)	(d1，e1，f1)
2	(a2，b2，c2)	(d2，e2，f2)
			3	(a3，b3，c3)	(d3，e3，f3)

With reference to table 1, it can be seen that each point cloud has a corresponding predicted coordinate value (target predicted value) corresponding to each key point.

In one embodiment, step C01 can be represented by the following formula (1):

y=Σx_i/N

......(1)

wherein, i is 1.·, N; i represents the ith point cloud, and N point clouds in total; x is a target predicted value; y is the keypoint information (e.g., the spatial coordinate value of the keypoint).

The spatial coordinate information of the keypoint 1 is a coordinate value obtained by calculating [ (a1, b1, c1) + (a2, b2, c2) + (a3, b3, c3) ]/3.

Step C02: and aiming at each key point of the object, determining a target predicted value corresponding to the key point and each point cloud respectively and a weighted average value of probability values corresponding to each point cloud in the example mask information as key point information.

Alternatively, step C02 may be understood as a weighted regression method, using the location-aware example mask segmentation confidence M as a weight (if the confidence contribution of the mask prediction is small, the contribution to the keypoint prediction is small).

In one embodiment, step C02 can be represented by the following equation (2):

y=Σw_ix_i/Σw_i

......(2)

wherein, i is 1.·, N; i represents the ith point cloud, and N point clouds in total; x is a target predicted value; w is the mask confidence of each cell (probability value of each point cloud corresponding to each cell), W = W _ ins_idx；idx＝1,...,D³W _ ins is the network output probability map of the example mask segmentation, idx is the index of the spatial location of the cell where the object is located. Alternatively, w may also be denoted as w ═ M _ idx.

Specifically, the target prediction value may be expressed by referring to the example shown in table 1 in step C01 described above. The probability value corresponding to each point cloud in the example mask information is the probability value corresponding to the current position (cell) of each point cloud (the three-dimensional grid has a plurality of cells, and the probability value corresponding to each cell of each point cloud can be predicted when the example mask is segmented).

The spatial coordinate information of the keypoint 1 is a coordinate value obtained by [ w1 (a1, b1, c1) + w2 (a2, b2, c2) + w3 (a3, b3, c3) ]/[ w1+ w2+ w3 ].

Step C03: and aiming at each key point of the object, determining key point information of the object by respectively corresponding target predicted values of the key point and a preset value point cloud which is closest to the center point of the object and weighted average values of probability values respectively corresponding to the preset value point clouds in the example mask information.

In one embodiment, step C03 can be expressed by the following equation (3):

y=Σw_ix_i/Σw_i

......(3)

wherein, i is 1, … T; i represents the ith point cloud, for a total of T point clouds. The determination method of the T point clouds is to predict the offset of all point clouds to the object center (key point), that is, the offset is predicted

Wherein dx, dy, and dz are offset predicted values in the xyz three directions, respectively, (or target predicted values) are sorted in ascending order, and the T point clouds sorted in the top are taken as the point clouds for calculating the key point information in step C03. x is a target predicted value; w is the mask confidence of each cell (probability value of each point cloud corresponding to each cell), W = W _ ins_idx；idx＝1,...,D³W _ ins is the network output probability map of the example mask segmentation, idx is the index of the spatial location of the cell where the object is located. Alternatively, w may also be denoted as w — M _ idx.

Alternatively, the target predicted value may be represented by referring to the example shown in table 1 in step C01 described above. Assume that the first T point clouds (T is 2) in the current 3 point clouds are point cloud 1 and point cloud 3. The spatial coordinate information of the keypoint 1 is a coordinate value obtained by [ w1 (a1, b1, c1) + w3 (a3, b3, c3) ]/[ w1+ w3 ].

Step C04: and aiming at each key point of the object, determining a distance closeness value corresponding to the key point and each point cloud respectively, a target prediction value corresponding to the key point and each point cloud respectively, and a weighted mean value of probability values corresponding to each point cloud in the example mask information respectively as key point information.

Specifically, the physical meaning represented by the proximity value of the distance between the key point and the point cloud is to quantify the distance between each point cloud and the key point of the object (the quantified distance is between [0 and 1 ]), which can be specifically represented by the following formula (4):

p＝(d_max-offset)/d_max

......(4)

wherein, d _ max can be the farthest Euclidean distance between the key point and all the point clouds on the object (which can be calculated by the target predicted value and the spatial coordinate value of the point clouds, and also can be a set threshold constant); offset is the predicted offset (resulting from performing the offset estimation task), i.e., offset

Wherein dx, dy and dz are predicted values of offset in the three directions of xyz.

Based on the distance closeness value shown in the above equation (4), it can be determined that the point cloud is closer to the key point as the distance closeness value p is larger.

In one embodiment, step C04 can be represented by the following equation (5):

y＝Σw_ip_ix_i/Σw_ip_i

......(5)

wherein i is 1, … N; i represents the ith point cloud, and N point clouds are total; x is a target predicted value; p is a distance closeness value shown in formula (4); w is the mask confidence of each cell (probability value of each point cloud corresponding to each cell), W = W _ ins_idx；idx＝1,...,D³W _ ins is the network output probability map of the example mask segmentation, idx is the index of the spatial location of the cell where the object is located. Alternatively, w may also be denoted as w ═ M _ idx.

Specifically, the target prediction value may be expressed by referring to the example shown in table 1 in step C01 described above.

The spatial coordinate information of the key point 1 is a coordinate value obtained by [ w1 × p1 (a1, b1, c1) + w2 × p2 (a2, b2, c2) + w3 × p3 (a3, b3, c3) ]/[ w1 × p1+ w2 × p2+ w3 × p3 ].

By performing the step C04, it is possible to further remove the influence of the abnormal value of the offset predicted by the execution of the offset estimation task, compared to the step C01. In a possible embodiment, a module for outputting the distance closeness value p (the output value of the module is a predicted value) may be added before the regressor of the network to reduce the amount of calculation in the regression calculation.

In an embodiment, a regressor in the network structure performs a key point regression task, and specifically performs the steps shown in steps C01-C04, and during network training, the regressor may be added to the entire network structure for training, or may be placed outside the network training, that is, during network training, only the training update includes feature extraction, semantic segmentation, example mask segmentation, and key point offset estimation of network parameters of each task.

Alternatively, in consideration of the limited expression brought by extracting only the point cloud features including the spatial coordinate information when performing the point cloud feature extraction based on the depth image, other features such as RGB colors, normal vectors, and the like are generally added. However, the time consumption for calculating other feature information is long, in order to improve the accuracy of the network for performing pose estimation and ensure timeliness while adding other feature information, a feature mapping module is further added to the implementation network provided by the present application, as shown in fig. 10 or fig. 11 (a feature mapping module may also be added to fig. 3 and fig. 4; a color image may also be replaced by a gray image in fig. 11), the steps a1 and a2 are to extract point cloud features based on the input depth image, and may further include the following steps a11-a 12:

step A11: acquiring corresponding point cloud information based on the input depth image;

step A12: point cloud features are extracted based on the point cloud information and at least one of the color features and the normal features.

Specifically, in the network training process, when point cloud feature extraction is performed, firstly, point cloud information corresponding to an input depth image is acquired and input into a point cloud feature extraction network to perform feature extraction training; and then point cloud information and other feature information (such as RGB color, normal vector and the like) are input into a point cloud feature extraction network, and the minimum Euclidean distance (other distance measurement modes can also be adopted) processing is carried out on the processed feature information, so that when the network is used in a test, the image features corresponding to the extracted point cloud can have other feature expression modes, and the effect of reducing the time for extracting other feature information is realized.

In fig. 4, 6 and 11 mentioned in the above embodiments, the blocks corresponding to the sharp corners represent operation steps or network structures, and the blocks corresponding to the rounded corners represent processing results. As shown in fig. 4, the block diagram representation corresponding to dense fusion fuses image pixel features and point cloud features, and the semantic segmentation probability chart represents output data of the multilayer perceptron 1.

To further explain the application of the object posture estimation method provided by the embodiment of the present application in an actual scene, the application is explained with reference to the augmented reality corresponding to fig. 12.

Fig. 12 shows a scene in which the object posture estimation method provided in the embodiment of the present application is applied to an augmented reality system. The image of the object is shot by AR glasses worn by the user in real time, the shot content is video data corresponding to the three-dimensional space in front of eyes when the user uses the augmented reality system, and the image data of each frame in the video data can be understood to be processed when the posture of the object is estimated. As can be seen from fig. 12, black represents a real existing object, and white represents a virtual object, and by executing the method provided by the embodiment of the present application, virtual content can be aligned on a real object with a correct posture in real time; for moving objects in a real scene, real-time pose estimation can ensure that virtual objects with few real scenes and no delay artifacts in the augmented reality system are updated in time. As shown in fig. 12, the user may perceive the virtual object to be aligned in real time to the real object for display.

In order to better show the effect of the network provided by the embodiment of the present application when implementing the object posture estimation method, the following experimental data shown in table 2 are given:

TABLE 2

As shown in the experimental data in table 2, in the embodiment of the present application, a YCB video data set (pose estimation data set) is used to calculate, on a server, PVN3D (a 3D keypoint detection neural network based on Hough voting) and the inference time of each model thereof (object instance segmentation and object keypoint detection are performed in a post-processing clustering manner). In the method provided by the present application, a single-branch method of equal-distance division is used for the location-aware network (here, D is set to 10). Wherein the fitting (LS fitting) is a least squares fitting for calculating the 6 degree of freedom pose of the object. Experimental tests show that compared with the PVN3D method, the method provided by the application achieves acceleration of object posture estimation on different types of GPUs.

Based on the same inventive concept, the embodiment of the present application further provides an image processing method, as shown in fig. 13, including the following steps S1-S3:

step S1: and acquiring image characteristics corresponding to point clouds of the input image.

Specifically, the content shown in step S101 in the above embodiment may be referred to as the image feature corresponding to the point cloud of the input image obtained in step S1. The image may also be a depth image, a color image, and a grayscale image.

Step S2: and determining example mask information of the image by taking the point cloud corresponding to the center of the object as a reference through a multilayer perceptron network of example mask segmentation based on the image characteristics.

In particular, one or more objects may be included in the image, and each object may correspond to one or more point clouds. In the embodiment of the present application, the example mask segmentation task is performed by using a multi-layer perceptron network, and the processing procedure of the image features in the network can refer to the content shown in steps B1 and B2 in the above embodiment.

The following describes a specific process of training a multi-layered perceptron network to perform an example mask segmentation task, and specifically, the training step of the multi-layered perceptron network for example mask segmentation comprises the following steps S21-S23:

step S21: acquiring a training data set; the training data set comprises a plurality of training images and object labeling information corresponding to the training images.

Step S22: and inputting the training image into a multi-layer perceptron network segmented by an example mask, so that the network outputs prediction grid information corresponding to each point cloud of the object based on a three-dimensional network corresponding to the input training image and taking a cell where the point cloud corresponding to the center of the object is located as a reference.

Step S23: and determining parameters of the multilayer perceptron network segmented by the example mask based on the predicted network information and the object labeling information.

Optionally, the training method of the network may include the following two ways:

(1) the object labeling information can represent the real corresponding position (grid) of each point cloud of the object in the three-dimensional space. A training image may include a plurality of objects, each object corresponding to its own label information. The multilayer perceptron network segmented by the example mask can output initial grid information corresponding to each point cloud of an object, and then based on the initial grid information, the initial grid information corresponding to the point cloud of the object center is finally used as the predicted grid information corresponding to each point cloud of the object by combining the grid information where the point cloud corresponding to the object center is located. And when the network parameters are determined, updating the network parameters based on the initial grid information and the marking information. Optionally, the loss value of each iteration of the network may be determined based on preset loss functions dice loss, softmax loss, and the like, and then the network parameter may be updated based on the loss value.

(2) The object labeling information can represent the real corresponding position (grid) of the point cloud corresponding to the object center in the three-dimensional space. The multi-layer perceptron network of example mask segmentation can output the mesh information (prediction mesh information) where the point cloud corresponding to the center of the object is located; and then updating network parameters based on the predicted grid information and the labeling information.

That is, during the network training, only the mesh information corresponding to the point cloud corresponding to the object center may be trained, and the mesh information corresponding to the other point clouds of the object may be regarded as the same as the mesh information corresponding to the point cloud corresponding to the object center.

Step S3: image processing is performed based on the example mask information.

In an embodiment, the image processing method further includes step S4: and determining semantic segmentation information of the image based on the image characteristics.

Specifically, the specific process of determining the semantic segmentation information of the image based on the image features in step S4 may refer to the content shown in step S102 in the above embodiment.

The step S3 of performing image processing based on the example mask information includes the following steps S31-S32:

step S31: instance segmentation information for the object is determined based on the semantic segmentation information and the instance mask information.

Specifically, the specific process of determining the instance division information of the object based on the semantic division information and the instance mask information in step S31 may refer to that shown in step C3 in the above embodiment.

Step S32: image processing is performed based on the instance segmentation information.

In an embodiment, the image processing method may be applied to an object posture estimation method for object posture estimation.

Corresponding to the object posture estimation method provided by the present application, an embodiment of the present application further provides an object posture estimation device 1400, a schematic structural diagram of which is shown in fig. 14, where the object posture estimation device 1400 includes: a first obtaining module 1401, a first determining module 1402, and an attitude estimating module 1403.

The first obtaining module 1401 is configured to obtain an image feature corresponding to a point cloud of an input image; a first determining module 1402, configured to determine semantic segmentation information, instance mask information, and key point information of an object based on image features; and an attitude estimation module 1403, configured to perform object attitude estimation based on the semantic segmentation information, the example mask information, and the key point information.

Optionally, a first obtaining module 1401 configured to perform at least one of the following:

extracting point cloud features based on the input depth image, and confirming the extracted point cloud features as image features corresponding to the point cloud;

extracting a first image feature based on the input color image and/or grayscale image; extracting point cloud features based on the input depth image; and fusing the first image characteristic and the point cloud characteristic to obtain an image characteristic corresponding to the point cloud.

Optionally, the first obtaining module 1401 is configured to, when performing extracting point cloud features based on an input depth image, further perform:

acquiring corresponding point cloud information based on the input depth image;

extracting point cloud characteristics based on the point cloud information and at least one item;

color features and normal features.

Optionally, the first obtaining module 1401 is configured to fuse the first image feature and the point cloud feature to obtain an image feature corresponding to the point cloud, and further configured to perform:

and performing pixel-by-pixel fusion on the first image characteristic and the point cloud characteristic to obtain an image characteristic corresponding to the point cloud.

Optionally, the first obtaining module 1401 is configured to, when performing the first image feature extraction based on the input color image and/or grayscale image, further perform: extracting a first image feature through a convolutional neural network based on an input color image and/or a gray image; and/or

The first obtaining module 1401 is configured to, when performing point cloud feature extraction based on an input depth image, further perform: and extracting point cloud characteristics through a multi-layer perceptron network based on the input depth image.

Optionally, the first determining module 1402 is configured to, when determining semantic segmentation information, instance mask information, and keypoint information of an object based on image features, further perform:

determining semantic segmentation information corresponding to the point cloud based on the image characteristics;

creating a three-dimensional grid based on point cloud information corresponding to the input image, and determining example mask information of the object based on the three-dimensional grid;

the keypoint information is determined based on the instance mask information, or based on the semantic segmentation information and the instance mask information.

Optionally, the example mask information represents mesh information corresponding to the point clouds in the three-dimensional mesh, wherein the mesh information corresponding to each point cloud of the object is determined based on the mesh information corresponding to the point cloud of the center of the object.

Optionally, the first determining module 1402 is configured to, when performing creating a three-dimensional mesh based on point cloud information corresponding to the input image, further perform at least one of the following:

dividing a three-dimensional space corresponding to the point cloud information at equal intervals to obtain a three-dimensional grid;

dividing three-dimensional spaces corresponding to the point cloud information based on a plurality of preset intervals respectively to obtain a plurality of three-dimensional grids;

and dividing the three-dimensional space corresponding to the point cloud information to obtain a plurality of three-dimensional grids based on the same division starting points with different intervals.

Optionally, the first determining module 1402 is configured to, when performing the determining of the keypoint information based on the example mask information, further perform at least one of:

estimating a first offset of each point cloud corresponding to the key point based on the image characteristics; determining key point information of the object in a regression mode based on the first offset and the example mask information;

estimating second offset of key points corresponding to each point cloud in each cell of the three-dimensional grid based on the image characteristics and the example mask information; determining key point information of the object in a regression mode based on the second offset;

optionally, the first determining module 1402, when performing determining the key point information based on the semantic segmentation information and the instance mask information, is further configured to perform:

determining instance segmentation information based on the semantic segmentation information and the instance mask information;

estimating a first offset of each point cloud corresponding to the key point based on the image characteristics;

and determining key point information of the object in a regression mode based on the first offset and the example segmentation information.

Optionally, when the first determining module 1402 is configured to determine the key point information of the object by a regression manner based on the first offset and the example mask information, the first determining module is further configured to: determining an initial predicted value of the key point predicted by taking the point cloud as a reference based on the first offset and the point cloud information; determining target predicted values of key points in the three-dimensional grid based on the initial predicted values and the example mask information; and determining key point information of the object in a regression mode based on the target predicted value.

Optionally, the first determining module 1402 is configured to, when performing the determining of the key point information of the object by regression based on the second offset, further perform: determining a target predicted value of the key point predicted by taking the point cloud as a reference based on the second offset and the point cloud information; determining key point information of the object in a regression mode based on the target predicted value;

optionally, when the first determining module 1402 is configured to determine the key point information of the object in a regression manner based on the first offset and the example segmentation information, the first determining module is further configured to: determining an initial predicted value of the key point predicted by taking the point cloud as a reference based on the first offset and the point cloud information; determining a target predicted value of a key point in the three-dimensional grid based on the initial predicted value and the example segmentation information; and determining key point information of the object in a regression mode based on the target predicted value.

Optionally, the first determining module 1402 is further configured to, when performing regression based on the target predicted value to determine the key point information of the object, at least one of:

determining the key point information as the mean value of the target predicted values respectively corresponding to the key point and each point cloud aiming at each key point of the object;

aiming at each key point of the object, determining a target predicted value corresponding to the key point and each point cloud respectively and a weighted average value of probability values corresponding to each point cloud in the example mask information as key point information;

for each key point of the object, determining a target predicted value corresponding to the key point and a preset numerical value point cloud closest to the center point of the object respectively as key point information by using a weighted average value of probability values corresponding to the preset numerical value point clouds in the example mask information respectively;

and aiming at each key point of the object, determining a distance closeness value corresponding to the key point and each point cloud respectively, a target prediction value corresponding to the key point and each point cloud respectively, and a weighted mean value of probability values corresponding to each point cloud in the example mask information respectively as key point information.

Corresponding to the image processing method provided by the present application, an embodiment of the present application further provides an image processing apparatus 1500, a schematic structural diagram of which is shown in fig. 15, where the image processing apparatus 1500 includes: a second acquisition module 1501, a second determination module 1502, and a processing module 1503.

The second obtaining module 1501 is configured to obtain an image feature corresponding to a point cloud of an input image; the second determining module 1502 is configured to determine example mask information of the image based on the image features and by using the point cloud corresponding to the object center as a reference through a multi-layer perceptron network of example mask segmentation; the third processing module 1503 is used for image processing based on the example mask information.

Optionally, the training step of the example mask-segmented multi-layered perceptron network comprises:

acquiring a training data set; the training data set comprises a plurality of training images and object labeling information corresponding to the training images;

inputting a training image into a multi-layer perceptron network segmented by an example mask, so that the network outputs prediction grid information corresponding to each point cloud of an object based on a three-dimensional network corresponding to the input training image and taking a cell where the point cloud corresponding to the center of the object is located as a reference;

and determining parameters of the multilayer perceptron network segmented by the example mask based on the predicted network information and the object labeling information.

Optionally, the apparatus 1500 further includes a third determining module, configured to determine semantic segmentation information of the image based on the image feature.

Optionally, the processing module 1503 is further configured to: determining instance segmentation information for the object based on the semantic segmentation information and the instance mask information; image processing is performed based on the instance segmentation information.

The apparatus according to the embodiment of the present application may execute the method provided by the embodiment of the present application, and the implementation principle is similar, the actions executed by the modules in the apparatus according to the embodiments of the present application correspond to the steps in the method according to the embodiments of the present application, and for the detailed functional description of the modules in the apparatus, reference may be specifically made to the description in the corresponding method shown in the foregoing, and details are not repeated here.

The present application further provides an electronic device comprising a memory and a processor; wherein the memory has stored therein a computer program; the processor is adapted to perform the method provided in any of the alternative embodiments of the present application when running the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method provided in any of the alternative embodiments of the present application.

Alternatively, fig. 16 shows a schematic structural diagram of an electronic device to which the embodiment of the present application is applicable, and as shown in fig. 16, the electronic device 1600 may include a processor 1601 and a memory 1603. The processor 1601 is coupled to the memory 1603, such as via a bus 1602. Optionally, the electronic device 1600 may also include a transceiver 1604. It should be noted that the transceiver 1604 is not limited to one in practical application, and the structure of the electronic device 1600 does not constitute a limitation on the embodiment of the present application.

The Processor 1601 may be a CPU (Central Processing Unit), a general purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 1601 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessors, a combination of DSPs and microprocessors, etc.

Bus 1602 may include a path that transfers information between the above components. The bus 1602 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 1602 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 16, but this is not intended to represent only one bus or type of bus.

The Memory 1603 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 1603 is used for storing application program codes for executing the scheme of the application and is controlled to be executed by the processor 1601. The processor 1601 is adapted to execute application program code (computer program) stored in the memory 1603 to implement the content shown in any one of the method embodiments described above.

In embodiments provided herein, the above object pose estimation method performed by an electronic device may be performed using an artificial intelligence model.

According to an embodiment of the application, the method performed in the electronic device may obtain output data identifying the image or image features in the image by using the image data or video data as input data for an artificial intelligence model. The artificial intelligence model may be obtained through training. Here, "obtained by training" means that a basic artificial intelligence model is trained with a plurality of pieces of training data by a training algorithm to obtain a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose). The artificial intelligence model can include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values, and the neural network calculation is performed by a calculation between a calculation result of a previous layer and the plurality of weight values.

Visual understanding is a technique for recognizing and processing things like human vision, and includes, for example, object recognition, object tracking, image retrieval, human recognition, scene recognition, 3D reconstruction/localization, or image enhancement.

The object posture estimation device provided by the application can realize at least one module in a plurality of modules through an AI model. The functions associated with the AI may be performed by the non-volatile memory, the volatile memory, and the processor.

The processor may include one or more processors. At this point, the one or more processors may be general purpose processors (e.g., a Central Processing Unit (CPU), an Application Processor (AP), etc.) or pure graphics processing units (e.g., a Graphics Processing Unit (GPU), a Visual Processing Unit (VPU), and/or an AI-specific processor (e.g., a Neural Processing Unit (NPU)).

The one or more processors control the processing of the input data according to predefined operating rules or Artificial Intelligence (AI) models stored in the non-volatile memory and the volatile memory. Predefined operating rules or artificial intelligence models are provided through training or learning.

Here, the provision by learning means that a predefined operation rule or an AI model having a desired characteristic is obtained by applying a learning algorithm to a plurality of learning data. This learning may be performed in the device itself in which the AI according to the embodiment is performed, and/or may be implemented by a separate server/system.

The AI model may be comprised of layers including multiple neural networks. Each layer has a plurality of weight values, and the calculation of one layer is performed by the calculation result of the previous layer and the plurality of weights of the current layer. Examples of neural networks include, but are not limited to, Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Restricted Boltzmann Machines (RBMs), Deep Belief Networks (DBNs), Bidirectional Recurrent Deep Neural Networks (BRDNNs), generative confrontation networks (GANs), and deep Q networks.

A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data to enable, or control the target device for determination or prediction. Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An object pose estimation method, comprising:

2. The method of claim 1, wherein obtaining image features corresponding to point clouds of the input image comprises at least one of:

3. The method of claim 2, wherein extracting point cloud features based on the input depth image comprises:

acquiring corresponding point cloud information based on the input depth image;

color features and normal features.

4. The method of claim 2, wherein the fusing the first image feature with the point cloud feature to obtain an image feature corresponding to the point cloud comprises:

5. The method of claim 1, wherein determining semantic segmentation information, instance mask information, and keypoint information for an object based on the image features comprises:

determining semantic segmentation information corresponding to the point cloud based on the image features;

creating a three-dimensional grid based on point cloud information corresponding to an input image, and determining example mask information of an object based on the three-dimensional grid;

keypoint information is determined based on the instance mask information, or based on the semantic segmentation information and instance mask information.

6. The method of claim 5, wherein the example mask information characterizes mesh information corresponding to point clouds in the three-dimensional mesh, wherein the mesh information corresponding to each point cloud of the object is determined based on the mesh information corresponding to the point cloud at the center of the object.

7. The method of claim 5, wherein creating a three-dimensional mesh based on point cloud information corresponding to the input image comprises at least one of:

8. The method of claim 5,

determining keypoint information based on the instance mask information, including at least one of:

estimating second offset of key points corresponding to each point cloud in each cell of the three-dimensional grid based on the image features and the example mask information; determining key point information of the object in a regression mode based on the second offset;

determining keypoint information based on the semantic segmentation information and the instance mask information, including:

9. The method of claim 8,

determining key point information of the object by a regression mode based on the first offset and the example mask information, wherein the determining comprises the following steps:

determining an initial predicted value of the key point predicted by taking the point cloud as a reference based on the first offset and the point cloud information; determining target predicted values of key points in the three-dimensional grid based on the initial predicted values and example mask information; determining key point information of the object in a regression mode based on the target predicted value;

the determining the key point information of the object by means of regression based on the second offset includes:

determining a target predicted value of the key point predicted by taking the point cloud as a reference based on the second offset and the point cloud information; determining key point information of the object in a regression mode based on the target predicted value;

the determining, by means of regression, the key point information of the object based on the first offset and the example segmentation information includes:

determining an initial predicted value of the key point predicted by taking the point cloud as a reference based on the first offset and the point cloud information; determining a target predicted value of a key point in the three-dimensional grid based on the initial predicted value and the example segmentation information; and determining key point information of the object in a regression mode based on the target predicted value.

10. The method of claim 9, wherein determining the keypoint information of the object by regression based on the target predicted value comprises at least one of:

for each key point of the object, determining a target predicted value corresponding to the key point and each point cloud respectively and a weighted average value of probability values corresponding to each point cloud in the example mask information as key point information;

for each key point of the object, determining a target predicted value corresponding to the key point and a preset numerical value point cloud closest to the object center point respectively and a weighted average value of probability values corresponding to the preset numerical value point clouds in the example mask information respectively as key point information;

and aiming at each key point of the object, determining a distance closeness value corresponding to the key point and each point cloud respectively, a target prediction value corresponding to the key point and each point cloud respectively, and a weighted average value of probability values corresponding to each point cloud in the example mask information respectively as key point information.

11. An image processing method, characterized by comprising:

and performing image processing based on the example mask information.

12. The method of claim 11, further comprising the step of: determining semantic segmentation information of the image based on the image features;

the image processing based on the instance mask information includes:

determining instance segmentation information for the object based on the semantic segmentation information and instance mask information;

image processing is performed based on the instance segmentation information.

13. An electronic device comprising a memory and a processor;

the memory has stored therein a computer program;

the processor, when running the computer program, is configured to perform the method of any of claims 1 to 10 or 11 to 12.

14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 10 or 11 to 12.