CN116152334A

CN116152334A - Image processing method and related equipment

Info

Publication number: CN116152334A
Application number: CN202111348242.6A
Authority: CN
Inventors: 汪昊; 李炜明; 王强; 金知姸; 张现盛; 洪性勋
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2023-05-23
Also published as: KR20230071052A

Abstract

The embodiment of the application provides an image processing method, an image processing device, electronic equipment, a computer readable storage medium and a computer program product, and relates to the technical fields of image processing and artificial intelligence. The image processing method comprises the following steps: based on the color image and the depth image, acquiring three-dimensional characteristic information and two-dimensional characteristic information; based on the attention mechanism, fusing the three-dimensional characteristic information and the two-dimensional characteristic information to obtain fused characteristic information; and performing image processing based on the fusion characteristic information. The implementation of the method carries out image processing through multi-mode information, and is beneficial to improving the accuracy of image processing. Meanwhile, the above-described image processing method performed by the electronic device may be performed using an artificial intelligence model.

Description

Image processing method and related equipment

Technical Field

The present application relates to the field of image processing and artificial intelligence, and in particular, to an image processing method, an apparatus, an electronic device, a computer readable storage medium, and a computer program product.

Background

In the image processing technology, pose estimation, image segmentation, object recognition, and the like belong to important research directions. In the prior art, only information of a single modality is generally used for image processing. However, in the face of a complex example, image processing using information of a single modality tends to result in very low accuracy of the processing result.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, electronic equipment, a computer readable storage medium and a computer program product, which can solve the technical problem of low accuracy of image processing results in the related technology. The technical scheme is as follows:

according to an aspect of the embodiments of the present application, there is provided an image processing method, including:

based on the color image and the depth image, acquiring three-dimensional characteristic information and two-dimensional characteristic information;

based on an attention mechanism, fusing the three-dimensional characteristic information and the two-dimensional characteristic information to obtain fused characteristic information;

and performing image processing based on the fusion characteristic information.

According to another aspect of the embodiments of the present application, there is provided an image processing apparatus including:

the acquisition module is used for acquiring three-dimensional characteristic information and two-dimensional characteristic information based on the color image and the depth image;

the fusion module is used for fusing the three-dimensional characteristic information and the two-dimensional characteristic information based on an attention mechanism to obtain fused characteristic information;

and the processing module is used for processing the image based on the fusion characteristic information.

According to another aspect of the embodiments of the present application, there is provided an electronic device including:

One or more processors;

a memory;

one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs configured to: the above image processing method is performed.

According to still another aspect of the embodiments of the present application, there is provided a computer-readable storage medium for storing computer instructions that, when executed on a computer, enable the computer to perform the above-described image processing method.

According to an aspect of embodiments of the present application, there is provided a computer program product comprising a computer program or instructions which, when executed by a processor, implement the steps of the above-described image processing method.

The beneficial effects that technical scheme that this application embodiment provided brought are:

the application provides an image processing method, an image processing device, electronic equipment, a computer readable storage medium and a computer program product, and specifically, aiming at an input image, three-dimensional characteristic information and two-dimensional characteristic information are firstly obtained based on a color image and a depth image of the input image, and then the three-dimensional characteristic information and the two-dimensional characteristic information are subjected to characteristic fusion to obtain fusion characteristic information, wherein the characteristic fusion is realized by adopting an attention mechanism; further, image processing is performed based on the fusion feature information; the implementation of the scheme obtains multi-mode fusion characteristic information through characteristic fusion so as to realize image processing based on multi-mode information, and compared with the image processing based on single-mode information, the implementation of the scheme is beneficial to improving the accuracy of image processing. In addition, in some specific scenes, such as application scenes of augmented reality, the implementation of the scheme is also beneficial to improving the perception capability of three-dimensional information, so that the processing efficiency and the robustness of the system are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flow chart of an image processing method according to an embodiment of the present application;

fig. 2 is a flowchart of an image processing method according to an embodiment of the present application;

FIG. 3a is a block flow diagram of a method for estimating 6D pose and size of an object based on class level of color and depth images according to an embodiment of the present application;

FIG. 3b is a block flow diagram of an image processing method based on a color image and a depth image according to an embodiment of the present application;

FIG. 3c is a block diagram of an ARF-Net provided in an embodiment of the present application;

FIG. 3d is a block diagram of another ARF-Net provided by an embodiment of the present application;

FIG. 4 is a block flow diagram of a method for estimating a multi-scale fused object pose based on an attention mechanism according to an embodiment of the present application;

FIG. 5 is a block flow diagram of a depth feature fusion method based on an attention mechanism according to an embodiment of the present application;

FIG. 6 is a block diagram of a design of an attention mechanism provided by an embodiment of the present application;

FIG. 7 is a block diagram of a design of an attention mechanism provided by an embodiment of the present application;

FIG. 8 is a block diagram of a design of an attention mechanism provided by an embodiment of the present application;

FIG. 9 is a block flow diagram of an end-to-end object pose estimation method in combination with multi-modal fusion according to an embodiment of the present application;

FIG. 10 is a block flow diagram of a method for estimating pose of joint object shape reconstruction and segmentation tasks according to an embodiment of the present application;

FIG. 11a is a schematic diagram of an operating environment according to an embodiment of the present application;

FIG. 11b is a schematic diagram of an input image according to an embodiment of the present application;

FIG. 11c is a schematic diagram of an image processing result according to an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present application. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g. "a and/or B" indicates implementation as "a", or as "a and B".

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The following describes the related art to which the present application relates:

artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. In this application, computer vision techniques may be involved.

Computer Vision (CV) is a science of how to make a machine "look at", and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing, so that the Computer processes the target into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric technologies such as face recognition, fingerprint recognition, etc.

Specifically, the image processing method and the related device provided by the embodiments of the present application may be applied to example scenes of augmented reality (Augmented Reality, AR), image processing, image recognition, object recognition, image segmentation, 6D pose estimation, and the like. In the scene of augmented reality, a real scene experience is provided for a user by adding virtual content in a real scene in front of the user; in three-dimensional space, system processing relying on augmented reality technology requires high-precision real-time processing and understanding of the three-dimensional state of surrounding objects to accomplish the presentation of high-quality virtual-real fusion effects in front of the user.

In the related art, image processing is generally performed using only image data of a single modality. Such as 6D pose estimation using only depth images, while RGB images are used only for object detection. Under this technical basis, there is a limit to image processing, such as sensor noise, object occlusion, etc., which may cause the predicted pose to be blurred. In addition, intra-class shape variation in image processing is a great challenge for class-level pose estimation tasks, and shape variation is prone to inaccuracy in predicting and locating objects.

The embodiment of the application provides an image processing method, an image processing device, electronic equipment, a computer readable storage medium and a computer program product; specifically, the implementation of the application performs image processing by inputting an image comprising color and depth information, which is beneficial to improving the efficiency and robustness of the system in applications such as augmented reality and the like; in addition, color features and depth features are fused at the same time, so that the perception capability of the model on three-dimensional information is improved, and the problem of shape and scale change of class-level objects is better processed.

The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.

An image processing method is provided in the embodiments of the present application, as shown in fig. 1 and fig. 2, fig. 1 shows a schematic flow chart of the image processing method provided in the embodiments of the present application, and fig. 2 shows a block flow chart of the image processing method provided in the embodiments of the present application; the method may be executed by any electronic device, as shown in fig. 11a, and may be a user terminal 100, or may be a server 200, where the user terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted device, an AR device, or the like, and the server may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, a cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network, a content distribution network), and a big data and an artificial intelligent platform. Wherein communication may be performed between the terminal 100 and the server 200.

Specifically, as shown in fig. 1, the image processing method provided in the embodiment of the present application may include the following steps S101 to S103:

step S101: based on the color image and the depth image, three-dimensional feature information and two-dimensional feature information are acquired.

Specifically, the color image and the depth image may be images to which depth information and color information contained in one input image respectively correspond. The input image may be a Depth image with colors, such as a superposition of a color image (RGB image) and a Depth image (Depth image) (RGB-D image), or a superposition of a grayscale image and a Depth image. Wherein the input image may appear as shown in fig. 11 b.

The three-dimensional characteristic information can be extracted based on the depth image, and the two-dimensional characteristic information can be extracted based on the color image or the gray image. Specifically, the input image is subject to object detection by the object detector to obtain an object region, and the depth and color image are clipped to the object region to obtain a color image and a depth image containing the object.

The feature extraction process from the bottom layer to the high layer is adopted in the image feature extraction, as shown in fig. 11b, the feature extracted from the bottom layer of the network may be the outline feature of the table, the feature extracted along with the deep extraction of the network may be higher feature information such as a table leg, a table drawer and the like, and the feature with different dimensions may be the feature extracted at different dimensions of the network. Alternatively, in the present application, the processing may be performed with respect to three-dimensional feature information and two-dimensional feature information of one scale, or may be performed with respect to three-dimensional feature information and two-dimensional feature information of multiple scales.

Step S102: and based on an attention mechanism, fusing the three-dimensional characteristic information and the two-dimensional characteristic information to obtain fused characteristic information.

The feature fusion process is to combine the single-mode features extracted from the image into multi-mode features with discrimination capability more than the input features. The embodiment of the application adopts the attention mechanism to realize the fusion of the three-dimensional characteristic information and the two-dimensional characteristic information. It can be understood that in the fused characteristic information obtained after the three-dimensional characteristic information and the two-dimensional characteristic information are fused, the characteristics are complemented, and the influence of inherent defects of the single-mode characteristic is reduced.

Optionally, when three-dimensional feature information and two-dimensional feature information of multiple scales are extracted, the fused feature information obtained by fusion is the feature information of multi-scale fusion.

Step S103: and performing image processing based on the fusion characteristic information.

Specifically, image processing based on the fusion feature information can solve the problem of image processing without an accurate three-dimensional model, so that a complex application scene in practice can be better dealt with (according to the embodiment of the application, accurate AR interaction can be performed in a real-world scene by adopting RGBD data of a model which does not need a known object). As shown in fig. 11c, the result image obtained after the image processing by the method provided by the embodiment of the present application is used, and the virtual object rendered in the augmented reality system can be controlled based on the result image, so that the real object and the virtual object generate real and natural interaction.

According to the method and the device for processing the three-dimensional object, the three-dimensional features and the two-dimensional features are fused and fully utilized, the method and the device can have efficient processing capacity on a mobile platform with limited computing and storage resources and energy consumption, and the requirements on precision and robustness are met in the aspects of three-dimensional object processing and understanding.

The image processing method provided by the embodiment of the application is adapted to, and an ARF-Net (Attention-guided RGB-D Fusion Net) is provided, and the model can fuse multi-mode information by using a transform mechanism. Specifically, ARF-Net can adaptively fuse two-dimensional features (apparent features extracted from RGB images) with three-dimensional features (three-dimensional features extracted from depth images or point clouds) through an illumination mechanism, and can fully explore object structural features to distinguish shapes of different examples. The ARF network provided by the application can fuse RGB features with point cloud features, and can achieve improvement in performance on various feature extractors.

The following describes a process of acquiring three-dimensional feature information and two-dimensional feature information based on an input image.

In one embodiment, the step S101 of acquiring three-dimensional feature information and two-dimensional feature information of at least one scale of the input image includes the following steps A1-A2:

Step A1: converting the input depth image into point cloud characteristic information in a three-dimensional space, and encoding based on the point cloud characteristic information to obtain three-dimensional characteristic information of at least one scale.

Specifically, as shown in fig. 3a, the depth image combined with the camera internal reference may be converted into point cloud feature information (also referred to as point cloud data) in a three-dimensional space, and then the point cloud feature information is used as input of a three-dimensional feature encoder, and the three-dimensional feature information may be obtained through processing of the three-dimensional feature encoder.

The three-dimensional feature information may be implemented using a three-dimensional feature extraction network (three-dimensional feature encoder), such as a multi-layer perceptual network (MLP) encoder, a three-dimensional voxel network, to extract a three-dimensional feature vector corresponding to each point. The three-dimensional feature extraction network may describe three-dimensional structural features of the depth image in a three-dimensional space.

Step A2: and encoding based on the input color image or color depth image to obtain two-dimensional characteristic information of at least one scale.

Specifically, as shown in fig. 3a, one of a color image, a gray level image, a color depth image, a gray level depth image, and the like may be used as an input of a two-dimensional code feature encoder, and processed by the two-dimensional code feature encoder to obtain two-dimensional feature information. The two-dimensional code encoder can be realized by adopting a deep convolutional neural network, and the two-dimensional apparent characteristics of the input image can be extracted by the two-dimensional code encoder.

In the embodiment of the application, for an aligned RGB-D scene, an instance segmentation method may be first adopted to detect and segment an object instance, and the RGB image of the instance is cut by the object bounding box through the masked depth information and the point cloud data of the instance calculated by the camera internal parameters, and this data is used as an input of the ARF network; as shown in fig. 3c and 3d, ARF-Net first adopts convolutional neural network (RGB network) to extract RGB features, and uses point cloud feature network (point cloud network) to extract point cloud features; features of the two modalities are then fused by an attention directed RGB-D fusion module (ARF) to further enhance the learning ability of the network to the canonical shape of the object. Based on the fusion module, the object appearance features can be adaptively fused into the object geometry features. In the shape decoding stage, the NOCS (Normalized Object Coordinate Space ) can be reconstructed using an MLP-based decoder that takes as input the fusion features. Finally, the predicted NOCS coordinates are matched to the observation points using a similarity transformation, such as a mel-mountain transformation algorithm (Umeyama algorithm), to obtain the 6D pose and size of the object.

The specific processing procedure for feature fusion is described below.

Specifically, as shown in fig. 3a, in the embodiment of the present application, three-dimensional feature information extracted by a three-dimensional feature encoder and two-dimensional feature information extracted by a two-dimensional feature encoder may be input to a multiscale fusion module based on an attention mechanism to realize feature fusion, and fused feature information is output. That is, the embodiment of the application may fuse the three-dimensional feature information of at least one scale and the two-dimensional feature information of at least one scale based on the attention mechanism to obtain the fused feature information.

In an embodiment, as shown in fig. 4, the step S102 of fusing the three-dimensional feature information and the two-dimensional feature information based on the attention mechanism to obtain fused feature information includes the step B1:

step B1: the following operations are performed for three-dimensional feature information and two-dimensional feature information of any scale: and carrying out feature fusion on the three-dimensional feature information of the current scale and the two-dimensional feature information of the current scale based on an attention mechanism to obtain fusion feature information.

The three-dimensional characteristic information of the current scale is determined according to the fusion characteristic information of the last scale and the three-dimensional characteristic information of the last scale; the two-dimensional characteristic information of the current scale is determined according to the two-dimensional characteristic information of the previous scale.

Specifically, the multi-scale feature fusion is to fuse three-dimensional feature information and two-dimensional feature information extracted on different scales respectively. The fusion can adopt a fusion method based on an attention mechanism, and can realize fusion of multiple scales in a cascading mode. Optionally, after the feature fused by each scale is spliced with the three-dimensional feature information of the previous scale, the feature fused by each scale can be used as the input of a three-dimensional feature encoder of the next scale; the two-dimensional characteristic information of the previous scale can be used as the input of the two-dimensional characteristic encoder of the next scale.

Take the 3-scale fusion approach shown in fig. 4 as an example:

for the scale 1, extracting first three-dimensional characteristic information under the scale based on the point cloud data, extracting first two-dimensional characteristic information under the scale based on the color image, and fusing the first three-dimensional characteristic information and the first two-dimensional characteristic information by adopting a module of the fusion 1 based on an attention mechanism to obtain first fused characteristic information;

for the scale 2, second three-dimensional characteristic information under the scale can be extracted based on the first three-dimensional characteristic information and the first fusion characteristic information, second two-dimensional characteristic information under the scale can be extracted based on the first two-dimensional characteristic information, and the second three-dimensional characteristic information and the second two-dimensional characteristic information are fused by adopting a module of the fusion 2 based on an attention mechanism to obtain second fusion characteristic information;

For the scale 3, third three-dimensional characteristic information under the scale can be extracted based on the second three-dimensional characteristic information and the second fusion characteristic information, third two-dimensional characteristic information under the scale can be extracted based on the second two-dimensional characteristic information, and the third three-dimensional characteristic information and the third two-dimensional characteristic information are fused by adopting a module of the fusion 3 based on an attention mechanism to obtain third fusion characteristic information (namely, finally obtained fusion characteristic).

In one embodiment, the step S102 of fusing the three-dimensional feature information and the two-dimensional feature information based on the attention mechanism to obtain fused feature information includes the following steps B2-B4:

step B2: and acquiring point cloud voxel characteristic information and/or voxel position characteristic information according to the three-dimensional characteristic information.

Specifically, the three-dimensional feature information can be directly voxelized and then converted into point cloud voxel feature information.

As shown in fig. 5, whether the three-dimensional feature information is a voxel feature may be determined first, if yes, voxel processing is performed, and then voxel position feature information and point cloud voxel feature information are obtained through voxel feature encoding; if not, the three-dimensional characteristic information is directly converted into voxel position characteristic information and point cloud voxel characteristic information. Where voxel (volume) is an abbreviation for volume element (volume pixel). Voxelization (Voxelization) is the conversion of a geometric representation of an object into a voxel representation closest to the object.

Step B3: and acquiring the voxel characteristic information of the first image according to the two-dimensional characteristic information.

Specifically, since the RGBD image is aligned, the image pixels and the three-dimensional points of the point cloud have a one-to-one correspondence, based on which two-dimensional feature information (image feature) can be projected into a voxel space consistent with the point cloud through a known 2D-3D positional relationship to obtain image voxel feature information.

Step B4: and based on an attention mechanism, carrying out feature fusion according to the point cloud voxel feature information, the voxel position feature information and/or the first image voxel feature information to obtain fusion feature information.

Specifically, the point cloud voxel characteristic information, the first image voxel characteristic information and the voxel position characteristic information can be used as input of an attention module, so that the fusion processing of the characteristic information can be realized by adopting an attention mechanism.

In a possible embodiment, considering the appearance characteristics of RGB and the geometric characteristics of the point cloud, the ARF-Net proposed in the embodiment of the present application uses a cross attention module to establish the correlation between the RGB characteristics and the point cloud characteristics. The fusion module may adaptively select representative apparent features through cross-modal correlation calculations to enhance corresponding point cloud features. ARF-Net adopts a self-attention module to extract the internal space relation between object point clouds and describe the global space structural relation between local objects.

Specifically, in the fusion module, the structural-aware geometric features with self-attention may be obtained first, and then the relational fusion RGB features with cross-attention may be obtained. The fusion module can be used alone or in a plurality of overlapped modes.

Aiming at the point cloud characteristics of the structure perception, a self-attention module is adopted to establish the dependency relationship between the point clouds. In order to collect multi-scale point cloud features, the low-level multi-scale point cloud features can be up-sampled to the same resolution and feature stitching can be performed. As shown in fig. 3c and 3d, after the multi-scale features are spliced, the multi-layer perceptron is used to compress the feature dimensions into a fixed feature dimension.

The self-attention module can take the point cloud characteristics as input, and project the point cloud characteristics through linear operation to generate a query, a key and a value; specifically, the following formulas (1) to (4) can be expressed:

/>

where m is the number of attention heads, in a multi-head attention module, parallel computing attention operations are performed in multiple heads. In each head, an attention map A is calculated between each local feature in the projection embedding space _m Will pay attention to try to match V _m Multiplying (value) to obtain the enhanced point cloud characteristics of the instance; from each head point cloud

The features of each header are then connected (concat) together to model the object structure in its entirety.

Wherein softmax () is an activation function, Q _m In order to query the vector of the vector,

is a key vector, V _m As a vector of values,

and->

The weight coefficients of the query vector, the key vector and the value vector are respectively; t, d are relevant parameters involved in the calculation of attention.

For relation-aware RGB features, to enhance the three-dimensional representation, consider adaptively selecting the relative RGB features of the respective point clouds. Since the RGB-D images are aligned, the RGB features corresponding to each point can be obtained by observing the object point location. When a correlation model between RGB features and point cloud features is established, a multi-head attention mode is also adopted. For example, three-dimensional point clouds can be adopted to sample RGB features of a plurality of scales of a low level to point level representation, and then after the multi-scale features are spliced, the shared multi-layer perceptron is used to compress the multi-scale features to the feature dimension same as the point cloud features. Since the points are sparse, when employing context cues in neighboring pixels, a max pooling operation may be used on the RGB feature map of each pixel, aggregating the context features prior to the aggregation operation. Specifically, the expression (5) below can be expressed:

F _r ＝Multihead Attention(F _p ,F _r )

..

Wherein the multi-head attention operation (Multihead Attention) is similar to the attention calculation described above, but the input will be different, with the RGB feature F at the point level _r As key and value, using point cloud feature F _p As a query.

On the other hand, each element in the attention learned in the cross-attention operation represents a relationship score between the appearance feature of the i-th point and the geometric feature of the j-th point. The higher the correlation means the greater the contribution of the corresponding appearance feature to a certain point. Thus, the learned correlation is used as a guide for highlighting important appearance information. Will enhance F _r And F _p Spliced together and thenA Feed Forward Network (FFN) consisting of one linear transformation layer is fed to obtain the complete multi-modal characteristics of the object instance. The specific formula (6) is as follows:

F _p ＝FFN(F _p +F _r )

..

In the present embodiment, two multi-attention modules are employed to extract 3D features from point and RGB features. The RGBD fusion module is used for processing the feature fusion, so that the geometric features with rich semantic appearance features are enhanced, and global structural information is explored. In this way, the network may utilize local and global multimodal information to improve learning of the geometric representation to achieve accurate pose estimation.

Specifically, in step B4, feature fusion is performed according to the point cloud voxel feature information, voxel position feature information and/or first image voxel feature information based on an attention mechanism to obtain fusion feature information, which includes one of the following steps B41-B44:

step B41: and aiming at the voxel characteristic information of the first image and aiming at the voxel position characteristic information, the point cloud voxel characteristic information and the characteristic information output after the image voxel characteristic information is processed based on a self-attention mechanism, carrying out characteristic fusion through a cross-attention mechanism to obtain fusion characteristic information.

Specifically, as shown in fig. 5, the self-attention module takes voxel position feature information and feature information obtained by stitching point cloud voxel feature information and first image voxel feature information as inputs, and outputs the processed feature information to the cross-attention module. The cross-attention module takes as input the feature information and the first image voxel feature output from the attention module.

Optionally, after the cross attention module fuses the inputs, the fused features are input into the forward feature extraction network to be processed and fused feature information is output.

Step B42: and aiming at the first image voxel characteristic information and the characteristic information output after the self-attention mechanism is processed aiming at the point cloud voxel characteristic information, carrying out characteristic fusion through a cross-attention mechanism to obtain fusion characteristic information.

Specifically, as shown in fig. 6, the self-attention module takes the point cloud voxel characteristics as input, and outputs the processed characteristic information to the cross-attention module. The cross-attention module takes as input the feature information output from the attention module and the first image voxel feature.

Optionally, the output of the cross-attention module is input into the forward feature extraction network, and finally the fusion feature information is output.

Step B43: and aiming at the first image voxel characteristic information and the characteristic information output after the point cloud voxel characteristic information is processed based on a cross attention mechanism, carrying out characteristic fusion through a self attention mechanism to obtain fusion characteristic information.

Specifically, as shown in fig. 7, the cross-attention module takes as input the point cloud voxel characteristics and outputs the processed characteristic information to the self-attention module. The self-attention module takes as input the feature information output by the cross-attention module and the first image voxel feature.

Optionally, the output of the self-attention module is input into the forward feature extraction network, and finally the fusion feature information is output.

Step B44: and carrying out feature fusion on the feature information output after the processing of the first image voxel feature information based on a self-attention mechanism and the feature information output after the processing of the point cloud voxel feature information and the first image voxel feature information based on a cross-attention mechanism to obtain fusion feature information.

Specifically, as shown in fig. 8, the cross-attention module takes as input the point cloud voxel feature and the first image voxel feature, and the self-attention module takes as input the first image voxel feature.

Optionally, after feature stitching is performed on the feature information output by the cross attention module and the feature information output by the self attention module, the feature information is input into a forward feature extraction network, and finally the fused feature information is output.

The cross attention module has two input features that can be treated as key or query respectively. Alternatively, the feature mapping manner in the two attention modules may be an MLP manner, or a manner based on graph convolution may be employed to model structural information between voxels. After the processing of the N attention modules, the three-dimensional features (fused feature information) after fusion can be output through a forward feature extraction network, and can be used as feature input of an image processing module (also called a prediction module).

The following describes specific contents of image processing in the embodiment of the present application.

In one embodiment, the image processing based on the fusion characteristic information in step S103 includes at least one of the following steps C1-C2:

Step C1: and carrying out attitude estimation and/or size estimation based on the fusion characteristic information.

Specifically, the problem to be solved by the pose estimation is to determine the direction pointing problem of a certain three-dimensional target object; in the embodiment of the application, the structure and the shape of the object are represented based on the fusion characteristic information, and the corresponding relation is established between the model and the image by extracting the characteristics of the object so as to realize the estimation of the spatial attitude of the object. The pose estimation can be followed by outputting 6 degrees of freedom poses, such as three-dimensional rotation and three-dimensional translation.

Specifically, the size estimation is used to estimate the actual size of the object. The three-dimensional size of the object can be output after the size estimation.

As shown in fig. 11c, a graph showing the effect after the posture estimation and the size estimation is shown.

Optionally, in addition to processing based on the fused feature information, processing may be performed in combination with the three-dimensional feature information to better perceive the three-dimensional space when performing pose estimation and/or size estimation.

Step C2: and carrying out shape reconstruction and/or segmentation based on the fusion characteristic information.

Specifically, the shape reconstruction and segmentation may be performed by a shape decoder, and in the embodiment of the present application, a processing flow of the shape decoder may be used as a branch of an auxiliary task, and after processing based on the fusion feature information, a shape reconstruction result and an object segmentation result may be output.

In one embodiment, the step C1 of estimating the pose and/or estimating the size based on the fused feature information includes the steps of C11-C13:

step C11: and detecting the three-dimensional object based on the fusion characteristic information, and determining detection information of each object.

Specifically, as shown in fig. 9, in performing object pose estimation, an RGBD image of the entire map may be taken as an input to a two-dimensional feature encoder.

When the point cloud data are extracted, an object detector is not required to detect an input image to obtain an object region, namely, the relationship between the global space scene context and the object can be captured by adopting an end-to-end model instead of processing the object region only aiming at the single object space relationship.

Specifically, fusion feature information obtained by multi-scale fusion based on an attention mechanism may first detect a three-dimensional object existing in a scene through a three-dimensional object detector. The three-dimensional object detector may locate a three-dimensional object and identify the class of the object. Alternatively, the three-dimensional object detector may be built up by a plurality of Tansformer modules, thereby learning the spatial relationship of objects in the scene.

Step C12: and cutting and sampling the fusion characteristic information based on the detection information to obtain the three-dimensional object characteristics.

Specifically, the fusion characteristic information can be cut based on the three-dimensional object obtained by detection, and the fusion characteristic information can be sampled to be regular three-dimensional object characteristics.

Step C13: and carrying out attitude estimation and/or size estimation based on the three-dimensional object features.

Wherein the detection information includes location information and category information.

In one embodiment, the pose estimation and/or size estimation based on the three-dimensional object features in step C13 comprises steps C131-C132:

step C131: and projecting, cutting and sampling the two-dimensional characteristic information, and then converting the two-dimensional characteristic information into second image voxel characteristic information consistent with the space corresponding to the fusion characteristic information.

Specifically, as shown by a dotted line in fig. 9, the two-dimensional feature information is projected to a three-dimensional space, and is subjected to clipping and sampling processing to obtain an image voxel feature consistent with the space in which the fused feature information is located.

Step C132: and carrying out attitude estimation and/or size estimation based on the features obtained by splicing the three-dimensional object features and the second image voxel feature information.

Specifically, the second image voxel feature information obtained in the step C131 may be combined with the clipped and sampled fusion feature information, and then subject to object pose estimation and size estimation.

The characteristic information after clipping and sampling can be used as the input of the object posture characteristic extractor. The extracted features are input to a pose estimator and a size estimator, outputting the 6D pose and three-dimensional size of the object. Alternatively, the object pose feature extractor may be constructed by multiple convertors modules, thereby learning partial relationships between objects.

In one embodiment, the shape reconstruction and/or segmentation in step C2 based on the fusion feature information includes the following step C21:

step C21: based on the fusion characteristic information, performing shape reconstruction and/or segmentation to obtain reconstructed shape information and/or segmentation information.

In particular, for the proposed multi-scale features, a cascaded shape decoder may be employed to achieve shape reconstruction and segmentation. The case shown in fig. 10 containing three dimensions is described: the three scale fusion characteristic information is respectively used as the input of three shape decoders; wherein the shape decoder 1 only has the fused feature information input of the scale 1, the input of the shape decoder 2 comprises the output of the shape decoder 1 and the fused feature information of the scale 2, and the input of the shape decoder 3 comprises the output of the shape decoder 2 and the fused feature information of the scale 3; the shape decoder 3 then acts as the last shape decoder, the shape and segmentation result of which is output as the final network output result.

Optionally, as shown in fig. 3b, the embodiment of the present application proposes a structure-aware attention fusion network for spatial dependency and structural details between regions. Wherein object shape reconstruction and segmentation as a branch of an auxiliary task may be used to guide the internal structure of a network learning object. In the main pose estimation task in image processing, pose estimation and size estimation may be performed based on the fused feature information and the two-dimensional feature information, as shown in fig. 3 b.

In a possible embodiment, as shown in fig. 3c and 3d, the confidence of the shape reconstruction can be increased stepwise and shape deviation reduced in view of the design of the auxiliary task. By adding branches, the multi-modal features learned by the trunk gesture estimation network can be more robust to understanding the shape of objects within the class, and can learn features with more discriminative object gestures and sizes. Optionally, the branch corresponding to the shape encoder may be used selectively as an auxiliary task in the embodiment of the present application, for example, in some scenarios, if the shape of the object and the segmentation result need not to be output, the calculation of the branch may be ignored during the network reasoning, so as to ensure the processing efficiency of the system.

Wherein, as shown in fig. 3c and 3d, N characterizes the number of ARF modules employed; in one embodiment, N may be 3. The instance segmentation (Instance segmentation) can be realized through a maskrnn.

The ARF-Net proposed by the embodiments of the present application can be used for RGBD-based class-level 6D pose estimation. Specifically, a fusion model based on structural awareness is included for capturing the dependency and structural details of a space; the method can also comprise an auxiliary task branch for shape reconstruction and image segmentation so as to better guide the internal structure of the network learning object and improve the precision and efficiency of network processing; in addition, ARF-Net can also be adapted to class-level 6D pose and size estimation end-to-end attention fusion networks.

An embodiment of the present application provides an image processing apparatus, as shown in fig. 12, the image processing apparatus 1200 may include: an acquisition module 1201, a fusion module 1202 and a processing module 1203.

The acquiring module 1201 is configured to acquire three-dimensional feature information and two-dimensional feature information based on the color image and the depth image; a fusion module 1202, configured to fuse the three-dimensional feature information and the two-dimensional feature information based on an attention mechanism, to obtain fused feature information; and the processing module 1203 is used for performing image processing based on the fusion characteristic information.

In an embodiment, the fusion module 1202 is configured to perform an attention-based mechanism, and fuse the three-dimensional feature information and the two-dimensional feature information to obtain fused feature information, and is specifically configured to:

based on the attention mechanism, the three-dimensional characteristic information of at least one scale and the two-dimensional characteristic information of at least one scale are fused to obtain fused characteristic information.

the following operations are performed for three-dimensional feature information and two-dimensional feature information of any scale: feature fusion is carried out on the three-dimensional feature information of the current scale and the two-dimensional feature information of the current scale based on an attention mechanism, so that fusion feature information is obtained;

the three-dimensional characteristic information of the current scale is determined according to the fusion characteristic information of the previous scale and the three-dimensional characteristic information of the previous scale;

the two-dimensional characteristic information of the current scale is determined according to the two-dimensional characteristic information of the previous scale.

According to the three-dimensional characteristic information, acquiring point cloud voxel characteristic information and/or voxel position characteristic information;

acquiring voxel characteristic information of a first image according to the two-dimensional characteristic information;

and based on an attention mechanism, carrying out feature fusion according to the point cloud voxel feature information, the voxel position feature information and/or the first image voxel feature information to obtain fusion feature information.

In an embodiment, the fusion module 1202 is configured to perform feature fusion based on the attention mechanism according to the point cloud voxel feature information, the voxel position feature information, and/or the first image voxel feature information, and when obtaining the fused feature information, is specifically configured to perform one of the following:

feature fusion is carried out on the voxel feature information of the first image and the feature information output after the processing of the voxel position feature information, the point cloud voxel feature information and the first image voxel feature information based on a self-attention mechanism through a cross-attention mechanism, so that fusion feature information is obtained;

aiming at the first image voxel characteristic information and the characteristic information output after the processing of the point cloud voxel characteristic information based on a self-attention mechanism, carrying out characteristic fusion through a cross-attention mechanism to obtain fusion characteristic information;

Feature fusion is carried out on the first image voxel feature information and feature information output after the point cloud voxel feature information is processed based on a cross attention mechanism through a self attention mechanism, so that fusion feature information is obtained;

and carrying out feature fusion on the feature information output after the processing of the first image voxel feature information based on a self-attention mechanism and the feature information output after the processing of the point cloud voxel feature information and the first image voxel feature information based on a cross-attention mechanism to obtain fusion feature information.

In an embodiment, the processing module 1203 is configured to perform image processing based on the fused feature information, and specifically is configured to perform at least one of the following:

performing attitude estimation and/or size estimation based on the fusion characteristic information;

and carrying out shape reconstruction and/or segmentation based on the fusion characteristic information.

In an embodiment, the processing module 1203 is configured to perform shape reconstruction and/or segmentation based on the fused feature information, and is specifically configured to:

based on the fusion characteristic information, performing shape reconstruction and/or segmentation to obtain reconstructed shape information and/or segmentation information.

The apparatus of the embodiments of the present application may perform the method provided by the embodiments of the present application, and implementation principles of the method are similar, and actions performed by each module in the apparatus of each embodiment of the present application correspond to steps in the method of each embodiment of the present application, and detailed functional descriptions of each module of the apparatus may be referred to in the corresponding method shown in the foregoing, which is not repeated herein.

The embodiment of the application provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of an image processing method, and compared with the prior art, the steps of the image processing method can be realized: aiming at an input image, three-dimensional characteristic information and two-dimensional characteristic information of at least one scale are firstly obtained based on a color image and a depth image of the input image, and then the three-dimensional characteristic and the two-dimensional characteristic information are subjected to characteristic fusion to obtain fusion characteristic information, wherein the characteristic fusion is realized by adopting an attention mechanism; further, image processing is performed based on the fusion feature information; the implementation of the scheme obtains multi-mode fusion characteristic information through characteristic fusion so as to realize image processing based on multi-mode information, and compared with the image processing based on single-mode information, the implementation of the scheme is beneficial to improving the accuracy of image processing. In addition, in some specific scenes, such as application scenes of augmented reality, the implementation of the scheme is also beneficial to improving the perception capability of three-dimensional information, so that the processing efficiency and the robustness of the system are improved.

In an alternative embodiment, an electronic device is provided, as shown in fig. 13, the electronic device 1300 shown in fig. 13 includes: a processor 1301 and a memory 1303. Processor 1301 is coupled to memory 1303, such as via bus 1302. Optionally, the electronic device 1300 may further include a transceiver 1304, where the transceiver 1304 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 1304 is not limited to one, and the structure of the electronic device 1300 is not limited to the embodiments of the present application.

Processor 1301 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. Processor 1301 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 1302 may include a path to transfer information between the components. Bus 1302 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 1302 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 13, but not only one bus or one type of bus.

The Memory 1303 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory ), a CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, without limitation.

The memory 1303 is used for storing a computer program for executing the embodiments of the present application, and is controlled to be executed by the processor 1301. The processor 1301 is configured to execute a computer program stored in the memory 1303 to implement the steps shown in the foregoing method embodiments.

Among them, electronic devices include, but are not limited to: smart phones, tablet computers, notebook computers, smart speakers, smart watches, vehicle-mounted devices, and the like.

Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, may implement the steps and corresponding content of the foregoing method embodiments.

The embodiments of the present application also provide a computer program product, which includes a computer program, where the computer program can implement the steps of the foregoing method embodiments and corresponding content when executed by a processor.

In the embodiments provided herein, the above-described method of estimating the pose of an electronic device may be performed using an artificial intelligence model.

According to embodiments of the present application, the method performed in an electronic device may obtain output data identifying an image or an image feature in an image by using image data or video data as input data for an artificial intelligence model. The artificial intelligence model may be obtained through training. Here, "obtained by training" means that a basic artificial intelligence model is trained with a plurality of training data by a training algorithm to obtain a predefined operating rule or artificial intelligence model configured to perform a desired feature (or purpose). The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values, and the neural network calculation is performed by calculation between the calculation result of the previous layer and the plurality of weight values.

Visual understanding is a technique for identifying and processing things like human vision and includes, for example, object recognition, object tracking, image retrieval, human recognition, scene recognition, 3D reconstruction/localization, or image enhancement.

The image processing apparatus provided by the present application can realize at least one module among a plurality of modules through an AI model. The functions associated with the AI may be performed by a non-volatile memory, a volatile memory, and a processor.

The processor may include one or more processors. At this time, the one or more processors may be general-purpose processors (e.g., central Processing Units (CPUs), application Processors (APs), etc.), or purely graphics processing units (e.g., graphics Processing Units (GPUs), vision Processing Units (VPUs), and/or AI-specific processors (e.g., neural Processing Units (NPUs)).

The one or more processors control the processing of the input data according to predefined operating rules or Artificial Intelligence (AI) models stored in the non-volatile memory and the volatile memory. Predefined operational rules or artificial intelligence models are provided through training or learning.

Here, providing by learning refers to deriving a predefined operation rule or an AI model having a desired characteristic by applying a learning algorithm to a plurality of learning data. The learning may be performed in the apparatus itself in which the AI according to the embodiment is performed, and/or may be implemented by a separate server/system.

The AI model may be comprised of layers comprising a plurality of neural networks. Each layer has a plurality of weight values, and the calculation of one layer is performed by the calculation result of the previous layer and the plurality of weights of the current layer. Examples of neural networks include, but are not limited to, convolutional Neural Networks (CNNs), deep Neural Networks (DNNs), recurrent Neural Networks (RNNs), boltzmann machines limited (RBMs), deep Belief Networks (DBNs), bi-directional recurrent deep neural networks (BRDNNs), generation countermeasure networks (GANs), and deep Q networks.

A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data so that, allowing, or controlling the target device to make a determination or prediction. Examples of such learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

It should be understood that, although the flowcharts of the embodiments of the present application indicate the respective operation steps by arrows, the order of implementation of these steps is not limited to the order indicated by the arrows. In some implementations of embodiments of the present application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages may be flexibly configured according to the requirement, which is not limited in the embodiment of the present application.

The foregoing is merely an optional implementation manner of the implementation scenario of the application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the application are adopted without departing from the technical ideas of the application, and also belong to the protection scope of the embodiments of the application.

Claims

1. An image processing method, comprising:

and performing image processing based on the fusion characteristic information.

2. The method according to claim 1, wherein the fusing the three-dimensional feature information and the two-dimensional feature information based on the attention mechanism to obtain fused feature information includes:

3. The method according to claim 2, wherein the fusing, based on the attention mechanism, the three-dimensional feature information of at least one scale and the two-dimensional feature information of at least one scale to obtain fused feature information includes:

The following operations are performed for three-dimensional feature information and two-dimensional feature information of any scale:

feature fusion is carried out on the three-dimensional feature information of the current scale and the two-dimensional feature information of the current scale based on an attention mechanism, so that fusion feature information of the current scale is obtained; the three-dimensional characteristic information of the current scale is determined according to the fusion characteristic information of the previous scale and the three-dimensional characteristic information of the previous scale;

4. The method according to claim 1, wherein the fusing the three-dimensional feature information and the two-dimensional feature information based on the attention mechanism to obtain fused feature information includes:

5. The method according to claim 4, wherein the feature fusion is performed according to the point cloud voxel feature information, voxel position feature information and/or first image voxel feature information based on an attention mechanism to obtain fusion feature information, and the feature fusion method comprises one of the following steps:

6. The method of claim 1, wherein the image processing based on the fused feature information comprises at least one of:

7. The method of claim 6, wherein the reconstructing and/or segmenting the shape based on the fused feature information comprises:

based on the fusion feature information, shape reconstruction and/or segmentation is performed, resulting in shape information and/or segmentation information.

8. An image processing apparatus, comprising:

9. An electronic device, the electronic device comprising:

one or more processors;

a memory;

one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs configured to: performing the method according to any one of claims 1 to 7.

10. A computer readable storage medium for storing computer instructions which, when run on a computer, cause the computer to perform the method of any one of the preceding claims 1 to 7.

11. A computer program product comprising a computer program or instructions which, when executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.