CN114550162A

CN114550162A - Three-dimensional object identification method combining view importance network and self-attention mechanism

Info

Publication number: CN114550162A
Application number: CN202210143670.3A
Authority: CN
Inventors: 马伟; 徐儒常
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-02-16
Filing date: 2022-02-16
Publication date: 2022-05-27
Anticipated expiration: 2042-02-16
Also published as: CN114550162B

Abstract

The invention discloses a three-dimensional object identification method combining a view importance network and a self-attention mechanism. The method comprises the following steps: projecting a three-dimensional object to be identified from n different visual angles to obtain n different two-dimensional views, wherein n is greater than or equal to two; extracting the features of the n views through a basic CNN model to obtain feature maps of the corresponding views; judging the importance degree of each of the n views to the three-dimensional object recognition through the view importance network, and strengthening the features to different degrees according to the importance degree to obtain a view strengthening feature map; processing the view enhancement characteristic graph by using a self-attention mechanism to obtain a three-dimensional shape descriptor; and inputting the three-dimensional shape descriptor into a full-connection network to perform multi-view object recognition, so as to realize three-dimensional object recognition. The method and the device can highlight the important views beneficial to the three-dimensional object recognition, inhibit the interference of the non-important views on the three-dimensional object recognition, and improve the three-dimensional object recognition accuracy.

Description

Three-dimensional object identification method combining view importance network and self-attention mechanism

Technical Field

The invention belongs to the technical field of computer vision, and relates to a three-dimensional object identification method combining a view importance network and a self-attention mechanism.

Background

With the development of indoor robots and computer vision in recent years, it has become practical for indoor robots to actively find and grab objects indoors for human beings, and how to accurately recognize three-dimensional objects is one of the basic problems in the field. With the open source of ModelNet project at Princeton university, a comprehensive and clear three-dimensional object model set is provided for researchers, and various methods are developed in the field of three-dimensional object recognition. The three-dimensional object identification method can be divided into three categories according to different input data types: point cloud based three-dimensional object recognition, voxel based three-dimensional object recognition, and multi-view based three-dimensional object recognition.

The method for identifying the three-dimensional object based on the point cloud generally comprises the steps of directly performing convolution processing on unordered point cloud collected by data acquisition equipment to obtain category information of the three-dimensional object; the voxel-based three-dimensional object identification method generally divides disordered point clouds into blocks, forms voxel data, and then obtains the category information of a three-dimensional object by using a convolution processing method. The two methods have the problems of expensive data acquisition equipment, high data dimension, high processing cost and the like, and are difficult to be widely applied to daily life. The multi-view-based method obtains more attention due to the fact that data are easy to obtain and convenient to process, and the multi-view-based three-dimensional object recognition method obtains an optimal recognition result due to the fact that a large-scale data set such as ImageNet is used for CNN model pre-training and the like, and becomes a mainstream method.

The multi-view-based three-dimensional object identification method generally renders a three-dimensional object model from multiple viewing angles, further obtains multiple views of a three-dimensional object to be identified, and classifies the obtained multiple views by applying a convolutional network. For example, Su et al propose a multi-view-based three-dimensional object recognition method, named MVCNN, which is superior to most point cloud and voxel-based methods. However, the MVCNN method uses the maximum pooling method, and most of the view information of the three-dimensional object is lost, so that the multi-view-based three-dimensional object identification method needs to be further researched and researched.

Disclosure of Invention

The invention provides a three-dimensional object recognition method combining a view importance network and a self-attention mechanism, aiming at overcoming the defects of the existing three-dimensional object recognition method based on multiple views, the method firstly calculates the importance score of each view in the multiple views through the view importance network, enhances the importance scores in different degrees according to the corresponding importance scores, strengthens the expression beneficial to the three-dimensional object recognition view through the view importance network, and then fuses non-local information among different views through the self-attention mechanism to further enhance the feature expression of the multiple views. The feature expression of the multiple views of the three-dimensional object is enhanced by combining the view importance network with the self-attention mechanism, and experimental results show that the accuracy is effectively improved by utilizing the enhanced multiple views for identification and classification, and the method is proved to have good performance.

In order to realize the aim, the technical scheme of the invention is as follows: step 1, projecting a three-dimensional object to be identified from n different visual angles to obtain n different two-dimensional views, wherein n is greater than or equal to two; step 2, extracting the features of the n views through a basic CNN model to obtain feature maps of the corresponding views; step 3, outputting the importance scores of the n views for the three-dimensional object recognition through a view importance network, wherein the higher the score is, the richer key information used for object recognition contained in the view is represented, and the features are reinforced according to the importance scores to obtain a view enhanced feature map; step 4, processing the view enhancement feature map by using a self-attention mechanism to obtain a cross-view enhancement feature map; and 5, inputting the three-dimensional shape descriptor into a full-connection network to perform multi-view object identification, so as to realize three-dimensional object identification.

The three-dimensional object identification method combining the view importance network and the self-attention mechanism provided by the invention can also have the following characteristics: wherein, step 1 includes: modeling a three-dimensional object from nProjecting from each view angle, and acquiring n rendering views V ═ V of the object₁，v₂，...，v_nIn which v is_iIs the ith view of the object.

The three-dimensional object identification method combining the view importance network and the self-attention mechanism provided by the invention can also have the following characteristics: wherein, step 2 includes: rendering view V ═ V₁，v₂，...，v_nExtracting initial visual characteristic graphs Z ═ Z of n views through a basic CNN model₁，z₂，...，z_nIn which z is_iIs the ith view of the object, z_i∈R^C×H×W，Z∈R^n×C×H×WWhere n represents the number of multiple views, C represents the number of channels per visual feature map, H represents the height of each visual feature map, and W represents the width of each visual feature map.

The three-dimensional object identification method combining the view importance network and the self-attention mechanism provided by the invention can also have the following characteristics: wherein, step 3 includes: the initial visual feature map Z of n views is set as Z₁，z₂，...，z_nThe input is to the view importance network, which will score each view, as in equation (1),

Score＝Softmax{f(z₁)，f(z₂)，...，f(z_n)}， (1)

in the formula (1), f represents a network layer for scoring the importance of the view, and through training, the network layer can score the importance of the view according to the information richness of the view characteristics, so that the characteristics containing visual characteristics and rich information can be highlighted; the Softmax function ensures that the sum of the importance of each view is 1, and avoids the occurrence of great difference of the importance scores of the views; the initial profile of the view will be multiplied by its importance and added to its initial profile, as in equation (2),

p_i＝z_i+Score_i*z_i， (2)

in the formula (2), z_iIs the ith view of the objectInitial visual feature map of (1), Score_iRepresenting the scoring of the view importance network to the ith view importance. Multiplying the initial characteristic map of each view by the importance of the initial characteristic map, and adding the initial characteristic map to obtain n view enhanced characteristic maps P { P } of the three-dimensional object₁，p₂，...，p_n}，p_i∈R^C×H×W，P∈R^n×C×H×W。

The three-dimensional object identification method combining the view importance network and the self-attention mechanism provided by the invention can also have the following characteristics: wherein, step 4 comprises the following substeps:

step 4-1, view enhancement feature map P ═ { P ═ P₁，p₂，...，p_nIs respectively input into three convolution networks to generate a new feature mapping P_q，P_kAnd P_v，P_q，P_k，P_v∈R^n×C×H×W. Will P_kPerforming transposition operation and matching with P_qMatrix multiplication is carried out to obtain the incidence relation of the characteristic diagram on the space, such as formula (3),

in formula (3), S represents the similarity, i and m are the index of the viewing angle, where i, m is the [1, n ]]N is the number of viewing angles, L is the same as W²All spatial positions in a single view profile are represented, and in summary, S_imThe relationship between the view enhancement features of any spatial position under any view angle and the features of any spatial position in all view angles is included, and the stronger the incidence relationship is, the larger the weight in the matrix is.

Step 4-2, adding S_imAnd P_vMatrix multiplication is carried out to obtain a cross-view enhancement feature map A ═ a₁，a₂，...，a_N}，a_i∈R^C×H×W，A∈R^n×C×H×W. Through a self-attention mechanism, the locality of the features is broken, the non-local feature enhancement across the visual angles is realized, and any space of any visual angle is enabledThe representation of the characteristics is richer, and the expression of the view characteristics is effectively enhanced.

The three-dimensional object identification method combining the view importance network and the self-attention mechanism provided by the invention can also have the following characteristics: wherein, step 5 includes:

enhancing feature map a across view angles { a ═ a }₁，a₂，...，a_NDimension reduction is carried out through 1 × 1 convolution, wherein the 1 × 1 convolution extracts features in a cross-view mode, and the problem of information loss caused by maximum pooling is avoided. And inputting the features subjected to dimension reduction into the full-connection layer for classification, so as to realize the identification of the three-dimensional object.

Advantageous effects

1) Different views are weighted correspondingly through the view importance network, so that the views beneficial to three-dimensional object recognition can be highlighted, and meanwhile, the expression of non-important views is inhibited; 2) non-local information among different views is fused through a self-attention mechanism, and cross-view spatial feature enhancement is achieved, so that feature expression of multiple views is further enhanced; 3) and 1 × 1 convolution is adopted to replace the maximum pooling operation to perform feature dimension reduction, so that the reduction of identification precision caused by information loss is avoided.

Drawings

FIG. 1 is a schematic diagram of a network framework for the method of the present invention;

FIG. 2 is an example experimental result of a view importance network proposed by the present invention;

FIG. 3 is a schematic diagram of a self-attention mechanism in an embodiment of the present invention;

Detailed Description

The method is realized based on an open source tool Pythrch of deep learning, and a network model is trained by using a GPU processor NVIDIA GTX 3090.

The various block configurations of the method of the present invention are further described in conjunction with the accompanying drawings and the detailed description, it is to be understood that the detailed description is provided for purposes of illustration only and is not intended to limit the scope of the invention, which is defined by the claims appended hereto.

The composition and flow of the network framework of the invention are shown in fig. 1, and the invention specifically comprises the following steps:

step 1, projecting a three-dimensional object model from n viewing angles, and further acquiring n rendering views V ═ V of the object₁，v₂，...，v_nIn which v is_iFor the ith view of the object, n is set to 12 in this experiment, i.e., 12 views are used for three-dimensional object recognition.

Step 2, changing the rendering view V to { V ═ V₁，v₂，...，v_nExtracting initial visual characteristic graphs Z ═ Z of n views through a basic CNN model₁，z₂，...，z_nIn which z is_iIs the ith view of the object. Specifically, a pre-training VGG network for object recognition on a single image is adopted, the last full-connection layer is removed, the rest of networks are reserved, and initial visual feature map extraction is carried out.

Step 3, changing the initial visual characteristic map Z of the n views to Z ═ Z₁，z₂，...，z_nThe view importance network will score each view, as in equation (1),

Score＝Softmax{f(z₁)，f(z₂)，...，f(z_n)}， (1)

in the formula (1), f represents a network layer for scoring the importance of the view, and through training, the network layer can score the importance of the view according to the information richness of the view characteristics, so that the characteristics containing rich visual characteristic information can be highlighted; the Sofimax function ensures that the sum of the importance of each view is 1, and avoids the occurrence of great difference of the importance scores of the views; the initial profile of the view will be multiplied by its importance and added to its initial profile, as in equation (2),

p_i＝z_i+Score_i*z_i， (2)

in the formula (2), z_iIs an initial visual feature map of the ith view of the object, Score_iRepresenting view importance network versus ith viewScoring of importance. Multiplying the initial characteristic map of each view by the importance of the initial characteristic map, and adding the initial characteristic map to obtain n view enhanced characteristic maps P { P } of the three-dimensional object₁，p₂，...，p_n}，p_i∈R^C×H×W，P∈R^n×C×H×W。

As shown in fig. 2, a sample of importance assignment of a view importance network to twelve different viewing angles after an original three-dimensional object is rendered by an airplane is shown in the figure, and the view importance network is used for enabling a view which is beneficial to three-dimensional object identification and contains rich information of an object to obtain more attention, and simultaneously endowing a view which lacks significant characteristics of the object with a lower weight, thereby reducing interference.

Step 4 comprises the following substeps:

step 4-1, get view enhancement feature map P ═ P₁，p₂，...，p_nIs respectively input into three convolution networks to generate a new feature mapping P_q，P_kAnd P_v，P_q，P_k，P_v∈R^n×C×H×W. Will P_kPerforming transposition operation and matching with P_qMatrix multiplication is carried out to obtain the incidence relation of the characteristic diagram on the space, such as formula (3),

Step 4-2, adding S_imAnd P_vMatrix multiplication is carried out to obtain a cross-view enhancement feature map A ═ a₁，a₂，...，a_N}，a_i∈R^C×H×W，A∈R^n×C×H×W. Through a self-attention mechanism, the locality of the features is broken, the non-local feature enhancement across the visual angles is realized, the feature representation on any space of any visual angle is richer, and the expression of the view features is effectively enhanced.

As shown in fig. 3, which shows a sample of the non-local feature enhancement across viewing angles from the attention mechanism, the features for the input N viewing angles will be input to theta,

and g, carrying out feature mapping on the three convolution layer networks to respectively obtain P_q，P_kAnd P_v。P_qWith P after inversion_kAnd carrying out matrix multiplication to obtain a similarity matrix, wherein the similarity matrix comprises the relationship between the characteristic of each spatial position and the characteristics of other spatial positions. By associating a similarity matrix with P_vAnd multiplying to realize the non-local characteristic enhancement across the visual angles and outputting the characteristics of N visual angles.

Step 5, the cross-view enhancement feature map A is set as { a ═ a₁，a₂，...，a_NDimension reduction is carried out through 1 × 1 convolution, wherein the 1 × 1 convolution extracts features in a cross-view mode, and the problem of information loss caused by maximum pooling is avoided. And inputting the features subjected to dimension reduction into the full-connection layer for classification, so as to realize the identification of the three-dimensional object.

In this embodiment, a comparison experiment is also performed on the three-dimensional object recognition method combining the view importance network and the self-attention mechanism to evaluate the classification recognition effect. We selected the ModelNet40 dataset, commonly used for identifying three-dimensional objects at Princeton university, for experiment and evaluation, and the ModelNet40 dataset contains models of 12311 three-dimensional objects of 40 categories, of which 9843 are classified as training sets and 2468 are classified as test sets. The number of samples in the ModelNet40 dataset is unequal between different classes, so we obey the two indicators of average instance precision (instacc) and average Class precision (Class Acc) reported in other works, where the average instance precision (instacc) calculates the percentage of correct predictions in all samples, and the average Class precision (Class Acc) is the average of the precision for each Class.

Claims

1. A three-dimensional object identification method combining a view importance network and a self-attention mechanism is characterized in that:

the step 1 comprises the following steps: projecting a three-dimensional object model from n visual angles, and acquiring n rendering views V ═ V of the object₁，v₂，...，v_nIn which v is_iIs the ith view of the object;

the step 2 comprises the following steps: rendering view V ═ V₁，v₂，...，v_nExtracting initial visual characteristic graphs Z ═ Z of n views through a basic CNN model₁，z₂，...，z_nIn which z is_iIs the ith view of the object, z_i∈R^C×H×W，Z∈R^n×C×H×WWherein n represents the number of multiple views, C represents the number of channels per visual feature map, H represents the height of each visual feature map, and W represents the width of each visual feature map;

the step 3 comprises the following steps: the initial visual feature map Z of n views is set as Z₁，z₂，...，z_nThe input is to the view importance network, which will score each view, as in equation (1),

Score＝Softmax{f(z₁)，f(z₂)，...，f(z_n)}， (1)

in formula (1), f represents a network layer that scores the importance of the view; the Softmax function ensures that the sum of the importance of each view is 1, and avoids the occurrence of great difference of the importance scores of the views; the initial profile of the view will be multiplied by its importance and added to its initial profile, as in equation (2),

p_i＝z_i+Score_i*z_i， (2)

in the formula (2), z_iIs an initial visual feature map of the ith view of the object, Score_iRepresenting importance of view importance network to ith viewGrading; multiplying the initial characteristic map of each view by the importance of the initial characteristic map, and adding the initial characteristic map to obtain n view enhanced characteristic maps P { P } of the three-dimensional object₁，p₂，…，p_n}，p_i∈R^C×H×W，P∈R^n×C×H×W；

Step 4 comprises the following substeps:

step 4-1, view enhancement feature map P ═ { P ═ P₁，p₂，...，p_nIs respectively input into three convolution networks to generate a new feature mapping P_q，P_kAnd P_v，P_q，P_k，P_v∈R^n×C×H×W(ii) a Will P_kPerforming transposition operation and matching with P_qMatrix multiplication is carried out to obtain the incidence relation of the characteristic diagram on the space, such as formula (3),

in formula (3), S represents the similarity, i and m are the index of the viewing angle, where i, m is the [1, n ]]N is the number of viewing angles, L is the same as W²Representing all spatial positions in a single view profile;

step 4-2, adding S_imAnd P_vMatrix multiplication is carried out to obtain a cross-view enhancement feature map A ═ a₁，a₂，...，a_N}，a_i∈R^C ^×H×W，A∈R^n×C×H×W(ii) a Through a self-attention mechanism, the locality of the features is broken, and the non-local feature enhancement across the visual angles is realized;

the step 5 comprises the following steps:

enhancing feature map a across view angles { a ═ a }₁，a₂，...，a_NAnd dimension reduction is carried out through 1 × 1 convolution, wherein the 1 × 1 convolution extracts features in a view angle crossing mode, and the features subjected to dimension reduction are input into a full-connection layer to be classified, so that the identification of the three-dimensional object is realized.