CN115601745A

CN115601745A - Multi-view three-dimensional object identification method facing application end

Info

Publication number: CN115601745A
Application number: CN202211102704.0A
Authority: CN
Inventors: 黄思帆; 曹江中; 戴青云
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2023-01-13

Abstract

The invention provides a multi-view three-dimensional object identification method facing an application end, and relates to the technical field of three-dimensional object identification. The method comprises the steps of firstly grouping multi-view features based on feature differences, grouping the features possibly from similar visual angles into one group and fusing the features into a plurality of group features, converting the multi-view data with the differences into similar intermediate layer features, then training a teacher model by using a complete view set, and guiding the teacher model trained by the complete multi-view data set to train a student model by using a knowledge distillation mode, so that the student model has the capability of adapting to the conditions of few views and uncertain visual angles. In an actual application task, when any multi-view data is input, the student model can obtain a good three-dimensional recognition effect, only view information is needed, and the student model has the characteristic of light weight. The invention is beneficial to solving the problem of three-dimensional recognition effect caused by small view quantity and insufficient information in practical application.

Description

Multi-view three-dimensional object identification method facing application end

Technical Field

The invention relates to the technical field of three-dimensional object identification, in particular to a multi-view three-dimensional object identification method facing an application end.

Background

In recent years, with the vigorous development in the fields of intelligent robots, automatic driving, virtual reality, medical images, and the like, three-dimensional object recognition has become a new research focus. In the deep learning era, various deep neural networks are widely applied to the field of three-dimensional object recognition, and among various methods, the multi-view-based method obtains more attention due to the fact that data is easy to obtain and convenient to process. After a large-scale data set such as ImageNet is used for pre-training a CNN model, a three-dimensional object recognition method based on multiple views takes a leading position in recognition precision and becomes the current mainstream method.

Mvcch (Multi-view CNN) is a combination of a plurality of two-dimensional projection features learned by a Convolutional Neural Network (CNN) in an end-to-end trainable manner, and this method has become a milestone for three-dimensional shape recognition and has achieved the current optimal performance. Since the birth of the MVCNN (Multi-view CNN) method, there are many Multi-view three-dimensional recognition methods, and such researches mainly focus on how to perform efficient feature fusion or reduce information redundancy to improve the three-dimensional recognition precision of an object, however, an important factor affecting the three-dimensional recognition effect is often ignored by researchers — the reliability of the data set. At present, a multi-view data set is obtained mainly according to a known three-dimensional object, and then a single view is sequentially rendered from a plurality of preset view angles on the three-dimensional object according to a certain rule, but in a real scene, due to the influence of factors such as occlusion and uncertain view angle position information, the obtained multi-view data is often far away from an ideal situation. At an application end, due to the limitation of application requirements of equipment and a specific scene, multi-view data often has the problems of small view quantity, uncertain view angle and the like, and the three-dimensional identification precision of an object can be seriously influenced by the conditions.

Many current methods have attempted to solve the above problems, and although some results have been achieved, they still have their own drawbacks, such as: MVNN can fuse any multi-view features by using a maximum pooling method, for example, a method for identifying a three-dimensional model based on visual saliency sharing is disclosed in the prior art, and the method comprises the steps of firstly obtaining a three-dimensional model to be retrieved, then obtaining a two-dimensional view sequence according to the three-dimensional model to be retrieved, and then obtaining a visual feature vector of the two-dimensional view sequence; then, inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features; finally, searching or classifying the three-dimensional model to be searched through the fusion characteristics, but the effect is poor due to the loss of a large amount of information; the rotanet method requires the view information of the camera to complete the three-dimensional recognition, and the actual requirements are seriously separated. Therefore, the current three-dimensional identification method is in practical application and needs to fulfill two requirements: 1. good results 2 can be obtained by using any multi-view data, and other information except the view data is not needed. On the application side, these two requirements must be met simultaneously.

Disclosure of Invention

In order to solve the problems of small view quantity and uncertain view angle of an object to be recognized in the practical application scene of three-dimensional object recognition, the invention provides an application-end-oriented multi-view three-dimensional object recognition method, which solves the problem of poor machine learning training effect caused by data input difference of the practical scene when the method is applied to practical application, only needs view information and has the characteristic of light weight.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

an application-end-oriented multi-view three-dimensional object recognition method comprises the following steps:

s1, extracting all view features of a multi-view data set for each three-dimensional object, and selecting a plurality of features from the multi-view features as group representative features based on feature difference distance measurement;

s2, dividing the rest of characteristics except the group representative characteristics into groups where the group representative characteristics closest to the group representative characteristics are located, and performing characteristic fusion on all characteristics in each group to obtain a plurality of group fusion characteristics;

s3, inputting the group fusion characteristics into a graph convolution network, carrying out global transmission of local information to obtain graph convolution characteristics with global information, and fusing all the graph convolution characteristics into a three-dimensional characteristic descriptor;

s4, constructing a teacher model and a student model with the functions of the steps S1-S3, training the teacher model by using the complete multi-view data set, guiding and training the student model by using the defect multi-view data set and the trained teacher model, and obtaining a trained student model;

and S5, inputting any multi-view data of the actual to-be-recognized three-dimensional object of the application end into the trained student model to obtain a three-dimensional recognition result.

The technical scheme utilizes the characteristic difference of the multi-view characteristics to carry out grouping, thereby grouping the characteristics possibly from similar visual angles into one group and fusing the characteristics into a plurality of groups of characteristics, converting the multi-view data with difference into similar intermediate layer characteristics, and improving the problem of poor machine learning training effect caused by data input difference; finally, performing feature fusion of the group features through a graph volume network to form an object descriptor for three-dimensional identification; in order to solve the problem of poor three-dimensional recognition effect caused by small view number and insufficient information, the teaching method guides a teacher model trained by complete multi-view data to train a student model by using a knowledge distillation method, and the student model simulates a real task and trains by using any multi-view data in the task. Finally, when any multi-view data is input, the student model can obtain a good three-dimensional recognition effect, only view information is needed by the model, and the model has the characteristic of light weight.

Preferably, in step S1, all the angular views of all the three-dimensional objects form a multi-view data set, and all the angular views of each three-dimensional object are input to the pre-training feature extraction network to obtain all the single-view image features of the multi-view data set.

Preferably, for each three-dimensional object, all of its single-view image features are combined into a multi-view feature F, expressed as: f = { F ₁ ,f ₂ ,...,f _N Where N represents the number of single-view image features contained in the three-dimensional object, f _i Represents the ith single-view image feature in the multi-view feature F, i =1,2, …, N.

Preferably, in step S2, for each three-dimensional object, a feature difference metric between the single-view image features in its multi-view feature F is calculated

The calculation expression is:

F _G ＝argmax(max(V(F；θ ⁱ ))

wherein, theta ⁱ Representing the characteristic parameters of each single-view image,

representing a feature difference measure obtained by calculating the sum of squared differences of the corresponding features, V (-) representing the feature difference measure

A method for extracting a recursive representative view based on the features f of a random extraction _i Then extracting the feature f with the largest sum of feature difference metrics from the previously extracted features _j Finally, the steps are repeated until M characteristics F are extracted _G ＝{f _g1 ,f _g2 ,...f _gM As a group representative characteristic of each group, M is the set number of groups, not more than N,

preferably, in step S3, the distances of the group representative feature from the remaining features are calculated, and the expression is:

d(f _gi -f _j )＝||f _gi -f _j || ₂

and (3) dividing the rest of the characteristics into groups with the minimum distance from the characteristics to represent the characteristics, namely:

G _l ＝argmin(min(d(F _G ；f _j )),l＝0,1,2...M-1,j＝0,1,2,..N-1

wherein,

finally, the characteristic group G is obtained _l ，G _l A characteristic group containing a plurality of characteristics is set as the M groups; performing maximum pooling operation on all the features in each group to realize local feature fusion and obtain a plurality of group fusion features, wherein the expression is as follows:

wherein maxpool represents the maximum pooling operation, N _l Indicates the number of features, G, contained in each group _l,i The feature F obtained by fusing the ith feature of the ith _n ＝{f ₁ ,f ₂ ,...f _M Contains M fused set features.

The characteristic differences of the multi-view characteristics are utilized for grouping, so that the multi-view data which are changed violently can be converted into similar intermediate layer characteristics, poor training effect caused by data change is avoided, and a certain positive effect is achieved on extracting the three-dimensional object descriptor with high resolution in practical application.

Preferably, when the distances between the group representative feature and the remaining features are calculated, if the distances between the group representative feature and the group representative features are the same, the features are simultaneously divided into groups in which the group representative features are located.

Preferably, the process of inputting the group fusion features into the graph convolution network for global transmission of the local information includes:

s41, group characteristics F _n Group fusion feature f in (1) _i As nodes of the graph structure, and obtaining an adjacency matrix S representing the neighborhood of the graph nodes through an intermediate layer comprising a plurality of layers of MLPs _i,j :S _i,j ＝φ(d _ij ；θ _s )

Wherein d is _ij ＝[f _i ,f _j ,f _i -f _j ,||f _i -f _j || ₂ ]∈R ¹⁰ Representing spatial relationships between two group-fused featuresPhi (□) denotes that it contains multiple layers of MLP and vectorized fusion of elements in group features, theta _s A representing parameter representing a correspondence between the two sets of features;

s42, determining the group characteristics in the nearest neighbor range of each group characteristic through a KNN algorithm, only keeping the relevant edges of the K nearest neighbor group characteristics, and obtaining a sparse connection matrix A _i,j ：

A _i,j ＝S _i,j ·C{f _ni ∈K(f _nj )}；

Wherein, C (-) represents the nearest neighbor operation for judging whether the group feature belongs to another group feature, and multiplication represents the sparsification of the original adjacent matrix;

s43, carrying out graph convolution on the graph structure to obtain graph convolution structure characteristics of the level

Wherein A is ^l A adjacency matrix representing the l-th layer diagram, F _G ^l Is the original set of features of the class diagram, W ^l Is the learnable weight matrix, θ, of the layer map ^l For the parameters of the linear activation function, Ψ is a nonlinear transformation function when the original set of features F is input _G ^l First by the adjacency matrix A ^l Propagating, and updating group feature nodes by linear transformation of a learnable weight matrix to obtain local graph convolution features of each level

It is subject to global messaging:

wherein i, j =0,1,2.. M-1,

the space relation between two nodes is represented by sigma, which is a relation function between the two nodes, and the practical meaning is that node pairs output by multiple layers of MLPs (multi-level MLPs) gather content messages, and each layer of MLPs comprises a plurality of convolution units and nonlinear activation;

s44, according to the newly obtained spatial relations of all node pairs, fusing the space relations into the local graph convolution characteristics to obtain new local graph convolution characteristics with global information

Wherein omega is a single-layer MLP, and the new local graph convolution characteristics obtained by fusing global node pair information and the original group characteristics in the graph are output through batch normalization fusion characteristics;

s45, obtaining local graph convolution and carrying out global information transmission through the same mode for each level graph structure, gradually eliminating group features with minimum comprehensive distance measurement for each level graph structure based on feature difference distance measurement, and finally obtaining M graph convolution features

The local map convolution features of the M levels are fused together by a max-pooling method to form a global descriptor F representing the three-dimensional object _GCN ：

Here, the adjacent matrix is thinned, and the operation efficiency of the graph structure can be improved.

Preferably, in step S5, the constructed teacher model and the constructed student model each include an image feature extraction module, a feature grouping and fusing module, and an image convolution module;

the image feature extraction module is used for extracting all single-view image features of the three-dimensional object multi-view data set and combining the single-view image features into multi-view features; the feature grouping and fusing module selects a plurality of features from the multi-view features as group representative features for each three-dimensional object based on feature difference distance measurement, divides the rest features except the group representative features into the group where the group representative features closest to the group representative features are located, and performs feature fusion on all the features in each group to obtain a plurality of group fusion features; the image volume module is used for global transmission of local information to obtain image volume characteristics with global information, and all the image volume characteristics are fused into a three-dimensional characteristic descriptor.

Preferably, the complete multi-view data set in step S5 is a standard common data set, and the defective multi-view data set is a data set after view quantity reduction and order scrambling processing on the complete multi-view data set.

Preferably, the teacher model is trained using the complete multi-view dataset, and the process of training the student model using the defect multi-view dataset and the teacher model is guided by the teacher model is as follows: x for respectively extracting logics layers of teacher model and student model _i And y _i Expressing, namely measuring the characteristic difference of MSE mean square error by using the MSE mean square error, and respectively expressing the difference between a prediction result and a real label in a teacher-student network by using a cross entropy function:

wherein N represents the number of all views, r represents the number of groups, N _i A number of view features representing a certain group,

representing a sample correction variance, and p and q respectively represent a prediction result and a formal label;

adding the loss functions, measuring the similarity and the prediction accuracy of the output layers of the teacher model and the student model, and optimizing the similarity and the prediction accuracy to ensure that the student model learns the generalization ability of the output layers of the teacher model, wherein the expression is as follows:

minimizing the part of the loss function by taking the MSE distance of the teacher model and the student model corresponding to the characteristics of the middle layer as a part of the loss function

So that the function and the structure of the two intermediate layers are close to each other:

wherein,

and

respectively corresponding middle layer characteristics of a certain layer of graph structure in a teacher model and a student model, n is the total BatchSize number, M is the total grade number of the graph structure characteristics, and the two parts are combined to form a total loss function required by knowledge distillation

Under the condition that the difference between the teacher model and the student model is overlarge, the excessive knowledge input by the teacher model is not beneficial to obtaining good training effect, the temperature over-parameter T is introduced, the temperature over-parameter T is used for adjusting the characteristic difference of the middle layer, the amount of knowledge contained in the knowledge distillation is determined, the amount of knowledge contained in the knowledge distillation is set, the relation between the difference of the teacher model and the student model and the transferred knowledge amount is balanced, a training curve is smoothed, and the training effect is improved; and setting a super parameter lambda to enable the order of magnitude of loss of the logis layer to be close to that of loss of the hidden layer, and optimizing a total loss function to enable the loss function to be converged to obtain a trained student model.

The knowledge distillation is introduced into the multi-view three-dimensional recognition field, and the complete information after the teacher model training is distilled to the student models, so that the student models can obtain the three-dimensional recognition effect close to that of a complete view set when the number of views is small, the high-precision three-dimensional recognition models can be obtained when any multi-view data is input, the models do not need any information except the views, and the light-weight characteristic is achieved.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides an application-end-oriented multi-view three-dimensional object recognition method, which utilizes the characteristic difference of multi-view characteristics to group, divides the characteristics possibly from similar visual angles into one group and fuses the characteristics into a plurality of group characteristics, converts the multi-view data with difference into similar intermediate layer characteristics, constructs a teacher model and a student model with the same structure as the teacher model based on the idea, and can realize the functions.

Drawings

Fig. 1 is a schematic flowchart of an application-oriented multi-view three-dimensional object recognition method according to embodiment 1 of the present invention;

fig. 2 is a diagram showing a structure of a teacher model or a student model proposed in embodiment 2 of the present invention;

fig. 3 is a diagram showing a process of processing multi-view data by a teacher model and a student model according to embodiment 2 of the present invention;

fig. 4 is a graph showing the classification accuracy of a ModelNet40 dataset with view uncertainty using the method of the present application in comparison with other methods as set forth in example 3 of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for better illustration of the present embodiment, certain parts of the drawings may be omitted, enlarged or reduced, and do not represent actual dimensions;

it will be understood by those skilled in the art that certain descriptions of well-known structures in the drawings may be omitted.

The technical solution of the present invention is further described with reference to the drawings and the embodiments.

The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

example 1

As shown in fig. 1, the present embodiment provides an application-oriented multi-view three-dimensional object recognition method, and referring to fig. 1, the method includes the following steps:

s4, constructing a teacher model and a student model with the functions of the steps S1-S3, training the teacher model by using the complete multi-view data set, and guiding to train the student model by using the defect multi-view data set and the trained teacher model to obtain a trained student model;

The method provided by the embodiment comprises the steps of grouping by using the characteristic difference of multi-view characteristics, dividing the characteristics possibly from similar visual angles into one group and fusing the characteristics into a plurality of group characteristics, converting the multi-view data with difference into similar intermediate layer characteristics, finally performing characteristic fusion of the group characteristics through a graph convolution network to form an object descriptor for three-dimensional recognition, constructing a teacher model capable of realizing the functional ideas and a student model with the same structure as the teacher model, training the teacher model by using a knowledge distillation method to complete multi-view data set, guiding the student model to train by using the teacher model, simulating a real task by using the student model, and training by using any multi-view data in the task. Finally, when any multi-view data is input, the student model can obtain a good three-dimensional recognition effect.

In step S1, all the angle views of all the three-dimensional objects form a multi-view data set, and all the angle views of each three-dimensional object are input to a pre-training feature extraction network to obtain all the single-view image features of the multi-view data set.

For each three-dimensional object, all the single-view image features of the three-dimensional object are combined into a multi-view feature F, which is expressed as: f = { F ₁ ,f ₂ ,...,f _N Where N represents the number of single-view image features contained in the three-dimensional object, f _i Represents the ith single-view image feature in the multi-view feature F, i =1,2, …, N.

In step S2, for each three-dimensional object, a feature difference metric between the single-view image features in its multi-view feature F is calculated

The calculation expression is:

F _G ＝argmax(max(V(F；θ ⁱ ))

A method for extracting a recursive representative view based on the features f of a random extraction _i Then, extracting the feature f with the maximum feature difference metric sum from the previously extracted features _j Finally, the steps are repeated until M characteristics F are extracted _G ＝{f _g1 ,f _g2 ,...f _gM M is a set number of packets, not greater than N,

in step S3, the distances between the group representative feature and the remaining features are calculated, and the expression is:

d(f _gi -f _j )＝||f _gi -f _j || ₂

and (3) classifying the rest characteristics into the group with the minimum distance from the characteristics to represent the characteristics, namely satisfying the following conditions:

G _l ＝argmin(min(d(F _G ；f _j )),l＝0,1,2...M-1,j＝0,1,2,..N-1

wherein,

finally, the characteristic group G is obtained _l ，G _l For M groups of feature groups comprising a plurality of features, byCalculating the distance between the features, and dividing the view features F into M groups; performing maximum pooling operation on all the features in each group to realize local feature fusion and obtain a plurality of group fusion features, wherein the expression is as follows:

wherein maxpool represents the maximum pooling operation, N _l Indicates the number of features, G, contained in each group _l,i The feature F obtained by fusing the ith feature of the ith _n ＝{f ₁ ,f ₂ ,...f _M And M fused group characteristics are contained, and when the distances between the group representative characteristics and the rest characteristics are calculated, if the distances between the group representative characteristics and the rest characteristics are the same, the characteristics are simultaneously divided into the groups where the group representative characteristics are located. The characteristic differences of the multi-view characteristics are used for grouping, multi-view data which change violently can be converted into similar intermediate layer characteristics, poor training effect caused by data change is avoided, and the method has a certain positive effect on extracting the three-dimensional object descriptor with high resolution in practical application.

The process of inputting the group fusion characteristics into the graph convolution network for global transmission of local information is as follows:

Wherein d is _ij ＝[f _i ,f _j ,f _i -f _j ,||f _i -f _j || ₂ ]∈R ¹⁰ Representing the spatial relationship between two group-fused features, phi (□) represents the inclusion of multiple layers of MLP, and vectorized fusion of elements in the group features, theta _s Representing parameters representing the correspondence between the two sets of features;

A _i,j ＝S _i,j ·C{f _ni ∈K(f _nj )}；

It is subject to global messaging:

wherein i, j =0,1,2.. M-1,

s44, according to the newly obtained spatial relations of all node pairs, fusing the space relations into the convolution characteristics of the local graph to obtain the graph with global informationNew local graph convolution feature

Wherein omega is a single-layer MLP, and the new local graph convolution characteristics fused by global node pair information and the original group characteristics in the graph are output through batch normalization fusion characteristics;

The adjacent matrix is thinned, so that the operation efficiency of the graph structure can be improved. The model uses the distance measurement designed when selecting the group to represent the feature, and gradually eliminates the group feature with the minimum comprehensive distance measurement at each level, which can be called as 'minus one sampling method' (M at the first level, M-1 at the second level, and so on), and M image volume features can be obtained by feature sampling and local image volume processing method

Example 2

In this embodiment, a schematic structural diagram of a teacher model and a student model is shown in fig. 2, referring to fig. 2, each of the constructed teacher model and the constructed student model includes an image feature extraction module, a feature grouping and fusing module, and an image convolution module;

specifically, the image feature extraction module is used for extracting all single-view image features of the three-dimensional object multi-view data set to combine the single-view image features into multi-view features; the feature grouping and fusing module selects a plurality of features from the multi-view features as group representative features for each three-dimensional object based on feature difference distance measurement, divides the rest features except the group representative features into the group where the group representative features closest to the group representative features are located, and performs feature fusion on all the features in each group to obtain a plurality of group fusion features; the image convolution module is used for global transmission of local information to obtain image convolution characteristics with global information, all the image convolution characteristics are fused into a three-dimensional characteristic descriptor, namely a teacher model and a student model can realize the functions of the steps S1-S4, the processing process can be shown in the figure 3, in addition, when the teacher model and the student model are trained, a complete multi-view data set used by the teacher model is a standard public data set, and a defect multi-view data set is a data set after view quantity reduction and sequential disordering processing of the complete multi-view data set.

The teacher model is trained by using the complete multi-view data set, and the process of training the student model by using the defect multi-view data set and the teacher model is as follows: x for respectively extracting logics layers of teacher model and student model _i And y _i Expressing, measuring the characteristic difference of MSE mean square error, and respectively expressing the difference of the prediction result and the real label in the teacher-student network through a cross entropy function:

So that the function and the structure of the two middle layers are close to each other:

wherein,

and

Under the condition that the difference between the teacher model and the student model is overlarge, the excessive knowledge input by the teacher model is not beneficial to obtaining good training effect, the temperature over-parameter T is introduced, the temperature over-parameter T is used for adjusting the characteristic difference of the middle layer, the amount of knowledge contained in the knowledge distillation is determined, the amount of knowledge contained in the knowledge distillation is set, the relation between the difference of the teacher model and the student model and the transferred knowledge amount is balanced, a training curve is smoothed, and the training effect is improved; and setting a super parameter lambda to enable the order of magnitude of loss of the logis layer to be close to that of loss of the hidden layer, and optimizing a total loss function to enable the loss function to be converged to obtain a trained student model. The method introduces knowledge distillation into the multi-view three-dimensional recognition field, and distills complete information obtained after teacher model training to student models, so that the student models can obtain a three-dimensional recognition effect close to that of a complete view set when the number of views is small, high-precision three-dimensional recognition models can be obtained when any multi-view data is input, the models do not need any information except the views, and the light-weight characteristics are achieved.

Example 3

In this embodiment, the ModelNet40 and the ModelNet10 data set are used to evaluate the method of the present invention, and the effect is further illustrated by the following simulation experiment.

The ModelNet40 and ModelNet10 datasets used in the experiment are multi-view datasets of a three-dimensional object. The ModelNet40 multi-view dataset holds multi-view data from 40 classes, 12311 three-dimensional objects (12-view or 20-view), and the segmentation of the ModelNet40 dataset follows the following operations: 9843 objects in the data set are divided into a training set, 2468 objects are divided into a test set, and when the ModelNet40 data set is tested, the three-dimensional recognition results of 20 views and 12 views are respectively tested. The ModelNet10 data set is much smaller compared to the ModelNet40 data set. The ModelNet10 multi-view dataset has multi-view data from 10 classes, 4899 three-dimensional objects (12 views or 20 views), wherein 3991 objects are used as training sets and 908 objects are used as test sets, and the invention also tests the three-dimensional recognition effect of 20 views and 12 views.

The related comparison methods comprise the following steps: MVCNN, GVCNN, MHBN, MLVCNN, rotanet, view-GCN, CAR-Net and other multi-View three-dimensional identification methods. The main comparison index is the classification and retrieval accuracy in three-dimensional recognition. Classification accuracy is the ratio of the number of correct samples to the total number of samples for a class prediction, and MAP calculates L between features ₂ Distance sortingAnd taking the three-dimensional object with the minimum distance as a prediction result, and finally calculating the average retrieval accuracy. The comparative results are shown in Table 1.

TABLE 1

As can be seen from table 1, the method has a good three-dimensional recognition effect when the multi-view data is completely input. Table 2 shows the model size settings of the present application, the MVCNN method, and View-GCN.

TABLE 2

Method	The method of the present invention	MVCNN	View—GCN
				Size of model	63.76MB	491.84MB	129.48MB

Assuming that the training images are numbered 1-20 in order, in the case of out-of-order input, the input multi-view data may be in the form of, in conjunction with fig. 4, the abscissa in fig. 4 is the number of views and the ordinate represents the classification accuracy, for example: 8 views (view order and source may be: 13,7,2,14,3,6,7,9). As can be seen from fig. 4, the model can still obtain a good three-dimensional object recognition effect under the conditions of disorder and a small number of views, thereby verifying the effectiveness of the method provided by the present invention.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. An application-end-oriented multi-view three-dimensional object identification method is characterized by comprising the following steps:

2. The application-oriented multi-view three-dimensional object recognition method of claim 1, wherein in step S1, all the angular views of all the three-dimensional objects form a multi-view data set, and all the angular views of each three-dimensional object are input to a pre-training feature extraction network to obtain all the single-view image features of the multi-view data set.

3. The application-oriented multi-view three-dimensional object recognition method of claim 2, wherein for each three-dimensional object, all single-view image features thereof are combined into a multi-view feature F expressed as: f = { F ₁ ,f ₂ ,...,f _N Where N represents the number of single-view image features contained in the three-dimensional object, f _i Represents the i-th single-view image feature in the multi-view feature F, i =1,2, …, N.

4. The method for identifying multi-view three-dimensional objects facing application end according to claim 1, wherein in step S2, for each three-dimensional object, a feature difference metric between single-view image features in a multi-view feature F thereof is calculated

The calculation expression is:

A method for extracting a recursive representative view based on the features f of a random extraction _i Then extracting the feature f with the largest sum of feature difference metrics from the previously extracted features _j Finally, the steps are repeated until M characteristics F are extracted _G ＝{f _g1 ,f _g2 ,...f _gM M is a set number of packets, not greater than N,

5. the method for identifying a multi-view three-dimensional object facing an application end according to claim 4, wherein in step S3, the distances between the group representative feature and the rest of the features are calculated, and the expression is as follows:

d(f _gi -f _j )＝||f _gi -f _j || ₂

G _l ＝argmin(min(d(F _G ；f _j )),l＝0,1,2...M-1,j＝0,1,2,..N-1

wherein,

6. The method for identifying a multi-view three-dimensional object oriented to an application end according to claim 4, wherein when the distances between the group representative feature and the rest of the features are calculated, if the distances between the group representative feature and the rest of the features are the same, the features are simultaneously divided into the groups where the group representative features are located.

7. The application-oriented multi-view three-dimensional object recognition method of claim 5, wherein the process of inputting the group fusion features into the graph convolution network for global transmission of local information comprises:

Wherein d is _ij ＝[f _i ,f _j ,f _i -f _j ,||f _i -f _j || ₂ ]∈R ¹⁰ Representing the spatial relationship between the two group-fused features,

the representation comprises multiple layers of MLPs and vectorized fusion is carried out on elements in the group characteristics, theta _s A representing parameter representing a correspondence between the two sets of features;

A _i,j ＝S _i,j ·C{f _ni ∈K(f _nj )}；

It is subject to global messaging:

wherein i, j =0,1,2.. M-1,

s44, according to the newly obtained spatial relationship of all node pairs, fusing the spatial relationship with the local graph convolution characteristics to obtain new local graph convolution characteristics with global information

8. The method for identifying the multi-view three-dimensional object facing the application end according to claim 1, wherein in step S5, the constructed teacher model and the constructed student model each comprise an image feature extraction module, a feature grouping and fusing module and an image convolution module;

9. The method for identifying a multi-view three-dimensional object oriented to an application end according to claim 8, wherein the complete multi-view data set in step S5 is a standard common data set, and the defective multi-view data set is a data set processed by reducing the number of views of the complete multi-view data set and disordering the order of the views.

10. The method for identifying an application-oriented multi-view three-dimensional object according to claim 9, wherein the teacher model is trained using the complete multi-view data set, and the process of training the student model using the defect multi-view data set and using the teacher model is guided by: x for respectively extracting logics layers of teacher model and student model _i And y _i Expressing, namely measuring the characteristic difference of MSE mean square error by using the MSE mean square error, and respectively expressing the difference between a prediction result and a real label in a teacher-student network by using a cross entropy function:

minimizing the MSE distance of the teacher model and the student model corresponding to the intermediate layer features as part of the loss functionPartial loss function

wherein,

and

respectively corresponding intermediate layer characteristics of a certain layer of graph structure in a teacher model and a student model, n is the total BatchSize number, M is the total level number of the graph structure characteristics, and the two parts are combined to form a total loss function required by the student model in knowledge distillation