CN115601745A - Multi-view three-dimensional object identification method facing application end - Google Patents
Multi-view three-dimensional object identification method facing application end Download PDFInfo
- Publication number
- CN115601745A CN115601745A CN202211102704.0A CN202211102704A CN115601745A CN 115601745 A CN115601745 A CN 115601745A CN 202211102704 A CN202211102704 A CN 202211102704A CN 115601745 A CN115601745 A CN 115601745A
- Authority
- CN
- China
- Prior art keywords
- features
- view
- group
- feature
- dimensional object
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000012549 training Methods 0.000 claims abstract description 33
- 230000000694 effects Effects 0.000 claims abstract description 24
- 238000013140 knowledge distillation Methods 0.000 claims abstract description 14
- 230000009286 beneficial effect Effects 0.000 claims abstract description 5
- 239000010410 layer Substances 0.000 claims description 50
- 230000004927 fusion Effects 0.000 claims description 37
- 230000006870 function Effects 0.000 claims description 34
- 239000011159 matrix material Substances 0.000 claims description 23
- 238000005259 measurement Methods 0.000 claims description 14
- 230000005540 biological transmission Effects 0.000 claims description 12
- 238000000605 extraction Methods 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 10
- 238000010586 diagram Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 230000007547 defect Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000001902 propagating effect Effects 0.000 claims description 3
- 239000002356 single layer Substances 0.000 claims description 3
- 230000002950 deficient Effects 0.000 claims description 2
- 230000000007 visual effect Effects 0.000 abstract description 11
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008092 positive effect Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a multi-view three-dimensional object identification method facing an application end, and relates to the technical field of three-dimensional object identification. The method comprises the steps of firstly grouping multi-view features based on feature differences, grouping the features possibly from similar visual angles into one group and fusing the features into a plurality of group features, converting the multi-view data with the differences into similar intermediate layer features, then training a teacher model by using a complete view set, and guiding the teacher model trained by the complete multi-view data set to train a student model by using a knowledge distillation mode, so that the student model has the capability of adapting to the conditions of few views and uncertain visual angles. In an actual application task, when any multi-view data is input, the student model can obtain a good three-dimensional recognition effect, only view information is needed, and the student model has the characteristic of light weight. The invention is beneficial to solving the problem of three-dimensional recognition effect caused by small view quantity and insufficient information in practical application.
Description
Technical Field
The invention relates to the technical field of three-dimensional object identification, in particular to a multi-view three-dimensional object identification method facing an application end.
Background
In recent years, with the vigorous development in the fields of intelligent robots, automatic driving, virtual reality, medical images, and the like, three-dimensional object recognition has become a new research focus. In the deep learning era, various deep neural networks are widely applied to the field of three-dimensional object recognition, and among various methods, the multi-view-based method obtains more attention due to the fact that data is easy to obtain and convenient to process. After a large-scale data set such as ImageNet is used for pre-training a CNN model, a three-dimensional object recognition method based on multiple views takes a leading position in recognition precision and becomes the current mainstream method.
Mvcch (Multi-view CNN) is a combination of a plurality of two-dimensional projection features learned by a Convolutional Neural Network (CNN) in an end-to-end trainable manner, and this method has become a milestone for three-dimensional shape recognition and has achieved the current optimal performance. Since the birth of the MVCNN (Multi-view CNN) method, there are many Multi-view three-dimensional recognition methods, and such researches mainly focus on how to perform efficient feature fusion or reduce information redundancy to improve the three-dimensional recognition precision of an object, however, an important factor affecting the three-dimensional recognition effect is often ignored by researchers — the reliability of the data set. At present, a multi-view data set is obtained mainly according to a known three-dimensional object, and then a single view is sequentially rendered from a plurality of preset view angles on the three-dimensional object according to a certain rule, but in a real scene, due to the influence of factors such as occlusion and uncertain view angle position information, the obtained multi-view data is often far away from an ideal situation. At an application end, due to the limitation of application requirements of equipment and a specific scene, multi-view data often has the problems of small view quantity, uncertain view angle and the like, and the three-dimensional identification precision of an object can be seriously influenced by the conditions.
Many current methods have attempted to solve the above problems, and although some results have been achieved, they still have their own drawbacks, such as: MVNN can fuse any multi-view features by using a maximum pooling method, for example, a method for identifying a three-dimensional model based on visual saliency sharing is disclosed in the prior art, and the method comprises the steps of firstly obtaining a three-dimensional model to be retrieved, then obtaining a two-dimensional view sequence according to the three-dimensional model to be retrieved, and then obtaining a visual feature vector of the two-dimensional view sequence; then, inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features; finally, searching or classifying the three-dimensional model to be searched through the fusion characteristics, but the effect is poor due to the loss of a large amount of information; the rotanet method requires the view information of the camera to complete the three-dimensional recognition, and the actual requirements are seriously separated. Therefore, the current three-dimensional identification method is in practical application and needs to fulfill two requirements: 1. good results 2 can be obtained by using any multi-view data, and other information except the view data is not needed. On the application side, these two requirements must be met simultaneously.
Disclosure of Invention
In order to solve the problems of small view quantity and uncertain view angle of an object to be recognized in the practical application scene of three-dimensional object recognition, the invention provides an application-end-oriented multi-view three-dimensional object recognition method, which solves the problem of poor machine learning training effect caused by data input difference of the practical scene when the method is applied to practical application, only needs view information and has the characteristic of light weight.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
an application-end-oriented multi-view three-dimensional object recognition method comprises the following steps:
s1, extracting all view features of a multi-view data set for each three-dimensional object, and selecting a plurality of features from the multi-view features as group representative features based on feature difference distance measurement;
s2, dividing the rest of characteristics except the group representative characteristics into groups where the group representative characteristics closest to the group representative characteristics are located, and performing characteristic fusion on all characteristics in each group to obtain a plurality of group fusion characteristics;
s3, inputting the group fusion characteristics into a graph convolution network, carrying out global transmission of local information to obtain graph convolution characteristics with global information, and fusing all the graph convolution characteristics into a three-dimensional characteristic descriptor;
s4, constructing a teacher model and a student model with the functions of the steps S1-S3, training the teacher model by using the complete multi-view data set, guiding and training the student model by using the defect multi-view data set and the trained teacher model, and obtaining a trained student model;
and S5, inputting any multi-view data of the actual to-be-recognized three-dimensional object of the application end into the trained student model to obtain a three-dimensional recognition result.
The technical scheme utilizes the characteristic difference of the multi-view characteristics to carry out grouping, thereby grouping the characteristics possibly from similar visual angles into one group and fusing the characteristics into a plurality of groups of characteristics, converting the multi-view data with difference into similar intermediate layer characteristics, and improving the problem of poor machine learning training effect caused by data input difference; finally, performing feature fusion of the group features through a graph volume network to form an object descriptor for three-dimensional identification; in order to solve the problem of poor three-dimensional recognition effect caused by small view number and insufficient information, the teaching method guides a teacher model trained by complete multi-view data to train a student model by using a knowledge distillation method, and the student model simulates a real task and trains by using any multi-view data in the task. Finally, when any multi-view data is input, the student model can obtain a good three-dimensional recognition effect, only view information is needed by the model, and the model has the characteristic of light weight.
Preferably, in step S1, all the angular views of all the three-dimensional objects form a multi-view data set, and all the angular views of each three-dimensional object are input to the pre-training feature extraction network to obtain all the single-view image features of the multi-view data set.
Preferably, for each three-dimensional object, all of its single-view image features are combined into a multi-view feature F, expressed as: f = { F 1 ,f 2 ,...,f N Where N represents the number of single-view image features contained in the three-dimensional object, f i Represents the ith single-view image feature in the multi-view feature F, i =1,2, …, N.
Preferably, in step S2, for each three-dimensional object, a feature difference metric between the single-view image features in its multi-view feature F is calculatedThe calculation expression is:
F G =argmax(max(V(F;θ i ))
wherein, theta i Representing the characteristic parameters of each single-view image,representing a feature difference measure obtained by calculating the sum of squared differences of the corresponding features, V (-) representing the feature difference measureA method for extracting a recursive representative view based on the features f of a random extraction i Then extracting the feature f with the largest sum of feature difference metrics from the previously extracted features j Finally, the steps are repeated until M characteristics F are extracted G ={f g1 ,f g2 ,...f gM As a group representative characteristic of each group, M is the set number of groups, not more than N,
preferably, in step S3, the distances of the group representative feature from the remaining features are calculated, and the expression is:
d(f gi -f j )=||f gi -f j || 2
and (3) dividing the rest of the characteristics into groups with the minimum distance from the characteristics to represent the characteristics, namely:
G l =argmin(min(d(F G ;f j )),l=0,1,2...M-1,j=0,1,2,..N-1
wherein,finally, the characteristic group G is obtained l ,G l A characteristic group containing a plurality of characteristics is set as the M groups; performing maximum pooling operation on all the features in each group to realize local feature fusion and obtain a plurality of group fusion features, wherein the expression is as follows:wherein maxpool represents the maximum pooling operation, N l Indicates the number of features, G, contained in each group l,i The feature F obtained by fusing the ith feature of the ith n ={f 1 ,f 2 ,...f M Contains M fused set features.
The characteristic differences of the multi-view characteristics are utilized for grouping, so that the multi-view data which are changed violently can be converted into similar intermediate layer characteristics, poor training effect caused by data change is avoided, and a certain positive effect is achieved on extracting the three-dimensional object descriptor with high resolution in practical application.
Preferably, when the distances between the group representative feature and the remaining features are calculated, if the distances between the group representative feature and the group representative features are the same, the features are simultaneously divided into groups in which the group representative features are located.
Preferably, the process of inputting the group fusion features into the graph convolution network for global transmission of the local information includes:
s41, group characteristics F n Group fusion feature f in (1) i As nodes of the graph structure, and obtaining an adjacency matrix S representing the neighborhood of the graph nodes through an intermediate layer comprising a plurality of layers of MLPs i,j :S i,j =φ(d ij ;θ s )
Wherein d is ij =[f i ,f j ,f i -f j ,||f i -f j || 2 ]∈R 10 Representing spatial relationships between two group-fused featuresPhi (□) denotes that it contains multiple layers of MLP and vectorized fusion of elements in group features, theta s A representing parameter representing a correspondence between the two sets of features;
s42, determining the group characteristics in the nearest neighbor range of each group characteristic through a KNN algorithm, only keeping the relevant edges of the K nearest neighbor group characteristics, and obtaining a sparse connection matrix A i,j :
A i,j =S i,j ·C{f ni ∈K(f nj )};
Wherein, C (-) represents the nearest neighbor operation for judging whether the group feature belongs to another group feature, and multiplication represents the sparsification of the original adjacent matrix;
s43, carrying out graph convolution on the graph structure to obtain graph convolution structure characteristics of the level
Wherein A is l A adjacency matrix representing the l-th layer diagram, F G l Is the original set of features of the class diagram, W l Is the learnable weight matrix, θ, of the layer map l For the parameters of the linear activation function, Ψ is a nonlinear transformation function when the original set of features F is input G l First by the adjacency matrix A l Propagating, and updating group feature nodes by linear transformation of a learnable weight matrix to obtain local graph convolution features of each levelIt is subject to global messaging:
wherein i, j =0,1,2.. M-1,the space relation between two nodes is represented by sigma, which is a relation function between the two nodes, and the practical meaning is that node pairs output by multiple layers of MLPs (multi-level MLPs) gather content messages, and each layer of MLPs comprises a plurality of convolution units and nonlinear activation;
s44, according to the newly obtained spatial relations of all node pairs, fusing the space relations into the local graph convolution characteristics to obtain new local graph convolution characteristics with global information
Wherein omega is a single-layer MLP, and the new local graph convolution characteristics obtained by fusing global node pair information and the original group characteristics in the graph are output through batch normalization fusion characteristics;
s45, obtaining local graph convolution and carrying out global information transmission through the same mode for each level graph structure, gradually eliminating group features with minimum comprehensive distance measurement for each level graph structure based on feature difference distance measurement, and finally obtaining M graph convolution featuresThe local map convolution features of the M levels are fused together by a max-pooling method to form a global descriptor F representing the three-dimensional object GCN :
Here, the adjacent matrix is thinned, and the operation efficiency of the graph structure can be improved.
Preferably, in step S5, the constructed teacher model and the constructed student model each include an image feature extraction module, a feature grouping and fusing module, and an image convolution module;
the image feature extraction module is used for extracting all single-view image features of the three-dimensional object multi-view data set and combining the single-view image features into multi-view features; the feature grouping and fusing module selects a plurality of features from the multi-view features as group representative features for each three-dimensional object based on feature difference distance measurement, divides the rest features except the group representative features into the group where the group representative features closest to the group representative features are located, and performs feature fusion on all the features in each group to obtain a plurality of group fusion features; the image volume module is used for global transmission of local information to obtain image volume characteristics with global information, and all the image volume characteristics are fused into a three-dimensional characteristic descriptor.
Preferably, the complete multi-view data set in step S5 is a standard common data set, and the defective multi-view data set is a data set after view quantity reduction and order scrambling processing on the complete multi-view data set.
Preferably, the teacher model is trained using the complete multi-view dataset, and the process of training the student model using the defect multi-view dataset and the teacher model is guided by the teacher model is as follows: x for respectively extracting logics layers of teacher model and student model i And y i Expressing, namely measuring the characteristic difference of MSE mean square error by using the MSE mean square error, and respectively expressing the difference between a prediction result and a real label in a teacher-student network by using a cross entropy function:
wherein N represents the number of all views, r represents the number of groups, N i A number of view features representing a certain group,representing a sample correction variance, and p and q respectively represent a prediction result and a formal label;
adding the loss functions, measuring the similarity and the prediction accuracy of the output layers of the teacher model and the student model, and optimizing the similarity and the prediction accuracy to ensure that the student model learns the generalization ability of the output layers of the teacher model, wherein the expression is as follows:
minimizing the part of the loss function by taking the MSE distance of the teacher model and the student model corresponding to the characteristics of the middle layer as a part of the loss functionSo that the function and the structure of the two intermediate layers are close to each other:
wherein,andrespectively corresponding middle layer characteristics of a certain layer of graph structure in a teacher model and a student model, n is the total BatchSize number, M is the total grade number of the graph structure characteristics, and the two parts are combined to form a total loss function required by knowledge distillation
Under the condition that the difference between the teacher model and the student model is overlarge, the excessive knowledge input by the teacher model is not beneficial to obtaining good training effect, the temperature over-parameter T is introduced, the temperature over-parameter T is used for adjusting the characteristic difference of the middle layer, the amount of knowledge contained in the knowledge distillation is determined, the amount of knowledge contained in the knowledge distillation is set, the relation between the difference of the teacher model and the student model and the transferred knowledge amount is balanced, a training curve is smoothed, and the training effect is improved; and setting a super parameter lambda to enable the order of magnitude of loss of the logis layer to be close to that of loss of the hidden layer, and optimizing a total loss function to enable the loss function to be converged to obtain a trained student model.
The knowledge distillation is introduced into the multi-view three-dimensional recognition field, and the complete information after the teacher model training is distilled to the student models, so that the student models can obtain the three-dimensional recognition effect close to that of a complete view set when the number of views is small, the high-precision three-dimensional recognition models can be obtained when any multi-view data is input, the models do not need any information except the views, and the light-weight characteristic is achieved.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides an application-end-oriented multi-view three-dimensional object recognition method, which utilizes the characteristic difference of multi-view characteristics to group, divides the characteristics possibly from similar visual angles into one group and fuses the characteristics into a plurality of group characteristics, converts the multi-view data with difference into similar intermediate layer characteristics, constructs a teacher model and a student model with the same structure as the teacher model based on the idea, and can realize the functions.
Drawings
Fig. 1 is a schematic flowchart of an application-oriented multi-view three-dimensional object recognition method according to embodiment 1 of the present invention;
fig. 2 is a diagram showing a structure of a teacher model or a student model proposed in embodiment 2 of the present invention;
fig. 3 is a diagram showing a process of processing multi-view data by a teacher model and a student model according to embodiment 2 of the present invention;
fig. 4 is a graph showing the classification accuracy of a ModelNet40 dataset with view uncertainty using the method of the present application in comparison with other methods as set forth in example 3 of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for better illustration of the present embodiment, certain parts of the drawings may be omitted, enlarged or reduced, and do not represent actual dimensions;
it will be understood by those skilled in the art that certain descriptions of well-known structures in the drawings may be omitted.
The technical solution of the present invention is further described with reference to the drawings and the embodiments.
The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
example 1
As shown in fig. 1, the present embodiment provides an application-oriented multi-view three-dimensional object recognition method, and referring to fig. 1, the method includes the following steps:
s1, extracting all view features of a multi-view data set for each three-dimensional object, and selecting a plurality of features from the multi-view features as group representative features based on feature difference distance measurement;
s2, dividing the rest of characteristics except the group representative characteristics into groups where the group representative characteristics closest to the group representative characteristics are located, and performing characteristic fusion on all characteristics in each group to obtain a plurality of group fusion characteristics;
s3, inputting the group fusion characteristics into a graph convolution network, carrying out global transmission of local information to obtain graph convolution characteristics with global information, and fusing all the graph convolution characteristics into a three-dimensional characteristic descriptor;
s4, constructing a teacher model and a student model with the functions of the steps S1-S3, training the teacher model by using the complete multi-view data set, and guiding to train the student model by using the defect multi-view data set and the trained teacher model to obtain a trained student model;
and S5, inputting any multi-view data of the actual to-be-recognized three-dimensional object of the application end into the trained student model to obtain a three-dimensional recognition result.
The method provided by the embodiment comprises the steps of grouping by using the characteristic difference of multi-view characteristics, dividing the characteristics possibly from similar visual angles into one group and fusing the characteristics into a plurality of group characteristics, converting the multi-view data with difference into similar intermediate layer characteristics, finally performing characteristic fusion of the group characteristics through a graph convolution network to form an object descriptor for three-dimensional recognition, constructing a teacher model capable of realizing the functional ideas and a student model with the same structure as the teacher model, training the teacher model by using a knowledge distillation method to complete multi-view data set, guiding the student model to train by using the teacher model, simulating a real task by using the student model, and training by using any multi-view data in the task. Finally, when any multi-view data is input, the student model can obtain a good three-dimensional recognition effect.
In step S1, all the angle views of all the three-dimensional objects form a multi-view data set, and all the angle views of each three-dimensional object are input to a pre-training feature extraction network to obtain all the single-view image features of the multi-view data set.
For each three-dimensional object, all the single-view image features of the three-dimensional object are combined into a multi-view feature F, which is expressed as: f = { F 1 ,f 2 ,...,f N Where N represents the number of single-view image features contained in the three-dimensional object, f i Represents the ith single-view image feature in the multi-view feature F, i =1,2, …, N.
In step S2, for each three-dimensional object, a feature difference metric between the single-view image features in its multi-view feature F is calculatedThe calculation expression is:
F G =argmax(max(V(F;θ i ))
wherein, theta i Representing the characteristic parameters of each single-view image,representing a feature difference measure obtained by calculating the sum of squared differences of the corresponding features, V (-) representing the feature difference measureA method for extracting a recursive representative view based on the features f of a random extraction i Then, extracting the feature f with the maximum feature difference metric sum from the previously extracted features j Finally, the steps are repeated until M characteristics F are extracted G ={f g1 ,f g2 ,...f gM M is a set number of packets, not greater than N,
in step S3, the distances between the group representative feature and the remaining features are calculated, and the expression is:
d(f gi -f j )=||f gi -f j || 2
and (3) classifying the rest characteristics into the group with the minimum distance from the characteristics to represent the characteristics, namely satisfying the following conditions:
G l =argmin(min(d(F G ;f j )),l=0,1,2...M-1,j=0,1,2,..N-1
wherein,finally, the characteristic group G is obtained l ,G l For M groups of feature groups comprising a plurality of features, byCalculating the distance between the features, and dividing the view features F into M groups; performing maximum pooling operation on all the features in each group to realize local feature fusion and obtain a plurality of group fusion features, wherein the expression is as follows:wherein maxpool represents the maximum pooling operation, N l Indicates the number of features, G, contained in each group l,i The feature F obtained by fusing the ith feature of the ith n ={f 1 ,f 2 ,...f M And M fused group characteristics are contained, and when the distances between the group representative characteristics and the rest characteristics are calculated, if the distances between the group representative characteristics and the rest characteristics are the same, the characteristics are simultaneously divided into the groups where the group representative characteristics are located. The characteristic differences of the multi-view characteristics are used for grouping, multi-view data which change violently can be converted into similar intermediate layer characteristics, poor training effect caused by data change is avoided, and the method has a certain positive effect on extracting the three-dimensional object descriptor with high resolution in practical application.
The process of inputting the group fusion characteristics into the graph convolution network for global transmission of local information is as follows:
s41, group characteristics F n Group fusion feature f in (1) i As nodes of the graph structure, and obtaining an adjacency matrix S representing the neighborhood of the graph nodes through an intermediate layer comprising a plurality of layers of MLPs i,j :S i,j =φ(d ij ;θ s )
Wherein d is ij =[f i ,f j ,f i -f j ,||f i -f j || 2 ]∈R 10 Representing the spatial relationship between two group-fused features, phi (□) represents the inclusion of multiple layers of MLP, and vectorized fusion of elements in the group features, theta s Representing parameters representing the correspondence between the two sets of features;
s42, determining the group characteristics in the nearest neighbor range of each group characteristic through a KNN algorithm, only keeping the relevant edges of the K nearest neighbor group characteristics, and obtaining a sparse connection matrix A i,j :
A i,j =S i,j ·C{f ni ∈K(f nj )};
Wherein, C (-) represents the nearest neighbor operation for judging whether the group feature belongs to another group feature, and multiplication represents the sparsification of the original adjacent matrix;
s43, carrying out graph convolution on the graph structure to obtain graph convolution structure characteristics of the level
Wherein A is l A adjacency matrix representing the l-th layer diagram, F G l Is the original set of features of the class diagram, W l Is the learnable weight matrix, θ, of the layer map l For the parameters of the linear activation function, Ψ is a nonlinear transformation function when the original set of features F is input G l First by the adjacency matrix A l Propagating, and updating group feature nodes by linear transformation of a learnable weight matrix to obtain local graph convolution features of each levelIt is subject to global messaging:
wherein i, j =0,1,2.. M-1,the space relation between two nodes is represented by sigma, which is a relation function between the two nodes, and the practical meaning is that node pairs output by multiple layers of MLPs (multi-level MLPs) gather content messages, and each layer of MLPs comprises a plurality of convolution units and nonlinear activation;
s44, according to the newly obtained spatial relations of all node pairs, fusing the space relations into the convolution characteristics of the local graph to obtain the graph with global informationNew local graph convolution feature
Wherein omega is a single-layer MLP, and the new local graph convolution characteristics fused by global node pair information and the original group characteristics in the graph are output through batch normalization fusion characteristics;
s45, obtaining local graph convolution and carrying out global information transmission through the same mode for each level graph structure, gradually eliminating group features with minimum comprehensive distance measurement for each level graph structure based on feature difference distance measurement, and finally obtaining M graph convolution featuresThe local map convolution features of the M levels are fused together by a max-pooling method to form a global descriptor F representing the three-dimensional object GCN :The adjacent matrix is thinned, so that the operation efficiency of the graph structure can be improved. The model uses the distance measurement designed when selecting the group to represent the feature, and gradually eliminates the group feature with the minimum comprehensive distance measurement at each level, which can be called as 'minus one sampling method' (M at the first level, M-1 at the second level, and so on), and M image volume features can be obtained by feature sampling and local image volume processing method
Example 2
In this embodiment, a schematic structural diagram of a teacher model and a student model is shown in fig. 2, referring to fig. 2, each of the constructed teacher model and the constructed student model includes an image feature extraction module, a feature grouping and fusing module, and an image convolution module;
specifically, the image feature extraction module is used for extracting all single-view image features of the three-dimensional object multi-view data set to combine the single-view image features into multi-view features; the feature grouping and fusing module selects a plurality of features from the multi-view features as group representative features for each three-dimensional object based on feature difference distance measurement, divides the rest features except the group representative features into the group where the group representative features closest to the group representative features are located, and performs feature fusion on all the features in each group to obtain a plurality of group fusion features; the image convolution module is used for global transmission of local information to obtain image convolution characteristics with global information, all the image convolution characteristics are fused into a three-dimensional characteristic descriptor, namely a teacher model and a student model can realize the functions of the steps S1-S4, the processing process can be shown in the figure 3, in addition, when the teacher model and the student model are trained, a complete multi-view data set used by the teacher model is a standard public data set, and a defect multi-view data set is a data set after view quantity reduction and sequential disordering processing of the complete multi-view data set.
The teacher model is trained by using the complete multi-view data set, and the process of training the student model by using the defect multi-view data set and the teacher model is as follows: x for respectively extracting logics layers of teacher model and student model i And y i Expressing, measuring the characteristic difference of MSE mean square error, and respectively expressing the difference of the prediction result and the real label in the teacher-student network through a cross entropy function:
wherein N represents the number of all views, r represents the number of groups, N i A number of view features representing a certain group,representing a sample correction variance, and p and q respectively represent a prediction result and a formal label;
adding the loss functions, measuring the similarity and the prediction accuracy of the output layers of the teacher model and the student model, and optimizing the similarity and the prediction accuracy to ensure that the student model learns the generalization ability of the output layers of the teacher model, wherein the expression is as follows:
minimizing the part of the loss function by taking the MSE distance of the teacher model and the student model corresponding to the characteristics of the middle layer as a part of the loss functionSo that the function and the structure of the two middle layers are close to each other:
wherein,andrespectively corresponding middle layer characteristics of a certain layer of graph structure in a teacher model and a student model, n is the total BatchSize number, M is the total grade number of the graph structure characteristics, and the two parts are combined to form a total loss function required by knowledge distillation
Under the condition that the difference between the teacher model and the student model is overlarge, the excessive knowledge input by the teacher model is not beneficial to obtaining good training effect, the temperature over-parameter T is introduced, the temperature over-parameter T is used for adjusting the characteristic difference of the middle layer, the amount of knowledge contained in the knowledge distillation is determined, the amount of knowledge contained in the knowledge distillation is set, the relation between the difference of the teacher model and the student model and the transferred knowledge amount is balanced, a training curve is smoothed, and the training effect is improved; and setting a super parameter lambda to enable the order of magnitude of loss of the logis layer to be close to that of loss of the hidden layer, and optimizing a total loss function to enable the loss function to be converged to obtain a trained student model. The method introduces knowledge distillation into the multi-view three-dimensional recognition field, and distills complete information obtained after teacher model training to student models, so that the student models can obtain a three-dimensional recognition effect close to that of a complete view set when the number of views is small, high-precision three-dimensional recognition models can be obtained when any multi-view data is input, the models do not need any information except the views, and the light-weight characteristics are achieved.
Example 3
In this embodiment, the ModelNet40 and the ModelNet10 data set are used to evaluate the method of the present invention, and the effect is further illustrated by the following simulation experiment.
The ModelNet40 and ModelNet10 datasets used in the experiment are multi-view datasets of a three-dimensional object. The ModelNet40 multi-view dataset holds multi-view data from 40 classes, 12311 three-dimensional objects (12-view or 20-view), and the segmentation of the ModelNet40 dataset follows the following operations: 9843 objects in the data set are divided into a training set, 2468 objects are divided into a test set, and when the ModelNet40 data set is tested, the three-dimensional recognition results of 20 views and 12 views are respectively tested. The ModelNet10 data set is much smaller compared to the ModelNet40 data set. The ModelNet10 multi-view dataset has multi-view data from 10 classes, 4899 three-dimensional objects (12 views or 20 views), wherein 3991 objects are used as training sets and 908 objects are used as test sets, and the invention also tests the three-dimensional recognition effect of 20 views and 12 views.
The related comparison methods comprise the following steps: MVCNN, GVCNN, MHBN, MLVCNN, rotanet, view-GCN, CAR-Net and other multi-View three-dimensional identification methods. The main comparison index is the classification and retrieval accuracy in three-dimensional recognition. Classification accuracy is the ratio of the number of correct samples to the total number of samples for a class prediction, and MAP calculates L between features 2 Distance sortingAnd taking the three-dimensional object with the minimum distance as a prediction result, and finally calculating the average retrieval accuracy. The comparative results are shown in Table 1.
TABLE 1
As can be seen from table 1, the method has a good three-dimensional recognition effect when the multi-view data is completely input. Table 2 shows the model size settings of the present application, the MVCNN method, and View-GCN.
TABLE 2
Method | The method of the present invention | MVCNN | View—GCN |
Size of model | 63.76MB | 491.84MB | 129.48MB |
Assuming that the training images are numbered 1-20 in order, in the case of out-of-order input, the input multi-view data may be in the form of, in conjunction with fig. 4, the abscissa in fig. 4 is the number of views and the ordinate represents the classification accuracy, for example: 8 views (view order and source may be: 13,7,2,14,3,6,7,9). As can be seen from fig. 4, the model can still obtain a good three-dimensional object recognition effect under the conditions of disorder and a small number of views, thereby verifying the effectiveness of the method provided by the present invention.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (10)
1. An application-end-oriented multi-view three-dimensional object identification method is characterized by comprising the following steps:
s1, extracting all view features of a multi-view data set for each three-dimensional object, and selecting a plurality of features from the multi-view features as group representative features based on feature difference distance measurement;
s2, dividing the rest of characteristics except the group representative characteristics into groups where the group representative characteristics closest to the group representative characteristics are located, and performing characteristic fusion on all characteristics in each group to obtain a plurality of group fusion characteristics;
s3, inputting the group fusion characteristics into a graph convolution network, carrying out global transmission of local information to obtain graph convolution characteristics with global information, and fusing all the graph convolution characteristics into a three-dimensional characteristic descriptor;
s4, constructing a teacher model and a student model with the functions of the steps S1-S3, training the teacher model by using the complete multi-view data set, guiding and training the student model by using the defect multi-view data set and the trained teacher model, and obtaining a trained student model;
and S5, inputting any multi-view data of the actual to-be-recognized three-dimensional object of the application end into the trained student model to obtain a three-dimensional recognition result.
2. The application-oriented multi-view three-dimensional object recognition method of claim 1, wherein in step S1, all the angular views of all the three-dimensional objects form a multi-view data set, and all the angular views of each three-dimensional object are input to a pre-training feature extraction network to obtain all the single-view image features of the multi-view data set.
3. The application-oriented multi-view three-dimensional object recognition method of claim 2, wherein for each three-dimensional object, all single-view image features thereof are combined into a multi-view feature F expressed as: f = { F 1 ,f 2 ,...,f N Where N represents the number of single-view image features contained in the three-dimensional object, f i Represents the i-th single-view image feature in the multi-view feature F, i =1,2, …, N.
4. The method for identifying multi-view three-dimensional objects facing application end according to claim 1, wherein in step S2, for each three-dimensional object, a feature difference metric between single-view image features in a multi-view feature F thereof is calculatedThe calculation expression is:
wherein, theta i Representing the characteristic parameters of each single-view image,representing a feature difference measure obtained by calculating the sum of squared differences of the corresponding features, V (-) representing the feature difference measureA method for extracting a recursive representative view based on the features f of a random extraction i Then extracting the feature f with the largest sum of feature difference metrics from the previously extracted features j Finally, the steps are repeated until M characteristics F are extracted G ={f g1 ,f g2 ,...f gM M is a set number of packets, not greater than N,
5. the method for identifying a multi-view three-dimensional object facing an application end according to claim 4, wherein in step S3, the distances between the group representative feature and the rest of the features are calculated, and the expression is as follows:
d(f gi -f j )=||f gi -f j || 2
and (3) classifying the rest characteristics into the group with the minimum distance from the characteristics to represent the characteristics, namely satisfying the following conditions:
G l =argmin(min(d(F G ;f j )),l=0,1,2...M-1,j=0,1,2,..N-1
wherein,finally, the characteristic group G is obtained l ,G l A characteristic group containing a plurality of characteristics is set as the M groups; performing maximum pooling operation on all the features in each group to realize local feature fusion and obtain a plurality of group fusion features, wherein the expression is as follows:wherein maxpool represents the maximum pooling operation, N l Indicates the number of features, G, contained in each group l,i The feature F obtained by fusing the ith feature of the ith n ={f 1 ,f 2 ,...f M Contains M fused set features.
6. The method for identifying a multi-view three-dimensional object oriented to an application end according to claim 4, wherein when the distances between the group representative feature and the rest of the features are calculated, if the distances between the group representative feature and the rest of the features are the same, the features are simultaneously divided into the groups where the group representative features are located.
7. The application-oriented multi-view three-dimensional object recognition method of claim 5, wherein the process of inputting the group fusion features into the graph convolution network for global transmission of local information comprises:
s41, group characteristics F n Group fusion feature f in (1) i As nodes of the graph structure, and obtaining an adjacency matrix S representing the neighborhood of the graph nodes through an intermediate layer comprising a plurality of layers of MLPs i,j :S i,j =φ(d ij ;θ s )
Wherein d is ij =[f i ,f j ,f i -f j ,||f i -f j || 2 ]∈R 10 Representing the spatial relationship between the two group-fused features,the representation comprises multiple layers of MLPs and vectorized fusion is carried out on elements in the group characteristics, theta s A representing parameter representing a correspondence between the two sets of features;
s42, determining the group characteristics in the nearest neighbor range of each group characteristic through a KNN algorithm, only keeping the relevant edges of the K nearest neighbor group characteristics, and obtaining a sparse connection matrix A i,j :
A i,j =S i,j ·C{f ni ∈K(f nj )};
Wherein, C (-) represents the nearest neighbor operation for judging whether the group feature belongs to another group feature, and multiplication represents the sparsification of the original adjacent matrix;
s43, carrying out graph convolution on the graph structure to obtain graph convolution structure characteristics of the level
Wherein A is l A adjacency matrix representing the l-th layer diagram, F G l Is the original set of features of the class diagram, W l Is the learnable weight matrix, θ, of the layer map l For the parameters of the linear activation function, Ψ is a nonlinear transformation function when the original set of features F is input G l First by the adjacency matrix A l Propagating, and updating group feature nodes by linear transformation of a learnable weight matrix to obtain local graph convolution features of each levelIt is subject to global messaging:
wherein i, j =0,1,2.. M-1,the space relation between two nodes is represented by sigma, which is a relation function between the two nodes, and the practical meaning is that node pairs output by multiple layers of MLPs (multi-level MLPs) gather content messages, and each layer of MLPs comprises a plurality of convolution units and nonlinear activation;
s44, according to the newly obtained spatial relationship of all node pairs, fusing the spatial relationship with the local graph convolution characteristics to obtain new local graph convolution characteristics with global information
Wherein omega is a single-layer MLP, and the new local graph convolution characteristics fused by global node pair information and the original group characteristics in the graph are output through batch normalization fusion characteristics;
s45, obtaining local graph convolution and carrying out global information transmission through the same mode for each level graph structure, gradually eliminating group features with minimum comprehensive distance measurement for each level graph structure based on feature difference distance measurement, and finally obtaining M graph convolution featuresThe local map convolution features of the M levels are fused together by a max-pooling method to form a global descriptor F representing the three-dimensional object GCN :
8. The method for identifying the multi-view three-dimensional object facing the application end according to claim 1, wherein in step S5, the constructed teacher model and the constructed student model each comprise an image feature extraction module, a feature grouping and fusing module and an image convolution module;
the image feature extraction module is used for extracting all single-view image features of the three-dimensional object multi-view data set and combining the single-view image features into multi-view features; the feature grouping and fusing module selects a plurality of features from the multi-view features as group representative features for each three-dimensional object based on feature difference distance measurement, divides the rest features except the group representative features into the group where the group representative features closest to the group representative features are located, and performs feature fusion on all the features in each group to obtain a plurality of group fusion features; the image volume module is used for global transmission of local information to obtain image volume characteristics with global information, and all the image volume characteristics are fused into a three-dimensional characteristic descriptor.
9. The method for identifying a multi-view three-dimensional object oriented to an application end according to claim 8, wherein the complete multi-view data set in step S5 is a standard common data set, and the defective multi-view data set is a data set processed by reducing the number of views of the complete multi-view data set and disordering the order of the views.
10. The method for identifying an application-oriented multi-view three-dimensional object according to claim 9, wherein the teacher model is trained using the complete multi-view data set, and the process of training the student model using the defect multi-view data set and using the teacher model is guided by: x for respectively extracting logics layers of teacher model and student model i And y i Expressing, namely measuring the characteristic difference of MSE mean square error by using the MSE mean square error, and respectively expressing the difference between a prediction result and a real label in a teacher-student network by using a cross entropy function:
wherein N represents the number of all views, r represents the number of groups, N i A number of view features representing a certain group,representing a sample correction variance, and p and q respectively represent a prediction result and a formal label;
adding the loss functions, measuring the similarity and the prediction accuracy of the output layers of the teacher model and the student model, and optimizing the similarity and the prediction accuracy to ensure that the student model learns the generalization ability of the output layers of the teacher model, wherein the expression is as follows:
minimizing the MSE distance of the teacher model and the student model corresponding to the intermediate layer features as part of the loss functionPartial loss functionSo that the function and the structure of the two intermediate layers are close to each other:
wherein,andrespectively corresponding intermediate layer characteristics of a certain layer of graph structure in a teacher model and a student model, n is the total BatchSize number, M is the total level number of the graph structure characteristics, and the two parts are combined to form a total loss function required by the student model in knowledge distillation
Under the condition that the difference between the teacher model and the student model is overlarge, the excessive knowledge input by the teacher model is not beneficial to obtaining good training effect, the temperature over-parameter T is introduced, the temperature over-parameter T is used for adjusting the characteristic difference of the middle layer, the amount of knowledge contained in the knowledge distillation is determined, the amount of knowledge contained in the knowledge distillation is set, the relation between the difference of the teacher model and the student model and the transferred knowledge amount is balanced, a training curve is smoothed, and the training effect is improved; and setting a super parameter lambda to enable the order of magnitude of loss of the logis layer to be close to that of loss of the hidden layer, and optimizing a total loss function to enable the loss function to be converged to obtain a trained student model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211102704.0A CN115601745A (en) | 2022-09-09 | 2022-09-09 | Multi-view three-dimensional object identification method facing application end |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211102704.0A CN115601745A (en) | 2022-09-09 | 2022-09-09 | Multi-view three-dimensional object identification method facing application end |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115601745A true CN115601745A (en) | 2023-01-13 |
Family
ID=84843876
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211102704.0A Pending CN115601745A (en) | 2022-09-09 | 2022-09-09 | Multi-view three-dimensional object identification method facing application end |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115601745A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117688504A (en) * | 2024-02-04 | 2024-03-12 | 西华大学 | Internet of things abnormality detection method and device based on graph structure learning |
-
2022
- 2022-09-09 CN CN202211102704.0A patent/CN115601745A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117688504A (en) * | 2024-02-04 | 2024-03-12 | 西华大学 | Internet of things abnormality detection method and device based on graph structure learning |
CN117688504B (en) * | 2024-02-04 | 2024-04-16 | 西华大学 | Internet of things abnormality detection method and device based on graph structure learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109493346B (en) | Stomach cancer pathological section image segmentation method and device based on multiple losses | |
CN111062951B (en) | Knowledge distillation method based on semantic segmentation intra-class feature difference | |
CN108090472B (en) | Pedestrian re-identification method and system based on multi-channel consistency characteristics | |
CN112465120A (en) | Fast attention neural network architecture searching method based on evolution method | |
CN112801209A (en) | Image classification method based on dual-length teacher model knowledge fusion and storage medium | |
CN111931505A (en) | Cross-language entity alignment method based on subgraph embedding | |
CN114332578A (en) | Image anomaly detection model training method, image anomaly detection method and device | |
CN112633382A (en) | Mutual-neighbor-based few-sample image classification method and system | |
CN114241273A (en) | Multi-modal image processing method and system based on Transformer network and hypersphere space learning | |
CN114821342A (en) | Remote sensing image road extraction method and system | |
CN116416478B (en) | Bioinformatics classification model based on graph structure data characteristics | |
CN115311502A (en) | Remote sensing image small sample scene classification method based on multi-scale double-flow architecture | |
CN114780777B (en) | Cross-modal retrieval method and device based on semantic enhancement, storage medium and terminal | |
CN113361928A (en) | Crowdsourcing task recommendation method based on special-pattern attention network | |
CN116226467A (en) | Community discovery method of graph convolution neural network based on node structural features | |
CN117992805A (en) | Zero sample cross-modal retrieval method and system based on tensor product graph fusion diffusion | |
CN117350330A (en) | Semi-supervised entity alignment method based on hybrid teaching | |
CN116258990A (en) | Cross-modal affinity-based small sample reference video target segmentation method | |
CN115601745A (en) | Multi-view three-dimensional object identification method facing application end | |
CN118279320A (en) | Target instance segmentation model building method based on automatic prompt learning and application thereof | |
CN110796182A (en) | Bill classification method and system for small amount of samples | |
CN117371511A (en) | Training method, device, equipment and storage medium for image classification model | |
CN112529057A (en) | Graph similarity calculation method and device based on graph convolution network | |
CN114818945A (en) | Small sample image classification method and device integrating category adaptive metric learning | |
CN115063890A (en) | Human body posture estimation method based on two-stage weighted mean square loss function |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |