CN115601745A - Multi-view three-dimensional object identification method facing application end - Google Patents

Multi-view three-dimensional object identification method facing application end Download PDF

Info

Publication number
CN115601745A
CN115601745A CN202211102704.0A CN202211102704A CN115601745A CN 115601745 A CN115601745 A CN 115601745A CN 202211102704 A CN202211102704 A CN 202211102704A CN 115601745 A CN115601745 A CN 115601745A
Authority
CN
China
Prior art keywords
features
view
group
feature
dimensional object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211102704.0A
Other languages
Chinese (zh)
Inventor
黄思帆
曹江中
戴青云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202211102704.0A priority Critical patent/CN115601745A/en
Publication of CN115601745A publication Critical patent/CN115601745A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-view three-dimensional object identification method facing an application end, and relates to the technical field of three-dimensional object identification. The method comprises the steps of firstly grouping multi-view features based on feature differences, grouping the features possibly from similar visual angles into one group and fusing the features into a plurality of group features, converting the multi-view data with the differences into similar intermediate layer features, then training a teacher model by using a complete view set, and guiding the teacher model trained by the complete multi-view data set to train a student model by using a knowledge distillation mode, so that the student model has the capability of adapting to the conditions of few views and uncertain visual angles. In an actual application task, when any multi-view data is input, the student model can obtain a good three-dimensional recognition effect, only view information is needed, and the student model has the characteristic of light weight. The invention is beneficial to solving the problem of three-dimensional recognition effect caused by small view quantity and insufficient information in practical application.

Description

Multi-view three-dimensional object identification method facing application end
Technical Field
The invention relates to the technical field of three-dimensional object identification, in particular to a multi-view three-dimensional object identification method facing an application end.
Background
In recent years, with the vigorous development in the fields of intelligent robots, automatic driving, virtual reality, medical images, and the like, three-dimensional object recognition has become a new research focus. In the deep learning era, various deep neural networks are widely applied to the field of three-dimensional object recognition, and among various methods, the multi-view-based method obtains more attention due to the fact that data is easy to obtain and convenient to process. After a large-scale data set such as ImageNet is used for pre-training a CNN model, a three-dimensional object recognition method based on multiple views takes a leading position in recognition precision and becomes the current mainstream method.
Mvcch (Multi-view CNN) is a combination of a plurality of two-dimensional projection features learned by a Convolutional Neural Network (CNN) in an end-to-end trainable manner, and this method has become a milestone for three-dimensional shape recognition and has achieved the current optimal performance. Since the birth of the MVCNN (Multi-view CNN) method, there are many Multi-view three-dimensional recognition methods, and such researches mainly focus on how to perform efficient feature fusion or reduce information redundancy to improve the three-dimensional recognition precision of an object, however, an important factor affecting the three-dimensional recognition effect is often ignored by researchers — the reliability of the data set. At present, a multi-view data set is obtained mainly according to a known three-dimensional object, and then a single view is sequentially rendered from a plurality of preset view angles on the three-dimensional object according to a certain rule, but in a real scene, due to the influence of factors such as occlusion and uncertain view angle position information, the obtained multi-view data is often far away from an ideal situation. At an application end, due to the limitation of application requirements of equipment and a specific scene, multi-view data often has the problems of small view quantity, uncertain view angle and the like, and the three-dimensional identification precision of an object can be seriously influenced by the conditions.
Many current methods have attempted to solve the above problems, and although some results have been achieved, they still have their own drawbacks, such as: MVNN can fuse any multi-view features by using a maximum pooling method, for example, a method for identifying a three-dimensional model based on visual saliency sharing is disclosed in the prior art, and the method comprises the steps of firstly obtaining a three-dimensional model to be retrieved, then obtaining a two-dimensional view sequence according to the three-dimensional model to be retrieved, and then obtaining a visual feature vector of the two-dimensional view sequence; then, inputting the visual features into the MVCNN branch and the visual saliency branch, and fusing the complex features in the MVCNN branch with the visual saliency features in the visual saliency branch to form fused features; finally, searching or classifying the three-dimensional model to be searched through the fusion characteristics, but the effect is poor due to the loss of a large amount of information; the rotanet method requires the view information of the camera to complete the three-dimensional recognition, and the actual requirements are seriously separated. Therefore, the current three-dimensional identification method is in practical application and needs to fulfill two requirements: 1. good results 2 can be obtained by using any multi-view data, and other information except the view data is not needed. On the application side, these two requirements must be met simultaneously.
Disclosure of Invention
In order to solve the problems of small view quantity and uncertain view angle of an object to be recognized in the practical application scene of three-dimensional object recognition, the invention provides an application-end-oriented multi-view three-dimensional object recognition method, which solves the problem of poor machine learning training effect caused by data input difference of the practical scene when the method is applied to practical application, only needs view information and has the characteristic of light weight.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
an application-end-oriented multi-view three-dimensional object recognition method comprises the following steps:
s1, extracting all view features of a multi-view data set for each three-dimensional object, and selecting a plurality of features from the multi-view features as group representative features based on feature difference distance measurement;
s2, dividing the rest of characteristics except the group representative characteristics into groups where the group representative characteristics closest to the group representative characteristics are located, and performing characteristic fusion on all characteristics in each group to obtain a plurality of group fusion characteristics;
s3, inputting the group fusion characteristics into a graph convolution network, carrying out global transmission of local information to obtain graph convolution characteristics with global information, and fusing all the graph convolution characteristics into a three-dimensional characteristic descriptor;
s4, constructing a teacher model and a student model with the functions of the steps S1-S3, training the teacher model by using the complete multi-view data set, guiding and training the student model by using the defect multi-view data set and the trained teacher model, and obtaining a trained student model;
and S5, inputting any multi-view data of the actual to-be-recognized three-dimensional object of the application end into the trained student model to obtain a three-dimensional recognition result.
The technical scheme utilizes the characteristic difference of the multi-view characteristics to carry out grouping, thereby grouping the characteristics possibly from similar visual angles into one group and fusing the characteristics into a plurality of groups of characteristics, converting the multi-view data with difference into similar intermediate layer characteristics, and improving the problem of poor machine learning training effect caused by data input difference; finally, performing feature fusion of the group features through a graph volume network to form an object descriptor for three-dimensional identification; in order to solve the problem of poor three-dimensional recognition effect caused by small view number and insufficient information, the teaching method guides a teacher model trained by complete multi-view data to train a student model by using a knowledge distillation method, and the student model simulates a real task and trains by using any multi-view data in the task. Finally, when any multi-view data is input, the student model can obtain a good three-dimensional recognition effect, only view information is needed by the model, and the model has the characteristic of light weight.
Preferably, in step S1, all the angular views of all the three-dimensional objects form a multi-view data set, and all the angular views of each three-dimensional object are input to the pre-training feature extraction network to obtain all the single-view image features of the multi-view data set.
Preferably, for each three-dimensional object, all of its single-view image features are combined into a multi-view feature F, expressed as: f = { F 1 ,f 2 ,...,f N Where N represents the number of single-view image features contained in the three-dimensional object, f i Represents the ith single-view image feature in the multi-view feature F, i =1,2, …, N.
Preferably, in step S2, for each three-dimensional object, a feature difference metric between the single-view image features in its multi-view feature F is calculated
Figure BDA0003841292010000031
The calculation expression is:
Figure BDA0003841292010000032
F G =argmax(max(V(F;θ i ))
wherein, theta i Representing the characteristic parameters of each single-view image,
Figure BDA0003841292010000033
representing a feature difference measure obtained by calculating the sum of squared differences of the corresponding features, V (-) representing the feature difference measure
Figure BDA0003841292010000034
A method for extracting a recursive representative view based on the features f of a random extraction i Then extracting the feature f with the largest sum of feature difference metrics from the previously extracted features j Finally, the steps are repeated until M characteristics F are extracted G ={f g1 ,f g2 ,...f gM As a group representative characteristic of each group, M is the set number of groups, not more than N,
Figure BDA0003841292010000035
preferably, in step S3, the distances of the group representative feature from the remaining features are calculated, and the expression is:
d(f gi -f j )=||f gi -f j || 2
and (3) dividing the rest of the characteristics into groups with the minimum distance from the characteristics to represent the characteristics, namely:
G l =argmin(min(d(F G ;f j )),l=0,1,2...M-1,j=0,1,2,..N-1
wherein the content of the first and second substances,
Figure BDA0003841292010000036
finally, the characteristic group G is obtained l ,G l A characteristic group containing a plurality of characteristics is set as the M groups; performing maximum pooling operation on all the features in each group to realize local feature fusion and obtain a plurality of group fusion features, wherein the expression is as follows:
Figure BDA0003841292010000037
wherein maxpool represents the maximum pooling operation, N l Indicates the number of features, G, contained in each group l,i The feature F obtained by fusing the ith feature of the ith n ={f 1 ,f 2 ,...f M Contains M fused set features.
The characteristic differences of the multi-view characteristics are utilized for grouping, so that the multi-view data which are changed violently can be converted into similar intermediate layer characteristics, poor training effect caused by data change is avoided, and a certain positive effect is achieved on extracting the three-dimensional object descriptor with high resolution in practical application.
Preferably, when the distances between the group representative feature and the remaining features are calculated, if the distances between the group representative feature and the group representative features are the same, the features are simultaneously divided into groups in which the group representative features are located.
Preferably, the process of inputting the group fusion features into the graph convolution network for global transmission of the local information includes:
s41, group characteristics F n Group fusion feature f in (1) i As nodes of the graph structure, and obtaining an adjacency matrix S representing the neighborhood of the graph nodes through an intermediate layer comprising a plurality of layers of MLPs i,j :S i,j =φ(d ij ;θ s )
Wherein d is ij =[f i ,f j ,f i -f j ,||f i -f j || 2 ]∈R 10 Representing spatial relationships between two group-fused featuresPhi (□) denotes that it contains multiple layers of MLP and vectorized fusion of elements in group features, theta s A representing parameter representing a correspondence between the two sets of features;
s42, determining the group characteristics in the nearest neighbor range of each group characteristic through a KNN algorithm, only keeping the relevant edges of the K nearest neighbor group characteristics, and obtaining a sparse connection matrix A i,j
A i,j =S i,j ·C{f ni ∈K(f nj )};
Wherein, C (-) represents the nearest neighbor operation for judging whether the group feature belongs to another group feature, and multiplication represents the sparsification of the original adjacent matrix;
s43, carrying out graph convolution on the graph structure to obtain graph convolution structure characteristics of the level
Figure BDA0003841292010000041
Figure BDA0003841292010000042
Wherein A is l A adjacency matrix representing the l-th layer diagram, F G l Is the original set of features of the class diagram, W l Is the learnable weight matrix, θ, of the layer map l For the parameters of the linear activation function, Ψ is a nonlinear transformation function when the original set of features F is input G l First by the adjacency matrix A l Propagating, and updating group feature nodes by linear transformation of a learnable weight matrix to obtain local graph convolution features of each level
Figure BDA0003841292010000043
It is subject to global messaging:
Figure BDA0003841292010000044
wherein i, j =0,1,2.. M-1,
Figure BDA0003841292010000045
the space relation between two nodes is represented by sigma, which is a relation function between the two nodes, and the practical meaning is that node pairs output by multiple layers of MLPs (multi-level MLPs) gather content messages, and each layer of MLPs comprises a plurality of convolution units and nonlinear activation;
s44, according to the newly obtained spatial relations of all node pairs, fusing the space relations into the local graph convolution characteristics to obtain new local graph convolution characteristics with global information
Figure BDA0003841292010000046
Figure BDA0003841292010000047
Wherein omega is a single-layer MLP, and the new local graph convolution characteristics obtained by fusing global node pair information and the original group characteristics in the graph are output through batch normalization fusion characteristics;
s45, obtaining local graph convolution and carrying out global information transmission through the same mode for each level graph structure, gradually eliminating group features with minimum comprehensive distance measurement for each level graph structure based on feature difference distance measurement, and finally obtaining M graph convolution features
Figure BDA0003841292010000051
The local map convolution features of the M levels are fused together by a max-pooling method to form a global descriptor F representing the three-dimensional object GCN
Figure BDA0003841292010000052
Here, the adjacent matrix is thinned, and the operation efficiency of the graph structure can be improved.
Preferably, in step S5, the constructed teacher model and the constructed student model each include an image feature extraction module, a feature grouping and fusing module, and an image convolution module;
the image feature extraction module is used for extracting all single-view image features of the three-dimensional object multi-view data set and combining the single-view image features into multi-view features; the feature grouping and fusing module selects a plurality of features from the multi-view features as group representative features for each three-dimensional object based on feature difference distance measurement, divides the rest features except the group representative features into the group where the group representative features closest to the group representative features are located, and performs feature fusion on all the features in each group to obtain a plurality of group fusion features; the image volume module is used for global transmission of local information to obtain image volume characteristics with global information, and all the image volume characteristics are fused into a three-dimensional characteristic descriptor.
Preferably, the complete multi-view data set in step S5 is a standard common data set, and the defective multi-view data set is a data set after view quantity reduction and order scrambling processing on the complete multi-view data set.
Preferably, the teacher model is trained using the complete multi-view dataset, and the process of training the student model using the defect multi-view dataset and the teacher model is guided by the teacher model is as follows: x for respectively extracting logics layers of teacher model and student model i And y i Expressing, namely measuring the characteristic difference of MSE mean square error by using the MSE mean square error, and respectively expressing the difference between a prediction result and a real label in a teacher-student network by using a cross entropy function:
Figure BDA0003841292010000053
Figure BDA0003841292010000054
wherein N represents the number of all views, r represents the number of groups, N i A number of view features representing a certain group,
Figure BDA0003841292010000055
representing a sample correction variance, and p and q respectively represent a prediction result and a formal label;
Figure BDA0003841292010000056
adding the loss functions, measuring the similarity and the prediction accuracy of the output layers of the teacher model and the student model, and optimizing the similarity and the prediction accuracy to ensure that the student model learns the generalization ability of the output layers of the teacher model, wherein the expression is as follows:
Figure BDA0003841292010000061
minimizing the part of the loss function by taking the MSE distance of the teacher model and the student model corresponding to the characteristics of the middle layer as a part of the loss function
Figure BDA0003841292010000062
So that the function and the structure of the two intermediate layers are close to each other:
Figure BDA0003841292010000063
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003841292010000064
and
Figure BDA0003841292010000065
respectively corresponding middle layer characteristics of a certain layer of graph structure in a teacher model and a student model, n is the total BatchSize number, M is the total grade number of the graph structure characteristics, and the two parts are combined to form a total loss function required by knowledge distillation
Figure BDA0003841292010000066
Figure BDA0003841292010000067
Under the condition that the difference between the teacher model and the student model is overlarge, the excessive knowledge input by the teacher model is not beneficial to obtaining good training effect, the temperature over-parameter T is introduced, the temperature over-parameter T is used for adjusting the characteristic difference of the middle layer, the amount of knowledge contained in the knowledge distillation is determined, the amount of knowledge contained in the knowledge distillation is set, the relation between the difference of the teacher model and the student model and the transferred knowledge amount is balanced, a training curve is smoothed, and the training effect is improved; and setting a super parameter lambda to enable the order of magnitude of loss of the logis layer to be close to that of loss of the hidden layer, and optimizing a total loss function to enable the loss function to be converged to obtain a trained student model.
The knowledge distillation is introduced into the multi-view three-dimensional recognition field, and the complete information after the teacher model training is distilled to the student models, so that the student models can obtain the three-dimensional recognition effect close to that of a complete view set when the number of views is small, the high-precision three-dimensional recognition models can be obtained when any multi-view data is input, the models do not need any information except the views, and the light-weight characteristic is achieved.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides an application-end-oriented multi-view three-dimensional object recognition method, which utilizes the characteristic difference of multi-view characteristics to group, divides the characteristics possibly from similar visual angles into one group and fuses the characteristics into a plurality of group characteristics, converts the multi-view data with difference into similar intermediate layer characteristics, constructs a teacher model and a student model with the same structure as the teacher model based on the idea, and can realize the functions.
Drawings
Fig. 1 is a schematic flowchart of an application-oriented multi-view three-dimensional object recognition method according to embodiment 1 of the present invention;
fig. 2 is a diagram showing a structure of a teacher model or a student model proposed in embodiment 2 of the present invention;
fig. 3 is a diagram showing a process of processing multi-view data by a teacher model and a student model according to embodiment 2 of the present invention;
fig. 4 is a graph showing the classification accuracy of a ModelNet40 dataset with view uncertainty using the method of the present application in comparison with other methods as set forth in example 3 of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for better illustration of the present embodiment, certain parts of the drawings may be omitted, enlarged or reduced, and do not represent actual dimensions;
it will be understood by those skilled in the art that certain descriptions of well-known structures in the drawings may be omitted.
The technical solution of the present invention is further described with reference to the drawings and the embodiments.
The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
example 1
As shown in fig. 1, the present embodiment provides an application-oriented multi-view three-dimensional object recognition method, and referring to fig. 1, the method includes the following steps:
s1, extracting all view features of a multi-view data set for each three-dimensional object, and selecting a plurality of features from the multi-view features as group representative features based on feature difference distance measurement;
s2, dividing the rest of characteristics except the group representative characteristics into groups where the group representative characteristics closest to the group representative characteristics are located, and performing characteristic fusion on all characteristics in each group to obtain a plurality of group fusion characteristics;
s3, inputting the group fusion characteristics into a graph convolution network, carrying out global transmission of local information to obtain graph convolution characteristics with global information, and fusing all the graph convolution characteristics into a three-dimensional characteristic descriptor;
s4, constructing a teacher model and a student model with the functions of the steps S1-S3, training the teacher model by using the complete multi-view data set, and guiding to train the student model by using the defect multi-view data set and the trained teacher model to obtain a trained student model;
and S5, inputting any multi-view data of the actual to-be-recognized three-dimensional object of the application end into the trained student model to obtain a three-dimensional recognition result.
The method provided by the embodiment comprises the steps of grouping by using the characteristic difference of multi-view characteristics, dividing the characteristics possibly from similar visual angles into one group and fusing the characteristics into a plurality of group characteristics, converting the multi-view data with difference into similar intermediate layer characteristics, finally performing characteristic fusion of the group characteristics through a graph convolution network to form an object descriptor for three-dimensional recognition, constructing a teacher model capable of realizing the functional ideas and a student model with the same structure as the teacher model, training the teacher model by using a knowledge distillation method to complete multi-view data set, guiding the student model to train by using the teacher model, simulating a real task by using the student model, and training by using any multi-view data in the task. Finally, when any multi-view data is input, the student model can obtain a good three-dimensional recognition effect.
In step S1, all the angle views of all the three-dimensional objects form a multi-view data set, and all the angle views of each three-dimensional object are input to a pre-training feature extraction network to obtain all the single-view image features of the multi-view data set.
For each three-dimensional object, all the single-view image features of the three-dimensional object are combined into a multi-view feature F, which is expressed as: f = { F 1 ,f 2 ,...,f N Where N represents the number of single-view image features contained in the three-dimensional object, f i Represents the ith single-view image feature in the multi-view feature F, i =1,2, …, N.
In step S2, for each three-dimensional object, a feature difference metric between the single-view image features in its multi-view feature F is calculated
Figure BDA0003841292010000081
The calculation expression is:
Figure BDA0003841292010000082
F G =argmax(max(V(F;θ i ))
wherein, theta i Representing the characteristic parameters of each single-view image,
Figure BDA0003841292010000083
representing a feature difference measure obtained by calculating the sum of squared differences of the corresponding features, V (-) representing the feature difference measure
Figure BDA0003841292010000084
A method for extracting a recursive representative view based on the features f of a random extraction i Then, extracting the feature f with the maximum feature difference metric sum from the previously extracted features j Finally, the steps are repeated until M characteristics F are extracted G ={f g1 ,f g2 ,...f gM M is a set number of packets, not greater than N,
Figure BDA0003841292010000085
in step S3, the distances between the group representative feature and the remaining features are calculated, and the expression is:
d(f gi -f j )=||f gi -f j || 2
and (3) classifying the rest characteristics into the group with the minimum distance from the characteristics to represent the characteristics, namely satisfying the following conditions:
G l =argmin(min(d(F G ;f j )),l=0,1,2...M-1,j=0,1,2,..N-1
wherein the content of the first and second substances,
Figure BDA0003841292010000086
finally, the characteristic group G is obtained l ,G l For M groups of feature groups comprising a plurality of features, byCalculating the distance between the features, and dividing the view features F into M groups; performing maximum pooling operation on all the features in each group to realize local feature fusion and obtain a plurality of group fusion features, wherein the expression is as follows:
Figure BDA0003841292010000091
wherein maxpool represents the maximum pooling operation, N l Indicates the number of features, G, contained in each group l,i The feature F obtained by fusing the ith feature of the ith n ={f 1 ,f 2 ,...f M And M fused group characteristics are contained, and when the distances between the group representative characteristics and the rest characteristics are calculated, if the distances between the group representative characteristics and the rest characteristics are the same, the characteristics are simultaneously divided into the groups where the group representative characteristics are located. The characteristic differences of the multi-view characteristics are used for grouping, multi-view data which change violently can be converted into similar intermediate layer characteristics, poor training effect caused by data change is avoided, and the method has a certain positive effect on extracting the three-dimensional object descriptor with high resolution in practical application.
The process of inputting the group fusion characteristics into the graph convolution network for global transmission of local information is as follows:
s41, group characteristics F n Group fusion feature f in (1) i As nodes of the graph structure, and obtaining an adjacency matrix S representing the neighborhood of the graph nodes through an intermediate layer comprising a plurality of layers of MLPs i,j :S i,j =φ(d ij ;θ s )
Wherein d is ij =[f i ,f j ,f i -f j ,||f i -f j || 2 ]∈R 10 Representing the spatial relationship between two group-fused features, phi (□) represents the inclusion of multiple layers of MLP, and vectorized fusion of elements in the group features, theta s Representing parameters representing the correspondence between the two sets of features;
s42, determining the group characteristics in the nearest neighbor range of each group characteristic through a KNN algorithm, only keeping the relevant edges of the K nearest neighbor group characteristics, and obtaining a sparse connection matrix A i,j
A i,j =S i,j ·C{f ni ∈K(f nj )};
Wherein, C (-) represents the nearest neighbor operation for judging whether the group feature belongs to another group feature, and multiplication represents the sparsification of the original adjacent matrix;
s43, carrying out graph convolution on the graph structure to obtain graph convolution structure characteristics of the level
Figure BDA0003841292010000092
Figure BDA0003841292010000093
Wherein A is l A adjacency matrix representing the l-th layer diagram, F G l Is the original set of features of the class diagram, W l Is the learnable weight matrix, θ, of the layer map l For the parameters of the linear activation function, Ψ is a nonlinear transformation function when the original set of features F is input G l First by the adjacency matrix A l Propagating, and updating group feature nodes by linear transformation of a learnable weight matrix to obtain local graph convolution features of each level
Figure BDA0003841292010000094
It is subject to global messaging:
Figure BDA0003841292010000095
wherein i, j =0,1,2.. M-1,
Figure BDA0003841292010000096
the space relation between two nodes is represented by sigma, which is a relation function between the two nodes, and the practical meaning is that node pairs output by multiple layers of MLPs (multi-level MLPs) gather content messages, and each layer of MLPs comprises a plurality of convolution units and nonlinear activation;
s44, according to the newly obtained spatial relations of all node pairs, fusing the space relations into the convolution characteristics of the local graph to obtain the graph with global informationNew local graph convolution feature
Figure BDA0003841292010000101
Figure BDA0003841292010000102
Wherein omega is a single-layer MLP, and the new local graph convolution characteristics fused by global node pair information and the original group characteristics in the graph are output through batch normalization fusion characteristics;
s45, obtaining local graph convolution and carrying out global information transmission through the same mode for each level graph structure, gradually eliminating group features with minimum comprehensive distance measurement for each level graph structure based on feature difference distance measurement, and finally obtaining M graph convolution features
Figure BDA0003841292010000103
The local map convolution features of the M levels are fused together by a max-pooling method to form a global descriptor F representing the three-dimensional object GCN
Figure BDA0003841292010000104
The adjacent matrix is thinned, so that the operation efficiency of the graph structure can be improved. The model uses the distance measurement designed when selecting the group to represent the feature, and gradually eliminates the group feature with the minimum comprehensive distance measurement at each level, which can be called as 'minus one sampling method' (M at the first level, M-1 at the second level, and so on), and M image volume features can be obtained by feature sampling and local image volume processing method
Figure BDA0003841292010000105
Example 2
In this embodiment, a schematic structural diagram of a teacher model and a student model is shown in fig. 2, referring to fig. 2, each of the constructed teacher model and the constructed student model includes an image feature extraction module, a feature grouping and fusing module, and an image convolution module;
specifically, the image feature extraction module is used for extracting all single-view image features of the three-dimensional object multi-view data set to combine the single-view image features into multi-view features; the feature grouping and fusing module selects a plurality of features from the multi-view features as group representative features for each three-dimensional object based on feature difference distance measurement, divides the rest features except the group representative features into the group where the group representative features closest to the group representative features are located, and performs feature fusion on all the features in each group to obtain a plurality of group fusion features; the image convolution module is used for global transmission of local information to obtain image convolution characteristics with global information, all the image convolution characteristics are fused into a three-dimensional characteristic descriptor, namely a teacher model and a student model can realize the functions of the steps S1-S4, the processing process can be shown in the figure 3, in addition, when the teacher model and the student model are trained, a complete multi-view data set used by the teacher model is a standard public data set, and a defect multi-view data set is a data set after view quantity reduction and sequential disordering processing of the complete multi-view data set.
The teacher model is trained by using the complete multi-view data set, and the process of training the student model by using the defect multi-view data set and the teacher model is as follows: x for respectively extracting logics layers of teacher model and student model i And y i Expressing, measuring the characteristic difference of MSE mean square error, and respectively expressing the difference of the prediction result and the real label in the teacher-student network through a cross entropy function:
Figure BDA0003841292010000111
wherein N represents the number of all views, r represents the number of groups, N i A number of view features representing a certain group,
Figure BDA0003841292010000112
representing a sample correction variance, and p and q respectively represent a prediction result and a formal label;
Figure BDA0003841292010000113
adding the loss functions, measuring the similarity and the prediction accuracy of the output layers of the teacher model and the student model, and optimizing the similarity and the prediction accuracy to ensure that the student model learns the generalization ability of the output layers of the teacher model, wherein the expression is as follows:
Figure BDA0003841292010000114
minimizing the part of the loss function by taking the MSE distance of the teacher model and the student model corresponding to the characteristics of the middle layer as a part of the loss function
Figure BDA0003841292010000115
So that the function and the structure of the two middle layers are close to each other:
Figure BDA0003841292010000116
wherein the content of the first and second substances,
Figure BDA0003841292010000117
and
Figure BDA0003841292010000118
respectively corresponding middle layer characteristics of a certain layer of graph structure in a teacher model and a student model, n is the total BatchSize number, M is the total grade number of the graph structure characteristics, and the two parts are combined to form a total loss function required by knowledge distillation
Figure BDA0003841292010000119
Figure BDA00038412920100001110
Under the condition that the difference between the teacher model and the student model is overlarge, the excessive knowledge input by the teacher model is not beneficial to obtaining good training effect, the temperature over-parameter T is introduced, the temperature over-parameter T is used for adjusting the characteristic difference of the middle layer, the amount of knowledge contained in the knowledge distillation is determined, the amount of knowledge contained in the knowledge distillation is set, the relation between the difference of the teacher model and the student model and the transferred knowledge amount is balanced, a training curve is smoothed, and the training effect is improved; and setting a super parameter lambda to enable the order of magnitude of loss of the logis layer to be close to that of loss of the hidden layer, and optimizing a total loss function to enable the loss function to be converged to obtain a trained student model. The method introduces knowledge distillation into the multi-view three-dimensional recognition field, and distills complete information obtained after teacher model training to student models, so that the student models can obtain a three-dimensional recognition effect close to that of a complete view set when the number of views is small, high-precision three-dimensional recognition models can be obtained when any multi-view data is input, the models do not need any information except the views, and the light-weight characteristics are achieved.
Example 3
In this embodiment, the ModelNet40 and the ModelNet10 data set are used to evaluate the method of the present invention, and the effect is further illustrated by the following simulation experiment.
The ModelNet40 and ModelNet10 datasets used in the experiment are multi-view datasets of a three-dimensional object. The ModelNet40 multi-view dataset holds multi-view data from 40 classes, 12311 three-dimensional objects (12-view or 20-view), and the segmentation of the ModelNet40 dataset follows the following operations: 9843 objects in the data set are divided into a training set, 2468 objects are divided into a test set, and when the ModelNet40 data set is tested, the three-dimensional recognition results of 20 views and 12 views are respectively tested. The ModelNet10 data set is much smaller compared to the ModelNet40 data set. The ModelNet10 multi-view dataset has multi-view data from 10 classes, 4899 three-dimensional objects (12 views or 20 views), wherein 3991 objects are used as training sets and 908 objects are used as test sets, and the invention also tests the three-dimensional recognition effect of 20 views and 12 views.
The related comparison methods comprise the following steps: MVCNN, GVCNN, MHBN, MLVCNN, rotanet, view-GCN, CAR-Net and other multi-View three-dimensional identification methods. The main comparison index is the classification and retrieval accuracy in three-dimensional recognition. Classification accuracy is the ratio of the number of correct samples to the total number of samples for a class prediction, and MAP calculates L between features 2 Distance sortingAnd taking the three-dimensional object with the minimum distance as a prediction result, and finally calculating the average retrieval accuracy. The comparative results are shown in Table 1.
TABLE 1
Figure BDA0003841292010000131
As can be seen from table 1, the method has a good three-dimensional recognition effect when the multi-view data is completely input. Table 2 shows the model size settings of the present application, the MVCNN method, and View-GCN.
TABLE 2
Method The method of the present invention MVCNN View—GCN
Size of model 63.76MB 491.84MB 129.48MB
Assuming that the training images are numbered 1-20 in order, in the case of out-of-order input, the input multi-view data may be in the form of, in conjunction with fig. 4, the abscissa in fig. 4 is the number of views and the ordinate represents the classification accuracy, for example: 8 views (view order and source may be: 13,7,2,14,3,6,7,9). As can be seen from fig. 4, the model can still obtain a good three-dimensional object recognition effect under the conditions of disorder and a small number of views, thereby verifying the effectiveness of the method provided by the present invention.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. An application-end-oriented multi-view three-dimensional object identification method is characterized by comprising the following steps:
s1, extracting all view features of a multi-view data set for each three-dimensional object, and selecting a plurality of features from the multi-view features as group representative features based on feature difference distance measurement;
s2, dividing the rest of characteristics except the group representative characteristics into groups where the group representative characteristics closest to the group representative characteristics are located, and performing characteristic fusion on all characteristics in each group to obtain a plurality of group fusion characteristics;
s3, inputting the group fusion characteristics into a graph convolution network, carrying out global transmission of local information to obtain graph convolution characteristics with global information, and fusing all the graph convolution characteristics into a three-dimensional characteristic descriptor;
s4, constructing a teacher model and a student model with the functions of the steps S1-S3, training the teacher model by using the complete multi-view data set, guiding and training the student model by using the defect multi-view data set and the trained teacher model, and obtaining a trained student model;
and S5, inputting any multi-view data of the actual to-be-recognized three-dimensional object of the application end into the trained student model to obtain a three-dimensional recognition result.
2. The application-oriented multi-view three-dimensional object recognition method of claim 1, wherein in step S1, all the angular views of all the three-dimensional objects form a multi-view data set, and all the angular views of each three-dimensional object are input to a pre-training feature extraction network to obtain all the single-view image features of the multi-view data set.
3. The application-oriented multi-view three-dimensional object recognition method of claim 2, wherein for each three-dimensional object, all single-view image features thereof are combined into a multi-view feature F expressed as: f = { F 1 ,f 2 ,...,f N Where N represents the number of single-view image features contained in the three-dimensional object, f i Represents the i-th single-view image feature in the multi-view feature F, i =1,2, …, N.
4. The method for identifying multi-view three-dimensional objects facing application end according to claim 1, wherein in step S2, for each three-dimensional object, a feature difference metric between single-view image features in a multi-view feature F thereof is calculated
Figure FDA0003841292000000011
The calculation expression is:
Figure FDA0003841292000000012
Figure FDA0003841292000000013
wherein, theta i Representing the characteristic parameters of each single-view image,
Figure FDA0003841292000000014
representing a feature difference measure obtained by calculating the sum of squared differences of the corresponding features, V (-) representing the feature difference measure
Figure FDA0003841292000000015
A method for extracting a recursive representative view based on the features f of a random extraction i Then extracting the feature f with the largest sum of feature difference metrics from the previously extracted features j Finally, the steps are repeated until M characteristics F are extracted G ={f g1 ,f g2 ,...f gM M is a set number of packets, not greater than N,
Figure FDA0003841292000000025
5. the method for identifying a multi-view three-dimensional object facing an application end according to claim 4, wherein in step S3, the distances between the group representative feature and the rest of the features are calculated, and the expression is as follows:
d(f gi -f j )=||f gi -f j || 2
and (3) classifying the rest characteristics into the group with the minimum distance from the characteristics to represent the characteristics, namely satisfying the following conditions:
G l =argmin(min(d(F G ;f j )),l=0,1,2...M-1,j=0,1,2,..N-1
wherein the content of the first and second substances,
Figure FDA0003841292000000026
finally, the characteristic group G is obtained l ,G l A characteristic group containing a plurality of characteristics is set as the M groups; performing maximum pooling operation on all the features in each group to realize local feature fusion and obtain a plurality of group fusion features, wherein the expression is as follows:
Figure FDA0003841292000000021
wherein maxpool represents the maximum pooling operation, N l Indicates the number of features, G, contained in each group l,i The feature F obtained by fusing the ith feature of the ith n ={f 1 ,f 2 ,...f M Contains M fused set features.
6. The method for identifying a multi-view three-dimensional object oriented to an application end according to claim 4, wherein when the distances between the group representative feature and the rest of the features are calculated, if the distances between the group representative feature and the rest of the features are the same, the features are simultaneously divided into the groups where the group representative features are located.
7. The application-oriented multi-view three-dimensional object recognition method of claim 5, wherein the process of inputting the group fusion features into the graph convolution network for global transmission of local information comprises:
s41, group characteristics F n Group fusion feature f in (1) i As nodes of the graph structure, and obtaining an adjacency matrix S representing the neighborhood of the graph nodes through an intermediate layer comprising a plurality of layers of MLPs i,j :S i,j =φ(d ij ;θ s )
Wherein d is ij =[f i ,f j ,f i -f j ,||f i -f j || 2 ]∈R 10 Representing the spatial relationship between the two group-fused features,
Figure FDA0003841292000000022
the representation comprises multiple layers of MLPs and vectorized fusion is carried out on elements in the group characteristics, theta s A representing parameter representing a correspondence between the two sets of features;
s42, determining the group characteristics in the nearest neighbor range of each group characteristic through a KNN algorithm, only keeping the relevant edges of the K nearest neighbor group characteristics, and obtaining a sparse connection matrix A i,j
A i,j =S i,j ·C{f ni ∈K(f nj )};
Wherein, C (-) represents the nearest neighbor operation for judging whether the group feature belongs to another group feature, and multiplication represents the sparsification of the original adjacent matrix;
s43, carrying out graph convolution on the graph structure to obtain graph convolution structure characteristics of the level
Figure FDA0003841292000000023
Figure FDA0003841292000000024
Wherein A is l A adjacency matrix representing the l-th layer diagram, F G l Is the original set of features of the class diagram, W l Is the learnable weight matrix, θ, of the layer map l For the parameters of the linear activation function, Ψ is a nonlinear transformation function when the original set of features F is input G l First by the adjacency matrix A l Propagating, and updating group feature nodes by linear transformation of a learnable weight matrix to obtain local graph convolution features of each level
Figure FDA0003841292000000031
It is subject to global messaging:
Figure FDA0003841292000000032
wherein i, j =0,1,2.. M-1,
Figure FDA0003841292000000033
the space relation between two nodes is represented by sigma, which is a relation function between the two nodes, and the practical meaning is that node pairs output by multiple layers of MLPs (multi-level MLPs) gather content messages, and each layer of MLPs comprises a plurality of convolution units and nonlinear activation;
s44, according to the newly obtained spatial relationship of all node pairs, fusing the spatial relationship with the local graph convolution characteristics to obtain new local graph convolution characteristics with global information
Figure FDA0003841292000000034
Figure FDA0003841292000000035
Wherein omega is a single-layer MLP, and the new local graph convolution characteristics fused by global node pair information and the original group characteristics in the graph are output through batch normalization fusion characteristics;
s45, obtaining local graph convolution and carrying out global information transmission through the same mode for each level graph structure, gradually eliminating group features with minimum comprehensive distance measurement for each level graph structure based on feature difference distance measurement, and finally obtaining M graph convolution features
Figure FDA0003841292000000036
The local map convolution features of the M levels are fused together by a max-pooling method to form a global descriptor F representing the three-dimensional object GCN
Figure FDA0003841292000000037
8. The method for identifying the multi-view three-dimensional object facing the application end according to claim 1, wherein in step S5, the constructed teacher model and the constructed student model each comprise an image feature extraction module, a feature grouping and fusing module and an image convolution module;
the image feature extraction module is used for extracting all single-view image features of the three-dimensional object multi-view data set and combining the single-view image features into multi-view features; the feature grouping and fusing module selects a plurality of features from the multi-view features as group representative features for each three-dimensional object based on feature difference distance measurement, divides the rest features except the group representative features into the group where the group representative features closest to the group representative features are located, and performs feature fusion on all the features in each group to obtain a plurality of group fusion features; the image volume module is used for global transmission of local information to obtain image volume characteristics with global information, and all the image volume characteristics are fused into a three-dimensional characteristic descriptor.
9. The method for identifying a multi-view three-dimensional object oriented to an application end according to claim 8, wherein the complete multi-view data set in step S5 is a standard common data set, and the defective multi-view data set is a data set processed by reducing the number of views of the complete multi-view data set and disordering the order of the views.
10. The method for identifying an application-oriented multi-view three-dimensional object according to claim 9, wherein the teacher model is trained using the complete multi-view data set, and the process of training the student model using the defect multi-view data set and using the teacher model is guided by: x for respectively extracting logics layers of teacher model and student model i And y i Expressing, namely measuring the characteristic difference of MSE mean square error by using the MSE mean square error, and respectively expressing the difference between a prediction result and a real label in a teacher-student network by using a cross entropy function:
Figure FDA0003841292000000041
wherein N represents the number of all views, r represents the number of groups, N i A number of view features representing a certain group,
Figure FDA0003841292000000042
representing a sample correction variance, and p and q respectively represent a prediction result and a formal label;
Figure FDA0003841292000000043
adding the loss functions, measuring the similarity and the prediction accuracy of the output layers of the teacher model and the student model, and optimizing the similarity and the prediction accuracy to ensure that the student model learns the generalization ability of the output layers of the teacher model, wherein the expression is as follows:
Figure FDA0003841292000000044
minimizing the MSE distance of the teacher model and the student model corresponding to the intermediate layer features as part of the loss functionPartial loss function
Figure FDA0003841292000000045
So that the function and the structure of the two intermediate layers are close to each other:
Figure FDA0003841292000000046
wherein the content of the first and second substances,
Figure FDA0003841292000000047
and
Figure FDA0003841292000000048
respectively corresponding intermediate layer characteristics of a certain layer of graph structure in a teacher model and a student model, n is the total BatchSize number, M is the total level number of the graph structure characteristics, and the two parts are combined to form a total loss function required by the student model in knowledge distillation
Figure FDA0003841292000000049
Figure FDA00038412920000000410
Under the condition that the difference between the teacher model and the student model is overlarge, the excessive knowledge input by the teacher model is not beneficial to obtaining good training effect, the temperature over-parameter T is introduced, the temperature over-parameter T is used for adjusting the characteristic difference of the middle layer, the amount of knowledge contained in the knowledge distillation is determined, the amount of knowledge contained in the knowledge distillation is set, the relation between the difference of the teacher model and the student model and the transferred knowledge amount is balanced, a training curve is smoothed, and the training effect is improved; and setting a super parameter lambda to enable the order of magnitude of loss of the logis layer to be close to that of loss of the hidden layer, and optimizing a total loss function to enable the loss function to be converged to obtain a trained student model.
CN202211102704.0A 2022-09-09 2022-09-09 Multi-view three-dimensional object identification method facing application end Pending CN115601745A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211102704.0A CN115601745A (en) 2022-09-09 2022-09-09 Multi-view three-dimensional object identification method facing application end

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211102704.0A CN115601745A (en) 2022-09-09 2022-09-09 Multi-view three-dimensional object identification method facing application end

Publications (1)

Publication Number Publication Date
CN115601745A true CN115601745A (en) 2023-01-13

Family

ID=84843876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211102704.0A Pending CN115601745A (en) 2022-09-09 2022-09-09 Multi-view three-dimensional object identification method facing application end

Country Status (1)

Country Link
CN (1) CN115601745A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688504A (en) * 2024-02-04 2024-03-12 西华大学 Internet of things abnormality detection method and device based on graph structure learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688504A (en) * 2024-02-04 2024-03-12 西华大学 Internet of things abnormality detection method and device based on graph structure learning
CN117688504B (en) * 2024-02-04 2024-04-16 西华大学 Internet of things abnormality detection method and device based on graph structure learning

Similar Documents

Publication Publication Date Title
CN109493346B (en) Stomach cancer pathological section image segmentation method and device based on multiple losses
CN111489358B (en) Three-dimensional point cloud semantic segmentation method based on deep learning
CN110647904B (en) Cross-modal retrieval method and system based on unmarked data migration
CN108090472B (en) Pedestrian re-identification method and system based on multi-channel consistency characteristics
CN109063649B (en) Pedestrian re-identification method based on twin pedestrian alignment residual error network
CN103559504A (en) Image target category identification method and device
CN110516095A (en) Weakly supervised depth Hash social activity image search method and system based on semanteme migration
CN111931505A (en) Cross-language entity alignment method based on subgraph embedding
CN112015868A (en) Question-answering method based on knowledge graph completion
CN114169442B (en) Remote sensing image small sample scene classification method based on double prototype network
CN114241273A (en) Multi-modal image processing method and system based on Transformer network and hypersphere space learning
CN112633382A (en) Mutual-neighbor-based few-sample image classification method and system
CN112801209A (en) Image classification method based on dual-length teacher model knowledge fusion and storage medium
CN116416478B (en) Bioinformatics classification model based on graph structure data characteristics
CN114821342A (en) Remote sensing image road extraction method and system
CN114780777B (en) Cross-modal retrieval method and device based on semantic enhancement, storage medium and terminal
CN114254093A (en) Multi-space knowledge enhanced knowledge graph question-answering method and system
CN115601745A (en) Multi-view three-dimensional object identification method facing application end
CN110796182A (en) Bill classification method and system for small amount of samples
CN113361928A (en) Crowdsourcing task recommendation method based on special-pattern attention network
CN117350330A (en) Semi-supervised entity alignment method based on hybrid teaching
CN112529057A (en) Graph similarity calculation method and device based on graph convolution network
CN116226467A (en) Community discovery method of graph convolution neural network based on node structural features
CN115861664A (en) Feature matching method and system based on local feature fusion and self-attention mechanism
CN113160291B (en) Change detection method based on image registration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination