CN112633301A

CN112633301A - Traditional Chinese medicine tongue image greasy feature classification method based on depth metric learning

Info

Publication number: CN112633301A
Application number: CN202110045201.3A
Authority: CN
Inventors: 李晓光; 郭新; 卓力; 张辉
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-01-14
Filing date: 2021-01-14
Publication date: 2021-04-09

Abstract

The invention provides a traditional Chinese medicine tongue image greasy feature classification method based on depth measurement learning, which uses depth features with discrimination and a self-adaptive feature measurement criterion to construct a measurement space. The method comprises the steps of extracting depth features with discriminability by applying a twin network, and learning a measurement criterion suitable for the current greasy features by utilizing a neural network in a self-adaptive manner. The technology has wide application prospect in tongue image automatic analysis of traditional Chinese medicine, a new depth measurement learning network is designed for the algorithm to be applied to tongue image greasy feature classification tasks, intra-class compactness and inter-class discriminability of depth features are effectively enhanced through a learning self-adaptive measurement criterion, and constructed measurement space quality is improved, so that the accuracy of the algorithm in tongue image greasy feature classification is improved.

Description

Traditional Chinese medicine tongue image greasy feature classification method based on depth metric learning

Technical Field

The invention relates to an image classification method, in particular to a traditional Chinese medicine tongue image greasy feature classification method based on depth measurement learning.

Background

Traditional Chinese medicine is an important component of Chinese culture, and is an empirical summary and theoretical generalization of Chinese people in long-term struggle with diseases. It stands out from the forest of world medicine with a unique theoretical system, and is a valuable wealth of human medicine. The tongue diagnosis in traditional Chinese medicine is an important content of inspection in the four diagnostic methods in traditional Chinese medicine, and doctors diagnose diseases by observing various tongue picture characteristics such as tongue color, tongue body shape, tongue fur color, tongue fur quality and the like, and is an important link in the treatment based on differentiation of symptoms and signs in the principle of traditional Chinese medicine. For a long time, the tongue diagnosis result is mainly determined by the personal experience through visual observation of doctors, and is influenced by subjective factors such as knowledge level and experience of doctors. The tongue diagnosis result has strong subjectivity, lacks of unified standard, and seriously hinders the modernization of the traditional Chinese medicine. Therefore, the realization of the objectification of the tongue diagnosis information in the traditional Chinese medicine becomes a problem to be solved urgently in the modernization development process of the traditional Chinese medicine.

The greasy tongue condition is an important index in tongue diagnosis in traditional Chinese medicine, and doctors can deduce whether the patient is phlegm-damp or indigestion through the greasy characteristic. The greasy coating means that the tongue coating is fine, compact, smooth and white, the reasons are damp cold, and the yellow greasy reasons are damp heat; the rotten tongue refers to loose or rotten tongue, which is mostly caused by food stagnation and phlegm turbidity. The tongue image greasy characteristic is analyzed by a computer, so that the tongue image greasy characteristic information is objective, the diagnosis of diseases is facilitated, and the method has an important application value.

The tongue image features can be classified into two main categories: tongue image classification based on traditional methods and tongue image classification based on deep learning. The traditional tongue image classification method needs to design characteristics aiming at tongue images manually and meticulously, so the extracted characteristics have specific physical meanings, but the method usually needs abundant professional experience, and the extracted characteristics are some bottom-layer characteristics such as textures, colors, shapes, gradients and the like. The characteristics of manual design have the limitation of limited generalization capability, thereby leading to low tongue image classification accuracy. In recent years, deep learning brings a revolution to many industries due to strong feature extraction and expression capability. The deep learning knowledge theory is applied to the classification of the greasy characteristics of the tongue image of the traditional Chinese medicine, and a more accurate tongue picture analysis result of the traditional Chinese medicine is expected to be provided.

There are two main problems to be solved when using deep learning models for tongue image greasy feature classification: 1) high quality labeled tongue image samples are costly to acquire: tongue image training samples need to be marked by doctors with abundant tongue diagnosis experience, and cost is high and difficult to obtain. Therefore, the tongue image training data set of traditional Chinese medicine is generally small in size. If the deep learning algorithm is directly applied, the phenomena of difficult model training or overfitting and the like can be caused due to the insufficient number of training samples. 2) When tongue image data acquired in a real scene of a hospital is analyzed, the number proportion of samples of various categories of acquired greasy characteristics is found to be obviously different, namely the categories of the samples are unbalanced. The relatively large number of class samples is called a majority class, and the relatively small number of class samples is called a minority class. When the training sample data set with unbalanced class proportion is directly applied to deep learning network training, a model with poor generalization capability may be obtained, so that the classification result is biased to most classes, and the recognition rate of the model to samples of few classes is low.

The invention provides a traditional Chinese medicine tongue image greasy characteristic classification method based on depth measurement learning. A depth metric learning network is designed which incorporates adaptive metric criteria. The method can be divided into two parts, wherein one part is a feature extraction network, and a twin network is adopted to extract features with discrimination; the second part is a feature comparison network, and the similarity between features is judged according to an input sample self-adaptive learning metric criterion; the method can effectively improve the accuracy of tongue image greasy classification. The technology has important application value in automatic analysis of tongue images in traditional Chinese medicine.

Disclosure of Invention

The invention aims to solve the problems that when tongue images are classified according to greasy characteristics in traditional Chinese medicine, overfitting phenomenon occurs when the amount of acquired tongue image samples is insufficient, and the recognition rate of a trained model to a few categories is low due to the fact that categories of the greasy characteristics of tongue image samples acquired under a real scene are unbalanced. Aiming at the problems, the traditional Chinese medicine tongue image greasy characteristic classification method based on metric learning is provided, and accurate tongue image greasy characteristic automatic classification can be realized by setting a reading mechanism of training data and self-adapting learning metric criteria from the data.

The invention is realized by adopting the following technical means:

a traditional Chinese medicine tongue image greasy feature classification method based on depth measurement learning. The method is realized by utilizing a data reading mechanism designed according to the characteristics of the tongue image greasy characteristic data set and adopting a twin network framework.

1) The invention randomly selects the same number of samples from each category in the data set by a data reorganization mode during each training to form the data set used by the training, and performs pairwise arrangement and combination on the randomly selected samples to form training data, wherein one pair of samples is one training data.

2) A traditional Chinese medicine tongue image greasy feature classification method based on depth measurement learning. The method employs a twin network framework. Firstly, the combined training data sample pairs are sent to a feature extraction network, depth features with discrimination for the current tongue image greasy feature classification task of traditional Chinese medicine are extracted, then the two extracted depth features are subjected to dimensionality splicing, the spliced combined features are sent to a feature comparison network, and a measurement criterion suitable for the current tongue image greasy feature classification of traditional Chinese medicine is learned in a self-adaptive mode. The whole network of the method is shown as the attached figure 1 and is divided into two parts: the feature extraction network and the feature comparison network are shown in the attached figures 2 and 3. The following are detailed descriptions of the feature extraction network and the feature comparison network, respectively.

Structure of the feature extraction network: a twin network structure is adopted, namely a combined structure established by two neural networks. The twin neural network takes two samples as input, and the output is extracted sample depth characteristics so as to compare the similarity degree between the two samples. The two neural network weights are shared, namely the network weights are the same, and the same depth characteristics of the two input tongue images are guaranteed to be extracted. The characteristic extraction network adopts a residual error network and mainly comprises a residual error module. The residual module adds identity mapping in the network structure, allowing the information of the previous layer to be directly transferred to the following layer. Because the bottom-layer features of the tongue image, such as color, texture and the like, are helpful for classification of the greasy features, the residual error network can enable the bottom-layer features useful for classification of the greasy features to be better transmitted to the high-layer features, and therefore the tongue image classification accuracy can be improved.

The feature extraction network consists of one convolutional layer and 4 residual modules. The convolution layer includes a convolution operation and a max pooling to reduce the dimension of the image, where the convolution kernel size is 7 x 7. Each residual error module in the feature extraction network consists of two residual error units, and one residual error unit comprises an identity mapping. The residual unit of the first residual module is operated by two 3 × 3 convolutions, between which there are in turn batch normalization and a nonlinear activation function (ReLU), followed by a ReLU function. The latter three residual modules are slightly different from the first residual module in design, and because the input and output characteristic dimension of the first residual unit of the last 3 residual modules is changed, an extra 1 × 1 convolution needs to be added to the layer jump connection to adjust the output of the layer jump connection.

Structure of the feature comparison network: the method is characterized in that a convolutional layer and a linear layer are adopted to form, the depth features extracted by a feature extraction network are input, feature dimensionality reduction is carried out on the depth features through 128 3 x 3 convolutional cores, then batch normalization and a nonlinear activation function ReLU are carried out, and the robustness and the expression capacity of a model are improved. After passing through two linear layers, the number of neurons in the linear layers is respectively 10 and 1, and finally, the value of the similarity of the output is enabled to be between 0 and 1 through a sigmoid function.

The method is divided into two stages: a training stage and a testing stage;

the training stage comprises the following specific steps:

firstly, establishing a training data queue; the tongue image data set is divided into 3 classes of greasy, slightly greasy and non-greasy characteristics, and each class has c₁，c₂，c₃A sample is obtained; first, q samples are selected from each class (where q is the number of samples in the class<min(c₁,c₂,c₃) Then, a total of 3 × q samples are selected, and 3 × q samples are arranged and combined to form (3 × q)²Sample pairs, wherein the two samples in the sample pairs have a sequence; and taking all generated sample pairs as a data queue during the training, setting the label to be 1 if each pair of samples are of the same category during the training, otherwise setting the label to be 0, and training the model by using the sample pairs and the corresponding labels, wherein the permutation and combination samples of each training are reselected.

Secondly, extracting features of the sample pairs by using a feature extraction network; inputting the sample pairs in the data queue established in the first step, and outputting two extracted depth features of the corresponding tongue image; in order to prevent the overfitting phenomenon, the network is trained in a mode of fine tuning the parameters of the whole network, namely, the parameters of a pre-training model are used as initial parameters of the network during training, and training data are used for training the parameters of the whole network, so that the training efficiency and accuracy can be effectively improved.

Thirdly, calculating the similarity of the depth features by the feature comparison network; the method comprises the steps of inputting splicing characteristics of two depth characteristics output by a characteristic extraction network, specifically splicing the two depth characteristics output by the characteristic extraction network in a space dimension, wherein the extracted depth characteristic dimension is [ c × h × w ], and c, h and w respectively represent the channel number, height and width of a characteristic diagram. The dimension after splicing is [2c × h × w ]. The output is the value of the similarity of the features compared with the network calculation. Samples of the same category should have higher similarity, and samples of different categories should have lower similarity; in the invention, the closer the value of the similarity is set to 1, the more similar the two depth characteristics are; similarly, the closer the similarity value is to 0, the lower the similarity between the two depth features.

The cost loss function loss of the whole training process is divided into two parts. The first part is whether the input sample pair is the same-class sample, the similarity value between the same-class samples calculated by the constraint network during training is close to 1, and the similarity value between different-class samples is close to 0. Can be expressed as:

loss1＝-(y*log^p)+(1-y)*log^(1-p) (1)

wherein p represents the similarity value calculated by the feature contrast network, y represents the sample pair label determined by two sample categories, and if the categories of the sample pair are the same, the label is 1, otherwise the label is 0. The second part should not have a difference in the stitching order when calculating the similarity value for two depth features. Assume that the two depth features of the output of the feature extraction network are a and b. Let a & b represent the feature after a-splicing b. F _ R (a & b) represents the calculated similarity between a and b, and F _ R (a & b) ═ F _ R (b & a), the values of the calculated similarities for the same sample pair should be the same. Also, in order to have the same order of magnitude between loss2 and loss1, the cross-entropy is also used to calculate its loss function, so it can be expressed as:

loss2＝-(y*log^{F_R(a&b)-F_R(b&a)})+(1-y)*log^{(1-F_R(a&b)-F_R(b&a))} (2)

when calculating loss2, it can be deduced that y is 0, so the last loss2 can be reduced to:

loss2＝log^{(1-F_R(a&b)-F_R(b&a))} (3)

so the total loss is loss1+ loss 2.

The test phase comprises the following steps:

firstly, extracting corresponding depth features from all samples in a training set by using a trained feature extraction network, and then calculating the central depth feature of each category, namely

Using the class center depth feature to represent the position of the class in the measurement space, so that class center depth features of class quantity can be obtained; extracting the depth characteristics of the test sample through a characteristic extraction network;

and secondly, calculating the similarity value between the depth feature of the test sample and the central depth feature of each category by using the learned measurement criterion, namely a comparison network, and classifying the test sample into the category with the maximum similarity value calculated by adopting the nearest neighbor idea. In the invention, the classification of the test sample into the class with the maximum similarity value with each training sample is not directly adopted, because the tongue image sample data is insufficient, the constructed measurement space can be influenced to a certain extent, and the direct use of the nearest neighbor classification can cause the erroneous classification of the test sample due to abnormal points, less sample data and the like, so the classification is carried out by adopting the average similarity in a global consideration. And the method for calculating the depth feature of the category center by the advanced machine is obviously faster than the method for directly using the nearest neighbor thought classification.

Compared with the prior art, the invention has the following obvious advantages and beneficial effects:

according to the method, a new data queue is formed by adopting a data recombination mode during each round of training, so that the situation of class imbalance existing in the tongue image greasy characteristic can be reduced to a certain extent; and a twin network thought is adopted, and a feature comparison network learning measurement criterion is utilized during the learning of the feature extraction network, so that the learned measurement criterion is more suitable for tongue image greasy feature analysis than the measurement modes which are set in advance such as Euclidean distance, cosine similarity, Pearson coefficient and the like; and finally, calculating the class center depth features of the training data set, and calculating the class of the test sample by adopting the depth features of the test sample and the center depth features of each class, so that the robustness of the algorithm can be improved to a certain extent.

The invention has the characteristics that:

1. by utilizing a data recombination mode, partial data with relatively balanced tongue image greasy characteristic categories are selected from a data set before each training, and are combined to form a data queue, so that the conditions of less collected tongue image samples and unbalanced category of greasy characteristics are relieved to a certain extent;

2. the network learning measurement criterion is used for replacing a measurement criterion mode set in advance manually, so that the learned measurement criterion is more suitable for the current tongue image greasy characteristic classification task, and the classification accuracy is improved;

3. during testing, the trained feature extraction network is used for calculating the center depth feature of each category, the similarity between the center depth feature of each category and the test sample is calculated by using the center depth feature of each category, and the category of the test sample is calculated. The performance of the classification model can be improved by adopting the average similarity, and meanwhile, the algorithm is more robust.

The following detailed description is presented, in conjunction with examples, to provide a further understanding of the objects, features, and advantages of the invention.

Description of the drawings:

FIG. 1, inventive method network architecture;

FIG. 2, a feature extraction network architecture;

FIG. 3, feature comparison network architecture;

FIG. 4 is a schematic diagram of a method of classification at a test phase of the method;

FIG. 5 is a diagram of a tongue image greasy feature data set;

the specific implementation mode is as follows:

the following detailed description of embodiments of the invention is provided in conjunction with the accompanying drawings:

(1) the resolution of tongue images collected from a hospital site is 5184 × 3456, the resolution of the tongue images is too large to be directly used as a training data set, in order to reduce interference of irrelevant information existing in the collected tongue images on tongue image greasy feature classification, the image blocks at key positions of tongue coating are firstly selected from the collected tongue image data, the resolution of the selected image blocks is 224 × 224, no repeated area exists among the selected image blocks, the resolution is used as a standard to manufacture a tongue image greasy feature data set, and the feature categories are classified into non-greasy, slightly greasy and greasy. Finally, after data cleaning, operations such as some obvious error data, missing value processing, repeated value positions and the like are screened out, and finally a tongue image greasy characteristic data set used for experiments is constructed, as shown in the attached figure 5; the detailed information is 1500 training sets and 443 test sets.

(2) The specific implementation steps of the training are as follows:

a) constructing a data queue; the integral training round number of the network is set to 10000 rounds, 30 samples are randomly selected from each class of data again during each round of training, and 90 samples are selected in total through three classes of greasy characteristics. And (3) combining the arranged and combined data queue, wherein the total number of the data queue is 90 × 90 to 8100 sample pairs, the label of each sample pair depends on the categories of two tongue image samples in the sample pair, and the label of the sample pair is set to be 1 when the two tongue image samples in the sample pair are of the same category, and is set to be 0 when the two tongue image samples in the sample pair are of the same category. The data queue needs to be reconstructed every round of training, so that the information of the whole data set can be more fully utilized.

b) A feature extraction network; the network structure is as shown in fig. 2, a pre-trained 18-layer residual error network model is used as a main skeleton of a feature extraction network, because tongue image greasy feature classification and natural image classification tasks have certain difference, compared with natural images, the difference between different tongue image greasy features is small, and a plurality of fine textures are probably important information, pre-trained model parameters are used as initial parameters of the main skeleton network, a random gradient descent (SGD) optimizer is used for training the model, a fixed step length attenuation learning rate is adopted, the initial learning rate is set to be 0.005, then the initial learning rate is reduced to be 80% every 500 epochs, and weight attenuation (weight decay) is set to be 1e-5 in order to prevent overfitting. And fine-tuning the characteristic network during training. Two tongue images with the size of 3 x 224 x 22 pixels are input into a feature extraction network, more discriminant depth features are extracted through the feature extraction network, and the output dimension of the depth features is 512 x 7.

c) And (3) feature comparison network: the network structure is as shown in fig. 3, and the two extracted feature extractions are spliced in the spatial dimension to form a 1024 × 7 × 7 feature that is sent to the feature comparison network. The feature contrast network structure is shown in fig. 3, the spliced depth features firstly pass through 128 convolution kernels of 3 × 3, the output dimension is 128 × 7 × 7, then a normalization operation and a ReLU activation function are followed, the output dimension is 128 × 1 × 1 after global tie pooling, the dimension is changed into that the dimension passes through two linear layers, the number of neurons in the linear layers is 10 and 1 respectively, and finally the output similarity value is between 0 and 1 through a sigmoid function. Representing the calculated similarity values between the features of the stitching. The closer the value is to 1, the more similar the two features are, and it is highly probable that the two tongue image samples are homogeneous samples. Conversely, close to 0, it is highly probable that the two tongue image samples are not homogeneous samples. The training mode and parameters of the feature comparison network are consistent with those of the feature comparison network, but the network parameters are initialized randomly.

(d) And simultaneously training the feature comparison network and the feature comparison network. Eventually, the network converges and the loss remains substantially unchanged.

(3) The specific implementation steps of the test stage are as follows:

a) the testing method is as shown in fig. 4, extracting depth features for all samples in a training set by using a trained feature extraction network, and then calculating the central depth feature class a of each class_iNamely:

where n represents the total number of samples in the currently computed class, a₁,a₂…a_nRespectively representing the n depth features extracted from all samples in the currently computed class. The position of the class in the measurement space is represented by the class center depth feature, so that the class center depth feature of the class number can be obtained. The test sample is also subjected to a feature extraction network to extract the depth features of the test sample.

b) Classifying the test samples: extracting feature a from the test sample_testCalculating the similarity value of the center depth feature of each category through the feature comparison network, wherein the category of the center depth feature of the category with the maximum similarity is the category of the test sample:

Max(F_R(a₁&a_test)，F_R(a_z&a_test)，F_R(a₃&a_test)) (6)

Claims

1. a traditional Chinese medicine tongue image greasy feature classification method based on depth measurement learning is characterized by comprising the following steps:

1) randomly selecting samples with the same quantity from each category of the data set in a data recombination mode during each training to form the data set used by the training, and performing pairwise arrangement and combination on the randomly selected samples to form training data, wherein one pair of samples is one training data;

2) adopting a twin network framework; firstly, sending the combined training data sample pair into a feature extraction network, extracting depth features with discrimination for the current tongue image greasy feature classification task of the traditional Chinese medicine, then carrying out dimension splicing on the two extracted depth features, sending the spliced combined features into a feature comparison network, and learning out a measurement criterion suitable for the current tongue image greasy feature classification of the traditional Chinese medicine in a self-adaptive manner;

the whole network of the method is divided into two parts: a feature extraction network and a feature comparison network; the following is a detailed description of the feature extraction network and the feature comparison network, respectively;

structure of the feature extraction network: a twin network structure is adopted, namely a combined structure established by two neural networks; the twin neural network takes two samples as input, and the output is the extracted sample depth characteristic so as to compare the similarity degree between the two samples; wherein, the two neural network weights are shared, namely the network weights are the same, so that the same depth characteristics of the two input tongue images are ensured to be extracted; the characteristic extraction network adopts a residual error network and mainly comprises a residual error module; the residual error module adds identity mapping in a network structure, and allows the information of the previous layer to be directly transmitted to the later layer;

the feature extraction network consists of a convolution layer and 4 residual modules; the convolution layer comprises a convolution operation and a maximum pooling, and dimension reduction is carried out on the image, wherein the size of a convolution kernel is 7 multiplied by 7; each residual error module in the feature extraction network consists of two residual error units, and one residual error unit comprises an identity mapping; the residual error unit of the first residual error module is operated by two convolutions of 3 multiplied by 3, a batch normalization function and a nonlinear activation function (ReLU) are sequentially arranged between the two convolutions, and a ReLU function is arranged behind the residual error unit; the latter three residual modules are slightly different from the first residual module in design, and because the input and output characteristic dimension of the first residual unit of the latter 3 residual modules is changed, 1 × 1 convolution needs to be additionally added to skip layer connection to adjust the output of the skip layer connection;

structure of the feature comparison network: the method comprises the steps that a convolutional layer and a linear layer are adopted, input is the depth features extracted by a feature extraction network, feature dimensionality reduction is carried out on the depth features through 128 3 x 3 convolutional cores, then batch normalization and a nonlinear activation function ReLU are carried out, and the robustness and the expression capacity of a model are improved; after passing through two linear layers, the number of neurons in the linear layers is respectively 10 and 1, and finally, the output similarity value is between 0 and 1 through a sigmoid function;

the method is divided into two stages: a training stage and a testing stage;

the training stage comprises the following specific steps:

firstly, establishing a training data queue; the tongue image data set is divided into 3 classes of greasy, slightly greasy and non-greasy characteristics, and each class has c₁，c₂，c₃A sample is obtained; first, q samples are selected from each class (where q < min (c)₁，c₂，c₃) Then, a total of 3 × q samples are selected, and 3 × q samples are arranged and combined to form (3 × q)²Sample pairs, wherein the two samples in the sample pairs have a sequence; taking all generated sample pairs as a data queue during the training, setting the label to be 1 if each pair of samples are of the same category during the training, otherwise setting the label to be 0, and training the model by using the sample pairs and the corresponding labels, wherein the permutation and combination samples of each training are reselected;

secondly, extracting features of the sample pairs by using a feature extraction network; inputting the sample pairs in the data queue established in the first step, and outputting two extracted depth features of the corresponding tongue image; in order to prevent the overfitting phenomenon, parameters of a pre-training model are used as initial parameters of a feature extraction network during training, and training data are used for training parameters of the whole network;

thirdly, calculating the similarity of the depth features by the feature comparison network; inputting splicing characteristics of two depth characteristics output by a characteristic extraction network, specifically splicing the two depth characteristics output by the characteristic extraction network in a space dimension, wherein the extracted depth characteristic dimension is [ c × h × w ], and c, h and w respectively represent the channel number, height and width of a characteristic diagram; the dimension after splicing is [2c × h × w ]; outputting a value of similarity calculated by the feature comparison network; samples of the same category should have higher similarity, and samples of different categories should have lower similarity; the closer the value of the similarity is set in the method to be 1, the more similar the two depth features are; the closer the value of similarity is to 0, the lower the similarity of two depth features is;

the cost loss function loss in the whole training process is divided into two parts; the first part is whether the input sample pair is similar samples, the similarity value between the similar samples calculated by the constraint network during training is close to 1, and the similarity value between different samples is close to 0; expressed as:

loss1＝-(y*log^p)+(1-y)*log^(1-p) (1)

wherein p represents a similarity value calculated by the feature comparison network, y represents a sample pair label determined by two sample types, if the sample types of the sample pair are the same, the sample type is 1, otherwise, the sample type is 0; the second part should not have the difference of the splicing sequence when calculating the similarity value for the two depth features; two depth features of the output of the feature extraction network are assumed to be a and b; suppose a & b represents the feature after a splices b; f _ R (a & b) represents the calculated similarity between a and b, and then F _ R (a & b) ═ F _ R (b & a), the values of the calculated similarities for the same sample pair should be the same; also, in order to have the same order of magnitude between loss2 and loss1, the cross-entropy is also used to calculate its loss function, so it can be expressed as:

loss2＝-(y*log^F_R(a&b)^-F_R(b&a))+(1-y)*log^{(1-F_R(a&b)-F-R(b&a))} (2)

when calculating loss2, it can be deduced that y is 0, so the final loss2 is reduced to:

loss2＝log^{(1-F-R(a&b)-F-R(b&a))} (3)

so the total loss is loss1+ loss 2;

the test phase comprises the following steps:

Using the category center depth feature to represent the position of the category in the measurement space to obtain category center depth features of category quantity; extracting the depth characteristics of the test sample through a characteristic extraction network;

and secondly, calculating the similarity value between the depth feature of the test sample and the central depth feature of each category by using the learned measurement criterion, namely a comparison network, and classifying the test sample into the category with the maximum similarity value calculated by adopting the nearest neighbor idea.