CN116310545A - Cross-domain tongue image classification method based on depth layering optimal transmission - Google Patents

Cross-domain tongue image classification method based on depth layering optimal transmission Download PDF

Info

Publication number
CN116310545A
CN116310545A CN202310252527.2A CN202310252527A CN116310545A CN 116310545 A CN116310545 A CN 116310545A CN 202310252527 A CN202310252527 A CN 202310252527A CN 116310545 A CN116310545 A CN 116310545A
Authority
CN
China
Prior art keywords
domain
tongue
tongue image
representing
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310252527.2A
Other languages
Chinese (zh)
Inventor
文贵华
徐映雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202310252527.2A priority Critical patent/CN116310545A/en
Publication of CN116310545A publication Critical patent/CN116310545A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cross-domain tongue image classification method based on depth layering optimal transmission, which comprises the steps of collecting tongue images; constructing a machine learning classification model, wherein the machine learning classification model realizes the alignment of tongue image features in different fields by using a depth layering optimal transmission model, and the depth layering optimal transmission model comprises a two-layer network structure, wherein a first layer of network structure is used for realizing optimal transmission among different fields, and a second layer of network structure is used for realizing optimal transmission among different samples; classifying the tongue images according to the depth layering optimal transmission model, and outputting tongue image categories. The tongue image classifying method has the advantages that the tongue images with different distributions are aligned, and meanwhile, the classifying capability of the tongue images is enhanced.

Description

Cross-domain tongue image classification method based on depth layering optimal transmission
Technical Field
The invention relates to the technical field of tongue image classification for assisting traditional Chinese medicine diagnosis and treatment, in particular to a cross-domain tongue image classification method based on depth layering optimal transmission.
Background
The existing tongue image classification method based on machine learning is mostly based on supervised learning. The supervised learning method generally assumes that the training set and the test set follow the same distribution, so that the model trained on the training set can perform well in the test set. However, in real applications, such an assumption is difficult to be made, and the main idea for solving such a problem is to assume that two data distributions are mapped to a common hidden space between domains through a nonlinear mapping, so as to reduce drift between the distributions, and make the two distributions more similar after undergoing the nonlinear mapping conversion. This process of nonlinear mapping is known as domain adaptation.
Existing machine learning based tongue image classification faces such problems. First, tongue images of different people have differences, including edge textures, colors, etc. of tongue images. Second, tongue image acquisition devices of different hospitals may be different, and the tongue image data acquired is also affected by the acquisition environment, such as angle, illumination, etc. In addition, the geographical locations of different hospitals are different, and individuals who acquire tongue images also have regional differences. These factors result in relatively large differences in distribution of tongue image data from hospital to hospital. If the tongue image data of each hospital is field data, the data distribution in different fields is different, and the differences can lead to a model trained on the acquired data set, and serious performance degradation can occur when the model is deployed to other hospitals. Meanwhile, since the labeling cost of medical data is higher, when the target domain has no labeled data, it is more difficult, namely, only the label of the source domain is available.
To solve this problem, the different domain distributions need to be aligned. The main stream method for solving the distribution alignment of different fields mainly comprises two steps: firstly, two distributions are closer by nonlinear conversion; then, a classifier is trained on the transformed distribution for the target domain using the label information of the source domain, so that the model can be generalized to the target domain, which is also the process of inter-domain knowledge migration. It can be seen how to find this nonlinear transformation is critical to solving the domain adaptation problem. In recent years, the optimal transmission method shows great advantages in the field self-adaption problem, and can directly measure the distance between two distributions on the edge distribution without label information. In the vision field, this Distance based on optimized transmission is called the EMD (Earth Mover's Distance) Distance. On the one hand, it can calculate the distance between two distributions directly on a discrete empirical distribution (domain). On the other hand, meaningful gradients can also be provided when the support sets of the two domains do not significantly overlap, thus not easily leading to training failure. Furthermore, it has good interpretability, enabling explicit modeling of the coupling between domains.
By optimizing the minimum transmission cost between the feature distributions of the two domains, the distributions of both the source domain and the target domain can be transformed into a common hidden space with the minimum cost, and features in the hidden space have domain invariance. This process is called domain alignment. The classifier trained on such features has the ability to migrate to the target domain. Whereas domain alignment is not the final goal, classification is only. However, in the optimal transmission, the cost matrix is generally calculated by calculating the euclidean distance (L2 distance) between every two samples. In such a metric space, a meaningful distance cannot be provided when the support sets of two samples do not overlap. This is manifested in visual problems in that when the background of two samples is too cluttered or has a large change in appearance within a class, images of the same class may be far apart in such metric space. In other words, the L2 distance at this time is greatly affected by the background change. Although this can be alleviated by modeling of neural networks, sufficient training data is required, which is difficult to achieve in real scenes (especially medical scenes), local features of the target area need to be emphasized, meanwhile, the L2 distance is used as a global representation, the spatial structure of the image features is destroyed, and local information is lost. While local information is information that can provide differentiation and migration, which is important for classification tasks. In particular, in the traditional Chinese medicine physical sign image dataset, the standardization degree of the acquisition process is poor, and the background or environmental factors are greatly changed. For the above reasons, in the existing domain alignment process, the domain invariance feature is obtained, and meanwhile, the category distinction of the feature is also blurred, namely, the excessive alignment occurs.
Therefore, how to avoid the excessive alignment phenomenon generated in the tongue image classification process and improve the accuracy of the tongue classification image are technical problems that need to be solved by those skilled in the art.
Disclosure of Invention
In order to solve the problems, the invention discloses a cross-domain tongue image classification method based on deep hierarchical optimal transmission, so that a machine learning model can learn invariance characteristics which are more robust to environmental noise, tongue image data with different distributions have self-adaptive capacity, and classification accuracy is improved.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a cross-domain tongue image classification method based on depth layering optimal transmission comprises the following steps:
s1, collecting tongue image samples in a plurality of different fields as a training set;
s2, performing feature extraction on source domain tongue image samples in the training set by using a deep neural network, and obtaining a source domain image feature map formed by corresponding source domain tongue image sample features;
extracting features of the target domain tongue image samples in the training set by using a deep neural network, and obtaining a target domain image feature map formed by corresponding target domain tongue image sample features;
s3, partitioning source domain tongue image sample characteristics in the source domain image characteristic diagram to obtain a source domain image characteristic set corresponding to the source domain tongue image sample;
dividing the characteristics of the target domain tongue image sample in the target domain image characteristic map into blocks to obtain a target domain image characteristic set corresponding to the target domain tongue image sample;
s4, calculating an optimized transmission distance between a source domain image feature set corresponding to each source domain tongue image sample and a target domain image feature set corresponding to the target domain tongue image sample, and taking the optimized transmission distance as a sample optimized transmission distance between the source domain tongue image sample and the target domain image sample;
s5, taking the sample optimized transmission distance as a cost measure between a source domain and a target domain, and calculating an inter-domain optimized transmission distance between the source domain and the target domain;
s6, calculating softmax cross entropy loss according to the source domain tongue image sample characteristic value extracted in the step S2, and taking the softmax cross entropy loss as a part of a loss function; constructing a classification loss function by taking the inter-domain optimized transmission distance as another part of the loss function, and training a classifier by using the classification loss function;
s7, classifying the tongue image sample to be verified by using the trained classifier.
In the step S4, an optimized transmission distance between a source domain image feature set corresponding to each source domain tongue image sample and a target domain image feature set corresponding to a target domain tongue image sample is calculated, which includes the following steps:
the EMD distance and the L2 distance are jointly used as a cost function between the source domain image feature set and the target domain image feature set, and the optimal transmission distance between the two feature sets is calculated, wherein the cost function specifically comprises the following steps:
Figure BDA0004128320780000041
wherein g represents a feature extractor of the deep neural network;
Figure BDA0004128320780000042
source domain image feature map representing i-th source domain tongue image sample extraction,/for>
Figure BDA0004128320780000043
Representing the ith source domain tongue image sample, +.>
Figure BDA0004128320780000044
H i W and W i Respectively representing the width and the height of a source domain image feature map extracted from an ith source domain tongue image sample;
Figure BDA0004128320780000045
target domain image feature map representing extraction of jth target domain tongue image sample,/for>
Figure BDA0004128320780000046
Represents the j-th target domain tongue image sample, < >>
Figure BDA0004128320780000047
H j W and W j The width and the height of a target domain image feature map extracted by a jth target domain tongue image sample are respectively represented;
Figure BDA0004128320780000048
a joint image feature map representing the meta-domain image feature map and the target-domain image feature map;
γ in representing an optimal transmission scheme between any one source domain tongue image sample and any one target domain tongue image sample with respect to a corresponding image feature set, C in A cost matrix representing a set of image features between any one of the source domain tongue image samples and any one of the target domain tongue image samples with respect to the corresponding image feature set;<γ in ,C in > F representing gamma in and Cin Frobenius point multiplication of (2);
Figure BDA0004128320780000049
global average pooling of source domain image feature maps representing i-th source domain tongue image sample extraction along spatial dimensions, +.>
Figure BDA00041283207800000410
Figure BDA00041283207800000411
Global average pooling result of target domain image feature map extracted from jth target domain tongue image sample along space dimension is represented, and the target domain tongue image feature map is +.>
Figure BDA00041283207800000412
ch represents the number of channels.
Preferably, in the step S4, an optimized transmission distance between the source domain image feature set corresponding to each source domain tongue image sample and the target domain image feature set corresponding to the target domain tongue image sample is calculated, and the method further includes the following steps:
the SWD distance and the L2 distance are jointly used as a cost function between the source domain image feature set and the target domain image feature set, and the optimal transmission distance between the two feature sets is calculated, wherein the cost function specifically comprises the following steps:
Figure BDA0004128320780000051
p represents the permutation matrix and,
Figure BDA0004128320780000052
representing the set of all permutation matrices, U i Representing the source tongue image sample +.>
Figure BDA0004128320780000053
The corresponding features being transferred to a common Gao Weiyin layer space, U j Representing the tongue image sample of the target region +.>
Figure BDA0004128320780000054
The corresponding features are transformed into a common Gao Weiyin layer space, T being the matrix transposed symbol.
Preferably, in the step S4, an optimized transmission distance between the source domain image feature set corresponding to each source domain tongue image sample and the target domain image feature set corresponding to the target domain tongue image sample is calculated, and the method further includes the following steps:
the cross entropy of SWD distance, L2 distance and class condition distribution difference is used as a cost function between two feature sets, and the optimal transmission distance between the two feature sets is calculated, wherein the cost function specifically comprises the following steps:
Figure BDA0004128320780000055
λ swd balance coefficient lambda representing SWD distance l2 Balance coefficient lambda representing L2 distance cond Balance coefficients of cross entropy representing class conditional distribution differences,
Figure BDA0004128320780000056
sample representing source tongue image->
Figure BDA0004128320780000057
Is->
Figure BDA0004128320780000058
The representation is the difference of class condition distribution, M represents the total number of projection matrixes and Z i Representing the source tongue image sample +.>
Figure BDA0004128320780000059
Feature matrix obtained by mapping to hidden layer space Z j Representing the target Domain sample->
Figure BDA0004128320780000061
Feature matrix mapped to hidden space Z, < >>
Figure BDA0004128320780000062
Representing the source tongue image sample +.>
Figure BDA0004128320780000063
Or target area tongue image sample->
Figure BDA0004128320780000064
The projection to the hidden layer space Z forms a corresponding mth projection matrix.
Preferably, in the step S5, a mini-batch strategy is adopted, which specifically includes randomly extracting a mini-batch with a size n from each of the source domain tongue image sample and the target domain tongue image sample each time, and calculating an optimal transmission between the two mini-batches as an optimal transmission distance between the fields:
Figure BDA0004128320780000065
wherein ,/>
Figure BDA0004128320780000066
OT n Matrix representing an optimized transmission distance between domains, +.>
Figure BDA0004128320780000067
Matrix representing distribution composition of source domain tongue image samples, < >>
Figure BDA0004128320780000068
Matrix representing the composition of the tongue sample distribution of the target domain, +.>
Figure BDA0004128320780000069
Representation->
Figure BDA00041283207800000610
and />
Figure BDA00041283207800000611
Is a joint distribution of gamma n N x n matrix representing optimal transmission scheme composition between any one source domain tongue image sample and any one target domain tongue image sample with respect to corresponding image feature set, C n An n matrix representing an optimal transmission distance composition of samples between any one of the source domain tongue image samples and any one of the target domain image samples,<γ n ,C n > F representing gamma n and Cn Frobenius point multiplication of (C).
Preferably, in step S5, calculating the inter-domain optimal transmission distance between the source domain and the target domain further includes using unbalanced optimal transmission.
Preferably, in the step S5, calculating the inter-domain optimized transmission distance between the source domain and the target domain further includes adding a classification cross entropy loss function of the source domain by using unbalanced optimal transmission loss.
Compared with the prior art, the invention discloses a cross-domain tongue image classification method based on depth layering optimal transmission, which has the following beneficial effects:
the tongue images with different distributions are aligned through the optimal transmission of depth layering, and meanwhile, the classification capability is enhanced; the unbalanced optimal transmission is adopted in the optimal transmission of the first layer for field alignment, so that the edge constraint of the optimal transmission is relaxed, and the more robust optimization performance can be provided for small-batch training; for optimal transmission of the second layer, SWD is used instead of EMD distance, enhancing the distinguishing characteristics of the samples, SWD being an approximation of EMD distance but less computationally expensive. The accuracy of tongue image classification is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a depth layering image classification method provided by the invention;
fig. 2 is a schematic structural diagram of a depth layering optimal transmission model provided by the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention discloses a cross-domain tongue image classification method based on depth layering optimal transmission, which comprises the following steps:
a cross-domain tongue image classification method based on depth layering optimal transmission comprises the following steps:
s1, collecting tongue image samples in a plurality of different fields as a training set;
s2, respectively carrying out feature extraction on a source domain tongue image sample and a target domain tongue image sample in a training set by using a deep neural network to obtain a source domain image feature map formed by corresponding source domain tongue image sample features and a target domain image feature map formed by target domain tongue image sample features;
namely: performing feature extraction on source domain tongue image samples in a training set by using a deep neural network to obtain a source domain image feature map formed by corresponding source domain tongue image sample features;
extracting features of the target domain tongue image samples in the training set by using a deep neural network, and obtaining a target domain image feature map formed by corresponding target domain tongue image sample features;
s3, partitioning the source domain image feature map to obtain a source domain image feature set corresponding to the source domain tongue image sample; partitioning the target domain image feature map to obtain a target domain image feature set corresponding to the target domain tongue image sample;
s4, calculating an optimized transmission distance between a source domain image feature set corresponding to each source domain tongue image sample and a target domain image feature set corresponding to the target domain tongue image sample, and taking the optimized transmission distance as a sample optimized transmission distance between the source domain tongue image sample and the target domain image sample;
according to the invention, two samples in the optimized transmission distance of the sample come from a source domain tongue image sample and a target domain image sample respectively, so that local information is introduced while field alignment is realized, and the distinguishing property of the characteristics is maintained;
s5, taking the sample optimized transmission distance as a cost measure between a source domain and a target domain, and calculating an inter-domain optimized transmission distance between the source domain and the target domain;
s6, calculating softmax cross entropy loss according to the source domain tongue image sample characteristic value extracted in the step S2, and taking the softmax cross entropy loss as a part of a loss function; constructing a classification loss function by taking the inter-domain optimized transmission distance as another part of the loss function, and training a classifier by using the classification loss function;
s7, classifying the tongue image sample to be verified by using the trained classifier.
Assume that
Figure BDA0004128320780000081
and />
Figure BDA0004128320780000082
Are respectively from source domain distribution mu s And target domain distribution mu t Pi (mu) st ) Is the source domain distribution mu s And target domain distribution mu t Is described. Let the number of samples in the two domains be N s and Nt C is greater than or equal to 0 and
Figure BDA0004128320780000083
is mu st Cost matrix in between, wherein each element is composed of +.>
Figure BDA0004128320780000084
The cost between two samples is calculated to measure the difference between the two samples, and c is a cost function that measures the distance between the two samples, typically using an L2 distance. To->
Figure BDA0004128320780000085
As a cost measure, an optimized transmission distance between the fields can be calculated. />
Figure BDA0004128320780000086
The following method is known.
In one embodiment, the sample optimized transmission distance between the source domain tongue image sample and the target domain image sample in step S4 may use the EMD distance and the L2 distance in combination as a cost function between the source domain image feature set and the target domain image feature set:
first a feature extractor g x-Z of the base depth neural network is designed, which can map the input to a hidden layer space Z. Meanwhile, a classifier f z- & gt y is designed, and can map the hidden layer space to the label space. The image x can be obtained by the feature extractor g
Figure BDA0004128320780000091
Then, the cost function between the source domain image feature set and the target domain image feature set may become:
Figure BDA0004128320780000092
wherein g represents a feature extractor of the deep neural network;
Figure BDA0004128320780000093
representing a source domain image feature map,>
Figure BDA0004128320780000094
representing source domain tongue image samples,>
Figure BDA0004128320780000095
H i w and W i Respectively representing the width and the height of a source domain image feature map;
Figure BDA0004128320780000096
image feature map representing the target domain->
Figure BDA0004128320780000097
Representing a target domain tongue image sample,>
Figure BDA0004128320780000098
H j w and W j Respectively representing the width and the height of the target domain image feature map;
γ in representing an optimal transmission scheme between two samples with respect to an image feature set, C in Cost matrix, gamma, representing the set of features between two samples with respect to the image in ∈R HiWi×HjWj ;C in ∈R HiWi×HjWj ;<γ in ,C in > F Representing gamma in and Cin Frobenius point multiplication of (2);
Figure BDA0004128320780000099
representing the global average pooling result of the source domain image feature map along the spatial dimension,
Figure BDA00041283207800000910
Figure BDA00041283207800000911
representing the global average pooling result of the target domain image feature map along the spatial dimension,
Figure BDA00041283207800000912
ch represents the number of channels.
The feature extractor may be implemented using a convolutional layer of a convolutional neural network.
In order to further optimize the above technical solution, in another embodiment, step S4 calculates a sample optimized transmission distance between the source domain tongue image sample and the target domain image sample by:
the SWD distance and the L2 distance are jointly used as a cost function between the source domain image feature set and the target domain image feature set, and the optimal transmission distance between the two feature sets is calculated, wherein the cost function specifically comprises the following steps:
Figure BDA00041283207800000913
p represents the permutation matrix and,
Figure BDA0004128320780000101
representing the set of all permutation matrices, U i Representing sample +.>
Figure BDA0004128320780000102
The corresponding features being transferred to a common Gao Weiyin layer space, U j Representing sample +.>
Figure BDA0004128320780000103
The corresponding features are transformed into a common Gao Weiyin layer space, T being the matrix transposed symbol.
In a mini-batch of size n, equation (1) is calculated once between every two samples between the source domain tongue image sample and the target domain image sample, and the calculation cost is too high, equation (2) uses SWD distance (Sliced Wasserstein Distance) to calculate the cost function between the source domain image feature set and the target domain image feature set instead of EMD distance approximation:
equation (2) introduces a permutation matrix P to match different regions of the images, P comprising an association between regions of the two images
Figure BDA0004128320780000104
Representing the set of all permutation matrices. />
Figure BDA0004128320780000105
Representing sample +.>
Figure BDA0004128320780000106
The corresponding features are transformed into a d-dimensional common hidden space.
In order to further optimize the above technical solution, in another embodiment, in step S4, another method for calculating the optimal transmission distance between the source domain tongue image sample and the target domain image sample is as follows:
the matching problem in equation (2) is NP-hard, time-complex too high, in order to solve this problem more efficiently, so equation (2) will be approximated by the following algorithm:
Figure BDA0004128320780000107
wherein ,
Figure BDA0004128320780000108
the M projections are included, where the permutation matrix P need not be explicitly found, but rather only the projected regions are ordered and their corresponding distances are calculated. Thus, the sample optimal transmission distance between the source domain tongue image sample and the target domain image sample is approximately calculated by the formula (3), each time +.>
Figure BDA0004128320780000109
Can be calculated from O (N) 3 ) (if solved by linear programming) or O (JN 2 ) (J is the number of iterations of the Sinkhorn scaling algorithm if solved by the Sinkhorn scaling algorithm) is reduced to O (MN), where N is the problem complexity. Since the feature map extracted by the depth feature extractor g is not large in size, and the segmented area is limited, N is relatively small, and the calculation cost is not increased too much.
The difference in edge distribution here is measured by the inner SWD distance versus the L2 distance, and class condition distribution is measured by the entropy between sample tags. Due to the target domain samples
Figure BDA0004128320780000111
Label-free information, here label +.>
Figure BDA0004128320780000112
As a proxy. By combining the alignment edge distribution and the class condition distribution, more class information can be introduced, and the distinguishing property between classes is improved. Equation (3) will therefore be converted into:
Figure BDA0004128320780000113
wherein ,λswd Balance coefficient lambda representing SWD distance l2 Balance coefficient lambda representing L2 distance cond Balance coefficient of cross entropy representing class condition distribution difference, M represents total number of projection matrixes and Z i Representing source domain tongue image samples
Figure BDA0004128320780000114
Feature matrix obtained by mapping to hidden layer space Z j Representing the tongue image sample of the target region +.>
Figure BDA0004128320780000115
Feature matrix mapped to hidden space Z, < >>
Figure BDA0004128320780000116
Representing the source tongue image sample +.>
Figure BDA0004128320780000117
Or target area tongue image sample->
Figure BDA0004128320780000118
Projecting to the hidden layer space Z to form a corresponding mth projection matrix, < >>
Figure BDA0004128320780000119
Sample representing source tongue image->
Figure BDA00041283207800001110
Is->
Figure BDA00041283207800001111
The representation is a difference in the distribution of class conditions.
The three contents are contained in the formula (4), and the first term is a source domain tongue image sample
Figure BDA00041283207800001112
And objectsDomain image sample->
Figure BDA00041283207800001113
In particular we will +.>
Figure BDA00041283207800001114
Feature Z of (2) i and />
Figure BDA00041283207800001115
Feature Z of (2) j Projection to a plurality of Z i and Zj Shared space in each of which Z can be respectively aligned with i and Zj After the feature subsets of (2) are ordered, the Euclidean distance is directly calculated to obtain Z i and Zj Distance in the shared space. Finally, for Z calculated in multiple shared spaces i and Zj Is taken as the average of the distances of Z i and Zi SWD distance of (c). Second item->
Figure BDA00041283207800001116
The function adds a global average pooling operation to the g-function, equivalent to adding Z i Or Z is j Global averaging pooling is performed so that the second item of content is for Z i Corresponding global average pooling results and Z j And calculating the L2 distance according to the corresponding global average pooling result. The third term is calculate->
Figure BDA0004128320780000121
Corresponding tag and pair->
Figure BDA0004128320780000122
The cross entropy between labels for classification prediction represents the difference in class condition distribution. Finally, equation (4) uses three super-parameters as the balance coefficients for the three terms, weighted and summed.
Notably, the three terms of equation (4) are all complementary: the SWD distance and the L2 distance form local and global complementary information; the SWD distance and the L2 distance are all differences of edge distribution;
Figure BDA0004128320780000123
the measure is the difference in the distribution of class conditions. By the mutual complementation of the three items, the domain alignment can be carried out while the classification distinction of the domain alignment is maintained, so that the classification performance is improved.
In this embodiment, the network structure of the feature extractor is ResNet-50, and the feature map extracted by the feature extractor g maintains its spatial structure for calculating the SWD distance. The feature extractor g was first pre-trained on ImageNet and the classifier f was trained from scratch, so the learning rate of the classifier was 10 times that of the feature extractor g. Here, several equilibrium coefficients in our loss function will be set to λ swd =0.001,λ l2 =0.001 and λ cond =1.0。
The optimizer of this embodiment employs an SGD optimizer with a momentum set of 0.9. The setting of the change strategy of the learning rate is linearly changed. The size of the small lot is 65 and iterates 10000 times.
Step S5, calculating the inter-domain optimized transmission distance between the source domain and the target domain according to the sample optimized transmission distance as a cost measure between the source domain and the target domain;
optimized transmission (Optimal Transport, OT) is a method of measuring the distance between two probability distributions, which can take advantage of the geometry of the distribution. In general, the OT searches for two distributions μ s and μt Possible coupling modes gamma e pi (mu) st ) Finding a coupling scheme with the minimum transmission cost:
Figure BDA0004128320780000124
wherein
Figure BDA0004128320780000125
and />
Figure BDA0004128320780000126
Are respectively from source domain distribution mu s And target domain distribution mu t Any two samples of (a); />
Figure BDA0004128320780000127
Is the cost of the cost between two samples, which is used to measure the difference between the two samples; II (mu) st ) Is the edge distribution mu s and μt Is described. Whereas the discrete form of OT over an empirical distribution can be defined as:
Figure BDA0004128320780000131
wherein ,μst In the case of a positive vector quantity,<·,·> F is the Frobenius point product. Let the number of samples in the two domains be N s and Nt C is greater than or equal to 0 and
Figure BDA0004128320780000132
is mu st Cost matrix between each element is composed of
Figure BDA0004128320780000133
Calculated. c is a cost function that measures the distance between two samples, typically using an L2 distance. By optimizing equation (6), i.e. minimizing the transmission cost, an optimal transmission stream +.>
Figure BDA0004128320780000134
Equation (6) can be solved by linear programming.
In one embodiment, step S5, calculating an inter-domain optimal transmission distance between a source domain and a target domain includes adopting a mini-batch policy, specifically including randomly extracting a mini-batch of size n from each domain each time, and calculating an optimal transmission between the two mini-batches as a proxy optimal transmission between domains:
Figure BDA0004128320780000135
wherein ,
Figure BDA0004128320780000136
considering the calculation cost of the formula (6), the invention randomly extracts mini-batch with the size of n from each domain each time, calculates the optimal transmission between the two mini-batches as the agent optimal transmission between the domains, namely, converting the formula (6) into the formula (7) C n Is calculated from equation (6) to form a hierarchical optimal transmission model.
As an improved technical solution, in another embodiment, step S5, another method of calculating an inter-domain optimized transmission distance between a source domain and a target domain uses a mini-batch based unbalanced optimal transmission method instead of formula (7):
Figure BDA0004128320780000137
wherein ,Dφ Csiszar Divergences, KL is the Kullback-Leibler divergence,
Figure BDA0004128320780000141
and />
Figure BDA0004128320780000142
Is gamma n Is a boundary distribution of the (c). Here, τ is a marginal penalty coefficient (Marginal Penalization), ε is a regularization coefficient (Regularization Coefficient), ε is equal to or greater than 0, and specifically, ε=0.01 and τ=0.5 can be set.
Thus, in each mini-batch, each field is a collection of samples. Equation (8) serves as the optimal transmission between the first layer source domain and the target domain, while the cost matrix C in equation (8) n Is calculated (4) from two samples of the corresponding source domain tongue image sample and the target domain image, whereby the second layer is an optimal transmission between the source domain tongue image sample and the target domain image sample, where each sample is a collection of spatial regions of the image feature map. Such that the two layers of optimal transmission form a depth hierarchyThe optimal transmission model (Deep Hierarchical Optimal Transport, deep) is optimized as shown in fig. 2. Thus, for a given mini-batch, the target problem for deep is:
Figure BDA0004128320780000143
wherein ,
Figure BDA0004128320780000144
Figure BDA0004128320780000145
as an improved technical solution, in another embodiment, step S5, another method calculates an inter-domain optimized transmission distance between the source domain and the target domain, and increases the cross-class entropy loss function of the source domain by using unbalanced optimal transmission loss:
Figure BDA0004128320780000146
the objective is to avoid the problem of "catastrophic forgetting" (Catastrophic Forgetting) on the source domain, and the final optimization objective includes a classification cross entropy loss L of the source domain.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. A cross-domain tongue image classification method based on depth layering optimal transmission, which is characterized by comprising the following steps:
s1, collecting tongue image samples in a plurality of different fields as a training set;
s2, performing feature extraction on source domain tongue image samples in the training set by using a deep neural network, and obtaining a source domain image feature map formed by corresponding source domain tongue image sample features;
extracting features of the target domain tongue image samples in the training set by using a deep neural network, and obtaining a target domain image feature map formed by corresponding target domain tongue image sample features;
s3, partitioning source domain tongue image sample characteristics in the source domain image characteristic diagram to obtain a source domain image characteristic set corresponding to the source domain tongue image sample;
dividing the characteristics of the target domain tongue image sample in the target domain image characteristic map into blocks to obtain a target domain image characteristic set corresponding to the target domain tongue image sample;
s4, calculating an optimized transmission distance between a source domain image feature set corresponding to each source domain tongue image sample and a target domain image feature set corresponding to the target domain tongue image sample, and taking the optimized transmission distance as a sample optimized transmission distance between the source domain tongue image sample and the target domain image sample;
s5, taking the sample optimized transmission distance as a cost measure between a source domain and a target domain, and calculating an inter-domain optimized transmission distance between the source domain and the target domain;
s6, calculating softmax cross entropy loss according to the source domain tongue image sample characteristic value extracted in the step S2, and taking the softmax cross entropy loss as a part of a loss function; constructing a classification loss function by taking the inter-domain optimized transmission distance as another part of the loss function, and training a classifier by using the classification loss function;
s7, classifying the tongue image sample to be verified by using the trained classifier.
2. The method for classifying cross-domain tongue images based on depth layering optimal transmission according to claim 1, wherein the step S4 of calculating the optimal transmission distance between the source domain image feature set corresponding to each source domain tongue image sample and the target domain image feature set corresponding to the target domain tongue image sample comprises the following steps:
the EMD distance and the L2 distance are jointly used as a cost function between the source domain image feature set and the target domain image feature set, and the optimal transmission distance between the two feature sets is calculated, wherein the cost function specifically comprises the following steps:
Figure FDA0004128320770000021
wherein g represents a feature extractor of the deep neural network;
Figure FDA0004128320770000022
source domain image feature map representing i-th source domain tongue image sample extraction,/for>
Figure FDA0004128320770000023
Representing the ith source domain tongue image sample, +.>
Figure FDA0004128320770000024
H i W and W i Respectively representing the width and the height of a source domain image feature map extracted from an ith source domain tongue image sample;
Figure FDA0004128320770000025
targets representing j-th target domain tongue image sample extractionDomain image feature map, < >>
Figure FDA0004128320770000026
Represents the j-th target domain tongue image sample, < >>
Figure FDA0004128320770000027
H j W and W j The width and the height of a target domain image feature map extracted by a jth target domain tongue image sample are respectively represented;
Figure FDA0004128320770000028
a joint image feature map representing the meta-domain image feature map and the target-domain image feature map;
γ in representing an optimal transmission scheme between any one source domain tongue image sample and any one target domain tongue image sample with respect to a corresponding image feature set, C in A cost matrix representing a set of image features between any one of the source domain tongue image samples and any one of the target domain tongue image samples with respect to the corresponding image feature set;<γ in ,C in > F representing gamma in and Cin Frobenius point multiplication of (2);
Figure FDA0004128320770000029
global average pooling of source domain image feature maps representing i-th source domain tongue image sample extraction along spatial dimensions, +.>
Figure FDA00041283207700000210
Figure FDA00041283207700000211
Global average pooling result of target domain image feature map extracted from jth target domain tongue image sample along space dimension is represented, and the target domain tongue image feature map is +.>
Figure FDA00041283207700000212
ch tableShowing the number of channels.
3. The method for classifying cross-domain tongue images based on depth layering optimal transmission according to claim 2, wherein in step S4, an optimal transmission distance between a source domain image feature set corresponding to each source domain tongue image sample and a target domain image feature set corresponding to a target domain tongue image sample is calculated, and further comprising the following steps:
the SWD distance and the L2 distance are jointly used as a cost function between the source domain image feature set and the target domain image feature set, and the optimal transmission distance between the two feature sets is calculated, wherein the cost function specifically comprises the following steps:
Figure FDA0004128320770000031
p represents the permutation matrix and,
Figure FDA0004128320770000032
representing the set of all permutation matrices, U i Representing the source tongue image sample +.>
Figure FDA0004128320770000033
The corresponding features being transferred to a common Gao Weiyin layer space, U j Representing the tongue image sample of the target region +.>
Figure FDA0004128320770000034
The corresponding features are transformed into a common Gao Weiyin layer space, T being the matrix transposed symbol.
4. A method for classifying a cross-domain tongue image based on depth layering optimal transmission according to claim 3, wherein in the step S4, an optimal transmission distance between a source domain image feature set corresponding to each source domain tongue image sample and a target domain image feature set corresponding to a target domain tongue image sample is calculated, and further comprising the following steps:
the cross entropy of SWD distance, L2 distance and class condition distribution difference is used as a cost function between two feature sets, and the optimal transmission distance between the two feature sets is calculated, wherein the cost function specifically comprises the following steps:
Figure FDA0004128320770000035
λ swd balance coefficient lambda representing SWD distance l2 Balance coefficient lambda representing L2 distance cond Balance coefficients of cross entropy representing class conditional distribution differences,
Figure FDA0004128320770000036
sample representing source tongue image->
Figure FDA0004128320770000037
Is->
Figure FDA0004128320770000038
Representing the difference of class condition distribution, M represents the total number of projection matrixes and Z i Representing the source tongue image sample +.>
Figure FDA0004128320770000039
Feature matrix obtained by mapping to hidden layer space Z j Representing the tongue image sample of the target region +.>
Figure FDA00041283207700000310
Feature matrix mapped to hidden space Z, < >>
Figure FDA00041283207700000311
Representing the source tongue image sample +.>
Figure FDA00041283207700000312
Or target area tongue image sample->
Figure FDA00041283207700000313
The projection to the hidden layer space Z forms a corresponding mth projection matrix.
5. The method for classifying a cross-domain tongue image based on depth layering optimal transmission according to claim 4, wherein in step S5, an inter-domain optimal transmission distance between a source domain and a target domain is calculated, including using a mini-batch policy, specifically including,
each time randomly extracting a mini-batch with the size of n from each source domain tongue image sample and each target domain tongue image sample, and calculating the optimal transmission between the two mini-batches to serve as the optimal transmission distance between the fields:
Figure FDA0004128320770000041
wherein ,
Figure FDA0004128320770000042
OT n matrix representing an optimized transmission distance between domains, +.>
Figure FDA0004128320770000043
Matrix representing distribution composition of source domain tongue image samples, < >>
Figure FDA0004128320770000044
Matrix representing the composition of the tongue sample distribution of the target domain, +.>
Figure FDA0004128320770000045
Representation->
Figure FDA0004128320770000046
And
Figure FDA0004128320770000047
is a joint distribution of gamma n N x n matrix representing optimal transmission scheme composition between any one source domain tongue image sample and any one target domain tongue image sample with respect to corresponding image feature set, C n An n matrix representing an optimal transmission distance composition of samples between any one of the source domain tongue image samples and any one of the target domain image samples,<γ n ,C n > F representing gamma n and Cn Frobenius point multiplication of (C).
6. A method for classifying a tongue image across domains based on depth layering optimal transmission according to claim 5, wherein in step S5, calculating an inter-domain optimal transmission distance between the source domain and the target domain further comprises using unbalanced optimal transmission.
7. A method for classifying a cross-domain tongue image based on depth layering optimal transmission according to claim 1, wherein in step S5, calculating an inter-domain optimal transmission distance between the source domain and the target domain further comprises adding a classification cross entropy loss function of the source domain using an unbalanced optimal transmission loss.
CN202310252527.2A 2023-03-16 2023-03-16 Cross-domain tongue image classification method based on depth layering optimal transmission Pending CN116310545A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310252527.2A CN116310545A (en) 2023-03-16 2023-03-16 Cross-domain tongue image classification method based on depth layering optimal transmission

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310252527.2A CN116310545A (en) 2023-03-16 2023-03-16 Cross-domain tongue image classification method based on depth layering optimal transmission

Publications (1)

Publication Number Publication Date
CN116310545A true CN116310545A (en) 2023-06-23

Family

ID=86781093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310252527.2A Pending CN116310545A (en) 2023-03-16 2023-03-16 Cross-domain tongue image classification method based on depth layering optimal transmission

Country Status (1)

Country Link
CN (1) CN116310545A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116566743A (en) * 2023-07-05 2023-08-08 北京理工大学 Account alignment method, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116566743A (en) * 2023-07-05 2023-08-08 北京理工大学 Account alignment method, equipment and storage medium
CN116566743B (en) * 2023-07-05 2023-09-08 北京理工大学 Account alignment method, equipment and storage medium

Similar Documents

Publication Publication Date Title
Lu et al. Class-agnostic counting
Krebs et al. Unsupervised probabilistic deformation modeling for robust diffeomorphic registration
CN107704877B (en) Image privacy perception method based on deep learning
Xu et al. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering
Boyda et al. Deploying a quantum annealing processor to detect tree cover in aerial imagery of California
Papa et al. Efficient supervised optimum-path forest classification for large datasets
US11494616B2 (en) Decoupling category-wise independence and relevance with self-attention for multi-label image classification
CN110516095B (en) Semantic migration-based weak supervision deep hash social image retrieval method and system
Wang Online Learning Behavior Analysis Based on Image Emotion Recognition.
Gao et al. Small sample classification of hyperspectral image using model-agnostic meta-learning algorithm and convolutional neural network
Gong et al. A coupling translation network for change detection in heterogeneous images
Liu et al. Generative self-training for cross-domain unsupervised tagged-to-cine mri synthesis
Shu et al. LVC-Net: Medical image segmentation with noisy label based on local visual cues
CN113298129B (en) Polarized SAR image classification method based on superpixel and graph convolution network
CN111126464A (en) Image classification method based on unsupervised domain confrontation field adaptation
Ning et al. Conditional generative adversarial networks based on the principle of homologycontinuity for face aging
CN114692732A (en) Method, system, device and storage medium for updating online label
Alshehri A content-based image retrieval method using neural network-based prediction technique
CN116310545A (en) Cross-domain tongue image classification method based on depth layering optimal transmission
Franchi et al. Latent discriminant deterministic uncertainty
Huynh et al. Joint age estimation and gender classification of Asian faces using wide ResNet
Chen et al. A robust automatic clustering algorithm for probability density functions with application to categorizing color images
Huang et al. An evidential combination method with multi-color spaces for remote sensing image scene classification
CN114612658A (en) Image semantic segmentation method based on dual-class-level confrontation network
Yun et al. Land cover classification based on tolerant rough set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination