CN116310545A

CN116310545A - Cross-domain tongue image classification method based on depth layering optimal transmission

Info

Publication number: CN116310545A
Application number: CN202310252527.2A
Authority: CN
Inventors: 文贵华; 徐映雪
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-06-23

Abstract

The invention discloses a cross-domain tongue image classification method based on depth layering optimal transmission, which comprises the steps of collecting tongue images; constructing a machine learning classification model, wherein the machine learning classification model realizes the alignment of tongue image features in different fields by using a depth layering optimal transmission model, and the depth layering optimal transmission model comprises a two-layer network structure, wherein a first layer of network structure is used for realizing optimal transmission among different fields, and a second layer of network structure is used for realizing optimal transmission among different samples; classifying the tongue images according to the depth layering optimal transmission model, and outputting tongue image categories. The tongue image classifying method has the advantages that the tongue images with different distributions are aligned, and meanwhile, the classifying capability of the tongue images is enhanced.

Description

Cross-domain tongue image classification method based on depth layering optimal transmission

Technical Field

The invention relates to the technical field of tongue image classification for assisting traditional Chinese medicine diagnosis and treatment, in particular to a cross-domain tongue image classification method based on depth layering optimal transmission.

Background

The existing tongue image classification method based on machine learning is mostly based on supervised learning. The supervised learning method generally assumes that the training set and the test set follow the same distribution, so that the model trained on the training set can perform well in the test set. However, in real applications, such an assumption is difficult to be made, and the main idea for solving such a problem is to assume that two data distributions are mapped to a common hidden space between domains through a nonlinear mapping, so as to reduce drift between the distributions, and make the two distributions more similar after undergoing the nonlinear mapping conversion. This process of nonlinear mapping is known as domain adaptation.

Existing machine learning based tongue image classification faces such problems. First, tongue images of different people have differences, including edge textures, colors, etc. of tongue images. Second, tongue image acquisition devices of different hospitals may be different, and the tongue image data acquired is also affected by the acquisition environment, such as angle, illumination, etc. In addition, the geographical locations of different hospitals are different, and individuals who acquire tongue images also have regional differences. These factors result in relatively large differences in distribution of tongue image data from hospital to hospital. If the tongue image data of each hospital is field data, the data distribution in different fields is different, and the differences can lead to a model trained on the acquired data set, and serious performance degradation can occur when the model is deployed to other hospitals. Meanwhile, since the labeling cost of medical data is higher, when the target domain has no labeled data, it is more difficult, namely, only the label of the source domain is available.

To solve this problem, the different domain distributions need to be aligned. The main stream method for solving the distribution alignment of different fields mainly comprises two steps: firstly, two distributions are closer by nonlinear conversion; then, a classifier is trained on the transformed distribution for the target domain using the label information of the source domain, so that the model can be generalized to the target domain, which is also the process of inter-domain knowledge migration. It can be seen how to find this nonlinear transformation is critical to solving the domain adaptation problem. In recent years, the optimal transmission method shows great advantages in the field self-adaption problem, and can directly measure the distance between two distributions on the edge distribution without label information. In the vision field, this Distance based on optimized transmission is called the EMD (Earth Mover's Distance) Distance. On the one hand, it can calculate the distance between two distributions directly on a discrete empirical distribution (domain). On the other hand, meaningful gradients can also be provided when the support sets of the two domains do not significantly overlap, thus not easily leading to training failure. Furthermore, it has good interpretability, enabling explicit modeling of the coupling between domains.

By optimizing the minimum transmission cost between the feature distributions of the two domains, the distributions of both the source domain and the target domain can be transformed into a common hidden space with the minimum cost, and features in the hidden space have domain invariance. This process is called domain alignment. The classifier trained on such features has the ability to migrate to the target domain. Whereas domain alignment is not the final goal, classification is only. However, in the optimal transmission, the cost matrix is generally calculated by calculating the euclidean distance (L2 distance) between every two samples. In such a metric space, a meaningful distance cannot be provided when the support sets of two samples do not overlap. This is manifested in visual problems in that when the background of two samples is too cluttered or has a large change in appearance within a class, images of the same class may be far apart in such metric space. In other words, the L2 distance at this time is greatly affected by the background change. Although this can be alleviated by modeling of neural networks, sufficient training data is required, which is difficult to achieve in real scenes (especially medical scenes), local features of the target area need to be emphasized, meanwhile, the L2 distance is used as a global representation, the spatial structure of the image features is destroyed, and local information is lost. While local information is information that can provide differentiation and migration, which is important for classification tasks. In particular, in the traditional Chinese medicine physical sign image dataset, the standardization degree of the acquisition process is poor, and the background or environmental factors are greatly changed. For the above reasons, in the existing domain alignment process, the domain invariance feature is obtained, and meanwhile, the category distinction of the feature is also blurred, namely, the excessive alignment occurs.

Therefore, how to avoid the excessive alignment phenomenon generated in the tongue image classification process and improve the accuracy of the tongue classification image are technical problems that need to be solved by those skilled in the art.

Disclosure of Invention

In order to solve the problems, the invention discloses a cross-domain tongue image classification method based on deep hierarchical optimal transmission, so that a machine learning model can learn invariance characteristics which are more robust to environmental noise, tongue image data with different distributions have self-adaptive capacity, and classification accuracy is improved.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a cross-domain tongue image classification method based on depth layering optimal transmission comprises the following steps:

s1, collecting tongue image samples in a plurality of different fields as a training set;

s2, performing feature extraction on source domain tongue image samples in the training set by using a deep neural network, and obtaining a source domain image feature map formed by corresponding source domain tongue image sample features;

extracting features of the target domain tongue image samples in the training set by using a deep neural network, and obtaining a target domain image feature map formed by corresponding target domain tongue image sample features;

s3, partitioning source domain tongue image sample characteristics in the source domain image characteristic diagram to obtain a source domain image characteristic set corresponding to the source domain tongue image sample;

dividing the characteristics of the target domain tongue image sample in the target domain image characteristic map into blocks to obtain a target domain image characteristic set corresponding to the target domain tongue image sample;

s4, calculating an optimized transmission distance between a source domain image feature set corresponding to each source domain tongue image sample and a target domain image feature set corresponding to the target domain tongue image sample, and taking the optimized transmission distance as a sample optimized transmission distance between the source domain tongue image sample and the target domain image sample;

s5, taking the sample optimized transmission distance as a cost measure between a source domain and a target domain, and calculating an inter-domain optimized transmission distance between the source domain and the target domain;

s6, calculating softmax cross entropy loss according to the source domain tongue image sample characteristic value extracted in the step S2, and taking the softmax cross entropy loss as a part of a loss function; constructing a classification loss function by taking the inter-domain optimized transmission distance as another part of the loss function, and training a classifier by using the classification loss function;

s7, classifying the tongue image sample to be verified by using the trained classifier.

In the step S4, an optimized transmission distance between a source domain image feature set corresponding to each source domain tongue image sample and a target domain image feature set corresponding to a target domain tongue image sample is calculated, which includes the following steps:

the EMD distance and the L2 distance are jointly used as a cost function between the source domain image feature set and the target domain image feature set, and the optimal transmission distance between the two feature sets is calculated, wherein the cost function specifically comprises the following steps:

wherein g represents a feature extractor of the deep neural network;

source domain image feature map representing i-th source domain tongue image sample extraction,/for>

Representing the ith source domain tongue image sample, +.>

H _i W and W _i Respectively representing the width and the height of a source domain image feature map extracted from an ith source domain tongue image sample;

target domain image feature map representing extraction of jth target domain tongue image sample,/for>

Represents the j-th target domain tongue image sample, < >>

H _j W and W _j The width and the height of a target domain image feature map extracted by a jth target domain tongue image sample are respectively represented;

a joint image feature map representing the meta-domain image feature map and the target-domain image feature map;

γ _in representing an optimal transmission scheme between any one source domain tongue image sample and any one target domain tongue image sample with respect to a corresponding image feature set, C _in A cost matrix representing a set of image features between any one of the source domain tongue image samples and any one of the target domain tongue image samples with respect to the corresponding image feature set;<γ _in ,C _in > _F representing gamma _in and C_in Frobenius point multiplication of (2);

global average pooling of source domain image feature maps representing i-th source domain tongue image sample extraction along spatial dimensions, +.>

Global average pooling result of target domain image feature map extracted from jth target domain tongue image sample along space dimension is represented, and the target domain tongue image feature map is +.>

ch represents the number of channels.

Preferably, in the step S4, an optimized transmission distance between the source domain image feature set corresponding to each source domain tongue image sample and the target domain image feature set corresponding to the target domain tongue image sample is calculated, and the method further includes the following steps:

the SWD distance and the L2 distance are jointly used as a cost function between the source domain image feature set and the target domain image feature set, and the optimal transmission distance between the two feature sets is calculated, wherein the cost function specifically comprises the following steps:

p represents the permutation matrix and,

representing the set of all permutation matrices, U _i Representing the source tongue image sample +.>

The corresponding features being transferred to a common Gao Weiyin layer space, U _j Representing the tongue image sample of the target region +.>

The corresponding features are transformed into a common Gao Weiyin layer space, T being the matrix transposed symbol.

the cross entropy of SWD distance, L2 distance and class condition distribution difference is used as a cost function between two feature sets, and the optimal transmission distance between the two feature sets is calculated, wherein the cost function specifically comprises the following steps:

λ _swd balance coefficient lambda representing SWD distance _l2 Balance coefficient lambda representing L2 distance _cond Balance coefficients of cross entropy representing class conditional distribution differences,

sample representing source tongue image->

Is->

The representation is the difference of class condition distribution, M represents the total number of projection matrixes and Z _i Representing the source tongue image sample +.>

Feature matrix obtained by mapping to hidden layer space Z _j Representing the target Domain sample->

Feature matrix mapped to hidden space Z, < >>

Representing the source tongue image sample +.>

Or target area tongue image sample->

The projection to the hidden layer space Z forms a corresponding mth projection matrix.

Preferably, in the step S5, a mini-batch strategy is adopted, which specifically includes randomly extracting a mini-batch with a size n from each of the source domain tongue image sample and the target domain tongue image sample each time, and calculating an optimal transmission between the two mini-batches as an optimal transmission distance between the fields:

wherein ,/>

OT ⁿ Matrix representing an optimized transmission distance between domains, +.>

Matrix representing distribution composition of source domain tongue image samples, < >>

Matrix representing the composition of the tongue sample distribution of the target domain, +.>

Representation->

and />

Is a joint distribution of gamma ⁿ N x n matrix representing optimal transmission scheme composition between any one source domain tongue image sample and any one target domain tongue image sample with respect to corresponding image feature set, C ⁿ An n matrix representing an optimal transmission distance composition of samples between any one of the source domain tongue image samples and any one of the target domain image samples,<γ ⁿ ,C ⁿ > _F representing gamma ⁿ and Cⁿ Frobenius point multiplication of (C).

Preferably, in step S5, calculating the inter-domain optimal transmission distance between the source domain and the target domain further includes using unbalanced optimal transmission.

Preferably, in the step S5, calculating the inter-domain optimized transmission distance between the source domain and the target domain further includes adding a classification cross entropy loss function of the source domain by using unbalanced optimal transmission loss.

Compared with the prior art, the invention discloses a cross-domain tongue image classification method based on depth layering optimal transmission, which has the following beneficial effects:

the tongue images with different distributions are aligned through the optimal transmission of depth layering, and meanwhile, the classification capability is enhanced; the unbalanced optimal transmission is adopted in the optimal transmission of the first layer for field alignment, so that the edge constraint of the optimal transmission is relaxed, and the more robust optimization performance can be provided for small-batch training; for optimal transmission of the second layer, SWD is used instead of EMD distance, enhancing the distinguishing characteristics of the samples, SWD being an approximation of EMD distance but less computationally expensive. The accuracy of tongue image classification is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a depth layering image classification method provided by the invention;

fig. 2 is a schematic structural diagram of a depth layering optimal transmission model provided by the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention discloses a cross-domain tongue image classification method based on depth layering optimal transmission, which comprises the following steps:

s2, respectively carrying out feature extraction on a source domain tongue image sample and a target domain tongue image sample in a training set by using a deep neural network to obtain a source domain image feature map formed by corresponding source domain tongue image sample features and a target domain image feature map formed by target domain tongue image sample features;

namely: performing feature extraction on source domain tongue image samples in a training set by using a deep neural network to obtain a source domain image feature map formed by corresponding source domain tongue image sample features;

s3, partitioning the source domain image feature map to obtain a source domain image feature set corresponding to the source domain tongue image sample; partitioning the target domain image feature map to obtain a target domain image feature set corresponding to the target domain tongue image sample;

according to the invention, two samples in the optimized transmission distance of the sample come from a source domain tongue image sample and a target domain image sample respectively, so that local information is introduced while field alignment is realized, and the distinguishing property of the characteristics is maintained;

Assume that

and />

Are respectively from source domain distribution mu _s And target domain distribution mu _t Pi (mu) _s ,μ _t ) Is the source domain distribution mu _s And target domain distribution mu _t Is described. Let the number of samples in the two domains be N _s and N_t C is greater than or equal to 0 and

is mu _s ,μ _t Cost matrix in between, wherein each element is composed of +.>

The cost between two samples is calculated to measure the difference between the two samples, and c is a cost function that measures the distance between the two samples, typically using an L2 distance. To->

As a cost measure, an optimized transmission distance between the fields can be calculated. />

The following method is known.

In one embodiment, the sample optimized transmission distance between the source domain tongue image sample and the target domain image sample in step S4 may use the EMD distance and the L2 distance in combination as a cost function between the source domain image feature set and the target domain image feature set:

first a feature extractor g x-Z of the base depth neural network is designed, which can map the input to a hidden layer space Z. Meanwhile, a classifier f z- & gt y is designed, and can map the hidden layer space to the label space. The image x can be obtained by the feature extractor g

Then, the cost function between the source domain image feature set and the target domain image feature set may become:

wherein g represents a feature extractor of the deep neural network;

representing a source domain image feature map,>

representing source domain tongue image samples,>

H _i w and W _i Respectively representing the width and the height of a source domain image feature map;

image feature map representing the target domain->

Representing a target domain tongue image sample,>

H _j w and W _j Respectively representing the width and the height of the target domain image feature map;

γ _in representing an optimal transmission scheme between two samples with respect to an image feature set, C _in Cost matrix, gamma, representing the set of features between two samples with respect to the image _in ∈R ^HiWi×HjWj ；C _in ∈R ^HiWi×HjWj ；<γ _in ,C _in > _F Representing gamma _in and C_in Frobenius point multiplication of (2);

representing the global average pooling result of the source domain image feature map along the spatial dimension,

representing the global average pooling result of the target domain image feature map along the spatial dimension,

ch represents the number of channels.

The feature extractor may be implemented using a convolutional layer of a convolutional neural network.

In order to further optimize the above technical solution, in another embodiment, step S4 calculates a sample optimized transmission distance between the source domain tongue image sample and the target domain image sample by:

p represents the permutation matrix and,

representing the set of all permutation matrices, U _i Representing sample +.>

The corresponding features being transferred to a common Gao Weiyin layer space, U _j Representing sample +.>

In a mini-batch of size n, equation (1) is calculated once between every two samples between the source domain tongue image sample and the target domain image sample, and the calculation cost is too high, equation (2) uses SWD distance (Sliced Wasserstein Distance) to calculate the cost function between the source domain image feature set and the target domain image feature set instead of EMD distance approximation:

equation (2) introduces a permutation matrix P to match different regions of the images, P comprising an association between regions of the two images

Representing the set of all permutation matrices. />

Representing sample +.>

The corresponding features are transformed into a d-dimensional common hidden space.

In order to further optimize the above technical solution, in another embodiment, in step S4, another method for calculating the optimal transmission distance between the source domain tongue image sample and the target domain image sample is as follows:

the matching problem in equation (2) is NP-hard, time-complex too high, in order to solve this problem more efficiently, so equation (2) will be approximated by the following algorithm:

wherein ,

the M projections are included, where the permutation matrix P need not be explicitly found, but rather only the projected regions are ordered and their corresponding distances are calculated. Thus, the sample optimal transmission distance between the source domain tongue image sample and the target domain image sample is approximately calculated by the formula (3), each time +.>

Can be calculated from O (N) ³ ) (if solved by linear programming) or O (JN ² ) (J is the number of iterations of the Sinkhorn scaling algorithm if solved by the Sinkhorn scaling algorithm) is reduced to O (MN), where N is the problem complexity. Since the feature map extracted by the depth feature extractor g is not large in size, and the segmented area is limited, N is relatively small, and the calculation cost is not increased too much.

The difference in edge distribution here is measured by the inner SWD distance versus the L2 distance, and class condition distribution is measured by the entropy between sample tags. Due to the target domain samples

Label-free information, here label +.>

As a proxy. By combining the alignment edge distribution and the class condition distribution, more class information can be introduced, and the distinguishing property between classes is improved. Equation (3) will therefore be converted into:

wherein ,λ_swd Balance coefficient lambda representing SWD distance _l2 Balance coefficient lambda representing L2 distance _cond Balance coefficient of cross entropy representing class condition distribution difference, M represents total number of projection matrixes and Z _i Representing source domain tongue image samples

Feature matrix obtained by mapping to hidden layer space Z _j Representing the tongue image sample of the target region +.>

Feature matrix mapped to hidden space Z, < >>

Representing the source tongue image sample +.>

Or target area tongue image sample->

Projecting to the hidden layer space Z to form a corresponding mth projection matrix, < >>

Sample representing source tongue image->

Is->

The representation is a difference in the distribution of class conditions.

The three contents are contained in the formula (4), and the first term is a source domain tongue image sample

And objectsDomain image sample->

In particular we will +.>

Feature Z of (2) _i and />

Feature Z of (2) _j Projection to a plurality of Z _i and Z_j Shared space in each of which Z can be respectively aligned with _i and Z_j After the feature subsets of (2) are ordered, the Euclidean distance is directly calculated to obtain Z _i and Z_j Distance in the shared space. Finally, for Z calculated in multiple shared spaces _i and Z_j Is taken as the average of the distances of Z _i and Z_i SWD distance of (c). Second item->

The function adds a global average pooling operation to the g-function, equivalent to adding Z _i Or Z is _j Global averaging pooling is performed so that the second item of content is for Z _i Corresponding global average pooling results and Z _j And calculating the L2 distance according to the corresponding global average pooling result. The third term is calculate->

Corresponding tag and pair->

The cross entropy between labels for classification prediction represents the difference in class condition distribution. Finally, equation (4) uses three super-parameters as the balance coefficients for the three terms, weighted and summed.

Notably, the three terms of equation (4) are all complementary: the SWD distance and the L2 distance form local and global complementary information; the SWD distance and the L2 distance are all differences of edge distribution;

the measure is the difference in the distribution of class conditions. By the mutual complementation of the three items, the domain alignment can be carried out while the classification distinction of the domain alignment is maintained, so that the classification performance is improved.

In this embodiment, the network structure of the feature extractor is ResNet-50, and the feature map extracted by the feature extractor g maintains its spatial structure for calculating the SWD distance. The feature extractor g was first pre-trained on ImageNet and the classifier f was trained from scratch, so the learning rate of the classifier was 10 times that of the feature extractor g. Here, several equilibrium coefficients in our loss function will be set to λ _swd ＝0.001,λ _l2 =0.001 and λ _cond ＝1.0。

The optimizer of this embodiment employs an SGD optimizer with a momentum set of 0.9. The setting of the change strategy of the learning rate is linearly changed. The size of the small lot is 65 and iterates 10000 times.

Step S5, calculating the inter-domain optimized transmission distance between the source domain and the target domain according to the sample optimized transmission distance as a cost measure between the source domain and the target domain;

optimized transmission (Optimal Transport, OT) is a method of measuring the distance between two probability distributions, which can take advantage of the geometry of the distribution. In general, the OT searches for two distributions μ _s and μ_t Possible coupling modes gamma e pi (mu) _s ,μ _t ) Finding a coupling scheme with the minimum transmission cost:

wherein

and />

Are respectively from source domain distribution mu _s And target domain distribution mu _t Any two samples of (a); />

Is the cost of the cost between two samples, which is used to measure the difference between the two samples; II (mu) _s ,μ _t ) Is the edge distribution mu _s and μ_t Is described. Whereas the discrete form of OT over an empirical distribution can be defined as:

wherein ,μ_s ,μ _t In the case of a positive vector quantity,<·,·> _F is the Frobenius point product. Let the number of samples in the two domains be N _s and N_t C is greater than or equal to 0 and

is mu _s ,μ _t Cost matrix between each element is composed of

Calculated. c is a cost function that measures the distance between two samples, typically using an L2 distance. By optimizing equation (6), i.e. minimizing the transmission cost, an optimal transmission stream +.>

Equation (6) can be solved by linear programming.

In one embodiment, step S5, calculating an inter-domain optimal transmission distance between a source domain and a target domain includes adopting a mini-batch policy, specifically including randomly extracting a mini-batch of size n from each domain each time, and calculating an optimal transmission between the two mini-batches as a proxy optimal transmission between domains:

wherein ,

considering the calculation cost of the formula (6), the invention randomly extracts mini-batch with the size of n from each domain each time, calculates the optimal transmission between the two mini-batches as the agent optimal transmission between the domains, namely, converting the formula (6) into the formula (7) C ⁿ Is calculated from equation (6) to form a hierarchical optimal transmission model.

As an improved technical solution, in another embodiment, step S5, another method of calculating an inter-domain optimized transmission distance between a source domain and a target domain uses a mini-batch based unbalanced optimal transmission method instead of formula (7):

wherein ,D_φ Csiszar Divergences, KL is the Kullback-Leibler divergence,

and />

Is gamma ⁿ Is a boundary distribution of the (c). Here, τ is a marginal penalty coefficient (Marginal Penalization), ε is a regularization coefficient (Regularization Coefficient), ε is equal to or greater than 0, and specifically, ε=0.01 and τ=0.5 can be set.

Thus, in each mini-batch, each field is a collection of samples. Equation (8) serves as the optimal transmission between the first layer source domain and the target domain, while the cost matrix C in equation (8) ⁿ Is calculated (4) from two samples of the corresponding source domain tongue image sample and the target domain image, whereby the second layer is an optimal transmission between the source domain tongue image sample and the target domain image sample, where each sample is a collection of spatial regions of the image feature map. Such that the two layers of optimal transmission form a depth hierarchyThe optimal transmission model (Deep Hierarchical Optimal Transport, deep) is optimized as shown in fig. 2. Thus, for a given mini-batch, the target problem for deep is:

wherein ,

as an improved technical solution, in another embodiment, step S5, another method calculates an inter-domain optimized transmission distance between the source domain and the target domain, and increases the cross-class entropy loss function of the source domain by using unbalanced optimal transmission loss:

the objective is to avoid the problem of "catastrophic forgetting" (Catastrophic Forgetting) on the source domain, and the final optimization objective includes a classification cross entropy loss L of the source domain.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A cross-domain tongue image classification method based on depth layering optimal transmission, which is characterized by comprising the following steps:

2. The method for classifying cross-domain tongue images based on depth layering optimal transmission according to claim 1, wherein the step S4 of calculating the optimal transmission distance between the source domain image feature set corresponding to each source domain tongue image sample and the target domain image feature set corresponding to the target domain tongue image sample comprises the following steps:

wherein g represents a feature extractor of the deep neural network;

Representing the ith source domain tongue image sample, +.>

targets representing j-th target domain tongue image sample extractionDomain image feature map, < >>

Represents the j-th target domain tongue image sample, < >>

ch tableShowing the number of channels.

3. The method for classifying cross-domain tongue images based on depth layering optimal transmission according to claim 2, wherein in step S4, an optimal transmission distance between a source domain image feature set corresponding to each source domain tongue image sample and a target domain image feature set corresponding to a target domain tongue image sample is calculated, and further comprising the following steps:

p represents the permutation matrix and,

4. A method for classifying a cross-domain tongue image based on depth layering optimal transmission according to claim 3, wherein in the step S4, an optimal transmission distance between a source domain image feature set corresponding to each source domain tongue image sample and a target domain image feature set corresponding to a target domain tongue image sample is calculated, and further comprising the following steps:

sample representing source tongue image->

Is->

Representing the difference of class condition distribution, M represents the total number of projection matrixes and Z _i Representing the source tongue image sample +.>

Feature matrix mapped to hidden space Z, < >>

Representing the source tongue image sample +.>

Or target area tongue image sample->

5. The method for classifying a cross-domain tongue image based on depth layering optimal transmission according to claim 4, wherein in step S5, an inter-domain optimal transmission distance between a source domain and a target domain is calculated, including using a mini-batch policy, specifically including,

each time randomly extracting a mini-batch with the size of n from each source domain tongue image sample and each target domain tongue image sample, and calculating the optimal transmission between the two mini-batches to serve as the optimal transmission distance between the fields:

wherein ,

Representation->

And

6. A method for classifying a tongue image across domains based on depth layering optimal transmission according to claim 5, wherein in step S5, calculating an inter-domain optimal transmission distance between the source domain and the target domain further comprises using unbalanced optimal transmission.

7. A method for classifying a cross-domain tongue image based on depth layering optimal transmission according to claim 1, wherein in step S5, calculating an inter-domain optimal transmission distance between the source domain and the target domain further comprises adding a classification cross entropy loss function of the source domain using an unbalanced optimal transmission loss.