CN113808075A

CN113808075A - Two-stage tongue picture identification method based on deep learning

Info

Publication number: CN113808075A
Application number: CN202110889480.1A
Authority: CN
Inventors: 田应仲; 卜雪虎; 李龙; 胡慧娟
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2021-12-17

Abstract

The invention discloses a two-stage tongue picture identification method based on deep learning, which comprises two stages: tongue body segmentation based on a Transformer model and a cross attention mechanism network model and tongue picture characteristic detection based on a Swin Transformer tongue picture recognition network model. In the first stage, tongue body segmentation is based on a Transformer model and a cross attention mechanism network model, and comprises depth residual error network extraction characteristics, Transformer module global information modeling and skipping connection module characteristic fusion; and the second stage is tongue picture characteristic detection of a tongue picture recognition network model based on Swin Transformer, and comprises the steps of extracting characteristics by a main network module, modeling global information by a Transformer module and obtaining characteristic categories and a boundary frame by a prediction module. The invention can realize the recognition of tongue picture characteristics, and the network model designed by the invention provides an effective technical means for the recognition of tongue picture characteristics in traditional Chinese medicine.

Description

Two-stage tongue picture identification method based on deep learning

Technical Field

The invention belongs to the technical field of traditional Chinese medicine tongue diagnosis and treatment assistance, and particularly relates to a two-stage tongue picture identification method based on deep learning, which solves the problem of low diagnosis accuracy rate of the deep learning in the traditional Chinese medicine tongue diagnosis method computerized process.

Background

The tongue diagnosis is an important content in the four diagnostic methods of traditional medicine in China, namely inspection, smelling, inquiry and cutting, and the pathological changes and the degree of the pathological changes can be qualitatively revealed by observing the tongue condition to understand the abundance or insufficiency of the pathogenic factors and the vital qi, the blood and the body fluid of a patient according to the internal relation between the tongue and the internal organs through the meridians. With the rapid development of artificial intelligence, computer vision and the application of a convolutional neural network in the aspect of image processing, the tongue diagnosis also has application requirements:

(1) quantitatively analyzing the tongue color, the tongue fur color, the tongue cracks, the tongue tooth marks and the like to quantify the tongue diagnosis;

(2) identifying color differences which are difficult to observe by human eyes through computer image processing;

(3) the interference of external environmental factors is reduced, and the accuracy of tongue diagnosis is improved. The deep learning and convolutional neural network is widely applied to image recognition by virtue of strong characteristic learning and expression capability, so that the research aiming at applying the deep learning to the tongue diagnosis method has important significance for developing traditional Chinese medicine and improving the tongue diagnosis accuracy.

Disclosure of Invention

The invention provides a two-stage tongue diagnosis identification method based on deep learning, which takes a traditional Chinese medicine tongue coating image as a research object, analyzes the characteristics and difficulties of tongue picture identification in the tongue coating image, designs a two-stage method of first segmentation and second identification, sequentially inputs a tongue picture of a patient collected by a professional tongue picture collector into a tongue body segmentation network and a tongue picture detection network, and realizes the identification of tongue diagnosis characteristics by using an image segmentation and target detection method in the deep learning. By finishing the work, the purposes of simulating the traditional Chinese medicine diagnosis process, realizing the standardization and the computerization of the traditional Chinese medicine tongue diagnosis and providing a real-time diagnosis and treatment scheme and an auxiliary decision for medical staff are achieved.

In order to achieve the purpose, the invention adopts the following inventive concept:

a two-stage tongue diagnosis identification method based on deep learning is characterized in that a task is modularly realized, and a tongue body segmentation network model and a tongue picture identification network model are designed. The whole process can be divided into two stages:

in the first stage, the tongue segmentation network model based on the Transformer and attention mechanism.

Firstly, a depth residual error network module is used for extracting high-level semantic features, and an input tongue picture image is converted into high-dimensional tensor data to extract bottom-layer features; then, a Transformer module is used for paying attention to the global information, and global information modeling is carried out on the extracted high-level features; and finally, applying weight to the low-level features transmitted in the skipping connection in the up-sampling process by using an attention mechanism, removing non-tongue interference information, and finally realizing fine tongue segmentation.

In the second stage, the network model is identified based on the tongue image of SwinTransformer.

Embedding a Transformer module into a depth residual error network of a pyramid structure to form a backbone network of a detection network, sending the tongue picture image into a backbone network, and performing down-sampling on the image through the depth residual error network to extract global information of multi-scale attention image features in the process of feature extraction, so as to realize the extraction of high-level semantic feature information of the tongue picture by the backbone network; then sending the high-level semantic features to a Transformer module to continuously concern the global information of the high-level features, and carrying out global information modeling; and finally, generating category and boundary box information through a prediction network module to realize the identification of tongue picture characteristics.

According to the inventive concept, the invention adopts the following technical scheme:

a two-stage tongue picture identification method based on deep learning is characterized in that the identification of tongue picture characteristics is realized by two stages; the first stage is tongue segmentation based on a transform model and a cross attention mechanism network model, and the second stage is tongue feature detection based on a Swin transform tongue image recognition network model, and the tongue feature detection method comprises the following steps:

the first stage is as follows: the first stage is tongue segmentation based on a Transformer model and a cross attention mechanism network model, and comprises the following specific steps:

step 1, extracting characteristics of a depth residual error network:

constructing a deep residual error network module by using a skip connection structure, so that the gradient of the network does not disappear, thereby ensuring that tensor data can be transmitted into the deep part of the network, and efficiently extracting high-level semantic features;

step 2, modeling global information of a Transformer module:

constructing a Transformer module by using a Multi-head Self-Attention (MHSA), a Multi-Layer perceptron (MLP), a residual error connection and Layer Normalization (LN) calculation unit, performing 1 × 1 convolution on high-level semantic features extracted by a deep-Layer residual error network module to compress the number of channels, converting the deep-Layer residual error network module into serialized data, sending the sequence data into a Transformer module to focus on global information, and performing global information modeling on the extracted high-level features;

and 3, skipping the feature fusion of the connection modules:

embedding Multi-head attention (MHA) into a skip connection structure to construct a skip connection module, transmitting low-level detail features generated by a deep residual module in the step 1 into the Multi-head attention MHA module through skip connection, simultaneously transmitting high-level semantic features generated in the step 2 into the MHA module through up-sampling, applying weight to the low-level detail features through the attention to the high-level semantic features, removing non-tongue interference information, performing feature fusion with the up-sampling features, and reconstructing a feature map to obtain a fine tongue segmentation image;

and a second stage: the second stage is tongue picture characteristic detection of a tongue picture recognition network model based on Swin transform, and the specific steps are as follows:

and 4, extracting features by the backbone network module:

constructing a depth residual error module in the step 1, constructing a Transformer block module by the computing unit in the step 2, embedding the Transformer block module into the depth residual error module in a pyramid structure to form a backbone network module, sending the segmented tongue body image obtained in the step 3 into a backbone network to extract high-level semantic features, and paying attention to global information of the features extracted by each layer of residual error network in a multi-scale manner;

step 5, modeling global information of the Transformer module:

constructing a Transformer module according to the step 2, sending the high-dimensional tensor data output in the step 4 into the Transformer module to concern global information, and carrying out global information modeling on the extracted high-level features;

step 6, the prediction module obtains a feature category and a bounding box:

and (3) constructing a prediction module by using a plurality of linear layers, and sending the high-dimensional tensor data output in the step (5) into the prediction module to output category prediction and boundary frame coordinate prediction to obtain final tongue picture detection characteristic information.

Preferably, the depth residual error network module in the step 1 is a ResNet50 network structure, and in order to design a lightweight model, the original five-layer convolution module is changed into four layers, and the output dimension is changed from 2048 dimensions to 1024 dimensions.

Preferably, the Transformer modules in the step 2 are connected in series by a multi-layer Transformer structure, and the number of the Transformer modules is 12.

Preferably, in step 3, the multi-head attention module MHA is used to apply a weight to the low-level features input by skipping the connection by focusing on the high-level semantic features transmitted by the up-sampling, so as to remove the non-tongue interference information, and then perform feature fusion with the up-sampling features to reconstruct a feature map, thereby obtaining a fine tongue segmentation image.

Preferably, in step 3, the feature data is segmented by changing dimensionality of tensor data before transmitting down-sampling low-dimensional tensor data transmitted to the MHA module by skipping the connection structure and high-dimensional data transmitted to the MHA module by up-sampling the tensor data, so as to reduce the computational complexity of the MHA module and lighten the network model.

Preferably, in the step 4, the transform block module is embedded into the depth residual module in a pyramid structure to form a backbone network module, and global information of features extracted by each layer of residual network is focused on in a multi-scale manner.

Preferably, in the step 4, the data dimension change is calculated by a transform block module embedded into the depth residual module in a pyramid structure, and the tensor data are divided into windows to reduce the calculation complexity of the transform block module, and meanwhile, the divided windows alternately adopt a regular window and a moving window to realize cross-window connection and cross-window feature interaction, so that the limitation that only a single-view window can be seen originally is overcome, the receptive field is enlarged, and higher efficiency is brought.

Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable technical progress:

1. according to the method, high-resolution spatial information of the depth residual error network characteristics and global semantic information coded by a transform are fully utilized, the model learning capacity is improved, a multi-head attention module is added in a skip connection module, weight is applied to low-level characteristics transmitted in skip connection through attention to the high-level characteristics sampled upwards, interference information is removed, and therefore more fine tongue body segmentation is obtained;

2. according to the method, a pyramid structure is designed to embed a Transformer module into a depth residual error network module, and multi-scale attention is paid to the global information of the features extracted by each layer of residual error network, so that the convergence of the model is accelerated, and the identification and detection accuracy is improved.

Drawings

FIG. 1 is a flow chart of the two-stage tongue recognition of the present invention.

FIG. 2 is a diagram of a first stage tongue segmentation network framework of the present invention.

Fig. 3 is a block diagram of the first stage depth residual module of the present invention.

FIG. 4 is a diagram of a first stage transform module according to the present invention.

Fig. 5 is a block diagram of a first stage skip connect module of the present invention.

FIG. 6 is a diagram of a second stage tongue picture recognition network framework of the present invention.

Fig. 7 is a block diagram of the backbone network in the second stage of the present invention.

FIG. 8 is a diagram of a transform block module in the second stage of the present invention.

Detailed Description

The details of the structure and operation of the preferred embodiment of the present invention are described in further detail below with reference to the accompanying drawings.

The first embodiment is as follows:

referring to fig. 1, a two-stage tongue picture recognition method based on deep learning, which is divided into two stages to realize the recognition of tongue picture features. The first stage is tongue segmentation based on a transform model and a cross attention mechanism network model, and the second stage is tongue feature detection based on a Swin transform tongue image recognition network model, and the tongue feature detection method comprises the following steps:

the first stage is as follows:

step 1, extracting characteristics of a depth residual error network:

step 2, modeling global information of a Transformer module:

and 3, skipping the feature fusion of the connection modules:

and a second stage:

and 4, extracting features by the backbone network module:

step 5, modeling global information of the Transformer module:

step 6, the prediction module obtains a feature category and a bounding box:

The embodiment realizes the recognition of the tongue diagnosis characteristics by using the image segmentation and target detection method in the deep learning, and achieves the purposes of simulating the traditional Chinese medicine diagnosis process, realizing the standardization and the computerization of the traditional Chinese medicine tongue diagnosis and providing a real-time diagnosis and treatment scheme and an auxiliary decision for medical staff by finishing the work.

Example two:

this embodiment is substantially the same as the first embodiment, and is characterized in that:

the depth residual module in the step 1 is a modified structure of a ResNet50 network, an original five-layer convolution module is changed into four layers, and an output dimension is changed from 2048 dimensions to 1024 dimensions.

The Transformer modules in the step 2 are formed by connecting multiple layers of Transformer structures in series, and the number of the Transformer layers is 12.

In the step 3, the multi-head attention module MHA is used for applying weight to the low-level features input by skipping connection through the attention of the high-level semantic features transmitted by the up-sampling, removing non-tongue interference information, then performing feature fusion with the up-sampling features, reconstructing a feature map, and obtaining a fine tongue segmentation image.

In the step 3, the dimensionality of the tensor data is changed by skipping the down-sampling low-dimensional tensor data transmitted into the MHA module by the connection structure and the dimensionality of the high-dimensional data transmitted into the MHA module by up-sampling before the data is transmitted into the MHA module, and the feature data is segmented, so that the calculation complexity of the MHA module is reduced, and the network model is light.

In the step 4, a Transformer block module is embedded into the depth residual error module by a pyramid structure to form a backbone network module, and the overall information of the features extracted by each layer of residual error network is concerned in a multi-scale mode.

In the step 4, a transform block module with a pyramid structure embedded in a depth residual module calculates the previous data dimension change, divides the window to divide tensor data, so as to reduce the calculation complexity of the transform block module, and meanwhile, the divided windows alternately adopt a regular window and a moving window, so that cross-window connection is realized, cross-window feature interaction is performed, the limitation that only a single-view window can be seen originally is made up, the receptive field is enlarged, and higher efficiency is brought.

The method fully utilizes high-resolution spatial information of the depth residual error network characteristics and global semantic information coded by a transform to improve the learning capability of the model, adds a multi-head attention module in a skip connection module, applies weight to low-level characteristics transmitted in skip connection through the attention of the high-level characteristics sampled upwards, removes interference information and further obtains more fine tongue body segmentation; according to the embodiment, the transform module is embedded into the depth residual error network module in a pyramid structure, the global information of the features extracted by each layer of residual error network is concerned in a multi-scale mode, the convergence of the model is accelerated, and the identification and detection accuracy is improved.

Example three:

referring to fig. 1, the two-stage tongue picture recognition method based on deep learning is divided into two stages: tongue body segmentation based on a Transformer model and a cross attention mechanism network model and tongue picture characteristic detection based on a Swin Transformer tongue picture recognition network model. In the first stage, as shown in fig. 2, the tongue segmentation stage extracts high-level semantic features through a depth residual module; then, performing global information modeling on the extracted high-level semantic information by using a Transformer module; and finally, applying weight to the low-level features transmitted in the skipped connection in the up-sampling process by using the multi-head attention MHA, removing non-tongue interference information, and finally realizing fine tongue segmentation. In the second stage, as shown in fig. 5, the segmented tongue body image is sent to a backbone network, and global information of multi-scale attention image features in the process of down-sampling and feature extraction of the image is carried out through a depth residual error network, so that high-level semantic feature information of the tongue image is extracted through the backbone network; then sending the high-level semantic features to a Transformer module to continuously concern the global information of the high-level features, and carrying out global information modeling; and finally, generating category and boundary box information through a prediction network module to realize the identification of tongue picture characteristics.

The specific implementation of each stage is as follows:

tongue segmentation based on a Transformer model and a cross attention mechanism network model, wherein a network model framework is shown in FIG. 2, and the specific implementation steps are as follows:

step 1, extracting characteristics of a depth residual error network:

as shown in fig. 3, a deep residual network module is constructed by using a skip connection structure, so that the gradient of the network does not disappear, tensor data can be transmitted into the deep part of the network, and high-level semantic features can be efficiently extracted. The depth residual error module comprises four layers of residual error structures of Res-1, Res-2, Res-3 and Res-4, and specific structure parameters are shown in Table 1.

TABLE 1 depth residual Module construction parameters

Step 2, modeling global information of a Transformer module:

as shown in fig. 4, a transform module is constructed by a multi-head self-attention MHSA, a multilayer perceptron MLP, a residual join and layer normalization LN calculation unit, high-level semantic feature data extracted by a deep residual network module is sent to the transform module and needs to be calculated by the multilayer transform, and the calculation is performed from the k-th level to the k + 1-th level as follows: performing 1 × 1 convolution on the high-level semantic features extracted by the deep residual error network module to compress the channel number, then converting reshape operation into serialized data, and finally adding position coding, wherein the formula is as follows (1):

after obtaining the sequence input data X, Q, K, V sequence data are calculated by a linear layer, as shown in formula (2):

then, sequentially calculating a self-attention moment matrix and a multi-head self-attention matrix, as formulas (3) and (4):

where d is the dimension of Q and K, since the value of Q, K increases as d increases, dividing by d is equivalent to normalization; concat represents the operation of matrix splicing.

Then, residual error connection and layer normalization calculation are carried out once, as shown in formula (5):

the LN layer is calculated as follows (6):

wherein μ and σ denote a mean value and a standard deviation of the feature, respectively, a dot product operation, γ and β are learnable transformation parameters, and H denotes the number of hidden neurons in the same sequence data.

And then passing through an MLP network, wherein the MLP network is a feature transformation module positioned between self-attention, and is essentially a two-layer fully-connected network, a ReLU activation function is contained in the middle of the two-layer fully-connected network, and the calculation formula is as follows (7):

finally, obtaining X through one time of residual error connection and layer normalization calculation^k+1As in equation (8):

and obtaining characteristic data finally subjected to global information modeling by a Transformer module.

And 3, skipping the feature fusion of the connection modules:

referring to fig. 2 and 5, a multi-head attention MHA is embedded into a skip connection structure to construct a skip connection module, a low-level detail feature Y generated by a deep residual module in step 1 and a high-level semantic feature X generated in step 2 are simultaneously transmitted into the MHA module, weight is applied to the low-level detail feature Y by focusing on the high-level semantic feature X, non-tongue interference information is removed, feature fusion is performed with an up-sampling feature, a feature map is reconstructed, and a fine tongue segmentation image is obtained, wherein the specific data calculation process is as follows:

y needs to be position coded first to learn the information that Y represents the position, and up-sampling the input X does not need to be position coded. Respectively carrying out reshape operation on input data Y and up-sampling data X which are skipped to be connected to obtain sequence input data, and then adding position codes to Y, wherein the formula is as follows (9):

the sequence data X, Y was then subjected to linear layer calculation Q, K, V sequence as in equation (10):

Q＝W_QX,K＝W_KY,V＝W_VY (10)

then, an attention moment matrix based on the upsampled input X attention and a multi-head attention matrix are sequentially calculated, as shown in formula (11):

the multi-head attention output sequence data is subjected to reshape operation to obtain Y', and then is spliced with the up-sampling data X to obtain output data O, as shown in formula (12):

after the network model of the first stage, the tongue body area in the tongue image containing the background area is completely and accurately segmented from the image.

In the second stage, tongue picture feature detection of the Swin Transformer-based tongue picture recognition network model is performed, and a network model framework is shown in fig. 2, and the specific implementation steps are as follows:

and 4, extracting features by the backbone network module:

referring to fig. 6 to 8, the computing unit constructs a Transformer block module as described in step 2, as shown in fig. 8. The Transformer block module consists of a Transformer Encode module and a Swin Transformer Encode module, wherein the Transformer Encode module is used for calculating the Transformer module by directly using a characteristic diagram segmentation method of a regular window, the Swin Transformer Encode module also needs a moving window method to perform rolling operation on characteristic diagram data before segmenting the characteristic diagram, and then the Transformer module is used for calculating after segmenting, and the specific calculation process is formula (13):

a depth residual error module is constructed in the step 1, and a Transformer block module is embedded into the depth residual error module in a pyramid structure to form a backbone network module, as shown in fig. 7. And (4) sending the segmented tongue body image obtained in the step (3) into a main network to extract high-level semantic features, and paying attention to the global information of the features extracted by each layer of residual error network in a multi-scale mode.

Step 5, modeling global information of the Transformer module:

and (3) constructing a Transformer module in the step 2, sending the high-dimensional tensor data output in the step 4 into the Transformer module to pay attention to global information, and carrying out global information modeling on the extracted high-level features.

Step 6, the prediction module obtains a feature category and a bounding box:

To sum up, the above embodiment is a two-stage tongue picture recognition method based on deep learning, which is divided into two stages: tongue body segmentation based on a Transformer model and a cross attention mechanism network model and tongue picture characteristic detection based on a Swin Transformer tongue picture recognition network model. In the first stage, high-level semantic features are extracted through a depth residual error module in the tongue body segmentation stage; then, performing global information modeling on the extracted high-level semantic information by using a Transformer module; and finally, applying weight to the low-level features transmitted in the skipped connection in the up-sampling process by using the multi-head attention MHA, removing non-tongue interference information, and finally realizing fine tongue segmentation. In the second stage, the segmented tongue body image is sent into a backbone network, and the backbone network extracts high-level semantic feature information of the tongue image by multi-scale global information of concerned image features in the process of down-sampling the image and extracting the features through a depth residual error network; then sending the high-level semantic features to a Transformer module to continuously concern the global information of the high-level features, and carrying out global information modeling; and finally, generating category and boundary box information through a prediction network module to realize the identification of tongue picture characteristics. The network model designed by the invention provides an effective technical means for the tongue manifestation characteristic identification of the traditional Chinese medicine.

The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and inventive concept of the present invention.

Claims

1. A two-stage tongue picture identification method based on deep learning is characterized in that the identification of tongue picture characteristics is realized by two stages; the first stage is tongue segmentation based on a transform model and a cross attention mechanism network model, and the second stage is tongue feature detection based on a Swin transform tongue image recognition network model, and the tongue feature detection method comprises the following steps:

the first stage is as follows:

step 1, extracting characteristics of a depth residual error network:

constructing a deep residual error network module by using a skip connection structure, so that the gradient of the network does not disappear, tensor data are transmitted to the deep part of the network, and high-level semantic features are efficiently extracted;

step 2, modeling global information of a Transformer module:

constructing a Transformer module by using a multi-head self-attention, multi-layer perceptron, residual connection and layer normalization computing unit, performing 1 x 1 convolution on the high-level semantic features extracted by a deep residual network module to compress the number of channels, then converting reshape operation into serialized data, sending the sequence data into the Transformer module to focus on global information, and performing global information modeling on the extracted high-level features;

and 3, skipping the feature fusion of the connection modules:

embedding multi-head attention into a skip connection structure to construct a skip connection module, transmitting low-level detail features generated by a deep residual module in the step 1 into a multi-head attention MHA module through skip connection, simultaneously transmitting high-level semantic features generated in the step 2 into the MHA module through up-sampling, applying weight to the low-level detail features through the attention of the high-level semantic features, removing non-tongue interference information, performing feature fusion with the up-sampling features, reconstructing a feature map, and obtaining a fine tongue segmentation image;

and a second stage:

and 4, extracting features by the backbone network module:

constructing a depth residual error module in the step 1, constructing a Transformer block module by the computing unit in the step 2, embedding the Transformer block module into the depth residual error module in a pyramid structure to form a backbone network module, sending the segmented tongue body image obtained in the step 3 into a backbone network to extract high-level semantic features, and focusing on global information of the extracted features of each layer of residual error network in a multi-scale manner;

step 5, modeling global information of the Transformer module:

for the Transformer module constructed in the step 2, the high-dimensional tensor data output in the step 4 are sent to the Transformer module to pay attention to global information, and global information modeling is carried out on the extracted high-level features;

step 6, the prediction module obtains a feature category and a bounding box:

2. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 1, the depth residual module is a modified structure of a ResNet50 network, an original five-layer convolution module is changed into four layers, and an output dimension is changed from 2048 dimensions to 1024 dimensions.

3. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 2, the Transformer modules are connected in series by a multi-layer Transformer structure, and the number of the Transformer modules is 12.

4. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 3, the multi-head attention module MHA is used for applying weight to the low-level features input by skipping connection through paying attention to the high-level semantic features transmitted by the up-sampling module MHA, removing non-tongue interference information, then performing feature fusion with the up-sampling features, reconstructing a feature map, and obtaining a fine tongue segmentation image.

5. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 3, the dimensionality of the tensor data is changed and the characteristic data is segmented before transmission by skipping the downsampling low-dimensional tensor data transmitted into the MHA module by the connection structure and the upsampling high-dimensional data transmitted into the MHA module, so that the calculation complexity of the MHA module is reduced, and the network model is light.

6. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 4, a transform block module is embedded into the depth residual error module in a pyramid structure to form a backbone network module, and global information of features extracted by each layer of residual error network is focused on in a multi-scale mode.

7. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 4, a transform block module which is embedded into the depth residual module in a pyramid structure is used for calculating the data dimension change before calculation, and dividing the window segmentation tensor data so as to reduce the calculation complexity of the transform block module; meanwhile, the divided windows alternately adopt a regular window and a mobile window, so that cross-window connection is realized, and cross-window feature interaction is performed.