CN113808075A - Two-stage tongue picture identification method based on deep learning - Google Patents

Two-stage tongue picture identification method based on deep learning Download PDF

Info

Publication number
CN113808075A
CN113808075A CN202110889480.1A CN202110889480A CN113808075A CN 113808075 A CN113808075 A CN 113808075A CN 202110889480 A CN202110889480 A CN 202110889480A CN 113808075 A CN113808075 A CN 113808075A
Authority
CN
China
Prior art keywords
module
tongue
transformer
network
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110889480.1A
Other languages
Chinese (zh)
Inventor
田应仲
卜雪虎
李龙
胡慧娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202110889480.1A priority Critical patent/CN113808075A/en
Publication of CN113808075A publication Critical patent/CN113808075A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a two-stage tongue picture identification method based on deep learning, which comprises two stages: tongue body segmentation based on a Transformer model and a cross attention mechanism network model and tongue picture characteristic detection based on a Swin Transformer tongue picture recognition network model. In the first stage, tongue body segmentation is based on a Transformer model and a cross attention mechanism network model, and comprises depth residual error network extraction characteristics, Transformer module global information modeling and skipping connection module characteristic fusion; and the second stage is tongue picture characteristic detection of a tongue picture recognition network model based on Swin Transformer, and comprises the steps of extracting characteristics by a main network module, modeling global information by a Transformer module and obtaining characteristic categories and a boundary frame by a prediction module. The invention can realize the recognition of tongue picture characteristics, and the network model designed by the invention provides an effective technical means for the recognition of tongue picture characteristics in traditional Chinese medicine.

Description

Two-stage tongue picture identification method based on deep learning
Technical Field
The invention belongs to the technical field of traditional Chinese medicine tongue diagnosis and treatment assistance, and particularly relates to a two-stage tongue picture identification method based on deep learning, which solves the problem of low diagnosis accuracy rate of the deep learning in the traditional Chinese medicine tongue diagnosis method computerized process.
Background
The tongue diagnosis is an important content in the four diagnostic methods of traditional medicine in China, namely inspection, smelling, inquiry and cutting, and the pathological changes and the degree of the pathological changes can be qualitatively revealed by observing the tongue condition to understand the abundance or insufficiency of the pathogenic factors and the vital qi, the blood and the body fluid of a patient according to the internal relation between the tongue and the internal organs through the meridians. With the rapid development of artificial intelligence, computer vision and the application of a convolutional neural network in the aspect of image processing, the tongue diagnosis also has application requirements:
(1) quantitatively analyzing the tongue color, the tongue fur color, the tongue cracks, the tongue tooth marks and the like to quantify the tongue diagnosis;
(2) identifying color differences which are difficult to observe by human eyes through computer image processing;
(3) the interference of external environmental factors is reduced, and the accuracy of tongue diagnosis is improved. The deep learning and convolutional neural network is widely applied to image recognition by virtue of strong characteristic learning and expression capability, so that the research aiming at applying the deep learning to the tongue diagnosis method has important significance for developing traditional Chinese medicine and improving the tongue diagnosis accuracy.
Disclosure of Invention
The invention provides a two-stage tongue diagnosis identification method based on deep learning, which takes a traditional Chinese medicine tongue coating image as a research object, analyzes the characteristics and difficulties of tongue picture identification in the tongue coating image, designs a two-stage method of first segmentation and second identification, sequentially inputs a tongue picture of a patient collected by a professional tongue picture collector into a tongue body segmentation network and a tongue picture detection network, and realizes the identification of tongue diagnosis characteristics by using an image segmentation and target detection method in the deep learning. By finishing the work, the purposes of simulating the traditional Chinese medicine diagnosis process, realizing the standardization and the computerization of the traditional Chinese medicine tongue diagnosis and providing a real-time diagnosis and treatment scheme and an auxiliary decision for medical staff are achieved.
In order to achieve the purpose, the invention adopts the following inventive concept:
a two-stage tongue diagnosis identification method based on deep learning is characterized in that a task is modularly realized, and a tongue body segmentation network model and a tongue picture identification network model are designed. The whole process can be divided into two stages:
in the first stage, the tongue segmentation network model based on the Transformer and attention mechanism.
Firstly, a depth residual error network module is used for extracting high-level semantic features, and an input tongue picture image is converted into high-dimensional tensor data to extract bottom-layer features; then, a Transformer module is used for paying attention to the global information, and global information modeling is carried out on the extracted high-level features; and finally, applying weight to the low-level features transmitted in the skipping connection in the up-sampling process by using an attention mechanism, removing non-tongue interference information, and finally realizing fine tongue segmentation.
In the second stage, the network model is identified based on the tongue image of SwinTransformer.
Embedding a Transformer module into a depth residual error network of a pyramid structure to form a backbone network of a detection network, sending the tongue picture image into a backbone network, and performing down-sampling on the image through the depth residual error network to extract global information of multi-scale attention image features in the process of feature extraction, so as to realize the extraction of high-level semantic feature information of the tongue picture by the backbone network; then sending the high-level semantic features to a Transformer module to continuously concern the global information of the high-level features, and carrying out global information modeling; and finally, generating category and boundary box information through a prediction network module to realize the identification of tongue picture characteristics.
According to the inventive concept, the invention adopts the following technical scheme:
a two-stage tongue picture identification method based on deep learning is characterized in that the identification of tongue picture characteristics is realized by two stages; the first stage is tongue segmentation based on a transform model and a cross attention mechanism network model, and the second stage is tongue feature detection based on a Swin transform tongue image recognition network model, and the tongue feature detection method comprises the following steps:
the first stage is as follows: the first stage is tongue segmentation based on a Transformer model and a cross attention mechanism network model, and comprises the following specific steps:
step 1, extracting characteristics of a depth residual error network:
constructing a deep residual error network module by using a skip connection structure, so that the gradient of the network does not disappear, thereby ensuring that tensor data can be transmitted into the deep part of the network, and efficiently extracting high-level semantic features;
step 2, modeling global information of a Transformer module:
constructing a Transformer module by using a Multi-head Self-Attention (MHSA), a Multi-Layer perceptron (MLP), a residual error connection and Layer Normalization (LN) calculation unit, performing 1 × 1 convolution on high-level semantic features extracted by a deep-Layer residual error network module to compress the number of channels, converting the deep-Layer residual error network module into serialized data, sending the sequence data into a Transformer module to focus on global information, and performing global information modeling on the extracted high-level features;
and 3, skipping the feature fusion of the connection modules:
embedding Multi-head attention (MHA) into a skip connection structure to construct a skip connection module, transmitting low-level detail features generated by a deep residual module in the step 1 into the Multi-head attention MHA module through skip connection, simultaneously transmitting high-level semantic features generated in the step 2 into the MHA module through up-sampling, applying weight to the low-level detail features through the attention to the high-level semantic features, removing non-tongue interference information, performing feature fusion with the up-sampling features, and reconstructing a feature map to obtain a fine tongue segmentation image;
and a second stage: the second stage is tongue picture characteristic detection of a tongue picture recognition network model based on Swin transform, and the specific steps are as follows:
and 4, extracting features by the backbone network module:
constructing a depth residual error module in the step 1, constructing a Transformer block module by the computing unit in the step 2, embedding the Transformer block module into the depth residual error module in a pyramid structure to form a backbone network module, sending the segmented tongue body image obtained in the step 3 into a backbone network to extract high-level semantic features, and paying attention to global information of the features extracted by each layer of residual error network in a multi-scale manner;
step 5, modeling global information of the Transformer module:
constructing a Transformer module according to the step 2, sending the high-dimensional tensor data output in the step 4 into the Transformer module to concern global information, and carrying out global information modeling on the extracted high-level features;
step 6, the prediction module obtains a feature category and a bounding box:
and (3) constructing a prediction module by using a plurality of linear layers, and sending the high-dimensional tensor data output in the step (5) into the prediction module to output category prediction and boundary frame coordinate prediction to obtain final tongue picture detection characteristic information.
Preferably, the depth residual error network module in the step 1 is a ResNet50 network structure, and in order to design a lightweight model, the original five-layer convolution module is changed into four layers, and the output dimension is changed from 2048 dimensions to 1024 dimensions.
Preferably, the Transformer modules in the step 2 are connected in series by a multi-layer Transformer structure, and the number of the Transformer modules is 12.
Preferably, in step 3, the multi-head attention module MHA is used to apply a weight to the low-level features input by skipping the connection by focusing on the high-level semantic features transmitted by the up-sampling, so as to remove the non-tongue interference information, and then perform feature fusion with the up-sampling features to reconstruct a feature map, thereby obtaining a fine tongue segmentation image.
Preferably, in step 3, the feature data is segmented by changing dimensionality of tensor data before transmitting down-sampling low-dimensional tensor data transmitted to the MHA module by skipping the connection structure and high-dimensional data transmitted to the MHA module by up-sampling the tensor data, so as to reduce the computational complexity of the MHA module and lighten the network model.
Preferably, in the step 4, the transform block module is embedded into the depth residual module in a pyramid structure to form a backbone network module, and global information of features extracted by each layer of residual network is focused on in a multi-scale manner.
Preferably, in the step 4, the data dimension change is calculated by a transform block module embedded into the depth residual module in a pyramid structure, and the tensor data are divided into windows to reduce the calculation complexity of the transform block module, and meanwhile, the divided windows alternately adopt a regular window and a moving window to realize cross-window connection and cross-window feature interaction, so that the limitation that only a single-view window can be seen originally is overcome, the receptive field is enlarged, and higher efficiency is brought.
Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable technical progress:
1. according to the method, high-resolution spatial information of the depth residual error network characteristics and global semantic information coded by a transform are fully utilized, the model learning capacity is improved, a multi-head attention module is added in a skip connection module, weight is applied to low-level characteristics transmitted in skip connection through attention to the high-level characteristics sampled upwards, interference information is removed, and therefore more fine tongue body segmentation is obtained;
2. according to the method, a pyramid structure is designed to embed a Transformer module into a depth residual error network module, and multi-scale attention is paid to the global information of the features extracted by each layer of residual error network, so that the convergence of the model is accelerated, and the identification and detection accuracy is improved.
Drawings
FIG. 1 is a flow chart of the two-stage tongue recognition of the present invention.
FIG. 2 is a diagram of a first stage tongue segmentation network framework of the present invention.
Fig. 3 is a block diagram of the first stage depth residual module of the present invention.
FIG. 4 is a diagram of a first stage transform module according to the present invention.
Fig. 5 is a block diagram of a first stage skip connect module of the present invention.
FIG. 6 is a diagram of a second stage tongue picture recognition network framework of the present invention.
Fig. 7 is a block diagram of the backbone network in the second stage of the present invention.
FIG. 8 is a diagram of a transform block module in the second stage of the present invention.
Detailed Description
The details of the structure and operation of the preferred embodiment of the present invention are described in further detail below with reference to the accompanying drawings.
The first embodiment is as follows:
referring to fig. 1, a two-stage tongue picture recognition method based on deep learning, which is divided into two stages to realize the recognition of tongue picture features. The first stage is tongue segmentation based on a transform model and a cross attention mechanism network model, and the second stage is tongue feature detection based on a Swin transform tongue image recognition network model, and the tongue feature detection method comprises the following steps:
the first stage is as follows:
step 1, extracting characteristics of a depth residual error network:
constructing a deep residual error network module by using a skip connection structure, so that the gradient of the network does not disappear, thereby ensuring that tensor data can be transmitted into the deep part of the network, and efficiently extracting high-level semantic features;
step 2, modeling global information of a Transformer module:
constructing a Transformer module by using a Multi-head Self-Attention (MHSA), a Multi-Layer perceptron (MLP), a residual error connection and Layer Normalization (LN) calculation unit, performing 1 × 1 convolution on high-level semantic features extracted by a deep-Layer residual error network module to compress the number of channels, converting the deep-Layer residual error network module into serialized data, sending the sequence data into a Transformer module to focus on global information, and performing global information modeling on the extracted high-level features;
and 3, skipping the feature fusion of the connection modules:
embedding Multi-head attention (MHA) into a skip connection structure to construct a skip connection module, transmitting low-level detail features generated by a deep residual module in the step 1 into the Multi-head attention MHA module through skip connection, simultaneously transmitting high-level semantic features generated in the step 2 into the MHA module through up-sampling, applying weight to the low-level detail features through the attention to the high-level semantic features, removing non-tongue interference information, performing feature fusion with the up-sampling features, and reconstructing a feature map to obtain a fine tongue segmentation image;
and a second stage:
and 4, extracting features by the backbone network module:
constructing a depth residual error module in the step 1, constructing a Transformer block module by the computing unit in the step 2, embedding the Transformer block module into the depth residual error module in a pyramid structure to form a backbone network module, sending the segmented tongue body image obtained in the step 3 into a backbone network to extract high-level semantic features, and paying attention to global information of the features extracted by each layer of residual error network in a multi-scale manner;
step 5, modeling global information of the Transformer module:
constructing a Transformer module according to the step 2, sending the high-dimensional tensor data output in the step 4 into the Transformer module to concern global information, and carrying out global information modeling on the extracted high-level features;
step 6, the prediction module obtains a feature category and a bounding box:
and (3) constructing a prediction module by using a plurality of linear layers, and sending the high-dimensional tensor data output in the step (5) into the prediction module to output category prediction and boundary frame coordinate prediction to obtain final tongue picture detection characteristic information.
The embodiment realizes the recognition of the tongue diagnosis characteristics by using the image segmentation and target detection method in the deep learning, and achieves the purposes of simulating the traditional Chinese medicine diagnosis process, realizing the standardization and the computerization of the traditional Chinese medicine tongue diagnosis and providing a real-time diagnosis and treatment scheme and an auxiliary decision for medical staff by finishing the work.
Example two:
this embodiment is substantially the same as the first embodiment, and is characterized in that:
the depth residual module in the step 1 is a modified structure of a ResNet50 network, an original five-layer convolution module is changed into four layers, and an output dimension is changed from 2048 dimensions to 1024 dimensions.
The Transformer modules in the step 2 are formed by connecting multiple layers of Transformer structures in series, and the number of the Transformer layers is 12.
In the step 3, the multi-head attention module MHA is used for applying weight to the low-level features input by skipping connection through the attention of the high-level semantic features transmitted by the up-sampling, removing non-tongue interference information, then performing feature fusion with the up-sampling features, reconstructing a feature map, and obtaining a fine tongue segmentation image.
In the step 3, the dimensionality of the tensor data is changed by skipping the down-sampling low-dimensional tensor data transmitted into the MHA module by the connection structure and the dimensionality of the high-dimensional data transmitted into the MHA module by up-sampling before the data is transmitted into the MHA module, and the feature data is segmented, so that the calculation complexity of the MHA module is reduced, and the network model is light.
In the step 4, a Transformer block module is embedded into the depth residual error module by a pyramid structure to form a backbone network module, and the overall information of the features extracted by each layer of residual error network is concerned in a multi-scale mode.
In the step 4, a transform block module with a pyramid structure embedded in a depth residual module calculates the previous data dimension change, divides the window to divide tensor data, so as to reduce the calculation complexity of the transform block module, and meanwhile, the divided windows alternately adopt a regular window and a moving window, so that cross-window connection is realized, cross-window feature interaction is performed, the limitation that only a single-view window can be seen originally is made up, the receptive field is enlarged, and higher efficiency is brought.
The method fully utilizes high-resolution spatial information of the depth residual error network characteristics and global semantic information coded by a transform to improve the learning capability of the model, adds a multi-head attention module in a skip connection module, applies weight to low-level characteristics transmitted in skip connection through the attention of the high-level characteristics sampled upwards, removes interference information and further obtains more fine tongue body segmentation; according to the embodiment, the transform module is embedded into the depth residual error network module in a pyramid structure, the global information of the features extracted by each layer of residual error network is concerned in a multi-scale mode, the convergence of the model is accelerated, and the identification and detection accuracy is improved.
Example three:
referring to fig. 1, the two-stage tongue picture recognition method based on deep learning is divided into two stages: tongue body segmentation based on a Transformer model and a cross attention mechanism network model and tongue picture characteristic detection based on a Swin Transformer tongue picture recognition network model. In the first stage, as shown in fig. 2, the tongue segmentation stage extracts high-level semantic features through a depth residual module; then, performing global information modeling on the extracted high-level semantic information by using a Transformer module; and finally, applying weight to the low-level features transmitted in the skipped connection in the up-sampling process by using the multi-head attention MHA, removing non-tongue interference information, and finally realizing fine tongue segmentation. In the second stage, as shown in fig. 5, the segmented tongue body image is sent to a backbone network, and global information of multi-scale attention image features in the process of down-sampling and feature extraction of the image is carried out through a depth residual error network, so that high-level semantic feature information of the tongue image is extracted through the backbone network; then sending the high-level semantic features to a Transformer module to continuously concern the global information of the high-level features, and carrying out global information modeling; and finally, generating category and boundary box information through a prediction network module to realize the identification of tongue picture characteristics.
The specific implementation of each stage is as follows:
tongue segmentation based on a Transformer model and a cross attention mechanism network model, wherein a network model framework is shown in FIG. 2, and the specific implementation steps are as follows:
step 1, extracting characteristics of a depth residual error network:
as shown in fig. 3, a deep residual network module is constructed by using a skip connection structure, so that the gradient of the network does not disappear, tensor data can be transmitted into the deep part of the network, and high-level semantic features can be efficiently extracted. The depth residual error module comprises four layers of residual error structures of Res-1, Res-2, Res-3 and Res-4, and specific structure parameters are shown in Table 1.
TABLE 1 depth residual Module construction parameters
Figure BDA0003195240340000071
Step 2, modeling global information of a Transformer module:
as shown in fig. 4, a transform module is constructed by a multi-head self-attention MHSA, a multilayer perceptron MLP, a residual join and layer normalization LN calculation unit, high-level semantic feature data extracted by a deep residual network module is sent to the transform module and needs to be calculated by the multilayer transform, and the calculation is performed from the k-th level to the k + 1-th level as follows: performing 1 × 1 convolution on the high-level semantic features extracted by the deep residual error network module to compress the channel number, then converting reshape operation into serialized data, and finally adding position coding, wherein the formula is as follows (1):
Figure BDA0003195240340000072
after obtaining the sequence input data X, Q, K, V sequence data are calculated by a linear layer, as shown in formula (2):
Figure BDA0003195240340000073
then, sequentially calculating a self-attention moment matrix and a multi-head self-attention matrix, as formulas (3) and (4):
Figure BDA0003195240340000074
Figure BDA0003195240340000075
where d is the dimension of Q and K, since the value of Q, K increases as d increases, dividing by d is equivalent to normalization; concat represents the operation of matrix splicing.
Then, residual error connection and layer normalization calculation are carried out once, as shown in formula (5):
Figure BDA0003195240340000076
the LN layer is calculated as follows (6):
Figure BDA0003195240340000077
wherein μ and σ denote a mean value and a standard deviation of the feature, respectively, a dot product operation, γ and β are learnable transformation parameters, and H denotes the number of hidden neurons in the same sequence data.
Figure BDA0003195240340000081
And then passing through an MLP network, wherein the MLP network is a feature transformation module positioned between self-attention, and is essentially a two-layer fully-connected network, a ReLU activation function is contained in the middle of the two-layer fully-connected network, and the calculation formula is as follows (7):
Figure BDA0003195240340000082
finally, obtaining X through one time of residual error connection and layer normalization calculationk+1As in equation (8):
Figure BDA0003195240340000083
and obtaining characteristic data finally subjected to global information modeling by a Transformer module.
And 3, skipping the feature fusion of the connection modules:
referring to fig. 2 and 5, a multi-head attention MHA is embedded into a skip connection structure to construct a skip connection module, a low-level detail feature Y generated by a deep residual module in step 1 and a high-level semantic feature X generated in step 2 are simultaneously transmitted into the MHA module, weight is applied to the low-level detail feature Y by focusing on the high-level semantic feature X, non-tongue interference information is removed, feature fusion is performed with an up-sampling feature, a feature map is reconstructed, and a fine tongue segmentation image is obtained, wherein the specific data calculation process is as follows:
y needs to be position coded first to learn the information that Y represents the position, and up-sampling the input X does not need to be position coded. Respectively carrying out reshape operation on input data Y and up-sampling data X which are skipped to be connected to obtain sequence input data, and then adding position codes to Y, wherein the formula is as follows (9):
Figure BDA0003195240340000084
the sequence data X, Y was then subjected to linear layer calculation Q, K, V sequence as in equation (10):
Q=WQX,K=WKY,V=WVY (10)
then, an attention moment matrix based on the upsampled input X attention and a multi-head attention matrix are sequentially calculated, as shown in formula (11):
Figure BDA0003195240340000085
the multi-head attention output sequence data is subjected to reshape operation to obtain Y', and then is spliced with the up-sampling data X to obtain output data O, as shown in formula (12):
Figure BDA0003195240340000086
after the network model of the first stage, the tongue body area in the tongue image containing the background area is completely and accurately segmented from the image.
In the second stage, tongue picture feature detection of the Swin Transformer-based tongue picture recognition network model is performed, and a network model framework is shown in fig. 2, and the specific implementation steps are as follows:
and 4, extracting features by the backbone network module:
referring to fig. 6 to 8, the computing unit constructs a Transformer block module as described in step 2, as shown in fig. 8. The Transformer block module consists of a Transformer Encode module and a Swin Transformer Encode module, wherein the Transformer Encode module is used for calculating the Transformer module by directly using a characteristic diagram segmentation method of a regular window, the Swin Transformer Encode module also needs a moving window method to perform rolling operation on characteristic diagram data before segmenting the characteristic diagram, and then the Transformer module is used for calculating after segmenting, and the specific calculation process is formula (13):
Figure BDA0003195240340000091
a depth residual error module is constructed in the step 1, and a Transformer block module is embedded into the depth residual error module in a pyramid structure to form a backbone network module, as shown in fig. 7. And (4) sending the segmented tongue body image obtained in the step (3) into a main network to extract high-level semantic features, and paying attention to the global information of the features extracted by each layer of residual error network in a multi-scale mode.
Step 5, modeling global information of the Transformer module:
and (3) constructing a Transformer module in the step 2, sending the high-dimensional tensor data output in the step 4 into the Transformer module to pay attention to global information, and carrying out global information modeling on the extracted high-level features.
Step 6, the prediction module obtains a feature category and a bounding box:
and (3) constructing a prediction module by using a plurality of linear layers, and sending the high-dimensional tensor data output in the step (5) into the prediction module to output category prediction and boundary frame coordinate prediction to obtain final tongue picture detection characteristic information.
To sum up, the above embodiment is a two-stage tongue picture recognition method based on deep learning, which is divided into two stages: tongue body segmentation based on a Transformer model and a cross attention mechanism network model and tongue picture characteristic detection based on a Swin Transformer tongue picture recognition network model. In the first stage, high-level semantic features are extracted through a depth residual error module in the tongue body segmentation stage; then, performing global information modeling on the extracted high-level semantic information by using a Transformer module; and finally, applying weight to the low-level features transmitted in the skipped connection in the up-sampling process by using the multi-head attention MHA, removing non-tongue interference information, and finally realizing fine tongue segmentation. In the second stage, the segmented tongue body image is sent into a backbone network, and the backbone network extracts high-level semantic feature information of the tongue image by multi-scale global information of concerned image features in the process of down-sampling the image and extracting the features through a depth residual error network; then sending the high-level semantic features to a Transformer module to continuously concern the global information of the high-level features, and carrying out global information modeling; and finally, generating category and boundary box information through a prediction network module to realize the identification of tongue picture characteristics. The network model designed by the invention provides an effective technical means for the tongue manifestation characteristic identification of the traditional Chinese medicine.
The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and inventive concept of the present invention.

Claims (7)

1. A two-stage tongue picture identification method based on deep learning is characterized in that the identification of tongue picture characteristics is realized by two stages; the first stage is tongue segmentation based on a transform model and a cross attention mechanism network model, and the second stage is tongue feature detection based on a Swin transform tongue image recognition network model, and the tongue feature detection method comprises the following steps:
the first stage is as follows:
step 1, extracting characteristics of a depth residual error network:
constructing a deep residual error network module by using a skip connection structure, so that the gradient of the network does not disappear, tensor data are transmitted to the deep part of the network, and high-level semantic features are efficiently extracted;
step 2, modeling global information of a Transformer module:
constructing a Transformer module by using a multi-head self-attention, multi-layer perceptron, residual connection and layer normalization computing unit, performing 1 x 1 convolution on the high-level semantic features extracted by a deep residual network module to compress the number of channels, then converting reshape operation into serialized data, sending the sequence data into the Transformer module to focus on global information, and performing global information modeling on the extracted high-level features;
and 3, skipping the feature fusion of the connection modules:
embedding multi-head attention into a skip connection structure to construct a skip connection module, transmitting low-level detail features generated by a deep residual module in the step 1 into a multi-head attention MHA module through skip connection, simultaneously transmitting high-level semantic features generated in the step 2 into the MHA module through up-sampling, applying weight to the low-level detail features through the attention of the high-level semantic features, removing non-tongue interference information, performing feature fusion with the up-sampling features, reconstructing a feature map, and obtaining a fine tongue segmentation image;
and a second stage:
and 4, extracting features by the backbone network module:
constructing a depth residual error module in the step 1, constructing a Transformer block module by the computing unit in the step 2, embedding the Transformer block module into the depth residual error module in a pyramid structure to form a backbone network module, sending the segmented tongue body image obtained in the step 3 into a backbone network to extract high-level semantic features, and focusing on global information of the extracted features of each layer of residual error network in a multi-scale manner;
step 5, modeling global information of the Transformer module:
for the Transformer module constructed in the step 2, the high-dimensional tensor data output in the step 4 are sent to the Transformer module to pay attention to global information, and global information modeling is carried out on the extracted high-level features;
step 6, the prediction module obtains a feature category and a bounding box:
and (3) constructing a prediction module by using a plurality of linear layers, and sending the high-dimensional tensor data output in the step (5) into the prediction module to output category prediction and boundary frame coordinate prediction to obtain final tongue picture detection characteristic information.
2. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 1, the depth residual module is a modified structure of a ResNet50 network, an original five-layer convolution module is changed into four layers, and an output dimension is changed from 2048 dimensions to 1024 dimensions.
3. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 2, the Transformer modules are connected in series by a multi-layer Transformer structure, and the number of the Transformer modules is 12.
4. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 3, the multi-head attention module MHA is used for applying weight to the low-level features input by skipping connection through paying attention to the high-level semantic features transmitted by the up-sampling module MHA, removing non-tongue interference information, then performing feature fusion with the up-sampling features, reconstructing a feature map, and obtaining a fine tongue segmentation image.
5. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 3, the dimensionality of the tensor data is changed and the characteristic data is segmented before transmission by skipping the downsampling low-dimensional tensor data transmitted into the MHA module by the connection structure and the upsampling high-dimensional data transmitted into the MHA module, so that the calculation complexity of the MHA module is reduced, and the network model is light.
6. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 4, a transform block module is embedded into the depth residual error module in a pyramid structure to form a backbone network module, and global information of features extracted by each layer of residual error network is focused on in a multi-scale mode.
7. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 4, a transform block module which is embedded into the depth residual module in a pyramid structure is used for calculating the data dimension change before calculation, and dividing the window segmentation tensor data so as to reduce the calculation complexity of the transform block module; meanwhile, the divided windows alternately adopt a regular window and a mobile window, so that cross-window connection is realized, and cross-window feature interaction is performed.
CN202110889480.1A 2021-08-04 2021-08-04 Two-stage tongue picture identification method based on deep learning Pending CN113808075A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110889480.1A CN113808075A (en) 2021-08-04 2021-08-04 Two-stage tongue picture identification method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110889480.1A CN113808075A (en) 2021-08-04 2021-08-04 Two-stage tongue picture identification method based on deep learning

Publications (1)

Publication Number Publication Date
CN113808075A true CN113808075A (en) 2021-12-17

Family

ID=78893319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110889480.1A Pending CN113808075A (en) 2021-08-04 2021-08-04 Two-stage tongue picture identification method based on deep learning

Country Status (1)

Country Link
CN (1) CN113808075A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114550305A (en) * 2022-03-04 2022-05-27 合肥工业大学 Human body posture estimation method and system based on Transformer
CN114612759A (en) * 2022-03-22 2022-06-10 北京百度网讯科技有限公司 Video processing method, video query method, model training method and model training device
CN116189884A (en) * 2023-04-24 2023-05-30 成都中医药大学 Multi-mode fusion traditional Chinese medicine physique judging method and system based on facial vision
CN117877686A (en) * 2024-03-13 2024-04-12 自贡市第一人民医院 Intelligent management method and system for traditional Chinese medicine nursing data

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378882A (en) * 2019-07-09 2019-10-25 北京工业大学 A kind of Chinese medicine tongue nature method for sorting colors of multi-layer depth characteristic fusion
CN110765966A (en) * 2019-10-30 2020-02-07 哈尔滨工业大学 One-stage automatic recognition and translation method for handwritten characters
CN111223553A (en) * 2020-01-03 2020-06-02 大连理工大学 Two-stage deep migration learning traditional Chinese medicine tongue diagnosis model
CN111723196A (en) * 2020-05-21 2020-09-29 西北工业大学 Single document abstract generation model construction method and device based on multi-task learning
WO2020215697A1 (en) * 2019-08-09 2020-10-29 平安科技(深圳)有限公司 Tongue image extraction method and device, and a computer readable storage medium
CN112349427A (en) * 2020-10-21 2021-02-09 上海中医药大学 Diabetes prediction method based on tongue picture and depth residual convolutional neural network
CN113011436A (en) * 2021-02-26 2021-06-22 北京工业大学 Traditional Chinese medicine tongue color and fur color collaborative classification method based on convolutional neural network
CN113139971A (en) * 2021-03-22 2021-07-20 杭州电子科技大学 Tongue picture identification method and system based on artificial intelligence
US20210232813A1 (en) * 2020-01-23 2021-07-29 Tongji University Person re-identification method combining reverse attention and multi-scale deep supervision

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378882A (en) * 2019-07-09 2019-10-25 北京工业大学 A kind of Chinese medicine tongue nature method for sorting colors of multi-layer depth characteristic fusion
WO2020215697A1 (en) * 2019-08-09 2020-10-29 平安科技(深圳)有限公司 Tongue image extraction method and device, and a computer readable storage medium
CN110765966A (en) * 2019-10-30 2020-02-07 哈尔滨工业大学 One-stage automatic recognition and translation method for handwritten characters
CN111223553A (en) * 2020-01-03 2020-06-02 大连理工大学 Two-stage deep migration learning traditional Chinese medicine tongue diagnosis model
US20210232813A1 (en) * 2020-01-23 2021-07-29 Tongji University Person re-identification method combining reverse attention and multi-scale deep supervision
CN111723196A (en) * 2020-05-21 2020-09-29 西北工业大学 Single document abstract generation model construction method and device based on multi-task learning
CN112349427A (en) * 2020-10-21 2021-02-09 上海中医药大学 Diabetes prediction method based on tongue picture and depth residual convolutional neural network
CN113011436A (en) * 2021-02-26 2021-06-22 北京工业大学 Traditional Chinese medicine tongue color and fur color collaborative classification method based on convolutional neural network
CN113139971A (en) * 2021-03-22 2021-07-20 杭州电子科技大学 Tongue picture identification method and system based on artificial intelligence

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
於张闲;冒宇清;胡孔法;: "基于深度学习的虚假健康信息识别", 软件导刊, no. 03, 15 March 2020 (2020-03-15) *
汤一平;王丽冉;何霞;陈朋;袁公萍;: "基于多任务卷积神经网络的舌象分类研究", 计算机科学, no. 12, 15 December 2018 (2018-12-15) *
王丽冉;汤一平;陈朋;何霞;袁公萍;: "面向舌体分割的两阶段卷积神经网络设计", 中国图象图形学报, no. 10, 16 October 2018 (2018-10-16) *
王俊豪;罗轶凤;: "通过细粒度的语义特征与Transformer丰富图像描述", 华东师范大学学报(自然科学版), no. 05, 25 September 2020 (2020-09-25) *
田应仲等: "融合卷积神经网络的核相关滤波视觉目标跟随算法研究", 计算测量与控制, vol. 28, no. 12, 31 December 2020 (2020-12-31), pages 176 - 180 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114550305A (en) * 2022-03-04 2022-05-27 合肥工业大学 Human body posture estimation method and system based on Transformer
CN114612759A (en) * 2022-03-22 2022-06-10 北京百度网讯科技有限公司 Video processing method, video query method, model training method and model training device
CN116189884A (en) * 2023-04-24 2023-05-30 成都中医药大学 Multi-mode fusion traditional Chinese medicine physique judging method and system based on facial vision
CN116189884B (en) * 2023-04-24 2023-07-25 成都中医药大学 Multi-mode fusion traditional Chinese medicine physique judging method and system based on facial vision
CN117877686A (en) * 2024-03-13 2024-04-12 自贡市第一人民医院 Intelligent management method and system for traditional Chinese medicine nursing data
CN117877686B (en) * 2024-03-13 2024-05-07 自贡市第一人民医院 Intelligent management method and system for traditional Chinese medicine nursing data

Similar Documents

Publication Publication Date Title
CN113808075A (en) Two-stage tongue picture identification method based on deep learning
CN112651973B (en) Semantic segmentation method based on cascade of feature pyramid attention and mixed attention
CN111242288B (en) Multi-scale parallel deep neural network model construction method for lesion image segmentation
CN110120055B (en) Fundus fluorography image non-perfusion area automatic segmentation method based on deep learning
CN112150469B (en) Laser speckle contrast image segmentation method based on unsupervised field self-adaption
CN111161200A (en) Human body posture migration method based on attention mechanism
CN107145893A (en) A kind of image recognition algorithm and system based on convolution depth network
CN115018809A (en) Target area segmentation and identification method and system of CT image
CN114445420A (en) Image segmentation model with coding and decoding structure combined with attention mechanism and training method thereof
WO2024104035A1 (en) Long short-term memory self-attention model-based three-dimensional medical image segmentation method and system
CN117274599A (en) Brain magnetic resonance segmentation method and system based on combined double-task self-encoder
CN115526829A (en) Honeycomb lung focus segmentation method and network based on ViT and context feature fusion
CN115271033A (en) Medical image processing model construction and processing method based on federal knowledge distillation
CN117078941A (en) Cardiac MRI segmentation method based on context cascade attention
CN116862891A (en) Double-branch OCT blood vessel hyperfine semantic segmentation method of encoder-decoder structure
Wang et al. Tiny-lesion segmentation in oct via multi-scale wavelet enhanced transformer
CN111543985A (en) Brain control hybrid intelligent rehabilitation method based on novel deep learning model
CN115565671A (en) Atrial fibrillation auxiliary analysis method based on cross-model mutual teaching semi-supervision
CN116416434A (en) Medical image segmentation method based on Swin transducer fused with multi-scale features and multi-attention mechanism
CN115761377A (en) Smoker brain magnetic resonance image classification method based on contextual attention mechanism
CN116309278A (en) Medical image segmentation model and method based on multi-scale context awareness
CN115984560A (en) Image segmentation method based on CNN and Transformer
CN115526898A (en) Medical image segmentation method
CN115222748A (en) Multi-organ segmentation method based on parallel deep U-shaped network and probability density map
CN115147303A (en) Two-dimensional ultrasonic medical image restoration method based on mask guidance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination