CN113808075A - Two-stage tongue picture identification method based on deep learning - Google Patents
Two-stage tongue picture identification method based on deep learning Download PDFInfo
- Publication number
- CN113808075A CN113808075A CN202110889480.1A CN202110889480A CN113808075A CN 113808075 A CN113808075 A CN 113808075A CN 202110889480 A CN202110889480 A CN 202110889480A CN 113808075 A CN113808075 A CN 113808075A
- Authority
- CN
- China
- Prior art keywords
- module
- tongue
- transformer
- network
- stage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000013135 deep learning Methods 0.000 title claims abstract description 22
- 230000011218 segmentation Effects 0.000 claims abstract description 30
- 238000001514 detection method Methods 0.000 claims abstract description 22
- 230000004927 fusion Effects 0.000 claims abstract description 12
- 230000007246 mechanism Effects 0.000 claims abstract description 11
- 238000005070 sampling Methods 0.000 claims description 26
- 238000004364 calculation method Methods 0.000 claims description 16
- 238000010606 normalization Methods 0.000 claims description 7
- 230000008859 change Effects 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 claims 1
- 239000003814 drug Substances 0.000 abstract description 11
- 238000000605 extraction Methods 0.000 abstract description 4
- 238000003745 diagnosis Methods 0.000 description 19
- 230000008569 process Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 239000011159 matrix material Substances 0.000 description 5
- 239000011248 coating agent Substances 0.000 description 2
- 238000000576 coating method Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000003709 image segmentation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 231100000915 pathological change Toxicity 0.000 description 2
- 230000036285 pathological change Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 206010043946 Tongue conditions Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 210000001124 body fluid Anatomy 0.000 description 1
- 239000010839 body fluid Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002405 diagnostic procedure Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000001717 pathogenic effect Effects 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 210000001835 viscera Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
- G06T7/0012—Biomedical image inspection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30004—Biomedical image processing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
- Radiology & Medical Imaging (AREA)
- Quality & Reliability (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a two-stage tongue picture identification method based on deep learning, which comprises two stages: tongue body segmentation based on a Transformer model and a cross attention mechanism network model and tongue picture characteristic detection based on a Swin Transformer tongue picture recognition network model. In the first stage, tongue body segmentation is based on a Transformer model and a cross attention mechanism network model, and comprises depth residual error network extraction characteristics, Transformer module global information modeling and skipping connection module characteristic fusion; and the second stage is tongue picture characteristic detection of a tongue picture recognition network model based on Swin Transformer, and comprises the steps of extracting characteristics by a main network module, modeling global information by a Transformer module and obtaining characteristic categories and a boundary frame by a prediction module. The invention can realize the recognition of tongue picture characteristics, and the network model designed by the invention provides an effective technical means for the recognition of tongue picture characteristics in traditional Chinese medicine.
Description
Technical Field
The invention belongs to the technical field of traditional Chinese medicine tongue diagnosis and treatment assistance, and particularly relates to a two-stage tongue picture identification method based on deep learning, which solves the problem of low diagnosis accuracy rate of the deep learning in the traditional Chinese medicine tongue diagnosis method computerized process.
Background
The tongue diagnosis is an important content in the four diagnostic methods of traditional medicine in China, namely inspection, smelling, inquiry and cutting, and the pathological changes and the degree of the pathological changes can be qualitatively revealed by observing the tongue condition to understand the abundance or insufficiency of the pathogenic factors and the vital qi, the blood and the body fluid of a patient according to the internal relation between the tongue and the internal organs through the meridians. With the rapid development of artificial intelligence, computer vision and the application of a convolutional neural network in the aspect of image processing, the tongue diagnosis also has application requirements:
(1) quantitatively analyzing the tongue color, the tongue fur color, the tongue cracks, the tongue tooth marks and the like to quantify the tongue diagnosis;
(2) identifying color differences which are difficult to observe by human eyes through computer image processing;
(3) the interference of external environmental factors is reduced, and the accuracy of tongue diagnosis is improved. The deep learning and convolutional neural network is widely applied to image recognition by virtue of strong characteristic learning and expression capability, so that the research aiming at applying the deep learning to the tongue diagnosis method has important significance for developing traditional Chinese medicine and improving the tongue diagnosis accuracy.
Disclosure of Invention
The invention provides a two-stage tongue diagnosis identification method based on deep learning, which takes a traditional Chinese medicine tongue coating image as a research object, analyzes the characteristics and difficulties of tongue picture identification in the tongue coating image, designs a two-stage method of first segmentation and second identification, sequentially inputs a tongue picture of a patient collected by a professional tongue picture collector into a tongue body segmentation network and a tongue picture detection network, and realizes the identification of tongue diagnosis characteristics by using an image segmentation and target detection method in the deep learning. By finishing the work, the purposes of simulating the traditional Chinese medicine diagnosis process, realizing the standardization and the computerization of the traditional Chinese medicine tongue diagnosis and providing a real-time diagnosis and treatment scheme and an auxiliary decision for medical staff are achieved.
In order to achieve the purpose, the invention adopts the following inventive concept:
a two-stage tongue diagnosis identification method based on deep learning is characterized in that a task is modularly realized, and a tongue body segmentation network model and a tongue picture identification network model are designed. The whole process can be divided into two stages:
in the first stage, the tongue segmentation network model based on the Transformer and attention mechanism.
Firstly, a depth residual error network module is used for extracting high-level semantic features, and an input tongue picture image is converted into high-dimensional tensor data to extract bottom-layer features; then, a Transformer module is used for paying attention to the global information, and global information modeling is carried out on the extracted high-level features; and finally, applying weight to the low-level features transmitted in the skipping connection in the up-sampling process by using an attention mechanism, removing non-tongue interference information, and finally realizing fine tongue segmentation.
In the second stage, the network model is identified based on the tongue image of SwinTransformer.
Embedding a Transformer module into a depth residual error network of a pyramid structure to form a backbone network of a detection network, sending the tongue picture image into a backbone network, and performing down-sampling on the image through the depth residual error network to extract global information of multi-scale attention image features in the process of feature extraction, so as to realize the extraction of high-level semantic feature information of the tongue picture by the backbone network; then sending the high-level semantic features to a Transformer module to continuously concern the global information of the high-level features, and carrying out global information modeling; and finally, generating category and boundary box information through a prediction network module to realize the identification of tongue picture characteristics.
According to the inventive concept, the invention adopts the following technical scheme:
a two-stage tongue picture identification method based on deep learning is characterized in that the identification of tongue picture characteristics is realized by two stages; the first stage is tongue segmentation based on a transform model and a cross attention mechanism network model, and the second stage is tongue feature detection based on a Swin transform tongue image recognition network model, and the tongue feature detection method comprises the following steps:
the first stage is as follows: the first stage is tongue segmentation based on a Transformer model and a cross attention mechanism network model, and comprises the following specific steps:
constructing a deep residual error network module by using a skip connection structure, so that the gradient of the network does not disappear, thereby ensuring that tensor data can be transmitted into the deep part of the network, and efficiently extracting high-level semantic features;
constructing a Transformer module by using a Multi-head Self-Attention (MHSA), a Multi-Layer perceptron (MLP), a residual error connection and Layer Normalization (LN) calculation unit, performing 1 × 1 convolution on high-level semantic features extracted by a deep-Layer residual error network module to compress the number of channels, converting the deep-Layer residual error network module into serialized data, sending the sequence data into a Transformer module to focus on global information, and performing global information modeling on the extracted high-level features;
and 3, skipping the feature fusion of the connection modules:
embedding Multi-head attention (MHA) into a skip connection structure to construct a skip connection module, transmitting low-level detail features generated by a deep residual module in the step 1 into the Multi-head attention MHA module through skip connection, simultaneously transmitting high-level semantic features generated in the step 2 into the MHA module through up-sampling, applying weight to the low-level detail features through the attention to the high-level semantic features, removing non-tongue interference information, performing feature fusion with the up-sampling features, and reconstructing a feature map to obtain a fine tongue segmentation image;
and a second stage: the second stage is tongue picture characteristic detection of a tongue picture recognition network model based on Swin transform, and the specific steps are as follows:
and 4, extracting features by the backbone network module:
constructing a depth residual error module in the step 1, constructing a Transformer block module by the computing unit in the step 2, embedding the Transformer block module into the depth residual error module in a pyramid structure to form a backbone network module, sending the segmented tongue body image obtained in the step 3 into a backbone network to extract high-level semantic features, and paying attention to global information of the features extracted by each layer of residual error network in a multi-scale manner;
constructing a Transformer module according to the step 2, sending the high-dimensional tensor data output in the step 4 into the Transformer module to concern global information, and carrying out global information modeling on the extracted high-level features;
step 6, the prediction module obtains a feature category and a bounding box:
and (3) constructing a prediction module by using a plurality of linear layers, and sending the high-dimensional tensor data output in the step (5) into the prediction module to output category prediction and boundary frame coordinate prediction to obtain final tongue picture detection characteristic information.
Preferably, the depth residual error network module in the step 1 is a ResNet50 network structure, and in order to design a lightweight model, the original five-layer convolution module is changed into four layers, and the output dimension is changed from 2048 dimensions to 1024 dimensions.
Preferably, the Transformer modules in the step 2 are connected in series by a multi-layer Transformer structure, and the number of the Transformer modules is 12.
Preferably, in step 3, the multi-head attention module MHA is used to apply a weight to the low-level features input by skipping the connection by focusing on the high-level semantic features transmitted by the up-sampling, so as to remove the non-tongue interference information, and then perform feature fusion with the up-sampling features to reconstruct a feature map, thereby obtaining a fine tongue segmentation image.
Preferably, in step 3, the feature data is segmented by changing dimensionality of tensor data before transmitting down-sampling low-dimensional tensor data transmitted to the MHA module by skipping the connection structure and high-dimensional data transmitted to the MHA module by up-sampling the tensor data, so as to reduce the computational complexity of the MHA module and lighten the network model.
Preferably, in the step 4, the transform block module is embedded into the depth residual module in a pyramid structure to form a backbone network module, and global information of features extracted by each layer of residual network is focused on in a multi-scale manner.
Preferably, in the step 4, the data dimension change is calculated by a transform block module embedded into the depth residual module in a pyramid structure, and the tensor data are divided into windows to reduce the calculation complexity of the transform block module, and meanwhile, the divided windows alternately adopt a regular window and a moving window to realize cross-window connection and cross-window feature interaction, so that the limitation that only a single-view window can be seen originally is overcome, the receptive field is enlarged, and higher efficiency is brought.
Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable technical progress:
1. according to the method, high-resolution spatial information of the depth residual error network characteristics and global semantic information coded by a transform are fully utilized, the model learning capacity is improved, a multi-head attention module is added in a skip connection module, weight is applied to low-level characteristics transmitted in skip connection through attention to the high-level characteristics sampled upwards, interference information is removed, and therefore more fine tongue body segmentation is obtained;
2. according to the method, a pyramid structure is designed to embed a Transformer module into a depth residual error network module, and multi-scale attention is paid to the global information of the features extracted by each layer of residual error network, so that the convergence of the model is accelerated, and the identification and detection accuracy is improved.
Drawings
FIG. 1 is a flow chart of the two-stage tongue recognition of the present invention.
FIG. 2 is a diagram of a first stage tongue segmentation network framework of the present invention.
Fig. 3 is a block diagram of the first stage depth residual module of the present invention.
FIG. 4 is a diagram of a first stage transform module according to the present invention.
Fig. 5 is a block diagram of a first stage skip connect module of the present invention.
FIG. 6 is a diagram of a second stage tongue picture recognition network framework of the present invention.
Fig. 7 is a block diagram of the backbone network in the second stage of the present invention.
FIG. 8 is a diagram of a transform block module in the second stage of the present invention.
Detailed Description
The details of the structure and operation of the preferred embodiment of the present invention are described in further detail below with reference to the accompanying drawings.
The first embodiment is as follows:
referring to fig. 1, a two-stage tongue picture recognition method based on deep learning, which is divided into two stages to realize the recognition of tongue picture features. The first stage is tongue segmentation based on a transform model and a cross attention mechanism network model, and the second stage is tongue feature detection based on a Swin transform tongue image recognition network model, and the tongue feature detection method comprises the following steps:
the first stage is as follows:
constructing a deep residual error network module by using a skip connection structure, so that the gradient of the network does not disappear, thereby ensuring that tensor data can be transmitted into the deep part of the network, and efficiently extracting high-level semantic features;
constructing a Transformer module by using a Multi-head Self-Attention (MHSA), a Multi-Layer perceptron (MLP), a residual error connection and Layer Normalization (LN) calculation unit, performing 1 × 1 convolution on high-level semantic features extracted by a deep-Layer residual error network module to compress the number of channels, converting the deep-Layer residual error network module into serialized data, sending the sequence data into a Transformer module to focus on global information, and performing global information modeling on the extracted high-level features;
and 3, skipping the feature fusion of the connection modules:
embedding Multi-head attention (MHA) into a skip connection structure to construct a skip connection module, transmitting low-level detail features generated by a deep residual module in the step 1 into the Multi-head attention MHA module through skip connection, simultaneously transmitting high-level semantic features generated in the step 2 into the MHA module through up-sampling, applying weight to the low-level detail features through the attention to the high-level semantic features, removing non-tongue interference information, performing feature fusion with the up-sampling features, and reconstructing a feature map to obtain a fine tongue segmentation image;
and a second stage:
and 4, extracting features by the backbone network module:
constructing a depth residual error module in the step 1, constructing a Transformer block module by the computing unit in the step 2, embedding the Transformer block module into the depth residual error module in a pyramid structure to form a backbone network module, sending the segmented tongue body image obtained in the step 3 into a backbone network to extract high-level semantic features, and paying attention to global information of the features extracted by each layer of residual error network in a multi-scale manner;
constructing a Transformer module according to the step 2, sending the high-dimensional tensor data output in the step 4 into the Transformer module to concern global information, and carrying out global information modeling on the extracted high-level features;
step 6, the prediction module obtains a feature category and a bounding box:
and (3) constructing a prediction module by using a plurality of linear layers, and sending the high-dimensional tensor data output in the step (5) into the prediction module to output category prediction and boundary frame coordinate prediction to obtain final tongue picture detection characteristic information.
The embodiment realizes the recognition of the tongue diagnosis characteristics by using the image segmentation and target detection method in the deep learning, and achieves the purposes of simulating the traditional Chinese medicine diagnosis process, realizing the standardization and the computerization of the traditional Chinese medicine tongue diagnosis and providing a real-time diagnosis and treatment scheme and an auxiliary decision for medical staff by finishing the work.
Example two:
this embodiment is substantially the same as the first embodiment, and is characterized in that:
the depth residual module in the step 1 is a modified structure of a ResNet50 network, an original five-layer convolution module is changed into four layers, and an output dimension is changed from 2048 dimensions to 1024 dimensions.
The Transformer modules in the step 2 are formed by connecting multiple layers of Transformer structures in series, and the number of the Transformer layers is 12.
In the step 3, the multi-head attention module MHA is used for applying weight to the low-level features input by skipping connection through the attention of the high-level semantic features transmitted by the up-sampling, removing non-tongue interference information, then performing feature fusion with the up-sampling features, reconstructing a feature map, and obtaining a fine tongue segmentation image.
In the step 3, the dimensionality of the tensor data is changed by skipping the down-sampling low-dimensional tensor data transmitted into the MHA module by the connection structure and the dimensionality of the high-dimensional data transmitted into the MHA module by up-sampling before the data is transmitted into the MHA module, and the feature data is segmented, so that the calculation complexity of the MHA module is reduced, and the network model is light.
In the step 4, a Transformer block module is embedded into the depth residual error module by a pyramid structure to form a backbone network module, and the overall information of the features extracted by each layer of residual error network is concerned in a multi-scale mode.
In the step 4, a transform block module with a pyramid structure embedded in a depth residual module calculates the previous data dimension change, divides the window to divide tensor data, so as to reduce the calculation complexity of the transform block module, and meanwhile, the divided windows alternately adopt a regular window and a moving window, so that cross-window connection is realized, cross-window feature interaction is performed, the limitation that only a single-view window can be seen originally is made up, the receptive field is enlarged, and higher efficiency is brought.
The method fully utilizes high-resolution spatial information of the depth residual error network characteristics and global semantic information coded by a transform to improve the learning capability of the model, adds a multi-head attention module in a skip connection module, applies weight to low-level characteristics transmitted in skip connection through the attention of the high-level characteristics sampled upwards, removes interference information and further obtains more fine tongue body segmentation; according to the embodiment, the transform module is embedded into the depth residual error network module in a pyramid structure, the global information of the features extracted by each layer of residual error network is concerned in a multi-scale mode, the convergence of the model is accelerated, and the identification and detection accuracy is improved.
Example three:
referring to fig. 1, the two-stage tongue picture recognition method based on deep learning is divided into two stages: tongue body segmentation based on a Transformer model and a cross attention mechanism network model and tongue picture characteristic detection based on a Swin Transformer tongue picture recognition network model. In the first stage, as shown in fig. 2, the tongue segmentation stage extracts high-level semantic features through a depth residual module; then, performing global information modeling on the extracted high-level semantic information by using a Transformer module; and finally, applying weight to the low-level features transmitted in the skipped connection in the up-sampling process by using the multi-head attention MHA, removing non-tongue interference information, and finally realizing fine tongue segmentation. In the second stage, as shown in fig. 5, the segmented tongue body image is sent to a backbone network, and global information of multi-scale attention image features in the process of down-sampling and feature extraction of the image is carried out through a depth residual error network, so that high-level semantic feature information of the tongue image is extracted through the backbone network; then sending the high-level semantic features to a Transformer module to continuously concern the global information of the high-level features, and carrying out global information modeling; and finally, generating category and boundary box information through a prediction network module to realize the identification of tongue picture characteristics.
The specific implementation of each stage is as follows:
tongue segmentation based on a Transformer model and a cross attention mechanism network model, wherein a network model framework is shown in FIG. 2, and the specific implementation steps are as follows:
as shown in fig. 3, a deep residual network module is constructed by using a skip connection structure, so that the gradient of the network does not disappear, tensor data can be transmitted into the deep part of the network, and high-level semantic features can be efficiently extracted. The depth residual error module comprises four layers of residual error structures of Res-1, Res-2, Res-3 and Res-4, and specific structure parameters are shown in Table 1.
TABLE 1 depth residual Module construction parameters
as shown in fig. 4, a transform module is constructed by a multi-head self-attention MHSA, a multilayer perceptron MLP, a residual join and layer normalization LN calculation unit, high-level semantic feature data extracted by a deep residual network module is sent to the transform module and needs to be calculated by the multilayer transform, and the calculation is performed from the k-th level to the k + 1-th level as follows: performing 1 × 1 convolution on the high-level semantic features extracted by the deep residual error network module to compress the channel number, then converting reshape operation into serialized data, and finally adding position coding, wherein the formula is as follows (1):
after obtaining the sequence input data X, Q, K, V sequence data are calculated by a linear layer, as shown in formula (2):
then, sequentially calculating a self-attention moment matrix and a multi-head self-attention matrix, as formulas (3) and (4):
where d is the dimension of Q and K, since the value of Q, K increases as d increases, dividing by d is equivalent to normalization; concat represents the operation of matrix splicing.
Then, residual error connection and layer normalization calculation are carried out once, as shown in formula (5):
the LN layer is calculated as follows (6):
wherein μ and σ denote a mean value and a standard deviation of the feature, respectively, a dot product operation, γ and β are learnable transformation parameters, and H denotes the number of hidden neurons in the same sequence data.
And then passing through an MLP network, wherein the MLP network is a feature transformation module positioned between self-attention, and is essentially a two-layer fully-connected network, a ReLU activation function is contained in the middle of the two-layer fully-connected network, and the calculation formula is as follows (7):
finally, obtaining X through one time of residual error connection and layer normalization calculationk+1As in equation (8):
and obtaining characteristic data finally subjected to global information modeling by a Transformer module.
And 3, skipping the feature fusion of the connection modules:
referring to fig. 2 and 5, a multi-head attention MHA is embedded into a skip connection structure to construct a skip connection module, a low-level detail feature Y generated by a deep residual module in step 1 and a high-level semantic feature X generated in step 2 are simultaneously transmitted into the MHA module, weight is applied to the low-level detail feature Y by focusing on the high-level semantic feature X, non-tongue interference information is removed, feature fusion is performed with an up-sampling feature, a feature map is reconstructed, and a fine tongue segmentation image is obtained, wherein the specific data calculation process is as follows:
y needs to be position coded first to learn the information that Y represents the position, and up-sampling the input X does not need to be position coded. Respectively carrying out reshape operation on input data Y and up-sampling data X which are skipped to be connected to obtain sequence input data, and then adding position codes to Y, wherein the formula is as follows (9):
the sequence data X, Y was then subjected to linear layer calculation Q, K, V sequence as in equation (10):
Q=WQX,K=WKY,V=WVY (10)
then, an attention moment matrix based on the upsampled input X attention and a multi-head attention matrix are sequentially calculated, as shown in formula (11):
the multi-head attention output sequence data is subjected to reshape operation to obtain Y', and then is spliced with the up-sampling data X to obtain output data O, as shown in formula (12):
after the network model of the first stage, the tongue body area in the tongue image containing the background area is completely and accurately segmented from the image.
In the second stage, tongue picture feature detection of the Swin Transformer-based tongue picture recognition network model is performed, and a network model framework is shown in fig. 2, and the specific implementation steps are as follows:
and 4, extracting features by the backbone network module:
referring to fig. 6 to 8, the computing unit constructs a Transformer block module as described in step 2, as shown in fig. 8. The Transformer block module consists of a Transformer Encode module and a Swin Transformer Encode module, wherein the Transformer Encode module is used for calculating the Transformer module by directly using a characteristic diagram segmentation method of a regular window, the Swin Transformer Encode module also needs a moving window method to perform rolling operation on characteristic diagram data before segmenting the characteristic diagram, and then the Transformer module is used for calculating after segmenting, and the specific calculation process is formula (13):
a depth residual error module is constructed in the step 1, and a Transformer block module is embedded into the depth residual error module in a pyramid structure to form a backbone network module, as shown in fig. 7. And (4) sending the segmented tongue body image obtained in the step (3) into a main network to extract high-level semantic features, and paying attention to the global information of the features extracted by each layer of residual error network in a multi-scale mode.
and (3) constructing a Transformer module in the step 2, sending the high-dimensional tensor data output in the step 4 into the Transformer module to pay attention to global information, and carrying out global information modeling on the extracted high-level features.
Step 6, the prediction module obtains a feature category and a bounding box:
and (3) constructing a prediction module by using a plurality of linear layers, and sending the high-dimensional tensor data output in the step (5) into the prediction module to output category prediction and boundary frame coordinate prediction to obtain final tongue picture detection characteristic information.
To sum up, the above embodiment is a two-stage tongue picture recognition method based on deep learning, which is divided into two stages: tongue body segmentation based on a Transformer model and a cross attention mechanism network model and tongue picture characteristic detection based on a Swin Transformer tongue picture recognition network model. In the first stage, high-level semantic features are extracted through a depth residual error module in the tongue body segmentation stage; then, performing global information modeling on the extracted high-level semantic information by using a Transformer module; and finally, applying weight to the low-level features transmitted in the skipped connection in the up-sampling process by using the multi-head attention MHA, removing non-tongue interference information, and finally realizing fine tongue segmentation. In the second stage, the segmented tongue body image is sent into a backbone network, and the backbone network extracts high-level semantic feature information of the tongue image by multi-scale global information of concerned image features in the process of down-sampling the image and extracting the features through a depth residual error network; then sending the high-level semantic features to a Transformer module to continuously concern the global information of the high-level features, and carrying out global information modeling; and finally, generating category and boundary box information through a prediction network module to realize the identification of tongue picture characteristics. The network model designed by the invention provides an effective technical means for the tongue manifestation characteristic identification of the traditional Chinese medicine.
The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and inventive concept of the present invention.
Claims (7)
1. A two-stage tongue picture identification method based on deep learning is characterized in that the identification of tongue picture characteristics is realized by two stages; the first stage is tongue segmentation based on a transform model and a cross attention mechanism network model, and the second stage is tongue feature detection based on a Swin transform tongue image recognition network model, and the tongue feature detection method comprises the following steps:
the first stage is as follows:
step 1, extracting characteristics of a depth residual error network:
constructing a deep residual error network module by using a skip connection structure, so that the gradient of the network does not disappear, tensor data are transmitted to the deep part of the network, and high-level semantic features are efficiently extracted;
step 2, modeling global information of a Transformer module:
constructing a Transformer module by using a multi-head self-attention, multi-layer perceptron, residual connection and layer normalization computing unit, performing 1 x 1 convolution on the high-level semantic features extracted by a deep residual network module to compress the number of channels, then converting reshape operation into serialized data, sending the sequence data into the Transformer module to focus on global information, and performing global information modeling on the extracted high-level features;
and 3, skipping the feature fusion of the connection modules:
embedding multi-head attention into a skip connection structure to construct a skip connection module, transmitting low-level detail features generated by a deep residual module in the step 1 into a multi-head attention MHA module through skip connection, simultaneously transmitting high-level semantic features generated in the step 2 into the MHA module through up-sampling, applying weight to the low-level detail features through the attention of the high-level semantic features, removing non-tongue interference information, performing feature fusion with the up-sampling features, reconstructing a feature map, and obtaining a fine tongue segmentation image;
and a second stage:
and 4, extracting features by the backbone network module:
constructing a depth residual error module in the step 1, constructing a Transformer block module by the computing unit in the step 2, embedding the Transformer block module into the depth residual error module in a pyramid structure to form a backbone network module, sending the segmented tongue body image obtained in the step 3 into a backbone network to extract high-level semantic features, and focusing on global information of the extracted features of each layer of residual error network in a multi-scale manner;
step 5, modeling global information of the Transformer module:
for the Transformer module constructed in the step 2, the high-dimensional tensor data output in the step 4 are sent to the Transformer module to pay attention to global information, and global information modeling is carried out on the extracted high-level features;
step 6, the prediction module obtains a feature category and a bounding box:
and (3) constructing a prediction module by using a plurality of linear layers, and sending the high-dimensional tensor data output in the step (5) into the prediction module to output category prediction and boundary frame coordinate prediction to obtain final tongue picture detection characteristic information.
2. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 1, the depth residual module is a modified structure of a ResNet50 network, an original five-layer convolution module is changed into four layers, and an output dimension is changed from 2048 dimensions to 1024 dimensions.
3. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 2, the Transformer modules are connected in series by a multi-layer Transformer structure, and the number of the Transformer modules is 12.
4. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 3, the multi-head attention module MHA is used for applying weight to the low-level features input by skipping connection through paying attention to the high-level semantic features transmitted by the up-sampling module MHA, removing non-tongue interference information, then performing feature fusion with the up-sampling features, reconstructing a feature map, and obtaining a fine tongue segmentation image.
5. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 3, the dimensionality of the tensor data is changed and the characteristic data is segmented before transmission by skipping the downsampling low-dimensional tensor data transmitted into the MHA module by the connection structure and the upsampling high-dimensional data transmitted into the MHA module, so that the calculation complexity of the MHA module is reduced, and the network model is light.
6. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 4, a transform block module is embedded into the depth residual error module in a pyramid structure to form a backbone network module, and global information of features extracted by each layer of residual error network is focused on in a multi-scale mode.
7. The two-stage tongue picture recognition method based on deep learning of claim 1, wherein: in the step 4, a transform block module which is embedded into the depth residual module in a pyramid structure is used for calculating the data dimension change before calculation, and dividing the window segmentation tensor data so as to reduce the calculation complexity of the transform block module; meanwhile, the divided windows alternately adopt a regular window and a mobile window, so that cross-window connection is realized, and cross-window feature interaction is performed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110889480.1A CN113808075A (en) | 2021-08-04 | 2021-08-04 | Two-stage tongue picture identification method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110889480.1A CN113808075A (en) | 2021-08-04 | 2021-08-04 | Two-stage tongue picture identification method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113808075A true CN113808075A (en) | 2021-12-17 |
Family
ID=78893319
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110889480.1A Pending CN113808075A (en) | 2021-08-04 | 2021-08-04 | Two-stage tongue picture identification method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113808075A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114550305A (en) * | 2022-03-04 | 2022-05-27 | 合肥工业大学 | Human body posture estimation method and system based on Transformer |
CN114612759A (en) * | 2022-03-22 | 2022-06-10 | 北京百度网讯科技有限公司 | Video processing method, video query method, model training method and model training device |
CN116189884A (en) * | 2023-04-24 | 2023-05-30 | 成都中医药大学 | Multi-mode fusion traditional Chinese medicine physique judging method and system based on facial vision |
CN117877686A (en) * | 2024-03-13 | 2024-04-12 | 自贡市第一人民医院 | Intelligent management method and system for traditional Chinese medicine nursing data |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110378882A (en) * | 2019-07-09 | 2019-10-25 | 北京工业大学 | A kind of Chinese medicine tongue nature method for sorting colors of multi-layer depth characteristic fusion |
CN110765966A (en) * | 2019-10-30 | 2020-02-07 | 哈尔滨工业大学 | One-stage automatic recognition and translation method for handwritten characters |
CN111223553A (en) * | 2020-01-03 | 2020-06-02 | 大连理工大学 | Two-stage deep migration learning traditional Chinese medicine tongue diagnosis model |
CN111723196A (en) * | 2020-05-21 | 2020-09-29 | 西北工业大学 | Single document abstract generation model construction method and device based on multi-task learning |
WO2020215697A1 (en) * | 2019-08-09 | 2020-10-29 | 平安科技(深圳)有限公司 | Tongue image extraction method and device, and a computer readable storage medium |
CN112349427A (en) * | 2020-10-21 | 2021-02-09 | 上海中医药大学 | Diabetes prediction method based on tongue picture and depth residual convolutional neural network |
CN113011436A (en) * | 2021-02-26 | 2021-06-22 | 北京工业大学 | Traditional Chinese medicine tongue color and fur color collaborative classification method based on convolutional neural network |
CN113139971A (en) * | 2021-03-22 | 2021-07-20 | 杭州电子科技大学 | Tongue picture identification method and system based on artificial intelligence |
US20210232813A1 (en) * | 2020-01-23 | 2021-07-29 | Tongji University | Person re-identification method combining reverse attention and multi-scale deep supervision |
-
2021
- 2021-08-04 CN CN202110889480.1A patent/CN113808075A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110378882A (en) * | 2019-07-09 | 2019-10-25 | 北京工业大学 | A kind of Chinese medicine tongue nature method for sorting colors of multi-layer depth characteristic fusion |
WO2020215697A1 (en) * | 2019-08-09 | 2020-10-29 | 平安科技(深圳)有限公司 | Tongue image extraction method and device, and a computer readable storage medium |
CN110765966A (en) * | 2019-10-30 | 2020-02-07 | 哈尔滨工业大学 | One-stage automatic recognition and translation method for handwritten characters |
CN111223553A (en) * | 2020-01-03 | 2020-06-02 | 大连理工大学 | Two-stage deep migration learning traditional Chinese medicine tongue diagnosis model |
US20210232813A1 (en) * | 2020-01-23 | 2021-07-29 | Tongji University | Person re-identification method combining reverse attention and multi-scale deep supervision |
CN111723196A (en) * | 2020-05-21 | 2020-09-29 | 西北工业大学 | Single document abstract generation model construction method and device based on multi-task learning |
CN112349427A (en) * | 2020-10-21 | 2021-02-09 | 上海中医药大学 | Diabetes prediction method based on tongue picture and depth residual convolutional neural network |
CN113011436A (en) * | 2021-02-26 | 2021-06-22 | 北京工业大学 | Traditional Chinese medicine tongue color and fur color collaborative classification method based on convolutional neural network |
CN113139971A (en) * | 2021-03-22 | 2021-07-20 | 杭州电子科技大学 | Tongue picture identification method and system based on artificial intelligence |
Non-Patent Citations (5)
Title |
---|
於张闲;冒宇清;胡孔法;: "基于深度学习的虚假健康信息识别", 软件导刊, no. 03, 15 March 2020 (2020-03-15) * |
汤一平;王丽冉;何霞;陈朋;袁公萍;: "基于多任务卷积神经网络的舌象分类研究", 计算机科学, no. 12, 15 December 2018 (2018-12-15) * |
王丽冉;汤一平;陈朋;何霞;袁公萍;: "面向舌体分割的两阶段卷积神经网络设计", 中国图象图形学报, no. 10, 16 October 2018 (2018-10-16) * |
王俊豪;罗轶凤;: "通过细粒度的语义特征与Transformer丰富图像描述", 华东师范大学学报(自然科学版), no. 05, 25 September 2020 (2020-09-25) * |
田应仲等: "融合卷积神经网络的核相关滤波视觉目标跟随算法研究", 计算测量与控制, vol. 28, no. 12, 31 December 2020 (2020-12-31), pages 176 - 180 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114550305A (en) * | 2022-03-04 | 2022-05-27 | 合肥工业大学 | Human body posture estimation method and system based on Transformer |
CN114612759A (en) * | 2022-03-22 | 2022-06-10 | 北京百度网讯科技有限公司 | Video processing method, video query method, model training method and model training device |
CN116189884A (en) * | 2023-04-24 | 2023-05-30 | 成都中医药大学 | Multi-mode fusion traditional Chinese medicine physique judging method and system based on facial vision |
CN116189884B (en) * | 2023-04-24 | 2023-07-25 | 成都中医药大学 | Multi-mode fusion traditional Chinese medicine physique judging method and system based on facial vision |
CN117877686A (en) * | 2024-03-13 | 2024-04-12 | 自贡市第一人民医院 | Intelligent management method and system for traditional Chinese medicine nursing data |
CN117877686B (en) * | 2024-03-13 | 2024-05-07 | 自贡市第一人民医院 | Intelligent management method and system for traditional Chinese medicine nursing data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113808075A (en) | Two-stage tongue picture identification method based on deep learning | |
CN112651973B (en) | Semantic segmentation method based on cascade of feature pyramid attention and mixed attention | |
CN111242288B (en) | Multi-scale parallel deep neural network model construction method for lesion image segmentation | |
CN110120055B (en) | Fundus fluorography image non-perfusion area automatic segmentation method based on deep learning | |
CN112150469B (en) | Laser speckle contrast image segmentation method based on unsupervised field self-adaption | |
CN111161200A (en) | Human body posture migration method based on attention mechanism | |
CN107145893A (en) | A kind of image recognition algorithm and system based on convolution depth network | |
CN115018809A (en) | Target area segmentation and identification method and system of CT image | |
CN114445420A (en) | Image segmentation model with coding and decoding structure combined with attention mechanism and training method thereof | |
WO2024104035A1 (en) | Long short-term memory self-attention model-based three-dimensional medical image segmentation method and system | |
CN117274599A (en) | Brain magnetic resonance segmentation method and system based on combined double-task self-encoder | |
CN115526829A (en) | Honeycomb lung focus segmentation method and network based on ViT and context feature fusion | |
CN115271033A (en) | Medical image processing model construction and processing method based on federal knowledge distillation | |
CN117078941A (en) | Cardiac MRI segmentation method based on context cascade attention | |
CN116862891A (en) | Double-branch OCT blood vessel hyperfine semantic segmentation method of encoder-decoder structure | |
Wang et al. | Tiny-lesion segmentation in oct via multi-scale wavelet enhanced transformer | |
CN111543985A (en) | Brain control hybrid intelligent rehabilitation method based on novel deep learning model | |
CN115565671A (en) | Atrial fibrillation auxiliary analysis method based on cross-model mutual teaching semi-supervision | |
CN116416434A (en) | Medical image segmentation method based on Swin transducer fused with multi-scale features and multi-attention mechanism | |
CN115761377A (en) | Smoker brain magnetic resonance image classification method based on contextual attention mechanism | |
CN116309278A (en) | Medical image segmentation model and method based on multi-scale context awareness | |
CN115984560A (en) | Image segmentation method based on CNN and Transformer | |
CN115526898A (en) | Medical image segmentation method | |
CN115222748A (en) | Multi-organ segmentation method based on parallel deep U-shaped network and probability density map | |
CN115147303A (en) | Two-dimensional ultrasonic medical image restoration method based on mask guidance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |