CN110729045A

CN110729045A - Tongue image segmentation method based on context-aware residual error network

Info

Publication number: CN110729045A
Application number: CN201910969290.3A
Authority: CN
Inventors: 李佐勇; 樊好义; 周常恩
Original assignee: Minjiang University
Current assignee: Minjiang University
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2020-01-24

Abstract

The invention relates to a tongue image segmentation method based on a context-aware residual error network, which utilizes a deep neural network to automatically extract image characteristics; determining a candidate region where the tongue body is located by using a region candidate network based on the extracted feature map; and finally, obtaining a tongue body segmentation result by segmenting the candidate region. The invention can effectively improve the accuracy of tongue image segmentation.

Description

Tongue image segmentation method based on context-aware residual error network

Technical Field

The invention relates to the technical field of image processing, in particular to a tongue image segmentation method based on a context-aware residual error network.

Background

Tongue diagnosis is one of the main contents of inspection in traditional Chinese medicine, and is one of the traditional diagnostic methods with the characteristics of traditional Chinese medicine. The tongue picture is the most sensitive index reflecting the physiological function and pathological change of human body and has important application value in the process of traditional Chinese medicine diagnosis and treatment. The image processing technology is applied to establish an objective quantification and identification method of tongue inspection information, so that the automation of the tongue inspection of the traditional Chinese medicine is realized, and the method has important practical significance for the modernization of the traditional Chinese medicine. In the automatic tongue diagnosis system, after tongue images of a patient are acquired by a digital acquisition instrument (an industrial camera, a camera and the like), a target area (a tongue body) must be automatically segmented, and then tongue body characteristics can be extracted and diagnosed. Therefore, the tongue image is segmented into important links connecting tongue image acquisition and tongue body diagnosis, and the segmentation quality directly influences the accuracy of subsequent diagnosis.

The difficulty of tongue image segmentation is: (1) the color of the tongue body is very close to the color of the face, particularly the color of the lips, and is easy to be confused; (2) the tongue body is used as a soft body without a fixed shape, and the individual difference of the shape of the tongue body is large; (3) the tongue body is not smooth, the tongue coating and tongue quality vary from person to person, and the pathological characteristics have large differences; (4) cracks and color blocks of the tongue coating may affect the accurate segmentation of the tongue.

In view of the difficulties and challenges of tongue image segmentation, it is often difficult to obtain satisfactory segmentation results with a single conventional image segmentation technique. Therefore, a fusion of various conventional segmentation techniques has been studied. Under the framework of fusion of various traditional segmentation technologies, the mainstream method is an Active Contour Model (ACM) based method. The ACM is also called Snake model, is a popular variable shape model, and is widely applied to contour extraction. And giving an initial contour curve, and evolving the initial contour curve towards the real target contour by the active contour model under the combined action of internal and external forces. The segmentation method based on the ACM mainly researches points on acquisition of an initial contour and curve evolution. However, the segmentation effect of the conventional tongue image segmentation method still needs to be improved.

Recently, methods based on the deep Convolutional Neural Network (CNN) have enjoyed significant success in the fields of computer vision and image processing. In the field of medical image segmentation, CNN-based methods are also widely used due to their powerful feature learning and representation capabilities. Among these methods, the Full Convolutional Network (FCN) shows good performance in biological cell and organ segmentation. A U-network (U-Net) was developed from FCN and allows for a jump connection between encoder and decoder by extending the symmetric self-encoder design to better locate targets in the image by combining high resolution features in the encoding path with the up-sampled output. The U-Net network was used to identify and segment the cardiac regions of Drosophila at different stages of development. In addition, convolutional neural networks have also been used to construct a focal stack-based method for the automatic detection of plasmodium falciparum malaria from blood smears. Tongue image segmentation based on deep learning has started in the last two years.

Disclosure of Invention

In view of this, the present invention provides a tongue image segmentation method based on a context-aware residual error network, which can effectively improve the accuracy of tongue image segmentation.

The invention is realized by adopting the following scheme: a tongue image segmentation method based on a context-aware residual error network specifically comprises the following steps:

automatically extracting image features by using a deep neural network;

determining a candidate region where the tongue body is located by using a region candidate network based on the extracted feature map;

and finally, obtaining a tongue body segmentation result by segmenting the candidate region.

Further, the automatic extraction of image features by using the deep neural network specifically includes the following steps:

step S11: establishing a cavity residual error module, wherein the corresponding mapping is as follows:

in the formula, x_iAnd x_i+1Respectively representing the input and output of the ith residual block, D representing a hole convolution operation, G_D(. and F)_D() represents two different non-linear mapping groups, wherein each mapping group consists of a hole convolution operation, a batch normalization operation and a ReLU activation function;and

respectively representing two mapped related parameter sets which are weights to be learned by the neural network;

and

respectively representing different weights assigned to the two mapping groups;

step S12: establishing a characteristic pyramid network by using the cavity residual error module established in the step S1 to realize multi-scale characteristic extraction of the tongue image and obtain a multi-scale characteristic diagram;

the feature pyramid network comprises a bottom-up path module, a transverse connection module and a top-down path module, wherein the bottom-up path module is a feature extraction basic network which is formed by serially constructing five context-aware cavity residual error modules, and the transverse connection module is used for connecting a feature diagram of the bottom-up path module to the top-down path module.

Further, the determining the candidate area where the tongue body is located by using the area candidate network specifically includes: the regional candidate network extracts candidate targets on the multi-scale feature map by using a sliding window, then obtains a 2048-dimensional vector through a standard convolutional layer with the size of 3 multiplied by 3 convolutional kernel, and respectively realizes target classification and position positioning of candidate frames by following two branches of candidate frame classification and candidate frame regression formed by the standard convolutional layer with the size of 1 multiplied by 1 convolutional kernel, thereby respectively generating 2k class probabilities and 4k candidate frame coordinate positions; the category probability comprises tongue probability and non-tongue probability, and the candidate frame coordinate position comprises an x coordinate, a y coordinate, box width and box height.

Further, the tongue segmentation result obtained by segmenting the candidate region specifically includes: firstly, a characteristic diagram corresponding to each candidate area is converted into a candidate characteristic diagram with a fixed size by a RoI alignment module and a bilinear interpolation technology, a plurality of candidate characteristic diagrams are aligned, and then final tongue body positioning and segmentation are realized through a positioning branch network and a segmentation branch network respectively.

Furthermore, the positioning branch network is used for performing position regression operation by taking two full-connection layers as regressors to realize accurate positioning; the segmentation branch network takes two layers of standard convolution layers as a pixel classifier to carry out pixel-level classification, namely tongue segmentation.

Further, the loss function adopted in the training process of the positioning branch network and the splitting branch network is as follows:

L＝L_loc+L_mask；

wherein the content of the first and second substances,

in the formula, t_iIs the tongue position marked manually,

the tongue body positioning branch network predicts the tongue body position, and x, y, w and h respectively represent the abscissa of the upper right corner of the tongue body positioning frame, the ordinate of the upper right corner of the tongue body positioning frame, the length of the tongue body and the width of the tongue body;

wherein the content of the first and second substances,

L_mask＝∑_c(1-TI_c)

in the formula, TI_cIs the Tversky similarity metric defined as follows:

in the formula, p_icIs the probability that the predicted pixel i belongs to the tongue class,

is the predicted probability that pixel i does not belong to the tongue class, g _ic1 indicates that pixel i belongs to the tongue category,

indicating that pixel i does not belong to the tongue class, e is an infinitesimal constant to avoid zero, and α and β are two parameters that control the accuracy and recall balance. Wherein α is 0.3 and β is 0.7.

Compared with the prior art, the invention has the following beneficial effects: the method firstly positions the tongue body area, and then classifies the pixel level in the positioned area, so as to realize the final accurate segmentation and effectively avoid the interference of complex background. In the characteristic learning process, in order to extract more representative characteristics, the invention provides a novel cavity residual error module based on context sensing, and the effective extraction of multi-level scale characteristics can be realized by combining a characteristic pyramid network. The invention can effectively improve the accuracy and robustness of tongue image segmentation.

Drawings

FIG. 1 is a schematic diagram of the method of the embodiment of the present invention.

Fig. 2 shows the original residual block structure in ResNet.

FIG. 3 is a block diagram of a context-aware hole residual module according to an embodiment of the present invention.

FIG. 4 is a feature pyramid network according to an embodiment of the present invention.

FIG. 5 is a sample of a feature pattern of an embodiment of the present invention: output of the context-aware feature pyramid network.

Fig. 6 is a diagram illustrating a candidate area network according to an embodiment of the present invention.

FIG. 7 is a box plot of the performance of the various methods on the three datasets. Wherein, (a) is Precision, (b) is Dice, (c) is mloU, (d) is FPR, (e) is FNR, and (f) is ME.

FIG. 8 is a quantitative comparison of the performance of various algorithm segmentations on three data sets.

Fig. 9 is a comparison of the segmentation results for three randomly selected tongue images on three data sets. Where (a) is dataset TestSet1, (b) is TestSet2, and (c) is TestSet 3.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a tongue image segmentation method based on a context-aware residual error network, an overall network framework of the present embodiment is an end-to-end tongue positioning and segmentation deep neural network, referred to as TongueNet for short, and an entire algorithm flow is composed of three stages: a Feature Extraction Stage (Feature Extraction Stage), a Region candidate Stage (Region Prediction Stage), and a Prediction Stage (Prediction Stage). Firstly, in the feature extraction stage, in order to effectively extract image space information and prior information of the tongue body (such as color, shape, tongue fur texture, and the like), the embodiment proposes a pyramid network module, which is based on a hole Convolution constraint and residual learning (residual learning), and can effectively realize multi-scale feature extraction of the tongue image; then, in the Region candidate stage, based on the feature map extracted in the feature extraction stage, the embodiment utilizes a Region candidate Network (Region pro-potential Network) to realize effective coarse positioning of the tongue candidate Region; finally, in the prediction stage, based on the tongue candidate region and the characteristic diagram thereof located in the region candidate stage, the combined learning of two different learning tasks (segmentation and location) is realized by optimizing the multi-task loss function designed by the invention.

The method specifically comprises the following steps:

automatically extracting image features by using a deep neural network;

Experiments show that the segmentation precision of the tongue image is remarkably improved by the method.

Preferably, an ideal feature extraction network should be a neural network deep enough to achieve efficient extraction of multi-scale features. Inspired by the successful application of ResNet in the task of feature extraction and image classification, the present embodiment provides a new Context-aware void Residual Block (Context-aware Residual Block) based on the Residual Block (Residual Blocks) in ResNet, so as to implement extraction of tongue features with more discriminability. The original residual block is composed of convolution layers with different convolution kernel sizes, and the corresponding mapping is shown as the following formula:

wherein x is_iAnd x_i+1Respectively representing the input and output of the ith residual block, and G (-) and F (-) respectively representing two different nonlinear mapping groups, wherein each mapping group consists of a standard convolution operation (StandardConvolition), a Batch Normalization operation (Batch Normalization) and a ReLU activation function;

and

the sets of relevant parameters representing the two mappings, respectively, are the weights that the neural network needs to learn. As shown in FIG. 2, given a residual block of 3 map groups, each map group consists of a standardConvolution layer, a batch normalization layer and a ReLU activation layer. The resolution of the feature map output by such a residual block is half of the original input, which results in a certain loss of spatial information.

In this embodiment, the automatically extracting image features by using the deep neural network specifically includes the following steps:

step S11: unlike the original residual block structure in ResNet, this embodiment proposes a new context-aware hole residual block, and the corresponding mapping is as follows:

in the formula, x_iAnd x_i+1Respectively representing the input and output of the ith residual block, D representing a hole convolution operation, G_D(. and F)_D() represents two different non-linear mapping groups, wherein each mapping group consists of a hole convolution operation, a batch normalization operation and a ReLU activation function;andrespectively representing two mapped related parameter sets which are weights to be learned by the neural network;and

respectively representing different weights assigned to the two mapping groups; here, the present embodiment employs a weighted Skip Connection (Skip Connection) to realize a weighted residual learning. As shown in FIG. 3, given a context-aware hole residual module consisting of 3 map-groups, each map-group consists of a hole convolution layer, a bulk normalization layer, and a ReLU activation layer. In thatIn the feature extraction process, the resolution of the feature map output by the residual block is consistent with the input resolution, so that the loss of spatial information caused by halving the output resolution of the original residual block in ResNet to the input resolution can be avoided. In addition, a more flexible characteristic learning process is realized through weighted residual learning;

as shown in fig. 4, the feature pyramid network includes a Bottom-up path module (Bottom-up path), which is a feature extraction base network serially constructed by five context-aware hole residual modules (DConv1_ x, DConv2_ x, DConv3_ x, DConv4_ x, and DConv5_ x), a horizontal connection module for connecting the feature map of the Bottom-up path module to the top-down path module, and a top-down path module. Finally, a multi-scale feature pyramid is formed and used for coarse positioning of the tongue body region in the candidate stage of the region and accurate positioning and segmentation of the tongue body region in the prediction stage. FIG. 5 is a diagram illustrating multi-scale features output by the context-aware feature pyramid network.

Preferably, in the region candidate stage, based on the multi-scale feature map extracted in the feature extraction stage, the present embodiment utilizes the region candidate network to achieve effective coarse positioning of the tongue candidate region, and the feature map corresponding to the positioned tongue candidate region is used for accurate positioning and segmentation of the tongue in the prediction stage.

In this embodiment, as shown in fig. 6, the determining the candidate region where the tongue is located by using the region candidate network specifically includes: the regional candidate network utilizes a Sliding Window (Sliding Window) to extract candidate targets on a multi-scale feature map, then obtains a 2048-dimensional vector through a standard convolutional layer with the size of 3 multiplied by 3 convolutional kernel, and respectively realizes target Classification and position positioning of candidate frames by following two branches of candidate frame Classification (Box Classification) and candidate frame regression (Box regression) which are formed by the standard convolutional layer with the size of 1 multiplied by 1 convolutional kernel, thereby respectively generating 2k class probabilities and 4k candidate frame coordinate positions; the category probability comprises tongue probability and non-tongue probability, and the candidate frame coordinate position comprises an x coordinate, a y coordinate, box width and box height.

In this embodiment, the tongue segmentation result obtained by segmenting the candidate region specifically includes: firstly, a feature map corresponding to each candidate area is converted into a candidate feature map with a fixed size by a RoI alignment (Align) module and a bilinear interpolation technology, a plurality of candidate feature maps are aligned, and then final tongue body positioning and segmentation are realized through a positioning Branch network (Localization Branch) and a segmentation Branch network (Mask Branch) respectively.

In this embodiment, the positioning branch network performs position regression operation by using two fully-connected layers as regressors, so as to realize accurate positioning; the segmentation branch network takes two layers of standard convolution layers as a pixel classifier to carry out pixel-level classification, namely tongue segmentation.

In this embodiment, the loss function adopted in the training process of the positioning branch network and the splitting branch network is:

L＝L_loc+L_mask；

wherein the content of the first and second substances,

in the formula, t_iIs the tongue position marked manually,

wherein the content of the first and second substances,

L_mask＝∑_c(1-TI_c)

in the formula, TI_cIs the Tversky similarity metric defined as follows:

in the formula, p_icIs the probability that the predicted pixel i belongs to the tongue class,is the predicted probability that pixel i does not belong to the tongue class, g _ic1 indicates that pixel i belongs to the tongue category,

indicating that the pixel i does not belong to the tongue class, and e is an infinitesimal constant to avoid zero, 10 is selected in this embodiment^-8And alpha and beta are two parameters for balancing control accuracy and recall rate.

Wherein α is 0.3 and β is 0.7.

The embodiment improves the accuracy and robustness of tongue image segmentation by a new end-to-end multitask deep learning framework. The tongue body area is firstly positioned, and then pixel-level classification is carried out in the positioned area, so that the final accurate segmentation is realized, and the interference of a complex background is effectively avoided. In the feature learning process, in order to extract more representative features, the embodiment provides a new context-aware cavity residual error module, and the effective extraction of multi-level scale features is realized by combining a feature pyramid network.

Specifically, in order to evaluate the performance of the tongue image segmentation algorithm, the present embodiment performed ten-fold cross validation experiments on three data sets, TestSet1(300 tongue images with a resolution of 768 × 576), TestSet2(331 tongue images with a resolution of 550 × 650), TestSet3(290 tongue images with a resolution of 600 × 576), and the segmentation performance was measured by 6 common segmentation measures. The first 3 measures, namely Precision (Precision), a Dice coefficient (Dice coefficient) and an mlou (mean interaction over union), are commonly used on the performance measurement of a segmentation model based on deep learning, and the larger the measure value is, the better the segmentation performance is; the last 3 measures, namely False Positive Rate/False alarm Rate (FPR), False Negative Rate (FNR), and Misclassification Error (ME), are commonly used in the performance metric of the conventional segmentation model, and a smaller measurement value indicates better segmentation performance. These measures are defined as:

in the formula, B_gAnd F_gBackground and object representing results of manual standard segmentation, B_pAnd F_pRepresenting the background and the target in the segmentation result corresponding to the automatic segmentation algorithm, and | represents the number of elements in the set. The value ranges of the six measures are all 0-1. Lower values of ME, FPR and FNR represent better segmentation; conversely, higher Precision, Dice, and mlou values represent better segmentation results.

To verify the effectiveness of the method of the present embodiment in tongue image segmentation, the present embodiment compares it with the newly proposed deep learning algorithms FCN, U-Net, SegNet, deep learning, and Mask R-CNN. As shown in the box plot of fig. 7 and the table of fig. 8, the metric results of the algorithm of the present invention (TongueNet) are almost the best at six measurements of the three data sets, with the Precision, Dice, mliou measurement values being significantly higher than the other methods, and the FPR and ME measurement values being significantly lower than the other methods. The only exception is that DeepTongue and U-Net are superior to the present algorithm in terms of FNR measure on partial data sets, but this is due to the more obvious over-segmentation phenomenon of the segmentation results of the two algorithms. The boxplot of FIG. 7 further demonstrates that the algorithm of the present invention is more stable than other methods because its outliers are generally less or deviate less.

Fig. 9 shows the manual segmentation result and the algorithm segmentation result of three randomly selected tongue images on three data sets, respectively, wherein the dotted line represents the manual ideal segmentation result and the solid line represents the algorithm segmentation result. As can be seen from FIG. 9, the algorithm segmentation result of the present invention is usually closest to the manual ideal segmentation result (the dotted line has the highest coincidence degree with the implementation), and the segmentation effect is the best; and the best segmentation effect is basically obtained on three randomly selected tongue images of the three data sets, which shows that the segmentation performance is most stable.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A tongue image segmentation method based on context-aware residual error network is characterized in that,

automatically extracting image features by using a deep neural network;

2. The tongue image segmentation method based on the context-aware residual error network according to claim 1, wherein the automatic image feature extraction by using the deep neural network specifically comprises the following steps:

in the formula, x_iAnd x_i+1Respectively representing the input and output of the ith residual block, D representing a hole convolution operation, G_D(. and F)_D() represents two different non-linear mapping groups, wherein each mapping group consists of a hole convolution operation, a batch normalization operation and a ReLU activation function;

and

and

respectively representing different weights assigned to the two mapping groups;

3. The tongue image segmentation method based on the context-aware residual error network as claimed in claim 1, wherein the determining the candidate region where the tongue body is located by using the region candidate network specifically comprises: the regional candidate network extracts candidate targets on the multi-scale feature map by using a sliding window, then obtains a 2048-dimensional vector through a standard convolutional layer with the size of 3 multiplied by 3 convolutional kernel, and respectively realizes target classification and position positioning of candidate frames by following two branches of candidate frame classification and candidate frame regression formed by the standard convolutional layer with the size of 1 multiplied by 1 convolutional kernel, thereby respectively generating 2k class probabilities and 4k candidate frame coordinate positions; the category probability comprises tongue probability and non-tongue probability, and the candidate frame coordinate position comprises an x coordinate, a y coordinate, box width and box height.

4. The tongue image segmentation method based on the context-aware residual error network according to claim 1, wherein the tongue segmentation result obtained by segmenting the candidate region is specifically: firstly, a characteristic diagram corresponding to each candidate area is converted into a candidate characteristic diagram with a fixed size by a RoI alignment module and a bilinear interpolation technology, a plurality of candidate characteristic diagrams are aligned, and then final tongue body positioning and segmentation are realized through a positioning branch network and a segmentation branch network respectively.

5. The tongue image segmentation method based on the context-aware residual error network according to claim 4, wherein the positioning branch network is implemented by performing position regression operation by using two fully-connected layers as regressors to realize accurate positioning; the segmentation branch network takes two layers of standard convolution layers as a pixel classifier to carry out pixel-level classification, namely tongue segmentation.

6. The tongue image segmentation method based on context-aware residual error network according to claim 4, wherein the loss function adopted in the training process of the positioning branch network and the segmentation branch network is:

L＝L_loc+L_mask；

wherein the content of the first and second substances,

in the formula, t_iIs the tongue position marked manually,

wherein the content of the first and second substances,

L_mask＝∑_c(1-TI_c)

in the formula, TI_cIs the Tversky similarity metric defined as follows:

is the predicted probability that pixel i does not belong to the tongue class, g_ic1 indicates that pixel i belongs to the tongue category,

indicating that pixel i does not belong to the tongue class, e is an infinitesimal constant to avoid zero, and α and β are two parameters that control the accuracy and recall balance.

7. The tongue image segmentation method based on context-aware residual error network as claimed in claim 6, wherein α -0.3 and β -0.7.