CN114627136A

CN114627136A - Tongue picture segmentation and alignment method based on feature pyramid network

Info

Publication number: CN114627136A
Application number: CN202210106423.6A
Authority: CN
Inventors: 张明川; 王莎莎; 王琳; 郑瑞娟; 吴庆涛; 朱军龙; 冀治航
Original assignee: Henan University of Science and Technology
Current assignee: Henan University of Science and Technology
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-06-14
Anticipated expiration: 2042-01-28
Also published as: CN114627136B

Abstract

The invention provides a tongue picture segmentation and alignment method based on a feature pyramid network, which comprises the steps of constructing a feature pyramid, carrying out multi-scale fusion on low-level detail features and high-level semantic features of the network, intercepting tongue body areas by using a target detection network, and further generating a mask for each feature map to finish tongue picture segmentation; the segmented tongue picture is further processed, and the alignment of the tongue picture is realized through conformal mapping, so that an efficient and accurate method is provided for the processing and diagnosis of malformed medical images, and the method has certain significance for the objective development of tongue diagnosis.

Description

Tongue picture segmentation and alignment method based on feature pyramid network

Technical Field

The invention relates to the fields of computer vision, image processing and tongue diagnosis in traditional Chinese medicine, in particular to a tongue picture segmentation and alignment method based on a characteristic pyramid network.

Background

The theory of traditional Chinese medicine considers that the tongue looks like a window for mapping the internal organ changes of the human body and reflects the nature of pathogenic qi, the advance and retreat of the disease and the abundance or decline of vital qi of the human body. The tongue diagnosis in traditional Chinese medicine is an important component of inspection in traditional Chinese medicine, which refers to a method for judging the health condition of human body by observing tongue body, tongue proper, tongue coating, sublingual collaterals and other tongue features, and is one of the main bases in the clinical diagnosis process of traditional Chinese medicine.

Tongue diagnosis has the characteristics of painlessness, non-invasiveness and the like, and is one of the most common diagnosis methods in traditional Chinese medicine. The appearance characteristics of the tongue body, such as color, texture, shape, tongue coating, and the like, reveal a great deal of information about the health condition of the human body. However, the traditional tongue diagnosis is highly dependent on the experience of the clinician, and different doctors diagnosing the same patient may draw different conclusions, which makes the tongue diagnosis of traditional Chinese medicine lack objectivity. The defects can be improved by a computer-aided tongue diagnosis method, but the acquired tongue picture also comprises background areas such as faces, lips and the like, and some tongue bodies are inclined, which affects subsequent auxiliary diagnosis and treatment, so that the tongue bodies need to be segmented and aligned firstly.

The traditional tongue segmentation methods at present are roughly divided into three categories, namely edge-based methods, region-based methods and region-and-edge fusion-based methods. According to the region-based method, a watershed algorithm is used for segmenting the tongue image into a plurality of small regions, and then region merging based on color similarity is carried out to obtain final tongue body segmentation. For edge-based methods, edge initialization is required to segment the final tongue. The method is based on the fusion method of the region and the edge, firstly, the region of interest of the tongue body is extracted by utilizing color information, and then the region of interest is used for replacing an original image to carry out subsequent segmentation. These methods, while successful to some degree, still have limitations. For example, the tongue is sensitive to complicated background and light change, and some lips and tongue body have very similar colors and require additional pretreatment, so that the whole process becomes complicated and the tongue body with smooth edges cannot be accurately segmented.

In recent years, with the rapid development of deep learning, a method based on a deep convolutional neural network is developed to improve the robustness of tongue segmentation. Because the deep neural network has strong characterization learning capability, the method based on deep learning has higher performance than the traditional tongue body segmentation method. The shallow part in the deep network has more detail information, the high-level part has rich semantic information, however, most networks are segmented by using the last high-level feature, but the shallow detail feature is ignored, and the shallow detail information can improve the segmentation accuracy to a certain extent. The ideal segmentation effect cannot be achieved using a single high-level feature.

Disclosure of Invention

In order to solve the problems of the traditional method and the deep network in tongue picture segmentation, the invention provides a tongue picture segmentation and alignment method based on a characteristic pyramid network by learning the thought of a residual error network.

In order to achieve the purpose, the invention adopts the technical scheme that: a tongue picture segmentation and alignment method based on a feature pyramid network comprises the following steps:

step 1: acquiring a required tongue picture by using tongue picture acquisition equipment, and performing data enhancement processing on the picture;

step 2: manually marking tongue bodies in all the picture data;

and step 3: preprocessing the marked picture, and sending the preprocessed picture into a constructed characteristic pyramid network to extract characteristics to obtain an effective characteristic layer;

and 4, step 4: sending the effective characteristic layer extracted in the step 3 into an RPN to obtain a suggestion frame;

and 5: performing maximum pooling operation on all the suggestion boxes obtained in the step 4;

step 6: connecting two fully-connected network layers after the suggestion frame subjected to the maximum pooling operation, respectively judging whether the suggestion frame contains an object, and then adjusting the suggestion frame to obtain a prediction frame;

and 7: taking the prediction frame obtained in the step 6 as an area intercepting part of a mask model, and classifying the pixel points by using the mask model after intercepting to obtain a semantic segmentation result;

and 8: calculating the boundary area and Fourier coefficient of the segmented tongue picture;

and step 9: calculating the corresponding area of the unit disc and the tongue picture boundary;

step 10: after the boundary value is determined, calculating the mapping of the internal region by using a Cauchy integral formula;

step 11: constructing a conformal map which updates the real part and the imaginary part of the function through iteration;

step 12: and outputting the aligned image.

Further, in step 1, when the data enhancement processing is performed on the picture, the number set is expanded through rotation and horizontal flipping.

Further, step 2, when the tongue bodies in all the picture data are manually marked, the lableme software is adopted.

Further, in step 5, all the suggested boxes are fixed to be 7 × 7 after the maximum pooling operation.

Further, the feature pyramid network is constructed by the output of the middle 4 bnecks of Mobilenetv 2.

Compared with the prior art, the invention has the beneficial effects that: the method comprises the steps of performing multi-scale fusion on low-level detail features and high-level semantic features of a network by constructing a feature pyramid, intercepting tongue body regions by using a target detection network, and further generating a mask for each feature map to finish segmentation of tongue images; the segmented tongue picture is further processed, and the alignment of the tongue picture is realized through conformal mapping, so that an efficient and accurate method is provided for the processing and diagnosis of malformed medical images, and the method has certain significance for the objective development of tongue diagnosis.

Drawings

FIG. 1 is a flowchart illustrating a tongue segmentation and alignment method based on a feature pyramid network according to the present invention;

FIG. 2 is a schematic view showing the shape of a unit disc mentioned in the example;

FIG. 3 is a schematic view showing the shape of a standard tongue image in the example;

fig. 4 is a tongue notation diagram given in the examples.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts belong to the protection scope of the present invention.

In the invention, a tongue picture segmentation and alignment method based on a feature pyramid network is provided, which uses Mobilenetv2 as a main feature extraction network, and takes the output of 4 bnecks in the middle of Mobilenetv2 to construct a feature pyramid network, and enhances the attention of the network to detailed features through multi-scale feature fusion, thereby being beneficial to more accurate segmentation. The segmented image is aligned by using a tongue image alignment method based on conformal mapping, and the main flow of the method is described in detail below.

The invention mainly aims at segmenting the tongue picture shot in a standard environment, and the shot picture comprises the face part and the lip part. Therefore, in order to more conveniently and rapidly divide the tongue body, the specific position of the tongue body is detected firstly, the required tongue body part is selected in a frame, and then the complete tongue body is divided. The invention is mainly divided into two parts, the first part is to position and divide the tongue picture, the second part is to realize the tongue picture alignment through conformal mapping to the divided tongue picture, the flow schematic diagram of the invention is shown in figure 1, the concrete steps are as follows:

1. tongue positioning and segmentation

The tongue body is positioned by firstly detecting the position of the tongue body in an image, wherein the tongue body in the image is a foreground point, the rest part of the tongue body is a background point, more accurate prediction frames are obtained through continuous adjustment, and simultaneously, a mask is generated for each prediction frame to realize the segmentation of the tongue body. By using a Mask R-CNN network model, replacing a backbone in the network with mobilene v2, and combining a high-resolution feature map and a low-resolution feature map, the network makes full use of detail information of a low layer and semantic information of a high layer to realize more accurate segmentation. The specific work is as follows:

(1) extracting basic characteristics: firstly, inputting a tongue picture into a network, acquiring a feature map through standard convolution, expanding a channel of the feature map by utilizing 1 × 1 convolution to enrich the feature quantity, then performing feature integration by utilizing 3 × 3 convolution, finally compressing by utilizing 1 × 1 convolution, and finally connecting the input feature map with the compressed feature map by utilizing jump connection, so that the feature quantity is enriched and the parameter quantity is also reduced. For the extracted high-level feature map, the spatial resolution is improved to be the same as that of the previous-level feature map in a deconvolution mode, the high-level feature map and the low-level feature map are fused in an addition mode, then the fused feature maps are convolved to generate a common feature layer, and the fusion feature maps of multiple scales are generated continuously in an iteration mode for subsequent use.

(2) Acquiring a prediction frame: and finally, the extracted multi-scale effective feature map is sent to an ROI Pooling layer together with the multi-scale effective feature layer, all region probes are fixed to be the same in size, the ROI generated in the process is subjected to concat connection, and then the ROI is sent to a full connection layer and a mask module. The fully connected layer can perform category prediction and regression of the bounding box to generate a more accurate prediction box.

(3) mask semantic segmentation: while performing classification and bounding box regression, the Mask module outputs a binary Mask for each region of interest selected by the box. The mask branch first performs a max posing operation on the region of interest to a size of 14 × 14 × 256, then performs two deconvolution operations, where the convolution kernels used are all 3 × 3 in size, and outputs a 28 × 28 × 80 mask through the final deconvolution operation. Sigmoid is applied to each pixel in the region of interest and then the average of the cross entropy of all pixels over the region of interest is taken as the segmentation penalty. The classification probability of each pixel can be obtained through the part, and semantic segmentation is realized; the pixels of the tongue image are divided into 2 classes, i.e., a tongue body region and a background region.

2. Tongue picture alignment

For the segmented tongue body image, a self-adaptive alignment method is used, and standard alignment can be realized for the segmented tongue images with different shapes. Firstly, a Fourier descriptor is utilized to construct a region mapping of a segmented tongue body boundary, then a Cauchy integral and a finite difference method are utilized to combine and expand the mapping to an internal region, and finally the mapping from a segmented oblique tongue image to a standard tongue image is realized through Riemann mapping. Fig. 2 and 3 are schematic views of the shapes of the unit disc and the standard tongue image, respectively, as referred to herein.

In the whole mapping process, two mappings are obtained, the first is the mapping of the segmented tongue image to the unit circle, and the second is the mapping of the segmented tongue image to the standard tongue image. In Riemann mapping, there is a mapping from the original tongue picture to the unit disc, and there is also a mapping from the standard tongue picture to the unit disc, at which time there will be an inverse mapping from the unit disc to the standard tongue picture, and by this inverse mapping, a composite mapping from the split original tongue picture to the standard tongue picture is obtained.

Suppose the tongue picture area is omega 1, the unit disc is D, and the standard tongue picture area is omega 2; firstly, solving the mapping from D to omega 1 and the mapping from D to omega 2; and then, mapping from omega 1 to omega 2 is obtained, namely mapping which is finally needed by us. The method mainly comprises the following steps:

the first step is as follows: first, to calculate the corresponding tongue boundaries, the boundary regions are calculated using the fourier descriptors by equation (1):

the second step is that: on the basis of obtaining the boundary correspondence, further calculating the mapping of the internal area, wherein the Cauchy integral formula is shown as a formula (2), and when gamma is a unit circle, the Cauchy integral formula is changed into a formula (3):

wherein z and z₀Points on the boundary and interior, f (z), respectively₀) And F (z)₀) Maps representing boundaries and interior regions, respectively:

wherein z is_k＝e^2jπ(k/K)Representing the form of a point on a unit circle, #_kRepresenting the corresponding tongue border.

The third step: constructing a conformal mapping, wherein the function of the conformal mapping is as follows:

F＝U+j·V (5)

the fourth step: calculate the mapping of the original tongue image to the unit circle:

suppose the original tongue picture is represented as I₀[u,v,w]Original image mapping to unitsThe image on the circle is denoted J [ x, y, w ]]It can be expressed by the following formula:

J[x,y,w]＝I₀[U(x,y),V(x,y),w] (6)

wherein (u, v) ∈ omega₁(original tongue picture region), (x, y) e D (unit circle), w 1,2,3 representing three channels RGB.

The fifth step: omega is obtained₁To omega₂Of (a) is₀[u,v,w]The original tongue image is shown in the figure,

indicating the standard tongue after alignment; for any one

With the corresponding pixel position on the unit circle being represented as

The mapping from unit circle to standard tongue is thus shown as:

the mapping from the original image to the unit disk is calculated by formula (6), and the following formula is obtained by substituting formula (6) into formula (7):

to this end, the original image I can be obtained from the formula (8)₀To standard tongue picture I₁To (3) is performed.

The specific implementation of this example is as follows:

step 1: acquiring a required tongue picture by using tongue picture acquisition equipment, performing data enhancement processing on the picture, and expanding a number set by rotating and horizontally turning;

step 2: manually marking tongue bodies in all picture data by adopting lableme software; the tongue annotation drawing given in this example is shown in fig. 4;

the characteristic pyramid network is constructed by outputs of 4 middle bnecks of Mobilenetv 2;

and 5: performing maximum pooling operation on all the suggestion frames obtained in the step 4, and fixing all the suggestion frames into the characteristics of 7x7 after the maximum pooling operation;

and 7: taking the prediction frame obtained in the step 6 as an area intercepting part of a mask model, and classifying the pixel points by using the mask model after interception to obtain a semantic segmentation result;

and step 8: calculating the boundary area and Fourier coefficient of the segmented tongue picture;

step 11: constructing a conformal mapping which updates real and imaginary parts of the function by iteration;

step 12: and outputting the aligned image, and finishing the algorithm.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A tongue picture segmentation and alignment method based on a feature pyramid network is characterized by comprising the following steps:

and 2, step: manually marking tongue bodies in all the picture data;

step 6: connecting two fully-connected network layers behind the suggestion frame subjected to the maximum pooling operation, respectively judging whether the suggestion frame contains an object, and then adjusting the suggestion frame to obtain a prediction frame;

and 8: calculating the boundary area and Fourier coefficient of the divided tongue picture;

step 12: and outputting the aligned image.

2. The tongue picture segmentation and alignment method based on the feature pyramid network as claimed in claim 1, wherein step 1 expands the logarithm set by rotation and horizontal flipping during the data enhancement process of the picture.

3. The method as claimed in claim 1, wherein step 2 employs a lableme software for manual tongue body labeling in all the picture data.

4. The method for tongue picture segmentation and alignment based on feature pyramid network as claimed in claim 1, wherein in step 5, all the suggested boxes are fixed to 7x7 feature after maximum pooling operation.

5. The method as claimed in claim 1, wherein the feature pyramid network is constructed from outputs of 4 bnecks in the middle of Mobilenetv 2.