CN116152500A

CN116152500A - Full-automatic tooth CBCT image segmentation method based on deep learning

Info

Publication number: CN116152500A
Application number: CN202310221468.2A
Authority: CN
Inventors: 秦红星; 张子南
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-03-09
Filing date: 2023-03-09
Publication date: 2023-05-23

Abstract

The invention relates to a full-automatic tooth CBCT image segmentation method based on deep learning, and belongs to the field of computer vision. The method is divided into an edge image acquisition stage, a multi-frame input stage and an image segmentation stage. In the edge image acquisition stage, firstly acquiring an edge image in a tooth CBCT image, and then adding the acquired edge image with an original image; in the multi-frame input stage, the enhanced images are overlapped on the channels, and the images of three channels are overlapped from the single-channel images; in the image segmentation stage, firstly, the feature extraction is carried out on the images processed in the multi-frame input stage through a backbone network, and then the extracted feature images are sent to a candidate region generation module, so that the segmentation of teeth in the tooth CBCT images is completed. The invention can more accurately segment variable tooth shapes, accurately extract root parts, and effectively reduce the error of dividing bones in the oral cavity into teeth.

Description

Full-automatic tooth CBCT image segmentation method based on deep learning

Technical Field

The invention belongs to the field of computer vision, and relates to a full-automatic tooth CBCT image segmentation method based on deep learning.

Background

Oral treatment mainly includes orthodontic, dental implant and tooth extraction. Cone Beam Computed Tomography (CBCT) is a diagnostic imaging technique widely used in dental problem studies. With accurate tooth segmentation, a physician can make more accurate treatment decisions and plans. For reconstructing a 3D tooth model, accurate segmentation of the teeth is crucial. Currently, this is done manually by a professional operator, however, manual marking by the operator is a time consuming task and the accuracy of the results depends on the operator. Automatic accurate segmentation of individual teeth from CBCT images is critical to the creation of an effective computer-aided diagnostic system for orthodontic, dental implant simulation and other dental treatments. There is currently a degree of difficulty in accurately identifying and segmenting teeth from CBCT images with rapidly changing conditions, mainly due to the following:

(1) The image contrast is low: the grey scale between individual tissues in the dental CBCT image is similar and the contrast is low. For example, the gray scale of a tooth is similar to the alveolar bone and jawbone, which can cause blurring of the boundaries of the tooth in the image.

(2) Gray scale unevenness: tooth structure can be divided into three parts, namely enamel, dentin and pulp. The gray scale difference of the imaging effect of the three parts is obvious, so that the gray scale of individual teeth in the image is different, and the segmentation difficulty is increased.

(3) Complexity of individual tissue organs: the topology of individual teeth varies greatly and a single complete tooth seen in the crown region may split into 2-5 branches in the root region.

(4) There is a common boundary in the image: it is possible that the common boundary is incorrectly divided, identifying two teeth as the same tooth.

(5) The resolution is low: although CBCT images are large, each tooth is small in size, approximately 20-40 pixels in diameter.

To address the above problems, various methods have been proposed in the industry to solve the above problems, and these methods can be divided into two categories: conventional methods that require hand-crafted features and deep learning methods that typically require a large number of samples. Conventional methods are generally semi-automatic and require manual handling, however manual handling does not meet the needs of patients with different conditions. Most advanced methods typically employ a Full Convolutional Network (FCN) as the main component of hierarchical learning, and then these methods integrate local to global features into the tooth contours in the form of instance segmentation or semantic segmentation, which first detects or locates the position of different teeth, and then segments the target. This approach may misclassify adjacent teeth into the same class because they look similar in CBCT images. The semantic segmentation method performs dense group pixel-to-voxel predictions for all descriptions of the teeth. However, these semantic segmentation networks cannot accurately label individual teeth into multiple categories due to the limited perceived range of relatively large input images. Furthermore, due to the similar gray scale, these methods may segment the bone into teeth. The performance of current deep learning methods is limited to specific tasks of tooth segmentation, mainly because they are easy to use with a generic network architecture. Therefore, in the field of CBCT image segmentation based on deep learning, there is still a need to establish a method capable of accurately segmenting a tooth CBCT image.

Disclosure of Invention

Accordingly, the present invention aims to provide a full-automatic tooth CBCT image segmentation method based on deep learning, which can accurately and fully automatically segment various types of teeth in a CBCT image, reduce bone segmentation errors in the image, and improve segmentation accuracy.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a full-automatic tooth CBCT image segmentation method based on deep learning is divided into an edge image acquisition stage, a multi-frame input stage and an image segmentation stage.

In the edge image acquisition stage, an edge image in the tooth CBCT image is acquired first, and the obtained edge image is added with the original image to enhance the edge characteristics of the image.

In the multi-frame input stage, the enhanced images with enhanced edge characteristics in the edge image acquisition stage are overlapped on the channels, and the images of three channels are overlapped from the single-channel images, so that a more accurate tooth root segmentation effect is obtained.

In the image segmentation stage, firstly, extracting features of the images processed in the multi-frame input stage through a backbone network, and then, sending the extracted feature images to a candidate region generation module; the candidate region generation module generates nine candidate frames for each pixel of the feature map according to three proportions and three sizes, each candidate frame classifies targets in the candidate frame and regresses coordinates through a classifier and a regressor, the obtained feature map is sent to a classification branch, a bounding box branch and a mask branch respectively to generate different results, and finally the mask quality is evaluated through a mask scoring branch to obtain a more accurate mask, so that the segmentation of the teeth in the tooth CBCT image is completed.

Further, the specific implementation mode of the edge image acquisition stage is as follows:

firstly, converting an RGB image into a gray image, and sending the gray image into an edge extraction network to obtain an edge image of the tooth. The edge extraction network adopts a U-shaped network structure of the Unet, and the structure comprises an encoder, a decoder and a jump connection part; performing convolution kernel pooling on the image in the encoder part, and obtaining four features with different sizes by using four pooling operations on the image; and in the jump connection part, fusing the characteristic diagrams of different layers of the encoder with the characteristic diagrams of corresponding decoder layers to obtain an edge image. The resulting edge image is then added to the gray scale image to enhance the edge features of the image.

Further, the encoder section includes two 3×3 convolutional layers and a ReLu layer, and one 2×2 max-pooling layer; the decoder section includes one upsampled convolutional layer and two 3 x 3 convolutional layers.

The loss function adopted by the edge extraction network is a cross entropy loss function:

CE(p)＝-ylogp-(1-y)log(1-p)

where y represents the true pixel tag value, y e {0,1}, and p represents the network predicted pixel tag value, p e {0,1}.

Further, in the image segmentation stage, the backbone network adopts the combination of ResNet50 and the feature pyramid network, and a convolution attention mechanism module is added into the backbone network to improve the effectiveness of the backbone network.

Further, the feature pyramid network includes three parts, a bottom-up branch (bottom-top branch), a top-down branch (top-down branch), and a cross connect (lateral connection); the bottom-up branching is a feature size reduction process, wherein features of different sizes are the outputs of the respective residual blocks; the top-down branch is an expansion process executed by nearest neighbor up-sampling; the lateral connection aligns the channel with the output channel of the top-down module.

The invention has the beneficial effects that: according to the method, the obtained edge map is used for enhancing the edge characteristics of the original image, so that the recognition degree of the model on the edge is improved, and the changeable tooth shapes can be more accurately segmented. Meanwhile, adjacent slices are overlapped, so that adjacent frame information in the image can be further utilized, and the tooth root part can be extracted more accurately. Finally, the attention mechanism module is introduced to amplify the weight of the tooth area of the image and inhibit the weight of other irrelevant areas, so that the error of dividing google in the oral cavity into teeth can be effectively reduced.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a structural frame diagram of the present invention;

FIG. 2 is a graph comparing the segmentation results of the present invention with the CNN-based method;

FIG. 3 is a graph comparing the results of the segmentation according to the RCNN method of the present invention;

fig. 4 is a graph comparing the results of the division according to the present invention with the results of the division according to the Mask Scoring RCNN method, wherein (a) and (b) are Mask Scoring RCNN division results, and (c) and (d) are division results according to the present invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

The invention provides a full-automatic tooth CBCT image segmentation method based on deep learning, which comprises three stages, namely an edge image acquisition stage, a multi-frame input stage and an image segmentation stage. The embodiments of each stage are described in detail below.

1. Edge image acquisition stage

The edge image of the teeth in the oral cavity CBCT image is firstly acquired, and then the obtained edge image is added with the original image, so that the edge characteristics of the image are enhanced.

Firstly, converting an RGB image in a data set into a gray image, wherein the formula for converting the gray image is as follows:

I(x,y)＝0.3*R(x,y)+0.59*G(x,y)+0.11B(x,y)

where I (x, y) represents the pixel value of the gray scale image, R (x, y) represents the red channel value of the original image, G (x, y) represents the green channel value of the original image, and B (x, y) represents the blue channel value of the original image. After the RGB image is converted into the gray image, useless information of the image can be reduced, training speed of the model is increased, and subsequent further processing can be facilitated.

The resulting gray-scale image is then fed into an edge extraction network to obtain an edge image of the tooth. The edge image acquisition network adopts a U-shaped network structure of the Unet, and the structure can achieve certain precision while training by using a small number of pictures. The edge acquisition network adopts a full convolution neural network, and the network mainly comprises three parts: an encoder section, a decoder section, and a skip connection section. Wherein the encoder section is the left half of the network, the encoder of each layer is made up of two 3 x 3 convolutional layers and a ReLu layer plus a 2 x 2 max pooling layer. The encoder part carries out convolution kernel pooling on the picture, and four features with different sizes can be obtained by using four pooling operations on the image. The decoder sections are then the right half of the network, each decoder consisting of one up-sampled convolutional layer and two 3 x 3 convolutional layers. In the jump connection part, the feature maps of different layers of the encoder are fused with the feature maps of the corresponding decoder layers. Since the image is up-sampled by linear interpolation in the decoding process, the image restoration method can lead to partial information loss in the image, so that the original image characteristic information is fused with the decoded characteristic image through jump connection, and the lost information in the image is restored.

Edge extraction of an image is effectively a two-classification task, with the edges of teeth in the image as targets, set to white, and the other background portions set to black. The most commonly used loss function in the two-classification task is the Cross entropy loss function (CE), which is defined as follows:

CE(p)＝-ylogp-(1-y)log(1-p)

where y represents the true pixel tag value, y e {0,1}, and p represents the network predicted pixel tag value, p e {0,1}. The resulting edge image is added to the gray scale image to enhance the edge features of the image.

2. Multiple frame input stage

The edge extraction network is used for enhancing the edge characteristics of teeth in the CBCT image, so that the accurate segmentation of the tooth edges is facilitated, and the smoothness of the segmentation is improved. In the input stage, the invention superimposes the single-channel gray level image on the channel, and superimposes the current input and two adjacent slices to form a three-channel image as the input of the network. CBCT images are a continuous sequence of image slices, with strong associated information in adjacent slices, sometimes not clear enough in the root information in the current slice, but clear targets in adjacent slices, so that by means of such superimposed input, a more accurate root segmentation effect can be obtained.

3. Image segmentation stage

The image is firstly subjected to feature extraction through a backbone network in the stage. Mask Scoring RCNN backbone networks typically use ResNet101, i.e., the number of network layers is 101, however, excessive layers can slow the efficiency of the network. Since CBCT image information is simpler in background information and less in target type than natural images, and excellent effects can be obtained by using a network with a small number of layers instead of the network, the present invention uses the res net50 as a backbone network for feature extraction. In addition, since the topology of teeth is complex, the shape and size are different, at the crown, the teeth are obvious in the image, and at the root, the teeth are split into multiple branches, the target becomes small, and all image features cannot be extracted well using a single convolutional neural network. Thus, the backbone structure of the present invention ultimately employs a combination of feature pyramid networks (feature pyramid network, FPN) and res net 50. The FPN solves the multi-scale problem of target object extraction in images by using a top-down horizontal connection hierarchy from single-scale input to construction of a feature pyramid network. FPN consists of three parts: bottom-up branching (top-down branching) and cross-connect (lateral connection). Down-top branch is a feature size reduction process in which different sized features are the output of the respective residual blocks. top-down branch is an extension process performed by nearest neighbor upsampling. The lateral connection aligns the channel with the output channel of the top-down module. The structure needs fewer parameters, has stronger robustness and adaptability, and the FPN network can better aim at a small target, so that the feature information of the tooth root in the CBCT image can be better extracted by using the combination of the ResNet50 and the FPN.

To further increase the effectiveness of the backbone network, the present invention introduces a attention mechanism by adding a convolved attention mechanism module (convolutional block attention module, CBAM) module to the backbone network, i.e., adding a CBAM module across the res net50 and FPN. The CBAM can capture high-level semantic information in the image, adjust the channel and space weight of the feature map, increase the weight of important features and reduce the weight of unimportant parts, so that feature extraction is enhanced, and the segmentation accuracy of the model is improved.

CBAM is a convolution module based attention module that combines spatial and channel attention modules to achieve better results than SENET which focuses on channel attention only. Each channel of the feature map represents a detector, so it is meaningful that the channel is focused on which features. The channel attention module uses two methods, global averaging and max pooling, to summarize global features to utilize different information, and inputs feature maps into the channel attention module, which consists of max pooling and averaging based on image width and height, with the two results from pooling operations fed to the full link layer (MLP), sharing parameters of the MLP. The module then adds the two results of the MLP outputs and activates the weight coefficients using the sigmoid function. Finally, the module multiplies the original feature map by the weight coefficient to obtain a new feature map.

Expressed by the formula:

M _c (F)＝σ(MLP(AvgPool(F))+MLP(MaxPool(F)))

where F represents the input feature map, σ represents the sigmoid function, MLP represents the fully connected layer, avgPool represents the average pooling, and MaxPool represents the maximum pooling.

After the channel attention module, the CBAM introduces a spatial attention module to focus where the feature is meaningful. For a given feature map, the spatial attention module first performs a maximum pooling and averaging of one channel dimension, resulting in two single channel feature maps. Then connecting the two feature maps, sending the connected feature maps into a convolution layer to obtain space weight coefficients, wherein the size of the convolution kernel is 7×7, and multiplying the weight coefficients by the feature maps to obtain a new feature map.

Expressed by the formula:

M _s (F)＝σ(f[AvgPool(F):MaxPool(F)])

where f represents a convolution layer.

In a tooth CBCT image, the positions of teeth are generally fixed, and the weight of the area where the teeth are positioned can be increased by the model through introducing the attention module, and the weight of other irrelevant areas is reduced, so that the error of dividing bones in the oral cavity into the teeth by the model can be reduced, and the robustness of the model is improved.

And sending the feature map extracted by the backbone network into a candidate region generation module. The candidate region generation module generates nine candidate frames for each pixel of the feature map according to three proportions and three sizes, each candidate frame classifies targets in the candidate frame and regresses coordinates through a classifier and a regressor, the obtained feature map is sent to a classification branch, a bounding box branch and a mask branch respectively to generate different results, and finally, the quality of the mask is evaluated through a mask scoring branch to obtain a more accurate mask, and the segmentation of the teeth in the tooth CBCT image is completed.

In this embodiment, the method provided by the invention is compared with other deep learning segmentation methods through a test data set, and the results are shown in the following table:

table 1 comparison results of each deep learning segmentation method

The larger the index values of the similarity coefficient and the Jaccard similarity coefficient are, the closer the index values are to the correct segmentation result, and the table shows that the two indexes are optimal, so that the more accurate segmentation result can be realized. Fig. 2 is a graph of the results of the present invention compared with a CNN-based method, showing the division results of three cases of tooth gray non-uniformity, tooth gray and tooth socket similarity, and tooth root, respectively. Fig. 3 is a graph of the results of the present invention compared to an RCNN-based method, showing the segmentation results when the tooth topology is greatly changed. Fig. 4 is a block diagram comparing the present invention with Mask Scoring RCNN-based methods, showing the segmentation results involving bone parts, and it can be seen that Mask Scoring RCNN segments bone parts in the mouth into roots, while the present invention can accurately detect roots, reducing errors in bone segmentation.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A full-automatic tooth CBCT image segmentation method based on deep learning is characterized in that: the method comprises an edge image acquisition stage, a multi-frame input stage and an image segmentation stage;

in the edge image acquisition stage, firstly acquiring an edge image in a tooth CBCT image, and adding the obtained edge image with an original image to enhance the edge characteristics of the image;

in the multi-frame input stage, overlapping the enhanced image enhanced with the edge characteristics in the edge image acquisition stage on a channel, and overlapping the single-channel image into a three-channel image so as to acquire an accurate root segmentation effect;

in the image segmentation stage, firstly, extracting features of the images processed in the multi-frame input stage through a backbone network, and then, sending the extracted feature images to a candidate region generation module; the candidate region generation module generates nine candidate frames for each pixel of the feature map according to three proportions and three sizes, each candidate frame classifies targets in the candidate frame and regresses coordinates through a classifier and a regressor, the obtained feature map is sent to a classification branch, a bounding box branch and a mask branch respectively to generate different results, and finally the mask quality is evaluated through a mask scoring branch to obtain an accurate mask, so that tooth segmentation in a tooth CBCT image is completed.

2. The method for segmenting a CBCT image of a tooth according to claim 1, wherein: in the edge image acquisition stage, the specific mode is as follows:

firstly, converting an RGB image into a gray image, and sending the gray image into an edge extraction network to obtain an edge image of a tooth; the edge extraction network adopts a U-shaped network structure of the Unet, and the structure comprises an encoder, a decoder and three parts which are connected in a jumping manner; performing convolution kernel pooling on the image in the encoder portion, and obtaining four features with different sizes by using four pooling operations on the image; at the jump connection part, fusing the characteristic diagrams of different layers of the encoder with the characteristic diagrams of corresponding decoder layers to obtain an edge image;

the resulting edge image is then added to the gray scale image to enhance the edge features of the image.

3. The method for segmenting a CBCT image of a tooth according to claim 2, wherein: the encoder section includes two 3 x 3 convolutional layers and a ReLu layer, and one 2 x 2 max-pooling layer; the decoder section includes an upsampled convolutional layer and two 3 x 3 convolutional layers;

the loss function employed by the edge extraction network is a cross entropy loss function:

CE(p)＝-ylogp-(1-y)log(1-p)

4. The method for segmenting a CBCT image of a tooth according to claim 1, wherein: in the image segmentation stage, the backbone network adopts the combination of ResNet50 and a characteristic pyramid network, and a convolution attention mechanism module is added into the backbone network to improve the effectiveness of the backbone network.

5. The method for segmenting a CBCT image of a tooth according to claim 4, wherein: the feature pyramid network comprises three parts, namely a bottom-up branch (bottom-top branch), a top-down branch (top-bottom branch) and a transverse connection (lateral connection); the bottom-up branching is a feature size reduction process, wherein features of different sizes are the outputs of the respective residual blocks; the top-down branch is an expansion process executed by nearest neighbor up-sampling; the lateral connection aligns the channel with the output channel of the top-down module.