CN112785609B

CN112785609B - CBCT tooth segmentation method based on deep learning

Info

Publication number: CN112785609B
Application number: CN202110180002.3A
Authority: CN
Inventors: 秦红星; 张雅欣
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2022-06-03
Anticipated expiration: 2041-02-07
Also published as: CN112785609A

Abstract

The invention relates to a CBCT tooth segmentation method based on deep learning, and belongs to the field of computer graphics. The method comprises the following steps: s1: preprocessing the cone beam CBCT image by using the priori knowledge of the tooth image, and extracting a tooth part to obtain an interested region; s2: performing feature extraction on the image through a ResNet-FPN network to obtain a feature map; s3: compressing the space and channel dimensions of the feature map by using a CBAM model, thereby performing importance coding on the feature map; s4: extracting candidate regions of the feature map by using an RPN (resilient packet network); s5: pooling the corresponding region in the feature map to a fixed size feature map according to the position coordinates of the preselected box using the ROI Align; s6: and classifying, segmenting, surrounding frame regression and segmenting and scoring the candidate regions. The invention simplifies the segmentation steps and improves the tooth segmentation accuracy in the CBCT image.

Description

CBCT tooth segmentation method based on deep learning

Technical Field

The invention belongs to the field of computer graphics, and relates to a CBCT tooth segmentation method based on deep learning.

Background

At present, oral medicine mainly comprises the aspects of tooth orthodontics, tooth implantation, oral, maxillofacial and joint disease diagnosis and the like. Taking a dental implant as an example, doctors usually make plan planning of surgery according to clinical experience on the basis of observing X-ray panoramic pictures and apical slices of patients at present. Because the panoramic view can not faithfully reflect the spatial position of the teeth in the oral cavity and the accurate acquisition of the tooth root information is lacked, the success rate of the operation is affected, and the risk of the operation failure is increased.

The development of computer vision and graphics in recent years has made digital oral medicine a reality. The key to digital oral medicine is to acquire and segment a complete 3D dental model. Three-dimensional information such as the shape and root position of a tooth is very important for clinical operations such as frontal and facial surgery, root canal therapy, and therapy simulation. However, obtaining a complete 3D model of the teeth is a difficult task. Currently, there are two mainstream techniques for obtaining 3D dental models: (1) intraoral or desktop scanning; (2) cone Beam Computed Tomography (CBCT). Intraoral or bench-top scanning is a convenient method of obtaining the geometry of the crown surface, but in many cases it is not possible to obtain the root information needed for accurate diagnosis and treatment. Compared with the common CT, the CBCT has the advantages of small radiation dose, short scanning time, high image spatial resolution and the like, and also provides more comprehensive 3D volume information of all oral tissues including teeth. So that the teeth are segmented from the CBCT image to obtain a more complete and accurate tooth model. However, the CBCT dental image has the following characteristics, making tooth segmentation a very challenging task: (1) the gaps between adjacent teeth are small, so that the contour lines of the contact positions of the adjacent teeth in the image are lost; (2) the density of teeth is different from the crown to the root of the tooth, so that the gray scale of a single tooth in the CBCT image is not uniform; (3) the root of the tooth is embedded in the alveolar bone, and the density of the root of the tooth and the alveolar bone is similar, so the edge is not clear; (4) there is a topological variation in the profile of the tooth between the root and crown portions.

In recent years, many experts and scholars at home and abroad make intensive research on CBCT tooth segmentation, and currently, many algorithms exist, for example: the method comprises a level set segmentation method, a threshold segmentation method and a region growing method, but the methods all need operators to have strong prior knowledge, and good segmentation effect can be obtained only by carrying out very good initialization on an algorithm.

Disclosure of Invention

In view of the above, the present invention provides a CBCT tooth segmentation method based on deep learning, which solves the problem that the conventional tooth segmentation method requires good initialization to accurately segment teeth, and realizes end-to-end CBCT tooth segmentation by using the deep learning method, so that not only can teeth be segmented fully automatically without user labeling and subsequent steps, but also teeth can be segmented from CBCT images more accurately while the segmentation steps are simplified.

In order to achieve the purpose, the invention provides the following technical scheme:

a CBCT tooth segmentation method based on deep learning comprises the following steps:

s1: preprocessing a Cone Beam Computed Tomography (CBCT) image by using prior knowledge of the tooth image, and extracting a tooth part to obtain an interested area;

s2: extracting the features of the image through a depth residual error network ResNet and a feature pyramid network FPN to obtain a feature map (feature map);

s3: sequentially compressing the space and channel dimensions of the feature map by using a Attention model (CBAM), thereby performing importance coding on the feature map;

s4: extracting candidate regions of the feature map by using the generated candidate region network RPN;

s5: pooling corresponding regions in the feature map into a feature map of fixed size for subsequent operation using the ROI Align network according to the position coordinates of the preselected frame;

s6: and classifying, segmenting, surrounding frame regression and segmenting and scoring the candidate regions.

Further, in step S1, the CBCT image includes: dental images and skull information;

preprocessing the CBCT image, comprising: and (3) cutting the image by utilizing the tooth priori knowledge, wherein the cut size is 384 multiplied by 320, and converting the CBCT scanning image into an image in a jpg format by using Mirco Dicom medical processing software, so that the subsequent processing is facilitated.

Further, step S2 specifically includes: the residual representation between input and output is learned by using a plurality of parameter layers of the ResNet network, so that the problems of gradient disappearance, explosion and network degradation caused by the increase of the number of network layers can be solved; applying deep-level tooth characteristic diagrams and shallow-level tooth characteristic diagrams in the ResNet network to the FPN network, and efficiently integrating the characteristic diagrams through bottom-up, top-down and transverse connection, so that the detection time is not greatly increased while the precision is improved; the CBCT dental picture can obtain the optimal CNN feature combination set of the dental picture through a ResNet network and an FPN network, and a feature map (feature map) is output.

Further, in step S3, the CBAM model is a model based on attention mechanism, which can encode the feature map with an importance, and is divided into two parts: a Channel Attention Module (CAM) and a Spatial Attention Module (SAM).

Further, in step S3, the feature map F obtained in step S2 is input into the CBAM model, the feature map F is spatially compressed in the CAM module to obtain a channel weight coefficient Mc, the coefficient Mc is multiplied by the feature map F to obtain a new feature map F ', and then the feature map F' enters the SAM module, the feature map F 'is compressed in one channel to obtain a spatial weight coefficient Ms, and the coefficient Ms is multiplied by the feature map F' to obtain a modified feature map F ". The corrected feature map F' contains channel weight coefficients and space weight coefficients, so that the network can better learn the features of the teeth.

Further, step S3 specifically includes: the method for compressing the feature map F in the CAM module in space specifically comprises the following steps: spatial compression was performed using max-pooled maxpool (f) and average-pooled avgpool (f) resulting in two different spatial descriptions:

and

the shared network composed of the multiple layers of perceptrons MLP is used for calculating the two different space descriptions to obtain a channel weight coefficient Mc, and the calculation formula is as follows:

where σ (·) denotes a sigmod function, W₁∈R^C/r×C，W₀∈R^C×C/rWherein C is the number of channels and r is the reduction ratio;

and obtaining a characteristic diagram F' by using the obtained channel weight coefficient Mc and the characteristic diagram F, wherein the calculation formula is as follows:

wherein,

a representative element multiplication;

the characteristic diagram F 'enters the SAM module, and the maximum pooling and the average pooling are used for compressing the characteristic diagram F' on the channel to obtain two different spatial descriptions:

and

connected and convolved by a standard convolution layer to generate a two-dimensional spatial weight coefficient Ms, which is calculated as follows:

wherein f is^7×7Represents a convolution operation with a convolution size of 7 × 7;

and obtaining a feature map F 'by using the obtained spatial weight coefficient Ms and the feature map F', wherein the calculation formula is as follows:

wherein, F "is a feature map after channel attention and spatial attention correction, and is a feature map after importance coding.

Further, in step S4, inputting the feature map F ″ encoded by the importance of step S3 into an RPN network, which generates K target boxes with preset aspect ratios and areas for each position by means of a window sliding on the shared feature map, where the default K is 9, the initial target box includes three areas (128 × 128, 256 × 256, 512 × 512), and each area includes three aspect ratios 1:1, 1:2, 2: 1; wherein, the RPN is essentially of a tree structure, the trunk is a 3 × 3 convolutional layer, and the branches are two 1 × 1 convolutional layers; the first 1 x 1 convolutional layer solves the foreground and background outputs, each point corresponds to 1 target frame, each target frame has a foreground score and a background score, so the output is 2 x K values. Another 1 × 1 convolutional layer solves the output of the frame correction, each point corresponds to K anchors, each anchor corresponds to a value of 4 correction coordinates, so the output is 4 × K values. In this network, a candidate bounding box is extracted for the tooth feature F ", and a bounding box correction is performed.

Further, in step S5, the result of step S4 is input into the roilign network, the tooth original image ROI region is mapped to the tooth feature map ROI, and then the size of the extracted region is normalized into the size input by the convolutional network through pooling; the ROIAlign directly uses bilinear interpolation from the original image to the ROI mapping of the feature map, so that errors are reduced, and the accuracy of the ROIAlign corresponding to the original image after pooling is higher; assuming that the obtained floating point coordinates are (X, Y), the nearest four points around the floating point coordinates are interpolated twice in the Y direction and once again in the X direction to obtain a new value, and the shape of the ROI is not changed.

Further, in step S6, in the segmentation, after 4 convolution operations are performed on the feature map with a fixed size generated by the ROIAlign network, a feature map with a size of 14 × 14 is generated; then generating a characteristic map with the size of 28 x 28 through upsampling; finally, generating a characteristic graph with the size of 28 x 28 and the depth of 80 through convolution operation;

in segmenting the scores, a modified network of maskolou scores was used, the model maskolou head aimed at regression predicting IoU between the tooth mask and its ground-truth mask; the input of the MaskIoU consists of two parts, namely a RoI feature map obtained by ROIAlign and a mask output by a mask branch; after the two are connected, maskIoU is output through 3 layers of convolution and 2 layers of full connection; the network module utilizes the example characteristics and the corresponding predicted mask to regression MaskIoU, can accurately evaluate the score of the tooth segmentation task, and improves the segmentation effect.

The invention has the beneficial effects that: the invention is based on a deep learning method and uses a two-stage deep supervision neural network. And (3) introducing a attention model (CBAM) to compress space and channel dimensions of the feature map in sequence, so as to perform importance coding on the feature map, adopting a network capable of grading the segmentation effect, regressing IoU between the predicted mask and the ground real mask thereof, and determining the final grade by the classification score and the segmentation score together, thereby obtaining more accurate segmented teeth.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of the deep learning-based CBCT tooth segmentation method according to the present invention;

FIG. 2 is a schematic overall framework diagram of the deep learning-based CBCT tooth segmentation method according to the present invention;

FIG. 3 is an image of a dental crown segmented from CBCT data in accordance with the present invention;

FIG. 4 is an image of a tooth root segmented from CBCT data according to the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Referring to fig. 1 to 4, the CBCT tooth segmentation method based on deep learning according to the present invention includes the following steps:

s1: preprocessing a Cone Beam Computed Tomography (CBCT) image by using prior knowledge of a tooth image, extracting a tooth part and obtaining an interested area, wherein the method specifically comprises the following steps: the whole cone-beam computer tomography image not only comprises a tooth image, but also comprises skull information, the image is cut by utilizing tooth priori knowledge, the size of the cut image is 384 multiplied by 320, and a CBCT scanning picture is converted into a picture in a jpg format by using Mirco dicom medical processing software, so that subsequent processing is facilitated;

s2: in the process of extracting tooth characteristics by using a deep residual network (ResNet), a plurality of parameter layers are used for learning residual representation between input and output, so that the problems of gradient disappearance, explosion and network degradation caused by the increase of the number of network layers can be solved. The Feature Pyramid Network (FPN) not only uses deep-level tooth feature maps in the ResNet network, but also uses shallow-level tooth feature maps in the feature pyramid network, and efficiently integrates the feature maps through bottom-up, top-down and transverse connection, so that the detection time is not greatly increased while the accuracy is improved. The CBCT dental picture can obtain the optimal CNN feature combination set of the dental picture through a ResNet network and an FPN network, and a feature map (feature map) is output.

S3: compressing the feature map by space and channel dimensions successively by using a attention model (CBAM), thereby coding the importance of the feature map; inputting the feature map (hereinafter, feature map is referred to as F) obtained in the previous step into a (CBAM) model, which is a model based on an attention mechanism and can encode the feature map with an importance, wherein the model is divided into two parts: channel Attention Modules (CAM) and Spatial Attention Modules (SAM). The feature map is spatially compressed in the CAM bank using maximum pooling and average pooling, resulting in two different spatial descriptions: f_max ^cAnd F_avg ^cAnd calculating the two different spatial background descriptions by using a shared network consisting of a plurality of layers of perceptrons MLP to obtain a channel weight coefficient Mc, wherein the calculation formula is as follows:

where σ (·) denotes a sigmod function, W₁∈R^C/r×C，W₀∈R^C×C/rWherein C is a channel numberNumber, r is the reduction ratio;

and obtaining a characteristic F' by using the obtained channel weight coefficient Mc and the characteristic F, wherein a calculation formula is as follows:

wherein,

the representative element multiplication.

The feature F' enters the SAM space attention module, which performs compression on the channel using maximum pooling and average pooling on the feature map, resulting in two different spatial descriptions:

and

these are concatenated and convolved by standard convolution layers to produce a two-dimensional spatial weighting coefficient Ms, which is calculated as follows:

wherein f is^7×7Representing a convolution operation with a convolution size of 7 x 7.

And obtaining the feature F 'by using the obtained spatial weight coefficient Ms and the feature F', wherein a calculation formula is as follows:

wherein, F "is the feature map after the channel attention and the spatial attention correction, and is the feature map after the importance coding. The positions of teeth in the computed tomography images are relatively fixed, and the importance coding of space and dimension on the tooth features can enable the network to learn the tooth features better.

S4: extracting candidate regions of the feature map by using a candidate Region Proposal Network (RPN);

inputting the feature F' subjected to significance coding in the previous step into a candidate region proposal network, wherein the network generates K bounding boxes (anchors) with preset length-width ratios and areas for each position by means of a window sliding on a shared feature map, the default K is 9, but as the dental target is smaller, the number K of the bounding boxes is set to be 1 in implementation, the initial bounding box comprises three areas (128 × 128, 256 × 256, 512 × 512), and each area comprises three length-width ratios (1:1, 1:2, 2: 1). The nature of the RPN network is a tree structure, with the trunk being one 3 × 3 convolutional layer and the branches being two 1 × 1 convolutional layers. The first 1 x 1 convolutional layer solves the foreground and background outputs, each point corresponds to 1 target frame, each target frame has a foreground score and a background score, so the output is 2 values. Another 1 × 1 convolutional layer solves the output of the frame correction, each point corresponds to 1 anchor, and each anchor corresponds to 4 values of the correction coordinates, so the output is 4 values. In this network, a candidate bounding box is extracted for the tooth feature F ", and a bounding box correction is performed.

S5: pooling corresponding areas in the feature map into a feature map with a fixed size according to the position coordinates of the preselected frame by using the ROI Align so as to perform subsequent operation;

and inputting the result of the last step into a ROIAlign network, wherein the network firstly maps the original image ROI area to the feature image ROI, and then the size of the area is normalized into the size of the input of the convolution network through pooling. The ROIAlign directly uses bilinear interpolation from the original image to the ROI mapping of the feature map without rounding, so that the error is small, and the accuracy of the original image after pooling is higher. Assuming that the obtained floating point coordinates are (X, Y), the nearest four points around the floating point coordinates are interpolated twice in the Y direction and once again in the X direction to obtain a new value, and the shape of the ROI is not changed.

Multi-class classification, box-candidate regression, instance segmentation, and segmentation scoring are performed on the dental ROI region. During example segmentation, a feature map with a fixed size of a tooth ROI generated by ROIAlign operation is subjected to 4 convolution operations to generate a feature map with a size of 14 × 14; then generating a characteristic map with the size of 28 x 28 through upsampling; finally, a feature map with the size of 28 x 28 and the depth of 80 is generated through a convolution operation. In segmenting the scores, a modified network of maskolou scores was used, and the model maskolou head was aimed at regressing IoU between the predicted mask and its ground-truth mask. The input of the MaskIoU consists of two parts, namely a RoI feature map obtained by ROIAlign and a mask output by a mask branch. After the two are connected, maskIoU is output through 3 layers of convolution and 2 layers of full connection. This network module uses the example features and the corresponding predicted mask to regression MaskIoU. The module has three types, the first type only learns MaskIoU of a target class, and ignores other classes in the proposal; the second learns maskolou for all classes. If the class does not appear in the RoI, its target MaskIoU is set to 0. This setup represents using only regression to predict maskolou, which requires the regressor to be aware of the existence of irrelevant classes; the third learns the maskolou of all foreground classes, where a foreground class indicates that the class appears in the RoI area. The remaining categories in the proposal will be ignored. In tooth segmentation we need to use the third type because in example segmentation different teeth belong to different classes, which need to be segmented.

Will s_maskDefined as the score of the predicted tooth. Ideal s of_maskIs equal to the prediction s_maskThe pixel level IoU between its matching ground truth masks, the maskoiu mentioned earlier. Ideal s_maskIt should be positive for only the ground truth class and zero for the other classes because the mask belongs to only one class. This requires a mask score s_maskIt works well on both tasks: the masks are classified into the correct categories, and the proposed maskIoU is regressed into the foreground object categories. It is difficult to train both tasks using only a single objective function. We can decompose the mask score learning task into mask classification and IoU regression for all objectsClasses, all indicated as s_mask＝s_cls·s_iou。s_clsFocus on classifying to which category a proposal belongs, and s_iouFocus on regression to maskIoU. For s_cls，s_clsThe goal of (1) is to classify the proposals belonging to which class, which is done in the classification task of the R-CNN stage. The corresponding classification score may be taken directly. Regression s_iouObtained in the network with the maskolou score modified in the fourth section. The scores of the whole tooth segmentation are determined simultaneously by using the classified scores and the segmentation scores, so that the scores are more accurate, and the segmentation accuracy is improved.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A CBCT tooth segmentation method based on deep learning is characterized by comprising the following steps:

s1: preprocessing a Cone Beam Computed Tomography (CBCT) image by using the priori knowledge of the tooth image, and extracting a tooth part to obtain an interested area;

s2: carrying out feature extraction on the image through a depth residual error network ResNet and a feature pyramid network FPN to obtain a feature map, which specifically comprises the following steps: learning residual representations between the inputs and outputs using a plurality of parametric layers of the ResNet network; applying deep-level tooth feature maps and shallow-level tooth feature maps in a ResNet network to an FPN network, integrating the feature maps from bottom to top, from top to bottom and in transverse connection to obtain an optimal CNN feature combination set of tooth pictures, and outputting the feature maps;

s3: sequentially compressing the feature map by using a constant Block orientation Module (CBAM) to perform importance coding on the feature map; the CBAM model is divided into two parts: a Channel Attention Module (CAM) and a Spatial Attention Module (SAM);

inputting the feature map F obtained in the step S2 into a CBAM model, performing spatial compression on the feature map F in a CAM module to obtain a channel weight coefficient Mc (F), multiplying the coefficient Mc (F) by the feature map F to obtain a new feature map F ', and then, entering the feature map F' into a SAM module, performing channel compression on the feature map F 'to obtain a spatial weight coefficient Ms (F'), and multiplying the coefficient Ms (F ') by the feature map F' to obtain a corrected feature map F ";

s4: extracting candidate regions from the feature map compressed in step S3 by using the generated candidate region network RPN;

s5: pooling the corresponding areas in the feature map processed in step S4 into a fixed-size feature map according to the position coordinates of the pre-selected frame using the roiign network;

s6: classifying, segmenting, surrounding frame regression and segmenting and grading the candidate regions;

during segmentation, 4 convolution operations are carried out on a feature map with a fixed size generated by the ROI Align network; then, upsampling is carried out; finally, performing convolution operation;

in the division of the scores, IoU between the tooth mask and the ground real mask thereof is predicted by regression using a correction network of MaskIoU scores; the input of the MaskIoU consists of two parts, namely a RoI feature map obtained by ROIAlign and a mask output by a mask branch; after the two are connected, maskIoU is output through 3 layers of convolution and 2 layers of full connection.

2. The CBCT tooth segmentation method based on deep learning of claim 1, wherein in step S1, the CBCT image comprises: dental images and skull information;

preprocessing the CBCT image, comprising: and (4) cutting the image by utilizing the tooth priori knowledge, and converting the CBCT scanning image into an image in a jpg format.

3. The CBCT tooth segmentation method based on deep learning of claim 1, wherein the step S3 specifically includes: the method for compressing the feature map F in the CAM module in space specifically comprises the following steps: spatial compression was performed using max-pooled maxpool (f) and average-pooled avgpool (f) resulting in two different spatial descriptions:

and

the shared network composed of the multiple layers of perceptrons MLP is used for calculating the two different space descriptions to obtain a channel weight coefficient Mc (F), and the calculation formula is as follows:

carrying out element multiplication on the obtained channel weight coefficient Mc (F) and the feature graph F to obtain a feature graph F', wherein the calculation formula is as follows:

wherein,

a representative element multiplication;

and

connected and convolved by standard convolution layers to generate a two-dimensional spatial weight coefficient Ms (F'), which is calculated as follows:

and carrying out element multiplication on the obtained space weight coefficient Ms (F ') and the feature map F' to obtain a feature map F ", wherein the calculation formula is as follows:

wherein, F' is a characteristic diagram after the channel attention and the space attention are corrected.

4. The deep learning based CBCT dental segmentation method as claimed in claim 1, wherein in step S4, the significance-coded feature map F "from step S3 is inputted into RPN network, the network generates K preset aspect ratio and area target boxes anchor for each position by means of a window sliding on the shared feature map, the initial target box comprises three areas, each area comprises three aspect ratios 1:1, 1:2, 2: 1; wherein, the RPN network has a tree structure, the trunk is a 3 × 3 convolutional layer, and the branches are two 1 × 1 convolutional layers.

5. The CBCT tooth segmentation method based on deep learning of claim 4, wherein in step S5, the result of step S4 is inputted into ROI Align network, firstly, the ROI area of tooth original image is mapped to the ROI, and then the size of the extracted area is normalized into the size inputted by the convolution network through pooling; ROIAlign directly uses bilinear interpolation from the original image to the ROI mapping of the feature map; assuming that the obtained floating point coordinates are (X, Y), the nearest four points around the floating point coordinates are interpolated twice in the Y direction and once again in the X direction to obtain a new value, and the shape of the ROI is not changed.