CN109003267B

CN109003267B - Computer-implemented method and system for automatically detecting target object from 3D image

Info

Publication number: CN109003267B
Application number: CN201810789942.0A
Authority: CN
Inventors: 宋麒; 孙善辉; 陈翰博; 白军杰; 高峰; 尹游兵
Original assignee: Shenzhen Keya Medical Technology Corp
Current assignee: Shenzhen Keya Medical Technology Corp
Priority date: 2017-08-09
Filing date: 2018-07-18
Publication date: 2021-07-30
Anticipated expiration: 2038-07-18
Also published as: CN109003267A

Abstract

The present disclosure relates to a computer-implemented method and system for automatically detecting a target object from a 3D image. The method may include receiving a 3D image acquired by an imaging device. The method may further include detecting, by the processor, a plurality of bounding boxes containing the target object using a 3D learning network. The learning network may be trained to generate a plurality of feature maps at different scales based on the 3D image. The method may also include determining, by the processor, a set of parameters identifying each detected bounding box using the 3D learning network, and locating, by the processor, the target object based on the set of parameters. The method enables fast, accurate and automatic detection of target objects from 3D images by means of a 3D learning network.

Description

Computer-implemented method and system for automatically detecting target object from 3D image

Cross Reference to Related Applications

This application claims priority from U.S. provisional application No. 62/542,890, filed on 2017, 8, 9, which is incorporated herein by reference in its entirety.

Technical Field

The present disclosure relates generally to image processing and analysis. More particularly, the present disclosure relates to methods and systems for automatically locating and detecting target objects from 3D images.

Background

The accuracy of the diagnosis and the therapeutic effect depend on the quality of the medical image analysis, in particular the detection of the target object (such as an organ, a tissue, a target site, etc.). Volumetric (3D) imaging, such as volumetric CT, may capture more valuable medical information than conventional two-dimensional imaging, thereby facilitating more accurate diagnosis. However, the target object is typically detected by experienced medical personnel (such as a radiologist) rather than by a machine, which makes it cumbersome, time consuming, and prone to error.

One example is the detection of lung nodules from images of the lungs. Fig. 1 shows an example of an axial plane image from a volumetric chest CT. The high density mass within the white bounding box corresponds to a lung nodule. To detect such lung nodules, a radiologist must screen hundreds and thousands of images from a volumetric CT scan. Due to the lack of 3D spatial information, merely identifying nodules from 2D images is not a simple task. Distinguishing small nodules from blood vessels in 2D images is not easy because the blood vessels in 2D axial views are also circular or elliptical, which look like nodules. Typically, radiologists need to examine neighboring images to virtually (in the brain) reconstruct 3D spatial relationships and/or to examine sagittal or coronal views (lower resolution) for reference. Therefore, detection of lung nodules is entirely dependent on the experience of the radiologist.

Although some basic machine learning methods are introduced for detection, these methods typically define features artificially, and thus detection accuracy is low. Furthermore, such machine learning is generally limited to 2D image learning, but due to the lack of 3D spatial information and the appreciable computational resources required for 3D learning, the target object cannot be detected directly in the 3D image.

The present disclosure provides a method and system that can quickly, accurately, and automatically detect a target object from a 3D image by means of a 3D learning network. Such detection may include, but is not limited to, locating the target object, determining the size of the target object, and identifying the type of the target object (e.g., a blood vessel or lung nodule).

Disclosure of Invention

In one aspect, the present disclosure is directed to a computer-implemented method for automatically detecting a target object from a 3D image. The method may include receiving a 3D image acquired by an imaging device. The method may further include detecting, by the processor, a plurality of bounding boxes containing the target object using a 3D learning network. The learning network may be trained to generate a plurality of feature maps at different scales based on the 3D image. The method may also include determining, by the processor, a set of parameters identifying each detected bounding box using the 3D learning network, and locating, by the processor, the target object based on the set of parameters.

In some embodiments, the set of parameters includes coordinates identifying the location of individual bounding boxes in the 3D image.

In some embodiments, the set of parameters includes dimensions that identify the size of the respective bounding box.

In some embodiments, the 3D learning network is trained to perform regression on the set of parameters.

In some embodiments, the computer-implemented method further comprises associating a plurality of anchor boxes with the 3D image, wherein the set of parameters indicates an offset of each bounding box relative to the respective anchor box.

In some embodiments, each anchor box is associated with a grid cell of the feature map.

In some embodiments, the anchor block is scaled according to a scale of the feature map.

In some embodiments, wherein the plurality of feature maps have varying image sizes.

In some embodiments, wherein the plurality of feature maps use a sliding window of variable size.

In some embodiments, the computer-implemented method further comprises creating an initial bounding box, wherein detecting the plurality of bounding boxes containing the target object comprises classifying the initial bounding box as being associated with a plurality of labels.

In some embodiments, the computer-implemented method further comprises applying non-maximum suppression to the detected bounding box.

In some embodiments, the computer-implemented method further comprises: segmenting the 3D image to obtain a convex hull and using the convex hull to constrain detection of the plurality of bounding boxes.

In some embodiments, the learning network is further trained to segment the target object within each detected bounding box.

In some embodiments, wherein the imaging device is a computed tomography imaging system.

In some embodiments, the target object is a lung nodule.

In some embodiments, the learning network is a full convolution neural network.

In another aspect, the present disclosure is also directed to a system for automatically detecting a target object from a 3D image. The system may include an interface configured to receive a 3D image acquired by an imaging device. The system may also include a processor configured to detect a plurality of bounding boxes containing the target object using a 3D learning network. The learning network may be trained to generate a plurality of feature maps at different scales based on the 3D image. The processor may be further configured to determine a set of parameters identifying each detected bounding box using a 3D learning network, and locate the target object based on the set of parameters.

In some embodiments, the processor comprises a graphics processing unit.

In yet another aspect, the present disclosure also relates to a non-transitory computer-readable medium having instructions stored thereon. The instructions, when executed by the processor, may perform a method for automatically detecting a target object from a 3D image. The method may include receiving a 3D image acquired by an imaging device. The method may further include detecting a plurality of bounding boxes containing the target object using a 3D learning network. The learning network may be trained to generate a plurality of feature maps at different scales based on the 3D image. The method may further include determining a set of parameters identifying each detected bounding box using a 3D learning network, and locating the target object based on the set of parameters.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

In the drawings, which are not necessarily drawn to scale, like reference numerals may depict like parts in different views. Like numbers with letter suffixes or different letter suffixes may represent different instances of similar components. The drawings illustrate various embodiments generally by way of example and not by way of limitation, and together with the description and claims serve to explain the disclosed embodiments. The same reference numbers will be used throughout the drawings to refer to the same or like parts, where appropriate. Such embodiments are illustrative, and are not intended to be exhaustive or exclusive embodiments of the present method, system, or non-transitory computer-readable medium having instructions thereon for carrying out the method.

FIG. 1 shows an exemplary axial image produced with chest volume computed tomography;

FIG. 2 illustrates an exemplary nodule detection system according to an embodiment of the present disclosure;

FIG. 3 illustrates an exemplary transition from a fully connected layer to a fully convolutional layer in accordance with an embodiment of the disclosure;

fig. 4 depicts a block diagram illustrating an exemplary medical image processing device according to an embodiment of the present disclosure;

fig. 5 shows a schematic diagram of a 3D learning network according to an embodiment of the present disclosure;

FIG. 6 shows a flow diagram of an exemplary process for training a convolutional neural network model, in accordance with an embodiment of the present disclosure;

FIG. 7 shows a flowchart of an exemplary process for identifying a target object, according to an embodiment of the present disclosure;

FIG. 8 illustrates an exemplary process for automatically detecting a target object from a 3D image according to an embodiment of the present disclosure;

FIG. 9 illustrates an exemplary nodule detection process using an n-scale 3D learning network in accordance with an embodiment of the disclosure; and

FIG. 10 illustrates an exemplary nodule segmentation process using a scale of a 3D learning network according to an embodiment of the disclosure.

Detailed Description

The term "target object" as used herein may refer to any anatomical structure in a subject's body, such as a tissue, a portion of an organ, or a target site. For example, the target object may be a lung nodule. In the following embodiments, the lung nodule is taken as an example of the "target object" for illustration and not limitation, but those skilled in the art can easily replace the lung nodule in the following embodiments with other types of "target objects".

Fig. 2 illustrates an exemplary nodule detection system 200 for automatically detecting a target object from a 3D image according to an embodiment of the present disclosure. In this embodiment, the lung nodule is the target object. The lung nodules can become target sites (target areas) for treatments such as radiation therapy. As shown in fig. 2, nodule detection system 200 includes: a nodule detection model training unit 202 for training a detection model; and a nodule detection unit 204 for detecting the location and classification of a nodule object using the trained detection model. The trained detection model may be transmitted from the nodule detection model training unit 202 to the nodule detection unit 204 such that the nodule detection unit 204 may obtain and apply the trained detection model to a 3D medical image, e.g., acquired from a 3D medical image database 206. In some embodiments, the detection model may be a 3D learning network.

For example, the location of a nodule object may be identified by the center of the nodule object and its extent. If desired, the classification of a nodule object may be identified by a label selected from (n +1) nodule labels, such as, but not limited to, non-nodules, first size nodules, a. As another example, the location may include a plurality of bounding boxes containing nodule objects. Alternatively or additionally, the position of the nodule object may comprise a set of parameters identifying the respective detected bounding box. The set of parameters may include coordinates identifying the location (e.g., center) of the respective bounding box in the 3D medical image. The set of parameters may also include dimensions that identify the size of individual bounding boxes in the 3D medical image. Based on the detected bounding box and/or a set of parameters identifying it, a nodule object may be located.

The training samples may be stored in the training image database 201 and may be acquired by the nodule detection model training unit 202 to train the detection model. Each training sample includes medical images and locations of nodule objects in the respective medical images as well as classification information.

In some embodiments, the output results of the nodule detection unit 204 (including the location and classification of nodule objects) may be visualized using a heat map overlaid with the original medical 3D image (e.g., the original volumetric CT image). In some embodiments, the detection results may be transmitted to the training image database 201 over the network 205 and added as additional training samples with the corresponding images. In this way, the training image database 201 may be continuously updated by including new detection results. In some embodiments, the nodule detection model training unit 202 may periodically train the detection model with updated training samples to improve the detection accuracy of the trained detection model.

The 3D learning network may be implemented by various neural networks. In some embodiments, the 3D learning network may be a feed-forward 3D convolutional neural network. Such a feed-forward 3D convolutional neural network, when applied to a lung volume CT image, may generate a plurality of feature maps, each feature map corresponding to a respective class of nodule objects, such as non-nodules, first size nodules, and n-th size nodules. In some embodiments, the respective grid cells of the feature map may indicate the presence state of respective types of nodule objects in corresponding regions of the lung volume CT image. Based on the plurality of feature maps, a plurality of bounding boxes and a score of the presence state of the nodule object in the bounding boxes are generated. For example, a score of 1.0 may indicate the presence of a respective type of nodule object in the bounding box, a score of 0.0 may indicate the absence of a respective type of nodule object in the bounding box, and a score between 0.0 and 1.0 may indicate a probability of the presence of a respective type of nodule object in the bounding box. In some embodiments, the feed-forward 3D convolutional neural network may be followed by a non-maximum suppression layer to produce the final detection result. Alternatively, on top of the 3D convolutional neural network, an auxiliary fully-connected layer or an auxiliary fully-convolutional layer may be added as a detection layer. In some embodiments, the bounding box may be 3D and identified by a set of parameters. For example, it can be identified by the coordinates (x, y, z) of the center of the bounding box and the box size (size _ x, size _ y, size _ z) on the x-axis, y-axis and z-axis, respectively. In some embodiments, the 3D convolutional neural network may be trained to perform regression on the set of parameters, so the result of the 3D convolutional neural network may include the classification result of the nodule object (the type of the detected nodule object) and 6 regression parameter values for the respective detected type of nodule object.

In one embodiment, the feed-forward 3D convolutional neural network may include a base network, and the feature map of scale 1 (the first scale) may be derived from the base network. For example, the base network may include three volume blocks, each consisting of two 3 × 3 × 3 convolution layers, a ReLU layer, and a 2 × 2 × 2 max pooling layer, and three detection layers fc1, fc2, and fc 3. Convolutional layer 1 and convolutional layer 2 in convolutional block 1 have 64 feature maps, convolutional layer 1 and convolutional layer 2 in convolutional block 2 have 128 feature maps, and convolutional layer 1 and convolutional layer 2 in convolutional block 3 have 256 feature maps. In some embodiments, fc1, fc2, and fc3 may be auxiliary fully-connected layers for classification tasks. In one embodiment, fc1 has 512 neurons after the ReLU layer, fc2 has 128 neurons after the ReLU layer, and fc3 has a number of neurons depending on the classification. For example, if the nodule objects are classified into 10 classes (e.g., non-nodule, size 1 nodule, size 2 nodule, …, size 9 nodule), the number of neurons in fc3 layer is 10.

In another embodiment, the base network may be modified by transforming the above-described full connection layers Fc1, Fc2, and Fc3 into full convolutional layers Fc1-conv, Fc2-conv, and Fc3-conv, respectively. Therefore, due to the acceleration effect of performing convolution operations on the image, calculations based on the modified base network can be accelerated. FIG. 3 illustrates an exemplary transition from a fully connected layer to a fully convolutional layer, in accordance with an embodiment of the disclosure. In some embodiments, the kernel size of Fc1-conv may be the same as the size of the feature map, which is pooled after output from convolution block 3 if desired, while Fc2-conv and Fc3-conv both have a kernel size of 1x1x 1. In some embodiments, the number of feature maps for three fully convolutional layers is the same as the number of feature maps for the corresponding fully-connected layers. As shown in fig. 3, the weights w00, w01, w10, and w11 of the convolution kernel are translated from the respective weights w00, w01, w10, and w11 of the respective fully-connected layers. In some embodiments, the weight of the fully-connected layer may be reshaped according to the convolution kernel size and the number of feature maps.

In one embodiment, the base network or a modified version thereof described above may be used as a 3D learning network to directly detect bounding boxes. The base network may be applied on the input 3D image to generate a plurality of feature maps, each feature map corresponding to a particular class of object. In some embodiments, each grid cell of the feature map corresponds to a respective block in the 3D image. For example, for an ith feature map corresponding to an ith class object, the value of its grid cell may indicate the probability that the ith class object exists in the corresponding block of the 3D image. In some embodiments, objects within the respective block may be classified based on the values of the corresponding grid cells of the respective feature maps. Further, by transforming the coordinates of the grid cells from the feature mapping space to the 3D image space, an initial bounding box may be generated and labeled with the classification result. In some embodiments, for convolution operations without padding, the transformation may be performed using equation (1).

Wherein x is_fIs a coordinate in the predicted feature map, x_oriIs the coordinate in image space, s₁、s₂And s₃Is a scale factor that is a function of,

is a lower rounding operation, c_{i_j}(

i

1,2,3, j 1,2) is the convolution kernel size of the jth convolution layer in the ith convolution block, c_{1_1}＝c_{1_2}＝c_{2_1}＝c_{2_2}＝c_{3_1}＝c_{3_2}＝3,c₄＝8,c₅＝1,c₆＝1。

In some embodiments, for convolution operations using padding, the transformation may be performed using equation (2).

In some embodiments, the 3D learning network may have several dimensions. For example, a convolutional network based on multi-scale feature mapping may be used. The number of scales may be determined based on different detection tasks. The multi-scale features of convolutional networks can be implemented in various ways. For example, multiple feature maps may have the same scale size, but obtained using sliding windows of different sizes. As another example, the feature map may be downsampled to different scales using a convolution filter or a pooling filter, or both, while using a sliding window of the same size. As yet another example, the feature map may also be downsampled to a different scale using a downsampling layer, and so on. Using the disclosed multi-scale feature mapping based convolutional network can speed up the computation, making detection based on a 3D learning network clinically applicable while enabling detection of objects with a wide range of sizes.

In some embodiments, the 3D learning network uses a series of full convolution filters (also referred to as full connectivity layers) at each scale to produce a fixed number of detection results. In some embodiments, the 3D learning network may return a plurality of bounding boxes, each bounding box associated with two portions: the object is classified and a set of parameters corresponding to the bounding box is identified. The object classification may have c classes. In one embodiment, c ═ 3, where the three object classes are non-lung background, non-nodule but lung tissue, and nodule, respectively. For the nodule class, a bounding box encloses the corresponding nodule object. In some embodiments, multiple anchor boxes may be introduced for each feature mapping grid cell in order to allow the detected bounding box to better track the target object. For example, the set of parameters that identify the respective bounding box may be the offset of the regression in coordinates (centered on the evaluated grid cell) and size relative to the corresponding anchor box. For the nodule class, the offset may be a relative value (dx, dy, dz, dsize _ x, dsize _ y, and dsize _ z) with respect to the coordinates and size of the corresponding anchor frame. For each scale feature map, if k anchor boxes are associated with each grid cell therein, then a total of k × s bounding boxes (and k × s anchor boxes) are obtained for each corresponding grid cell of the s scale feature map. Then, for each of the k × s bounding boxes, the score for each of the c categories and its 6 offsets from the corresponding anchor box may be calculated. This results in a total of (c +6) × k × s filters applied around each position in the feature map. In the case where the size of each feature map is m × n × d, (c +6) × k × s × m × n × d outputs may be generated. In some embodiments, the anchor box is 3D and is associated with feature maps of different scales. In some embodiments, the anchor block may be scaled based on the scale of the corresponding feature map. Alternatively, the position and size of the anchor frame may be adjusted by a regression algorithm based on the image information.

Fig. 4 depicts a block diagram illustrating an exemplary medical image processing device 300 adapted for automatic detection of a target object from a 3D image according to an embodiment of the present disclosure. The medical image processing device 300 may comprise a network interface 328, by means of which network interface 328 the medical image processing device 300 may be connected to a network (not shown), such as, but not limited to, a local area network in a hospital or the internet. The network may connect the medical image processing apparatus 300 with an external apparatus such as an image acquisition apparatus (not shown), a medical image database 325, and an image data storage 326. The image acquisition apparatus may be any apparatus for acquiring an image of an object, such as a DSA imaging device, an MRI imaging device, a CT imaging device, a PET imaging device, an ultrasound device, a fluoroscopy device, a SPECT imaging device or other medical imaging device for obtaining a medical image of a patient. For example, the imaging device may be a pulmonary CT imaging device or the like.

In some embodiments, the medical image processing device 300 may be a dedicated smart device or a general-purpose smart device. For example, the apparatus 300 may be a computer customized for image data acquisition and image data processing tasks, or a server in the cloud. For example, the apparatus 300 may be integrated into an image acquisition device. Optionally, the apparatus may comprise or cooperate with a 3D reconstruction unit for reconstructing a 3D image based on the 2D image acquired by the image acquisition device.

The medical image processing apparatus 300 may include an image processor 321 and a memory 322, and may additionally include at least one of an input/output 327 and an image display 329.

The image processor 321 may be a processing device that includes one or more general-purpose processing devices, such as a microprocessor, Central Processing Unit (CPU), Graphics Processing Unit (GPU), etc. More specifically, image processor 321 may be a Complex Instruction Set Computing (CISC) microprocessor, Reduced Instruction Set Computing (RISC) microprocessor, Very Long Instruction Word (VLIW) microprocessor, processor running other instruction sets, or processors running a combination of instruction sets. The image processor 321 may also be one or more special-purpose processing devices such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), a system on a chip (SoC), or the like. As will be appreciated by those skilled in the art, in some embodiments, the image processor 321 may be a dedicated processor rather than a general purpose processor. The image processor 321 may include one or more known processing devices, such as the Pentium (r) manufactured by intel corporation^TM、Core^TM、Xeon^TMOr Itanium^TMSeries of microprocessors, Turion manufactured by AMD^TM、Athlon^TM、Sempron^TM、Opteron^TM、FX^TM、Phenom^TMA family of microprocessors or any of a variety of processors manufactured by Sun Microsystems. Image processor 321 may also include a graphics processing unit, such as from Nvidia corporation

Series of GPUs GMA, Iris, manufactured by Intel^TMGPU series or Radeon manufactured by AMD^TMA series of GPUs. The image processor 321 may also include an accelerated processing unit, such as the desktop A-4(6,8) series manufactured by AMD, Xeon Phi manufactured by Intel^TMAnd (4) series. The disclosed embodiments are not limited to any type of processor or processor circuit that is otherwise configured to meet the following computational requirements: identify, analyze, maintain, generate, and/or provide a large amount of imaging data or manipulate such imaging data to detect and locate a target object from a 3D image or manipulate any other type of data consistent with the disclosed embodiments. In addition, the terms "processor" or "image processor" may include more than one processor, e.g., a multi-core design or multiple processors, each of which has a multi-core design. The image processor 321 may execute sequences of computer program instructions stored in the memory 322 to perform the various operations, processes, methods disclosed herein.

The image processor 321 may be communicatively coupled to the memory 322 and configured to execute computer-executable instructions stored therein. Memory 322 may include Read Only Memory (ROM), flash memory, Random Access Memory (RAM), Dynamic Random Access Memory (DRAM) such as synchronous DRAM (sdram) or Rambus DRAM, static memory (e.g., flash memory, static random access memory), etc., on which computer-executable instructions are stored in any format. In some embodiments, memory 322 may store computer-executable instructions for one or more image processing programs 223. The computer program instructions may be accessed by image processor 321, read from ROM or any other suitable storage location, and loaded into RAM for execution by image processor 321. For example, memory 322 may store one or more software applications. The software applications stored in memory 322 may include, for example, an operating system (not shown) for a general purpose computer system and a soft control device. In addition, the memory 322 may store the entire software application or only a portion of the software application (e.g., the image processing program 223) to be executable by the image processor 321. Additionally, memory 322 may store a plurality of software modules for implementing the various steps of a method of automatically detecting a target object from a 3D image or a process of training a 3D learning network consistent with the present disclosure. Further, the memory 322 may store data generated/cached when executing the computer program, such as medical image data 324, which includes medical images transmitted from an image acquisition device, a medical image database 325, an image data storage 326, and the like. Such medical image data 324 may comprise a received 3D medical image on which an automatic detection of a target object is to be performed. Furthermore, the medical image data 324 may also include 3D medical images along with target object detection results thereof.

The image processor 321 may execute the image processing program 223 to implement a method for automatically detecting a target object from a 3D image. In some embodiments, when executing the image processing program 223, the image processor 321 may associate the corresponding 3D image with the detection results including the object classification and the detected bounding box, and store the 3D image to the memory 322 along with (e.g., with the labeling of) the detection results. Alternatively, the memory 322 may communicate with the medical image database 325 to obtain an image therefrom (with an object to be detected therein) or to send a 3D image to the medical image database 325 along with the detection result.

In some embodiments, the 3D learning network may be stored in memory 322. Alternatively, the 3D learning network may be stored in a remote device, a separate database (such as the medical image database 325), a distributed device, and may be used by the image processing program 223. The 3D images along with the detection results may be stored as training samples in the medical image database 325.

The input/output 327 may be configured to allow the medical image processing apparatus 300 to receive and/or transmit data. Input/output 327 may include one or more digital and/or analog communication devices that allow device 300 to communicate with a user or other machines and devices. For example, input/output 327 may include a keyboard and mouse that allow a user to provide input.

Network interface 328 may include a network adapter, cable connector, serial connector, USB connector, parallel connector, high speed data transmission adapter such as fiber optic, USB 3.0, lightning, wireless network adapter such as WiFi adapter, telecommunications (3G, 4G/LTE, etc.) adapter. The apparatus 300 may connect to a network through a network interface 328. The network may provide the functionality of a Local Area Network (LAN), a wireless network, a cloud computing environment (e.g., as software for a service, as a platform for a service, as an infrastructure for a service, etc.), a client server, a Wide Area Network (WAN), etc.

In addition to displaying medical images, the image display 329 may also display other information, such as classification results and detected bounding boxes. For example, image display 329 may be an LCD, CRT, or LED display.

Various operations or functions are described herein that may be implemented as or defined as software code or instructions. Such content may be source code or difference code ("delta" or "block" code) that is directly executable ("object" or "executable" form). The software code or instructions may be stored in a computer-readable storage medium and, when executed, may cause a machine to perform the functions or operations described, and include any mechanism for storing information in a form accessible by a machine (e.g., a computing device, an electronic system, etc.), such as recordable or non-recordable media (e.g., Read Only Memory (ROM), Random Access Memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, etc.).

As described above, a 3D learning network according to embodiments of the present disclosure may operate in an end-to-end manner and directly predict nodule classes and bounding boxes.

In some embodiments, to reduce computational and storage costs, a phased training scheme may be used. The training regimen can be divided into three phases: (1) unsupervised learning; (2) training a classification network based on the small image blocks; and (3) large image patch based detection network training. In some embodiments, stages (1) and (2) may be used to train a classification network (part of a detection network), such as the basic network disclosed herein, to produce good network initialization for the entire detection network. In some embodiments, stage (3) may perform end-to-end training on large image patches.

In some embodiments, if the original 3D image is too large to fit into an image processor 321 (such as a modern GPU) memory, it may be divided into a plurality of large image blocks to fit into according to the memory size of the image processor 321. By segmenting the original 3D image into small tiles and large tiles, and using a staged training scheme including unsupervised training, training the classification network in the 3D detection network for the small tiles, and then training the 3D detection network on the basis of the trained classification network, the total computations required for training can be reduced sufficiently to be achieved by modern GPUs.

In one embodiment, the initial network weights may be generated using a 3D convolutional auto-encoder, as shown in fig. 5. In some embodiments, the encoder portion consists of cascaded convolutional blocks (e.g., 3 convolutional blocks), and the decoder portion consists of cascaded deconvolution blocks corresponding to the 3 convolutional blocks of the encoder portion. Within a deconvolution block, the deconvolution layer consists of an upsampling layer followed by a convolutional layer. As shown in fig. 5, an image block is input, a convolution operation is performed by an encoder part, and a deconvolution operation is performed by a decoder part, and then a predicted image block is output. The 3D convolutional auto-encoder may be trained such that the output image blocks are the same as the input image blocks (target image blocks). In one embodiment, noise, such as gaussian noise, may be added to the input image block but not to the target output image block in order to make the learning robust. In some embodiments, both the input image block and the target output image block may be transformed, such as rotated, deformed, scaled up/down, etc., to make the learning more robust. The supervised training process is then initialized with the network weights of the trained encoder (i.e., the base network).

The detection network comprises said classification network. In one embodiment, the classification network may be trained prior to training the detection network. FIG. 6 illustrates a flow diagram of a convolutional neural network training process 400 for training a classification network therein. The process 400 begins at step 450 where a 3D training image and associated classification results are received. In some embodiments, at step 452, the 3D training image may be divided into image blocks, for example, using a sliding window. Then, in step 454, the single image block is input into the classification network together with its classification result as training data. In some embodiments, the weights of the classification network may have been initialized. At step 456, classifier parameters of the classification network may be determined based on the training data. At step 458, the determination of classifier parameters may include validating against a loss function. In some embodiments,

steps

456 and 458 may also be integrated in the same step, wherein the classifier parameters of the classification network may be optimized for the loss function on a per image block basis. In some embodiments, the optimization process may be performed by any one of a number of commonly used algorithms, including but not limited to gradient descent algorithms, newton methods, conjugate gradient algorithms, quasi-newton methods, and Levenberg Marquardt algorithms, among others. At step 460 it is determined whether all image patches have been processed, and if so, at step 462, the trained classification network with the currently optimized classifier parameters is output as the trained model. Otherwise, the process returns to step 454 to process subsequent image blocks until all image blocks have been processed. At step 452, the 3D training image may be divided into image blocks using a sliding window. In one embodiment, the last several fully-connected layers of the convolutional network, which is a classification network (or detection network), may be converted to fully-convolutional layers, as explained above with reference to fig. 3. The convolution operation steps are equivalent as sliding window steps. A huge acceleration is obtained due to the fast convolution calculation on the GPU.

In some embodiments, based on a trained classification network, a detection network may be constructed and trained on large image patches. For example, the training process 400 of the classification network may be adapted to train the detection network. The differences are as follows: the training data input at step 454 is the large image patch and information about the bounding box detected therein containing the target object, such as a set of parameters identifying the corresponding detected bounding box. Such a set of parameters may include, but is not limited to, a tag (associated with a target object in a large image patch) and a set of location parameters of a detected bounding box. As an example, a label 0 may represent that the 3D patch does not contain nodules (i.e., where the detected bounding box contains a non-nodule target object), and labels 1-9 may represent that the 3D patch contains nodules of different sizes 1-9, respectively (i.e., where the detected bounding box contains a nodule target object of size n, n being an integer in the range of 1-9). A distinction is also made in

steps

456 and 458, wherein the parameters to be optimized for the loss function are parameters belonging to the detection network.

In some embodiments, to train the detection network, labels may be assigned to the various anchor boxes. In one embodiment, if the intersection ratio (IoU) of the anchor box overlapping any ground truth box is above a certain threshold (e.g., 0.7), a label corresponding to the ground truth box may be assigned to the anchor box. Note that a single ground truth box (e.g., a tight bounding box containing a nodule) may assign corresponding labels to several anchor boxes. For example, an IoU ratio for a non-nodule anchor box to all lung-outside ground-truth boxes is below 0.3, then an intra-lung non-nodule label may be assigned to the non-nodule anchor box. Conversely, if the IoU ratio of the non-nodule anchor box to all intra-pulmonary ground truth boxes is below 0.3, an extra-pulmonary non-nodule label may be assigned to the non-nodule anchor box.

In one embodiment, the loss function used to train the detection network may be a multi-tasking loss function to cover both classification tasks and bounding box prediction tasks. For example, the multitasking loss function may be defined by equation (3).

Where i is the index of the anchor box in the training minibatch, p_iIs the predicted probability that anchor box i is a nodule. Ground truth label is p_i*，t_i6 parameterized coordinates, t, representing the predicted bounding box_iIs the parameterized coordinates of the ground truth box associated with the nodule anchor box. L is_clsIs the cross entropy loss, L_regIs a robust loss function. In some embodiments, N_clsAnd N_regThe number of corresponding boxes in the small batch may be used for normalization, respectively. λ is a weighting parameter between the classification task and the regression task.

As an example, 6-parameter regression may be adopted for the bounding box regression, and these 6 parameters may be defined by equation (4).

Where x, y, z, w, h and d represent the center coordinates of the bounding box and its width, height and depth. Variables x, x_aAnd x^*Bounding boxes, anchor boxes, and ground truth boxes for prediction (the same applies to y, z, w, h, and d), respectively.

Fig. 7 shows a flow diagram of an exemplary process 500 for identifying a target object in a 3D image scan. The target object identification process 500 begins at step 512, where a trained nodule detection model is received. At step 514, a 3D medical image, possibly including a target object, is received. Then, in step 452, the 3D medical image may be divided into image blocks using a sliding window. In step 516, a plurality of bounding boxes are detected from the image block using a detection network. At step 518, a tag identifying each bounding box and a set of location parameters are determined. As an option, steps 516 and 518 may be integrated in the same step. The bounding box may then be classified using the labels at step 520, and the classified bounding box and its location parameters may be used to locate a target object such as a nodule. At step 460, it is determined whether all image blocks have been processed, and if so, the nodule detection results of the respective image blocks are integrated to obtain and output a complete nodule detection result at step 462. If not, process 500 returns to step 516. Although the target object identification process 500 shown in fig. 7 is based on a sliding window, it is not limited thereto. In one embodiment, the last several fully-connected layers of the convolutional network, which is a classification network (or detection network), may be converted to fully-convolutional layers, as described above with reference to fig. 3. The stride of the convolution operation acts similar to the stride of the sliding window, and these operations are equivalent. A huge acceleration is obtained due to the fast convolution calculation on the GPU.

Fig. 8 illustrates an exemplary process for automatically detecting a target object from a 3D image according to another embodiment of the present disclosure. As shown in fig. 8, the 3D lung volume CT image may be input to a detection system that uses a trained model to detect nodules therein and identifies corresponding bounding boxes containing each nodule, including the location and size of the corresponding bounding box.

In some embodiments, the 3D image may be divided into smaller chunks along various directions (such as, but not limited to, the z-direction), and then the detection network and its associated algorithms may be applied to the respective chunks to obtain respective detection results for the target object. The detection results of the individual chunks may be aggregated and the chunks with the respective detection results may be integrated to produce a complete detection result for the entire 3D image.

FIG. 9 illustrates an exemplary nodule detection process using an n-scale 3D learning network. As shown in fig. 9, the input 3D medical image is a W x H x Z volumetric CT scan. To load the detection nets and feature maps into the GPU memory, the input CT scan is split into smaller chunks, note that fig. 9 illustrates two chunks W H Z₁CT scan and W H Z₂CT scan asAn example of a detection process is illustrated and exemplified, and the number of chunks may in fact be chosen as desired to suit the performance of the GPU. For example, the base network corresponds to scale 1, and learning networks corresponding to other scales including scales 2-n may be implemented by scale scaling (rescale) operations on the base network, including but not limited to convolution and max pooling. Each chunk may have 3 category labels for the bounding box. Bounding boxes within each block may be detected using various scales of feature maps of 3 classes. The detection results may include the class labels of the bounding box and the 6 offset parameters of the regression (dx, dy, dz, dsize _ x, dsize _ y, and dsize _ z) with respect to the anchor box. The detected bounding boxes of all chunks, i.e. the multi-box prediction shown in fig. 9, can be combined and transformed into the original image coordinate system, and then 3D non-maximum suppression is performed to obtain the final detection result. By 3D non-maximal suppression, redundant bounding boxes may be eliminated to simplify and clarify the detection/localization results of the target object in the 3D medical image. For example, as a result of the 3D non-maximal suppression, one detected bounding box may be determined for one nodule.

Alternatively, the segmentation may be performed before running the detection algorithm to constrain the detection algorithm to potential regions of the 3D medical image instead of the entire 3D medical image. Thus, the detection accuracy can be improved, and the amount of calculation required for detecting the network can be reduced.

Taking the lung nodule as an example of a target object, it is known that the lung nodule is always inside the lung. In one embodiment, lung segmentation may be performed in advance to further remove false alarms within the lungs. In particular, lung segmentation may be performed in advance to generate a lung convex hull, which is then used to constrain nodule detection. The lung segmentation may be performed by various means including, but not limited to, convolutional networks, active contour models, watershed segmentation algorithms, and the like. In some embodiments, such lung segmentation may be performed in a lower resolution scan, and the results may be upsampled to the original resolution to load the 3D learning network and feature map into GPU memory while speeding up the segmentation process.

In a clinical setting, radiologists often need to perform quantitative analysis of lung nodules. For example, they require the boundaries of the nodule in addition to the detected bounding box, and the exact size of the detected nodule depending on the nodule segmentation, etc. In some embodiments, the segmentation may be performed based on the detected bounding box. For example, segmentation may be performed within the bounding box of a detected lung nodule using a 3D convolutional network. Thus, a segmentation model/learning network may be trained on smaller nodule image patches and applied to image regions within the detected bounding box.

In one embodiment, as shown in FIG. 10, nodule segmentation and detection may be integrated into a nodule segmentation process to enable an end-to-end detection and segmentation process. Although in fig. 10, the input W x H x Z3D scan is divided into two CT scans, it is contemplated that the input 3D scan may be divided into any suitable number of CT scans. For each of the divided CT scans, various scaling operations, including convolution and max-pooling operations, may be performed on the underlying network to obtain detection results at the respective scales. Once the bounding box is detected, the bounding box may be scaled back to the feature mapping space (e.g., realigned to the last feature layer) and then ROI (region of interest) pooling applied thereon, thereby generating the ROI region. A segmentation algorithm may be performed on each ROI to directly generate the nodule segmentation. As an example, such a segmentation algorithm may be implemented by segmenting layers through a full convolution network. Such a segmentation algorithm may also be implemented by a series of deconvolution or upsampling layers after the convolution layer. In some embodiments, the nodule segmentation results for each segmented CT scan may be integrated to obtain the entire nodule segmentation result in the input original CT scan. In one embodiment, pooling uses bilinear interpolation and resampling to make the ROIs the same size to speed up GPU computations.

It is envisaged that the nodule segmentation process, as shown in figure 10, may be extended from previous detection networks, where the detection and segmentation stages may share the same detection network, e.g. the base network.

In some embodiments, training of the network used in fig. 10 may proceed as follows. First, the detection network is trained to obtain weights for the detection network portions. Then, given the weights of the detected network portions, the partitioned network portions are trained using ground truth bounding boxes. Several loss functions may be employed to train the segmentation network portion, including but not limited to normalized cross entropy based on the number of foreground and background voxels. The two networks may be combined to obtain a nodule segmentation result.

The detection network portion and the segmentation network portion may be trained separately, sequentially or simultaneously. In one embodiment, during the training phase, both the detection network portion and the segmentation network portion may be trained simultaneously against a joint loss function, with ground truth bounding boxes and segmentations used to supervise the segmentation. The joint loss function may be defined by equation (5).

Wherein the preceding terms are the same as those in formula (3) and thus the definitions thereof are omitted. The last term is the loss component of the segmentation. N is a radical of_segIs the number of divided regions in a small batch. L is_segIs the voxel loss function in a region, j is the index of the region of interest in the training minibatch, S_jIs the predicted probability of the region of interest, S_jIs the ground truth segmentation.

The foregoing description has been presented for purposes of illustration. It is not exhaustive and is not limited to the precise forms or embodiments disclosed. Modifications and variations of the disclosed embodiments will become apparent from consideration of the specification and practice of the disclosed embodiments.

In this document, the terms "a" or "an" are used, as is common in patent documents, to include one or more than one, independent of any other instances or usages of "at least one" or "one or more. Herein, unless otherwise indicated, the term "or" is used to refer to a non-exclusive or such that "a or B" includes "a but not B", "B but not a" and "a and B". In this document, the terms "including" and "in which" are used as the plain-english equivalents of the respective terms "comprising" and "in which". Furthermore, in the following claims, the terms "comprising" and "including" are intended to be open-ended, i.e., an apparatus, system, device, article, composition, formulation, or process that comprises elements other than those listed in a claim below as a matter of claim. Furthermore, in the following claims, the terms "first," "second," and "third," etc. are used merely as labels, and are not intended to impose numerical requirements on their objects.

The exemplary methods described herein may be machine or computer-implemented, at least in part. Some examples may include a computer-readable medium or machine-readable medium encoded with instructions operable to configure an electronic device to perform a method as described in the above examples. An implementation of such a method may include software code, such as microcode, assembly language code, higher level language code, or the like. Various programs or program modules may be created using various software programming techniques. For example, program segments or program modules may be designed using Java, Python, C + +, assembly language, or any known programming language. One or more of such software portions or modules may be integrated into a computer system and/or computer-readable medium. Such software code may include computer readable instructions for performing various methods. The software code may form part of a computer program product or a computer program module. Further, in one example, the software code can be tangibly stored on one or more volatile, non-transitory, or non-volatile tangible computer-readable media, such as during execution or at other times. Examples of such tangible computer-readable media may include, but are not limited to, hard disks, removable magnetic disks, removable optical disks (e.g., compact disks and digital video disks), magnetic cassettes, memory cards or sticks, Random Access Memories (RAMs), Read Only Memories (ROMs), and the like.

Moreover, although illustrative embodiments have been described herein, the scope includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of aspects across various embodiments), adaptations or alterations based on the present disclosure. The elements of the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the life of the application, which examples are to be construed as non-exclusive. Further, the steps of the disclosed methods may be modified in any manner, including by reordering steps or inserting or deleting steps. It is intended, therefore, that the description be regarded as examples only, with a true scope being indicated by the following claims and their full scope of equivalents.

The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more aspects thereof) may be used in combination with each other. Other embodiments may be utilized by one of ordinary skill in the art in view of the above description. Moreover, in the detailed description above, various features may be combined together to simplify the present disclosure. This should not be interpreted as intending that an unclaimed disclosed feature is essential to any claim. Rather, inventive subject matter may lie in less than all features of a single disclosed embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that these embodiments may be combined with each other in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

1. A computer-implemented method of automatically detecting a target object from a 3D image, comprising:

receiving a 3D image acquired by an imaging device;

detecting, by a processor, a plurality of bounding boxes containing the target object using a 3D learning network, wherein the 3D learning network is trained to generate a plurality of feature maps having varying scales based on the 3D image, wherein the plurality of feature maps respectively correspond to a plurality of different sizes of the target object; the 3D learning network includes a segmentation network portion and a detection network portion, and detecting the plurality of bounding boxes using the 3D learning network includes: segmenting the 3D image using the segmentation network portion to obtain potential regions, and using the potential regions to constrain detection of the plurality of bounding boxes;

determining, by a processor, a set of parameters that identify each detected bounding box using the 3D learning network; and

locating, by a processor, the target object based on the set of parameters.

2. The computer-implemented method of claim 1, wherein the set of parameters includes coordinates that identify a location of each bounding box in the 3D image.

3. The computer-implemented method of claim 1, wherein the set of parameters includes dimensions that identify a size of each bounding box.

4. The computer-implemented method of claim 1, wherein the 3D learning network is trained to perform regression on the set of parameters.

5. The computer-implemented method of claim 1, further comprising associating a plurality of anchor boxes with the 3D image, wherein the set of parameters indicates an offset of each bounding box relative to a respective anchor box.

6. The computer-implemented method of claim 5, wherein each anchor box is associated with a grid cell of the feature map.

7. The computer-implemented method of claim 6, wherein the anchor box is scaled according to a scale of the feature map.

8. The computer-implemented method of claim 1, wherein the plurality of feature maps have varying image sizes.

9. The computer-implemented method of claim 1, wherein the plurality of feature maps use a sliding window of variable size.

10. The computer-implemented method of claim 1, further comprising creating an initial bounding box, wherein detecting a plurality of bounding boxes containing the target object comprises classifying the initial bounding box as being associated with a plurality of labels.

11. The computer-implemented method of claim 1, further comprising applying non-maximum suppression to the detected bounding box.

12. The computer-implemented method of claim 1, further comprising: segmenting the 3D image to obtain a convex hull and using the convex hull to constrain detection of the plurality of bounding boxes.

13. The computer-implemented method of claim 1, wherein the learning network is further trained to segment the target object within respective detected bounding boxes.

14. The computer-implemented method of claim 1, wherein the imaging device is a computed tomography imaging system.

15. The computer-implemented method of claim 1, wherein the target object is a lung nodule.

16. The computer-implemented method of claim 1, wherein the learning network is a full convolutional neural network.

17. A system for automatically detecting a target object from a 3D image, comprising:

an interface configured to receive a 3D image acquired by an imaging device; and

a processor configured to:

detecting a plurality of bounding boxes containing the target object using a 3D learning network, wherein the 3D learning network is trained to generate a plurality of feature maps having varying scales based on the 3D image, wherein the plurality of feature maps respectively correspond to a plurality of different sizes of the target object; the 3D learning network includes a segmentation network portion and a detection network portion, and detecting the plurality of bounding boxes using the 3D learning network includes: segmenting the 3D image using the segmentation network portion to obtain potential regions, and using the potential regions to constrain detection of the plurality of bounding boxes;

determining a set of parameters identifying each detected bounding box using a 3D learning network; and

locating the target object based on the set of parameters.

18. The system of claim 17, wherein the processor comprises a graphics processing unit.

19. The system of claim 17, wherein the imaging device is a computed tomography imaging system.

20. A non-transitory computer readable medium having instructions stored thereon, wherein the instructions, when executed by a processor, perform a method of automatically detecting a target object from a 3D image, the method comprising:

receiving a 3D image acquired by an imaging device;

determining a set of parameters identifying each detected bounding box using the 3D learning network; and

a target object is located based on the set of parameters.