CN110838125B

CN110838125B - Target detection method, device, equipment and storage medium for medical image

Info

Publication number: CN110838125B
Application number: CN201911087819.5A
Authority: CN
Inventors: 曹世磊; 刘小彤; 马锴; 伍健荣; 朱艳春; 李仁�; 陈景亮; 杨昊臻; 常佳; 郑冶枫
Original assignee: Tencent Healthcare Shenzhen Co Ltd
Current assignee: Tencent Healthcare Shenzhen Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2024-03-19
Anticipated expiration: 2039-11-08
Also published as: CN110838125A

Abstract

A method, apparatus, device and storage medium for object detection of medical images are disclosed. The method comprises the following steps: determining a feature map of the medical image; determining a target central point prediction graph according to the feature graph, wherein the target central point prediction graph comprises a plurality of first pixel points, and the value of each first pixel point indicates the possibility that the first pixel point is a prediction target central point; determining a target size prediction graph according to the feature graph, wherein the target size prediction graph comprises a plurality of second pixel points, each second pixel point corresponds to one first pixel point, and the value of each second pixel point indicates the prediction size applicable to the first pixel point corresponding to the second pixel point; determining a predicted target center point and a predicted target size according to the target center point predicted graph and the target size predicted graph; a position and a size of a target object are determined in the medical image based on the predicted target center point and the predicted target size.

Description

Target detection method, device, equipment and storage medium for medical image

Technical Field

The present invention relates to the field of image processing, and in particular, to a method, an apparatus, a device, and a storage medium for detecting a target of a medical image.

Background

In the existing target detection process, it is required to determine which reference frame the target existing in the image to be detected belongs to based on a preset reference frame (anchor), and by determining the offset and the size change of the target relative to the reference frame, the position and the size of the target object in the image to be detected can be determined. The accuracy of the target detection result in this way depends on the accuracy of the reference frame. Therefore, in order to obtain a good target detection result, a process of performing target detection for a sufficient number of reference frames is required, which greatly increases the target detection calculation amount.

Disclosure of Invention

The application aims to provide a target detection method, device and equipment for medical images and a storage medium.

According to one aspect of the present application, there is provided a target detection method of a medical image, comprising: determining a feature map of the medical image; determining a target central point prediction graph according to the feature graph, wherein the target central point prediction graph comprises a plurality of first pixel points, and the value of each first pixel point indicates the possibility that the first pixel point is a prediction target central point; determining a target size prediction graph according to the feature graph, wherein the target size prediction graph comprises a plurality of second pixel points, each second pixel point corresponds to one first pixel point, and the value of each second pixel point indicates the prediction size applicable to the first pixel point corresponding to the second pixel point; determining a predicted target center point and a predicted target size according to the target center point predicted graph and the target size predicted graph; a position and a size of a target object are determined in the medical image based on the predicted target center point and the predicted target size.

In some embodiments, determining a target center point prediction graph from the feature graph comprises: performing convolution processing on the feature map to determine a convolved feature map, wherein the convolved feature map comprises a plurality of third pixel points; for each third pixel point in the convolved feature map, determining a maximum horizontal pixel value of the convolved feature map in the horizontal direction, determining a maximum vertical pixel value of the convolved feature map in the vertical direction, and determining a value of a first pixel point in the target center point prediction map based on the maximum horizontal pixel value and the maximum vertical pixel value.

In some embodiments, determining a maximum horizontal pixel value of the convolved feature map in a horizontal direction comprises: searching for values of all third pixels between the third pixel point and the first horizontal edge of the feature map along the horizontal direction, determining the maximum value of the searched values as a first maximum horizontal pixel value, and/or searching for values of all third pixels between the third pixel point and the second horizontal edge of the feature map along the horizontal direction, determining the maximum value of the searched values as a second maximum horizontal pixel value, wherein determining the maximum vertical pixel value of the convolved feature map along the vertical direction comprises: searching for values of all third pixels between the third pixel point and the first vertical edge of the feature map along the vertical direction, determining the maximum value of the searched values as a first maximum vertical pixel value, and/or searching for values of all third pixels between the third pixel point and the second vertical edge of the feature map along the vertical direction, determining the maximum value of the searched values as a second maximum vertical pixel value, and determining the pixel value of the first pixel point in the target center point prediction map based on the maximum horizontal pixel value and the maximum vertical pixel value comprises: and summing the first maximum horizontal pixel value and/or the second maximum horizontal pixel value with the first maximum vertical pixel value and/or the second maximum vertical pixel value, and determining the value of a first pixel point corresponding to the third pixel point in the target central point prediction graph.

In some embodiments, the target center point prediction graph includes at least one channel, the value of the first pixel in each channel indicating a likelihood that the first pixel is a predicted target center point for a target of a corresponding class.

In some embodiments, determining a target size prediction graph from the feature graph comprises: and carrying out convolution processing on the feature map to obtain the target size prediction map, wherein the target size prediction map comprises at least one channel for each category of prediction target center point, and the value of the second pixel point of each channel indicates the prediction length in the preset size direction, which is applicable to the first pixel point corresponding to the second pixel point serving as the prediction target center point of the category.

In some embodiments, determining a predicted target center point and a predicted target size from the target center point prediction graph and the target size prediction graph comprises: determining a probability value of each first pixel point belonging to the predicted target center point according to the value of the first pixel point in the target center point predicted graph; determining a first pixel point with a probability value larger than a preset probability threshold value as the predicted target center point; and determining a predicted target size corresponding to the predicted target center point in the target size prediction graph according to the coordinates of the predicted target center point in the target center point prediction graph.

In some embodiments, determining the location and size of the target object in the medical image based on the predicted target center point and the predicted target size comprises: determining a mapping relationship between the feature map and the medical image; determining a position of the target object and a candidate target size in the medical image based on the predicted target center point, the predicted target size, and the mapping relationship; determining an offset for the candidate target size according to the feature map; a size of the target is determined in the medical image based on the offset and the candidate target size.

In some embodiments, the predicted target center point and the predicted target size comprise a predicted target center point for at least one class of target objects and a predicted target size for that class of target objects.

In some embodiments, the medical image is a 3D CT image, and the method further comprises a preprocessing step, prior to determining the feature map of the medical image, the preprocessing comprising normalizing the CT value for each pixel point in the medical image.

In some embodiments, determining the feature map of the medical image comprises: processing the medical image by using at least one convolution layer to obtain a convolved medical image; at least one downsampling and at least one upsampling of the convolved medical image with at least one downsampling layer and at least one upsampling layer to determine a feature map of the medical image, wherein the feature map is smaller in size than the medical image.

In some embodiments, the feature map, the target center point prediction map, and the target size prediction map are generated by at least one convolutional network trained by: determining a training dataset, wherein the training dataset comprises at least one training image, the at least one training image being marked with a true position and a true size of a true target object; for each training image: determining a training feature map of the training image; determining a training target center point prediction graph according to the training feature graph; determining a training target size prediction graph according to the training feature graph; determining a training predicted target center point and a training predicted target size according to the training target center point prediction graph and the training target size prediction graph; determining a training position and a training size of a training target object in the training image based on the training prediction target center point and the training prediction target size; parameters of the at least one convolution network are adjusted to minimize a loss between the training position and training size of the training target object in a training image and the real position and real size of the real target object.

In some embodiments, the loss includes at least one of: a difference between the training position and the true position; and a difference between the training size and the real size.

According to another aspect of the present application, there is also provided an object detection apparatus of a medical image, including: a feature map determining unit configured to determine a feature map of the medical image; a center point prediction unit configured to determine a target center point prediction graph from the feature graph, wherein the target center point prediction graph includes a plurality of first pixel points, a value of each first pixel point indicating a likelihood that the first pixel point is a prediction target center point; a size prediction unit configured to determine a target size prediction graph according to the feature graph, wherein the target size prediction graph includes a plurality of second pixels, each second pixel corresponds to one of the first pixels, and a value of each second pixel indicates a predicted size to which the first pixel corresponding to the second pixel is applicable; a target prediction unit configured to determine a predicted target center point and a predicted target size from the target center point prediction map and the target size prediction map; and a target determination unit configured to determine a position and a size of a target object in the medical image based on the predicted target center point and the predicted target size.

In some embodiments, the center point prediction unit is configured to: performing convolution processing on the feature map to determine a convolved feature map, wherein the convolved feature map comprises a plurality of third pixel points; for each third pixel point in the convolved feature map, determining a maximum horizontal pixel value of the convolved feature map in the horizontal direction, determining a maximum vertical pixel value of the convolved feature map in the vertical direction, and determining a value of a first pixel point in the target center point prediction map based on the maximum horizontal pixel value and the maximum vertical pixel value.

In some embodiments, the center point prediction unit is configured to: for each third pixel point in the convolved feature map, determining a maximum horizontal pixel value of the convolved feature map in a horizontal direction includes: searching for values of all third pixels between the third pixel point and the first horizontal edge of the feature map along the horizontal direction, determining the maximum value of the searched values as a first maximum horizontal pixel value, and/or searching for values of all third pixels between the third pixel point and the second horizontal edge of the feature map along the horizontal direction, determining the maximum value of the searched values as a second maximum horizontal pixel value, wherein determining the maximum vertical pixel value of the convolved feature map along the vertical direction comprises: searching for values of all third pixels between the third pixel point and the first vertical edge of the feature map along the vertical direction, determining the maximum value of the searched values as a first maximum vertical pixel value, and/or searching for values of all third pixels between the third pixel point and the second vertical edge of the feature map along the vertical direction, determining the maximum value of the searched values as a second maximum vertical pixel value, and determining the pixel value of the first pixel point in the target center point prediction map based on the maximum horizontal pixel value and the maximum vertical pixel value comprises: and summing the first maximum horizontal pixel value and/or the second maximum horizontal pixel value with the first maximum vertical pixel value and/or the second maximum vertical pixel value, and determining the value of a first pixel point corresponding to the third pixel point in the target central point prediction graph.

In some embodiments, the size prediction unit is configured to: and carrying out convolution processing on the feature map to obtain the target size prediction map, wherein the target size prediction map comprises at least one channel for each category of prediction target center point, and the value of the second pixel point of each channel indicates the prediction length in the preset size direction, which is applicable to the first pixel point corresponding to the second pixel point serving as the prediction target center point of the category.

In some embodiments, the target prediction unit is configured to: determining a probability value of each first pixel point belonging to the predicted target center point according to the value of the first pixel point in the target center point predicted graph; determining a first pixel point with a probability value larger than a preset probability threshold value as the predicted target center point; and determining a predicted target size corresponding to the predicted target center point in the target size prediction graph according to the coordinates of the predicted target center point in the target center point prediction graph.

In some embodiments, the targeting unit is configured to: determining a mapping relationship between the feature map and the medical image; determining a position of the target object and a candidate target size in the medical image based on the predicted target center point, the predicted target size, and the mapping relationship; determining an offset for the candidate target size according to the feature map; a size of the target is determined in the medical image based on the offset and the candidate target size.

In some embodiments, wherein the predicted target center point and the predicted target size comprise a predicted target center point for at least one class of target objects and a predicted target size for that class of target objects.

In some embodiments, the medical image is a 3D CT image, and the object detection device further comprises a preprocessing unit configured to normalize the CT value for each pixel point in the medical image.

In some embodiments, the feature map determining unit is configured to: processing the medical image by using at least one convolution layer to obtain a convolved medical image; at least one downsampling and at least one upsampling of the convolved medical image with at least one downsampling layer and at least one upsampling layer to determine a feature map of the medical image, wherein the feature map is smaller in size than the medical image.

According to still another aspect of the present application, there is also provided an object detection apparatus for medical image, including: one or more processors; and one or more memories, wherein the memories have stored therein computer readable code which, when executed by the one or more processors, performs the object detection method as described previously.

According to yet another aspect of the present application, there is also provided a computer readable storage medium having stored thereon instructions that, when executed by a processor, cause the processor to perform the object detection method as described above.

By using the target detection method, the device, the equipment and the storage medium provided by the application, the position and the size of the target object existing in the image can be determined by detecting the size of the center point and the size of the target frame without predefining the reference frame. Because only single-stage detection is needed for the key points, the target detection method provided by the application is high in speed, a large amount of computing resources can be saved, and a good target detection result can be obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. The following drawings are not intended to be drawn to scale, with emphasis instead being placed upon illustrating the principles of the present application.

FIG. 1 illustrates an exemplary scene graph of an image processing system for implementing object detection in accordance with the present application;

FIG. 2 shows a schematic flow chart of a method of object detection of a medical image according to an embodiment of the present application;

FIG. 3 shows a schematic diagram of an hourglass network stacked two times;

FIG. 4A shows a schematic diagram of a central point prediction network according to an embodiment of the present application;

FIG. 4B shows a schematic diagram of another center point prediction process according to an embodiment of the present application;

FIG. 5 shows a schematic block diagram of an object detection device of a medical image according to an embodiment of the present application;

FIG. 6 shows a schematic block diagram of a medical electronic device according to an embodiment of the present application;

FIG. 7A illustrates an exemplary process of detecting medical images according to an embodiment of the present application;

FIG. 7B illustrates an example of dividing a medical image 710 to obtain a plurality of image blocks;

FIG. 8 illustrates an example of performing a target detection process with an electronic device according to an embodiment of the present application;

FIG. 9A shows an exemplary process for target detection based on a reference frame;

FIG. 9B shows an illustrative process for target detection of lung nodules using 3D segmentation; and

Fig. 10 illustrates an architecture of a computing device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The convolutional neural network has good effects in the fields of computer vision and the like because of strong characteristic learning capability, and the convolutional operation can effectively reduce the complexity of a deep neural network model because of the characteristics of sparse connection, weight sharing and the like, so that the overfitting caused by the excessively complex model is effectively prevented.

In order to avoid the calculation amount generated by detecting a large number of reference frames in the process of detecting the target, the application provides a target detection method independent of the reference frames.

FIG. 1 illustrates an exemplary scene graph of an image processing system for implementing object detection in accordance with the present application. As shown in fig. 1, the image processing system 100 may include a user terminal 110, a network 120, a server 130, and a database 140.

The user terminal 110 may be, for example, a computer 110-1, a mobile phone 110-2 as shown in fig. 1. It will be appreciated that in fact, the user terminal may be any other type of electronic device capable of performing data processing, which may include, but is not limited to, a desktop computer, a notebook computer, a tablet computer, a smart phone, a smart home device, a wearable device, an in-vehicle electronic device, a monitoring device, etc. The user terminal may also be any equipment provided with electronic devices, such as vehicles, robots, etc.

The user terminal provided by the application can be used for receiving the image to be processed and realizing target detection by using the method provided by the application. For example, the user terminal may acquire an image to be processed by an image acquisition device (e.g., a camera, a video camera, etc.) provided on the user terminal. For another example, the user terminal may also receive an image to be processed from an image acquisition device that is separately provided. As another example, the user terminal may also receive the image to be processed from the server via the network. The image to be processed may be a single image or a frame in a video. In case the image to be processed is a medical image, the user terminal may also be connected to the medical electronic device and receive the image to be processed from the medical electronic device. The medical image may be acquired by CT, MRI, ultrasound, X-ray, nuclide imaging (such as SPECT, PET), or the like, or may be an image showing physiological information of a human body such as electrocardiogram, electroencephalogram, or optical photography.

In some embodiments, the method for detecting an object provided herein may be performed by a processing unit of a user terminal. In some implementations, the user terminal may perform the target detection method using an application built into the user terminal. In other implementations, the user terminal may execute the target detection method provided herein by invoking an application program stored external to the user terminal.

In other embodiments, the user terminal transmits the received image to be processed to the server 130 via the network 120, and the server 130 performs the object detection method. In some implementations, the server 130 may perform the target detection method using an application built into the server. In other implementations, the server 130 may perform the target detection method by invoking an application program stored external to the server.

Network 120 may be a single network or a combination of at least two different networks. For example, network 120 may include, but is not limited to, one or a combination of several of a local area network, a wide area network, a public network, a private network, and the like.

The server 130 may be a single server or a group of servers, each server within the group being connected via a wired or wireless network. A server farm may be centralized, such as a data center, or distributed. The server 130 may be local or remote.

Database 140 may refer broadly to a device having a storage function. The database 130 is mainly used to store various data utilized, generated, and outputted from the operation of the user terminal 110 and the server 130. Database 140 may be local or remote. The database 140 may include various memories such as random access Memory (Random Access Memory (RAM)), read Only Memory (ROM), and the like. The above-mentioned storage devices are merely examples and the storage devices that may be used by the system are not limited thereto.

Database 140 may be interconnected or in communication with server 130 or a portion thereof via network 120, or directly with server 130, or a combination thereof.

In some embodiments, database 150 may be a stand-alone device. In other embodiments, database 150 may also be integrated in at least one of user terminal 110 and server 140. For example, the database 150 may be provided on the user terminal 110 or on the server 140. For another example, the database 150 may be distributed, with one portion being provided on the user terminal 110 and another portion being provided on the server 140.

The flow of the target detection method provided in the present application will be described in detail below.

Fig. 2 shows a schematic flow chart of a method of object detection of a medical image according to an embodiment of the present application. The medical image may be acquired by CT, MRI, ultrasound, X-ray, nuclide imaging (such as SPECT, PET), or the like, or may be an image showing physiological information of a human body such as electrocardiogram, electroencephalogram, or optical photography. The principles of the present application will be described below taking the example of a medical image being a 3D lung CT image, for example, the medical image may be a 3D CT image having a length and width dimension of 256 pixels and a number of slices of 16. It will be appreciated by those skilled in the art that any type and size of medical image may be processed using the method shown in fig. 2 without departing from the principles of the present application. The medical image may comprise at least one channel. In case the medical image is a gray scale image, the medical image may comprise only one channel. In case the medical image is a color image, the medical image may comprise three channels of RGB.

As shown in fig. 2, in step S202, a feature map of the medical image may be determined. In some embodiments, the medical image may be processed using a feature determination network to obtain a feature map of the medical image.

In some embodiments, the medical image may be divided into a plurality of image blocks, and the object detection method shown in fig. 2 is performed for each image block, subject to the computing power of the computing device. When the medical image is a 3D image, the 3D medical image may be divided into a plurality of 3D image blocks. The divided image blocks may have overlapping portions. The detection results of the overlapping portions of each image block may be weighted averaged to obtain a final target detection result.

In some embodiments, prior to determining the feature map of the medical image, the method 200 further comprises a preprocessing step (not shown) comprising normalizing according to the CT value for each pixel point in the medical image.

The CT values in the CT image have corresponding physical significance. Different CT values may represent different types of biological tissue, such as bone, muscle, fat, lung, kidney, etc. The principles of the present disclosure are described herein in the context of the lung. To reduce the garbage in the image, the data in the CT image can be clipped to a range of values such as [ -1200,600], i.e., CT values in the image less than-1200 are all set to-1200, and CT values in the image greater than 600 are all set to 600. The CT values in the image may then be normalized to be within the range of [0,1 ].

It will be appreciated that for medical images other than CT images, other means of normalizing pixels in the image may be employed.

In some implementations, the feature determination network may be an artificial neural network such as an Hourglass (Hourslass) network, a U-net network, or the like. The artificial neural network may be a convolutional neural network comprising one or more convolutional layers. The result output by any one layer (e.g., the last layer or any intermediate layer) in the artificial neural network may be used as a feature map of the medical image.

In one example, the medical image may be processed using an hourglass network stacked two times.

Fig. 3 shows a schematic diagram of an hourglass network stacked two times. The first tier hourglass network 310 and the second tier hourglass network 320 may form a U-shaped structure. Wherein the first layer of hourglass network 310 may include at least one downsampling layer and at least one convolution layer, and the second layer of hourglass network 320 may include at least one upsampling layer and at least one convolution layer. The medical image is rolled and downsampled through the first layer of hourglass network, redundant information can be filtered from the medical image, and therefore the effect of information of important features in the image is reserved. Further, by further rolling and upsampling the information output by the first layer hourglass network 310, the effect of recovering detailed information in the medical image can be achieved.

In one implementation, the first layer hourglass network 310 may include two downsampling layers, each of which is capable of 2 times downsampling, thereby downsampling an input 256 x 256 medical image to a 64 x 64 feature information size. The second layer hourglass network 320 may include an upsampling layer and perform up-sampling by a factor of 2 so that a feature map of size 128 x 128 may be output. In this case, the size of the feature map is 1/4 of the size of the medical image.

Those skilled in the art will appreciate that the above-described configuration of the first tier hourglass network 310 and the second tier hourglass network 320 is merely an exemplary illustration. Depending on the actual situation, one skilled in the art may set different numbers of downsampling and upsampling layers in the first tier hourglass network 310 and the second tier hourglass network 320, so that different sized feature maps can be output.

For example, the first tier hourglass network 310 may include three or more downsampling layers and the second tier hourglass network 320 may include one upsampling layer such that the size of the feature map output by the second tier hourglass network 320 is 1/16 or less of the size of the medical image. As another example, the first tier hourglass network 310 may include two downsampling layers and the second tier hourglass network 320 may include two or more upsampling layers such that the second tier hourglass network 320 outputs feature images that are the same size as or even larger than the size of the medical image.

As the feature map is smaller in size, the medical image included in the feature map is less informative, but the amount of calculation required for the object detection process can be reduced accordingly. When the size of the feature map is larger, the more information of the medical image is included in the feature map, but the amount of calculation required for the object detection process may be increased. Thus, one skilled in the art can determine the structure of the network based on the actual situation and the desired set of features.

Referring back to fig. 2, in step S204, a target center point prediction graph may be determined from the feature graph, wherein the target center point prediction graph includes a plurality of first pixel points, and a value of each first pixel point indicates a likelihood that the first pixel point is a predicted target center point. The first pixel point may refer to any pixel point in the target center point prediction map. It is understood that the pixel points of the target center point prediction map described below are equivalent to the first pixel points described herein.

In some embodiments, the center point of the target may be determined by decimating the pixel maxima of the image in the horizontal and vertical directions. The feature map output at step S202 may be processed using a center point prediction network to determine the center point of the target object present in the medical image. The central point prediction network may be implemented as a convolutional neural network including at least one convolutional layer.

The feature map may be convolved with a convolution layer in the central point prediction network to determine a convolved feature map. In one implementation, a maximum horizontal pixel value of the convolved feature map in a horizontal direction and a maximum vertical pixel value of the convolved feature map in a vertical direction may be determined. For example, for each row of pixels in the convolved feature map, a maximum horizontal pixel value for that row may be determined. For each column of pixels in the convolved feature map, a maximum vertical pixel value for that column may be determined.

The value of a pixel in the target center point prediction graph may be determined based on the maximum horizontal pixel value and the maximum vertical pixel value such that the value of the pixel in the target center point prediction graph indicates a likelihood that the pixel is a predicted target center point. In one implementation, the larger the value of the pixel, the more likely the pixel is a predicted target center point. The predicted target center point may be determined in the target center point prediction graph using a preset prediction threshold. For example, a pixel point in the target center point prediction map whose value is greater than the prediction threshold value may be determined as the prediction target center point. In other implementations, the probability that a pixel point is a predicted target center point may also be determined based on the values of the pixel points in the target center point prediction graph.

Fig. 4A shows a schematic diagram of a central point prediction network according to an embodiment of the present application.

As shown in fig. 4A, the central point prediction network 400 may include a horizontal prediction module 410 and a vertical prediction module 420.

The vertical prediction module 410 may include a first convolution unit 411, a first vertical pooling unit 412 (upper pooling unit in the illustration), and a second vertical pooling unit 413 (lower pooling unit in the illustration). The first convolution unit 411 may include at least one convolution layer, a batch normalization layer (Batch Normalization), and a linear rectification function layer (ReLU). The feature map output in step S202 may be processed by the first convolution unit, so as to obtain values of a plurality of third pixels included in the convolved feature map.

For each third pixel point in the convolved feature map obtained by the first convolution unit, the first vertical pooling unit 412 may be configured to find values of all pixel points between the pixel point and a first vertical edge (e.g., an upper edge) of the feature map in a vertical direction, and determine a maximum value among the found values as a first maximum vertical pixel value. The second vertical pooling unit 413 may be configured to search values of all pixel points between the pixel point and a second vertical edge (e.g., a lower edge) of the feature map in the vertical direction, and determine a maximum value among the searched values as a second maximum vertical pixel value. In some embodiments, the first vertical edge and the second vertical edge may be straight lines, curved lines, or any line capable of constituting an edge. In some implementations, the first and second vertical edges may be identical in shape and parallel to each other.

The horizontal prediction module 420 may include a second convolution unit 421, a first horizontal pooling unit 422 (left pooling unit in the illustration), and a second horizontal pooling unit 423 (right pooling unit in the illustration). The first convolution unit 421 may include at least one convolution layer, a batch normalization layer (Batch Normalization), and a linear rectification function layer (ReLU). The feature map output in step S202 may be processed by the second convolution unit to obtain values of a plurality of third pixels included in the convolved feature map. Wherein the parameters in the first convolution unit 411 and the second volume level unit 421 may be the same or different.

For each third pixel point in the convolved feature map obtained by the second convolution unit, the first horizontal pooling unit 412 may be configured to find values of all pixel points between the pixel point and a first horizontal edge (e.g., left edge) of the feature map in a horizontal direction, and determine a maximum value among the found values as a first maximum horizontal pixel value. The second horizontal pooling unit 413 may be configured to find values of all pixel points between the pixel point and a second horizontal edge (right edge) of the feature map in the horizontal direction, and determine a maximum value among the found values as a second maximum horizontal pixel value. In some embodiments, the first horizontal edge and the second horizontal edge may be straight lines, curved lines, or any line capable of constituting an edge. In some implementations, the first horizontal edge and the second horizontal edge may be identical in shape and parallel to each other.

In one implementation, determining the value of the pixel point in the target center point prediction graph based on the maximum horizontal pixel value and the maximum vertical pixel value may include: and summing the first maximum horizontal pixel value, the second maximum horizontal pixel value, the first maximum vertical pixel value and the second maximum vertical pixel value to determine the value of the corresponding pixel point in the target center point prediction graph.

In other embodiments, the center point prediction may also be implemented using other combinations of the first horizontal pooling unit, the second horizontal pooling unit, the first vertical pooling unit, and the second vertical pooling unit in fig. 4A.

Fig. 4B shows a schematic diagram of another center point prediction process according to an embodiment of the present application. As shown in fig. 4B, for one pixel point in the convolved feature map, a lower pooling unit may be used to find values of all pixel points between the pixel point and the lower edge of the feature map in the vertical direction, and determine the maximum value of the found values as the first maximum vertical pixel value. The right pooling unit may be used to find values of all pixel points between the pixel point and the right edge of the feature map in the horizontal direction, and determine a maximum value of the found values as a first maximum horizontal pixel value. The value of the corresponding pixel point in the target center point prediction graph may be determined based on the sum of the first maximum horizontal pixel value and the first maximum vertical pixel value.

It is understood that the scope of the present application is not limited to the two implementations provided in fig. 4A and 4B, and for each pixel point in the feature map, the value of the corresponding pixel point in the target center point prediction map may be determined according to the sum of the first maximum horizontal pixel value and/or the second maximum horizontal pixel value, the first maximum vertical pixel value and/or the second maximum vertical pixel value.

Further, while the principles of the present application are described in fig. 4A, 4B by taking the example of finding the maximum value in the feature map in the horizontal direction and the vertical direction, the scope of the present application is not limited thereto. In fact, the maximum value of the pixels in the feature map can also be found in any direction in the feature map.

The value of each pixel point in the target center point prediction graph can be determined by the method. In some embodiments, the size of the target center point prediction graph and the size of the feature graph may be equal. Each point in the target center point prediction graph may be used to represent a probability that at least one pixel in the medical image corresponding to the pixel is the center point of the target. In one implementation, the target center point prediction graph may include at least one channel, where the number of channels refers to the dimension of the target center point prediction graph in a third direction in addition to the width and height. That is, the target center point prediction graph may be expressed in the form of a tensor of h×w×c, where H may represent the height of the target center point prediction graph, W may represent the width of the target center point prediction graph, and C may represent the number of channels of the target center point prediction graph. The value of a pixel in each channel indicates the probability that the pixel belongs to the predicted target center point of the target of the corresponding class. One skilled in the art may designate the target class corresponding to each channel during the training process, so that the method provided by the present application may perform target detection for multiple types of targets. For example, for a lung CT image, target detection may be performed for multiple types of lung lesions such as lung nodules, lung chords, lymph node calcifications, arteriosclerosis, and the like. For example, the target center point prediction graph may include three channels, where the value of a pixel in the first channel indicates the probability that the pixel belongs to the predicted target center point of a pulmonary nodule, the pixel in the second channel indicates the probability that the pixel belongs to the predicted target center point of a pulmonary cable, and the pixel in the third channel indicates the probability that the pixel belongs to the predicted target center point of a lymph node calcification.

Referring back to fig. 2, in step S206, values of a plurality of second pixels included in the target size prediction graph may be determined according to the feature graph, wherein each second pixel corresponds to one of the first pixels, and the value of each second pixel indicates a prediction size to which the first pixel corresponding to the second pixel applies. The second pixel point may refer to any pixel point in the target size prediction map. It is understood that the pixel points of the target size prediction map described below correspond to the second pixel points described herein.

In some embodiments, the feature map output by step S202 may be processed using a size prediction network to determine the size of the object present in the medical image. The size prediction network may be implemented as an artificial neural network. For example, the size prediction network may include at least one convolution layer. The feature map may be convolved with a convolution layer in the dimension prediction network, and the result output by the dimension prediction network may be used as the dimension prediction map. Wherein the size of the size prediction map may be the same as the size of the center point prediction map, each point in the size prediction map representing the size for a corresponding pixel point at the same location in the center point prediction map.

In one implementation, for each class of objects, the size prediction graph has at least one channel, the value of a pixel of each channel indicating a prediction length in a predetermined size direction to which a pixel in the center point prediction graph corresponding to the pixel is applied as a prediction target center point of the class. For example, if the object is represented as a cuboid, the center point prediction graph has n channels, the size prediction graph may have 3*n channels, each representing the size in the length, width, height directions for a specified type of object. For another example, if the object is represented as a sphere, the center point prediction graph has n channels, the size prediction graph may have n channels, each channel representing the size of a corresponding type of object in the radial direction.

It will be appreciated that when the object is represented as other geometric shapes, the predetermined dimensional direction may be any dimensional direction suitable for representing the shape and size of the object.

In step S208, a predicted target center point and a predicted target size may be determined from the target center point prediction map and the target size prediction map.

In some embodiments, when the center point prediction graph includes at least one channel indicating at least one category of targets, the predicted target center point and the predicted target size include a predicted target center point for the at least one category of targets and a predicted target size for the category.

In some embodiments, a probability value for each pixel in the target center point prediction graph that belongs to the predicted target center point may be determined from the value of the pixel. For example, the central point prediction map may be processed with an activation function sigmoid to obtain a central point prediction thermodynamic diagram, where each pixel point in the central point prediction thermodynamic diagram represents a probability value that the pixel point belongs to a prediction target central point. Pixels in the target center point prediction graph having a probability threshold greater than a preset probability threshold may be determined as prediction target center points.

The predicted size for the predicted target center point may be determined in the size prediction map based on the position coordinates of the predicted target center point in the center point prediction map. Since the size of the size prediction map and the size of the center point prediction map are the same, the value of a pixel point in the size prediction map having the same position coordinates as the prediction target center point can be determined as the prediction size for the prediction target center point.

In step S210, a position and a size of a target object may be determined in the medical image based on the predicted target center point and the predicted target size.

In some embodiments, step S210 may include determining a mapping relationship between the feature map and the medical image. As described above, the feature map output through the feature determination network has a predetermined size map relationship with the original medical image (for example, the size of the feature map is 1/4 of the size of the original medical image). Thus, according to the above-described size mapping relationship, the target position in the medical image corresponding to the predicted target center point determined using the center point prediction map can be determined.

Step S210 may further include determining a location of the target object and a candidate target size in the medical image based on the predicted target center point, the predicted target size, and the mapping relationship.

For example, in the case where the size of the feature map is 1/4 of the size of the original medical image, the size of the center point prediction map may also be 1/4 of the size of the medical image, and if a point with coordinates (2, 5) in the center point prediction map is determined as a predicted center point, then the coordinate value of the center position of the object in the medical image may be determined as 4 times the predicted center point coordinate, i.e., (8, 20), according to the mapping relationship between the feature map and the original medical image. Accordingly, the size for the object can be determined from the values of the pixel points of coordinates (2, 5) in all the channels in the size prediction map. For example, if the length, width, and height dimensions in the dimension prediction graph corresponding to the prediction center point are 15, 21, and 19, the dimensions for the target may be determined to be 4 times the results in the dimension prediction graph, i.e., 60, 84, and 76. The dimension may refer to a dimension from a center position of the target to an edge of the target detection frame in the length-width-height direction, respectively. Thus, the final target detection frame has a length 160, a width 168, and a height 152.

An approximation may exist in mapping the original medical image into a feature map. For example, when the size of the original medical image is not divisible by the coefficients of the downsampling layer (e.g., 2 times downsampling), the feature map size needs to be rounded during the downsampling process. In this case, when the position and size of the predicted target center point are mapped directly back to the medical image in the above-described manner, there may be a deviation in the obtained result. In this case, the target size obtained directly using the mapping relationship may be determined as a candidate target size, and the candidate target size may be adjusted according to the deviation to determine a final target size.

Thus, in one implementation, step S210 may further include determining an offset for the candidate target size from the feature map. In some implementations, an offset for the candidate target size may be determined using an offset prediction network. The offset determination network may include at least one convolutional layer. And convolving the feature map by using the offset prediction network to obtain an offset prediction map. Wherein the offset prediction map may be the same size as the size prediction map and have the same number of channels. The pixel map for each point in the offset prediction map may indicate an offset for each dimension in the dimension prediction map.

Accordingly, after the coordinates of the predicted center point and the predicted size for the predicted center point are determined based on the foregoing method, the size offset for the center point can be determined based on the coordinates of the predicted center point. The offset may be summed with the candidate target size so that the candidate target size can be adjusted to determine a final target size.

As previously described, the feature map, the target center point prediction map, and the target size prediction map are generated by at least one convolution network. The at least one convolutional network may be trained using the training method described below. For example, the parameters in the above-described feature map determination network, center point prediction network, size prediction network, and offset prediction network may be trained using the following methods.

For example, determining a training dataset may be utilized, wherein the training dataset comprises at least one training image marked with a true position and a true size of a true target object.

In the training process provided in the present application, for each training image, each training image may be processed by using the method shown in fig. 2, so as to obtain a training result of target detection for each training image.

During the training process, a training feature map of the training image may be determined. Then, a training target center point prediction graph, a training target size prediction graph and a training offset prediction graph can be determined according to the training feature graph. And determining a training predicted target center point and a training predicted target size according to the training target center point prediction graph, the training target size prediction graph and the training offset prediction graph. The training position and the training size of the training target object can be determined in the training image based on the training predicted target center point and the training predicted target size. By adjusting parameters of the at least one convolution network such that losses between the training positions and training dimensions of the training target object and the real positions and real dimensions of the real target object in the training image are minimized. Further, the real offset of the predicted target in the training image may be determined using the training size and the real size.

In some embodiments, the loss between the training position and training size and the true position and true size may be represented by equation (1):

L＝L _det +λ×L _siz e+γ×L _off (I)

wherein L is _Det Representing the loss between the training position and the true position of the target, L _size Representing the loss between training size and true size of the target, L _off The loss between the training offset and the true offset of the target may be represented. λ and γ are predefined hyper-parameters.

Wherein L is _De Can be represented by formula (2):

wherein N is the total number of targets to be detected in the training image, C is the number of categories of the targets to be detected, i, j and k are index numbers of pixel points, C is the index number of the target category, and W, H, Z is the length, width and height of the training image respectively. P is p _cijk Is a point P determined by using a central point prediction graph _ijk (i.e., points of length i, width j, and height k) are probability values of the center point of the class c object, y _cijk Is point P _ijk Is the tag value of the center point of the true class c object. Alpha, beta are predefined hyper-parameters.

It will be appreciated that for each target in the training image, there is only one true center point. In practical cases, however, a better target prediction result can be obtained by determining a point near the true center point as the center point. Therefore, when the pixel points in the training image are marked, the marking value of the real center point can be determined to be 1, and the marking value of the point close to the real center point can be determined to be a value between 0 and 1, so that the problem of unbalanced positive and negative samples can be solved.

L _size Can be represented by formula (3):

/>

where N is the total number of objects to be detected in the training image, k is the index number of the object,representing the true size of the kth object,S _k representing the predicted size of the kth target.

L _off Can be represented by formula (4):

where N is the total number of objects to be detected in the training image, m represents the index number of the object,representing the true coordinates, p, of the mth target detection frame _m Representing the predicted coordinates of the mth target detection frame, R representing the mapping relationship between the training feature map and the training image, and therefore, < >>The true offset of the predicted target may be represented. />Representing the predicted offset for the mth target obtained using the offset prediction network, when the target is represented by the dimensions in the three directions of length, width, and height, p is a vector including three elements each representing the predicted dimension of the predicted target in the three directions of length, width, and height, respectively.

By using the target detection method provided by the application, the position and the size of the target object existing in the image can be determined by detecting the center point and the size of the target frame without predefining the reference frame. Because only single-stage detection is needed for the key points, the target detection method provided by the application is high in speed, a large amount of computing resources can be saved, and a good target detection result can be obtained.

Fig. 5 shows a schematic block diagram of an object detection device of a medical image according to an embodiment of the present application. As shown in fig. 5, the object detection apparatus 500 may include a feature map determination unit 510, a center point prediction unit 520, a size prediction unit 530, an object prediction unit 540, and an object determination unit 550.

The feature map determining unit 510 may be configured to determine a feature map of the medical image. In some embodiments, the medical image may be processed using a feature determination network to obtain a feature map of the medical image.

In some embodiments, the medical image may be divided into a plurality of image blocks, and target detection is performed for each image block, subject to the computing power of the computing device. When the medical image is a 3D image, the 3D medical image may be divided into a plurality of 3D image blocks. The divided image blocks may have overlapping portions. The detection results of the overlapping portions of each image block may be weighted averaged to obtain a final target detection result.

In some embodiments, the object detection device 500 may further comprise a preprocessing unit (not shown) that may be configured to normalize the CT value for each pixel point in the medical image.

In one example, the medical image may be processed using an hourglass network stacked two times. The medical image may be processed with the at least one convolution layer to obtain a convolved medical image. A feature map of the medical image may be determined by at least one downsampling and at least one upsampling of the convolved medical image with the at least one downsampling layer and at least one upsampling layer. Wherein the feature map may be smaller in size than the medical image.

The center point prediction unit 520 may be configured to determine a target center point prediction graph according to the feature graph, wherein the target center point prediction graph comprises a plurality of first pixel points, and a value of each first pixel point indicates a likelihood that the first pixel point is a predicted target center point.

In some embodiments, the center point of the target may be determined by decimating the pixel maxima of the image in the horizontal and vertical directions. The feature map output by the feature map determination unit 510 may be processed using a center point prediction network to determine a center point of a target object present in the medical image. The central point prediction network may be implemented as a convolutional neural network including at least one convolutional layer.

Pixel values for pixel points in the target center point prediction graph may be determined based on the maximum horizontal pixel value and the maximum vertical pixel value. The value of a pixel in the target center point prediction graph is enabled to indicate the likelihood that the pixel is the predicted target center point. In one implementation, the larger the value of the pixel, the more likely the pixel is a predicted target center point. The predicted target center point may be determined in the target center point prediction graph using a preset prediction threshold. For example, a pixel point in the target center point prediction map whose value is greater than the prediction threshold value may be determined as the prediction target center point. In other implementations, the probability that a pixel point is a predicted target center point may also be determined based on the values of the pixel points in the target center point prediction graph.

In some embodiments, the central point prediction unit 520 may search for a first maximum horizontal pixel value of the values of the pixel point to the first horizontal edge all pixels of the feature map in a horizontal direction, search for a second maximum horizontal pixel value of the values of the pixel point to the second horizontal edge all pixels of the feature map in a horizontal direction, search for a first maximum vertical pixel value of the values of the pixel point to the first vertical edge all pixels of the feature map in a vertical direction, and search for a second maximum vertical pixel value of the values of the pixel point to the second vertical edge all pixels of the feature map in a vertical direction. The first maximum horizontal pixel value, the second maximum horizontal pixel value, the first maximum vertical pixel value, and the second maximum vertical pixel value may then be summed to determine a value for a corresponding pixel point in the target center point prediction graph.

For each point in the feature map, a value of a corresponding pixel point in the target center point prediction map may be determined according to a sum of the first maximum horizontal pixel value and/or the second maximum horizontal pixel value, the first maximum vertical pixel value and/or the second maximum vertical pixel value.

Further, while the principles of the present application have been described above by taking the example of finding the maximum value in the feature map in the horizontal direction and the vertical direction, the scope of the present application is not limited thereto. In fact, the maximum value of the pixels in the feature map can also be found in any direction in the feature map.

The value of each pixel point in the target center point prediction graph can be determined by the method. In some embodiments, the size of the target center point prediction graph and the size of the feature graph may be equal. Each point in the target center point prediction graph may be used to represent a probability that at least one pixel in the medical image corresponding to the pixel is the center point of the target. In one implementation, the target-center-point prediction graph may include at least one channel, where a value of a pixel in each channel indicates a probability that the pixel belongs to a predicted target center point of a target of a corresponding class. One skilled in the art may designate the target class corresponding to each channel during the training process, so that the method provided by the present application may perform target detection for multiple types of targets. For example, for a lung CT image, target detection may be performed for multiple types of lung lesions such as lung nodules, lung chords, lymph node calcifications, arteriosclerosis, and the like.

The size prediction unit 530 may be configured to determine values of a plurality of second pixels included in the target size prediction graph according to the feature graph, wherein each second pixel corresponds to one of the first pixels, and the value of each second pixel indicates a predicted size to which the first pixel corresponding to the second pixel is applicable. In some embodiments, the feature map output by the feature map determination unit 510 may be processed with a size prediction network to determine the size of the object present in the medical image. The size prediction network may be implemented as an artificial neural network. For example, the size prediction network may include at least one convolution layer. The feature map may be convolved with a convolution layer in the dimension prediction network, and the result output by the dimension prediction network may be used as the dimension prediction map. Wherein the size of the size prediction map may be the same as the size of the center point prediction map, each point in the size prediction map representing the size for a corresponding pixel point at the same location in the center point prediction map.

In one implementation, for each class of objects, the size prediction graph has at least one channel, the value of a pixel of each channel indicating a prediction length in a predetermined size direction to which a pixel in the center point prediction graph corresponding to the pixel is applied as a prediction target center point of the class. For example, if the object is represented as a cuboid, the center point prediction graph has n channels, the size prediction graph may have 3*n channels, each representing the size of a corresponding type of object in the length, width, and height directions. For another example, if the object is represented as a sphere, the center point prediction graph has n channels, the size prediction graph may have n channels, each channel representing the size of a corresponding type of object in the radial direction.

The target prediction unit 540 may be configured to determine a predicted target center point and a predicted target size from the target center point prediction map and the target size prediction map.

The target determination unit 550 may be configured to determine a position and a size of a target object in the medical image based on the predicted target center point and the predicted target size.

In some embodiments, the targeting unit 550 may be configured to determine a mapping relationship between the feature map and the medical image. As described above, the feature map output through the feature determination network has a predetermined size map relationship with the original medical image (for example, the size of the feature map is 1/4 of the size of the original medical image). Thus, according to the above-described size mapping relationship, the target position in the medical image corresponding to the predicted target center point determined using the center point prediction map can be determined.

The target determination unit 550 may be further configured to include determining a position of the target object and a candidate target size in the medical image based on the predicted target center point, the predicted target size, and the mapping relation.

Thus, in one implementation, the target determination unit 550 may be further configured to determine an offset for the candidate target size from the feature map. In some implementations, an offset for the candidate target size may be determined using an offset prediction network. The offset determination network may include at least one convolutional layer. And convolving the feature map by using the offset prediction network to obtain an offset prediction map. Wherein the offset prediction map may be the same size as the size prediction map and have the same number of channels. The pixel map for each point in the offset prediction map may indicate an offset for each dimension in the dimension prediction map.

Accordingly, after the coordinates of the predicted center point and the predicted size for the predicted center point are determined based on the foregoing method, the size offset for the center point can be determined based on the coordinates of the predicted center point. The candidate target sizes may be summed using the offset to enable adjustment of the candidate target sizes to determine a final target size.

By using the target detection device provided by the application, the position and the size of the target object existing in the image can be determined by detecting the center point and the size of the target frame without predefining the reference frame. Because only single-stage detection is needed for the key points, the target detection device provided by the application has high speed of executing target detection, can save a large amount of computing resources, and can obtain a good target detection result.

Fig. 6 shows a schematic block diagram of a medical electronic device according to an embodiment of the present application. As shown in fig. 6, the medical electronic device 600 may include an image acquisition unit 610, a feature map determination unit 620, a center point prediction unit 630, a size prediction unit 640, a target prediction unit 650, and a target determination unit 660.

The image acquisition unit 610 may be used for acquiring medical images. The medical image may be acquired by CT, MRI, ultrasound, X-ray, nuclide imaging (such as SPECT, PET), or the like, or may be an image showing physiological information of a human body such as electrocardiogram, electroencephalogram, or optical photography.

The feature map determining unit 620, the center point predicting unit 630, the size predicting unit 640, the target predicting unit 650, and the target determining unit 660 may be implemented as the feature map determining unit 510, the center point predicting unit 520, the size predicting unit 530, the target predicting unit 540, and the target determining unit 550 shown in fig. 5, and will not be described again.

In some implementations, the medical electronics provided herein may be any medical imaging device such as CT, MRI, ultrasound, X-ray instruments, and the like. The image acquisition unit 610 may be implemented as an imaging unit of the medical imaging device described above, and the feature map determination unit 620, the center point prediction unit 630, the size prediction unit 640, the target prediction unit 650, and the target determination unit 660 may be implemented by an internal processing unit (e.g., a processor) of the medical imaging device.

Fig. 7A illustrates an exemplary process of detecting medical images according to an embodiment of the present application. As shown in fig. 7A, the image 710 is a medical image to be detected. Feature map 720 for image 710 can be determined after image 710 is processed using a feature map determination network Hoursglass Net.

Feature map 720 may be processed with a center point prediction network, a size prediction network, and an offset prediction network, respectively, to obtain a target center point prediction map 730, a size prediction map 740, and an offset prediction map 750. By combining the information in the target center point prediction map 730, the size prediction map 740, and the offset prediction map 750, the target detection result of at least one type of target can be determined in the image 710.

Fig. 7B illustrates one example of dividing the medical image 710 to obtain a plurality of image blocks, the medical image 710 may be divided into a plurality of image blocks having overlapping portions, and the object detection process illustrated in fig. 7A will be performed for each image block. The target detection result for the entirety of the medical image 710 may be obtained by integrating the target detection result for each image block.

Fig. 8 shows an example of performing a target detection process with an electronic device according to an embodiment of the present application. As shown in fig. 8, front end 810 may be used to acquire and provide medical images and back end 820 may be used to perform the object detection methods provided herein. For example, the back end 820 may be implemented as the object detection device of fig. 5 of the present application. The backend 820 may then provide the target detection results (e.g., lesion localization results) to the front end and provide the results of the target detection to the user (e.g., doctor or patient) through the front end 810. The front end 810 may provide the results of the target detection to the user by any presentation means such as images, text, video, audio, etc.

Fig. 9A illustrates an exemplary process of performing object detection based on a reference frame, as shown in fig. 9A, using a preset reference frame, candidate regions may be determined, and then the candidate regions may be object-identified using an RCNN network. Therefore, the target detection method based on the process shown in fig. 9A is a two-stage method, which requires a large amount of computing resources.

Fig. 9B shows an exemplary process for target detection of lung nodules using 3D segmentation. In the detection process shown in fig. 9B, the detection result depends on fine labeling data for the lung nodule segmentation, and target detection cannot be performed for other lesion features.

Furthermore, methods or apparatus according to embodiments of the present application may also be implemented by way of the architecture of the computing device shown in fig. 10. Fig. 10 illustrates an architecture of the computing device. As shown in fig. 10, computing device 1000 may include a bus 1010, one or more CPUs 1020, a Read Only Memory (ROM) 1030, a Random Access Memory (RAM) 1040, a communication port 1050 connected to a network, an input/output component 1060, a hard disk 1070, and the like. A storage device in the computing device 1000, such as the ROM 1030 or the hard disk 1070, may store various data or files used for processing and/or communication of the object detection method provided herein, and program instructions executed by the CPU. The computing device 1000 may also include a user interface 1080. Of course, the architecture shown in FIG. 10 is merely exemplary, and one or at least two components of the computing device shown in FIG. 10 may be omitted as practical needed in implementing different devices.

Embodiments of the present application may also be implemented as a computer-readable storage medium. A computer readable storage medium according to an embodiment of the present application has computer readable instructions stored thereon. The computer readable instructions, when executed by a processor, may perform a method according to embodiments of the present application described with reference to the above figures. The computer-readable storage medium includes, but is not limited to, for example, volatile memory and/or nonvolatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like.

Those skilled in the art will appreciate that various modifications and improvements to the disclosure herein may occur. For example, the various devices or components described above may be implemented in hardware, or may be implemented in software, firmware, or a combination of some or all of the three.

Furthermore, as shown in the present application and in the claims, unless the context clearly dictates otherwise, the words "a," "an," "the," and/or "the" are not specific to the singular, but may also include the plural. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Furthermore, although the present application makes various references to certain elements in a system according to embodiments of the present application, any number of different elements may be used and run on a client and/or server. The units are merely illustrative and different aspects of the systems and methods may use different units.

Furthermore, flowcharts are used in this application to describe the operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the following claims. It is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the claims and their equivalents.

Claims

1. A method of object detection of a medical image, comprising:

determining a feature map of the medical image;

determining a target central point prediction graph according to the feature graph, wherein the target central point prediction graph comprises a plurality of first pixel points, and the value of each first pixel point indicates the possibility that the first pixel point is a prediction target central point;

determining a target size prediction graph according to the feature graph, wherein the target size prediction graph comprises a plurality of second pixel points, each second pixel point corresponds to one first pixel point, and the value of each second pixel point indicates the prediction size applicable to the first pixel point corresponding to the second pixel point;

Determining a predicted target center point and a predicted target size according to the target center point predicted graph and the target size predicted graph; and

determining a position and a size of a target object in the medical image based on the predicted target center point and the predicted target size,

wherein determining a predicted target center point and a predicted target size according to the target center point prediction graph and the target size prediction graph comprises:

determining a probability value of each first pixel point belonging to the predicted target center point according to the value of the first pixel point in the target center point predicted graph;

determining a first pixel point with a probability value larger than a preset probability threshold value as the predicted target center point; and

determining a predicted target size corresponding to the predicted target center point in the target size prediction graph according to the coordinates of the predicted target center point in the target center point prediction graph; and

wherein determining the position and size of the target object in the medical image based on the predicted target center point and the predicted target size comprises:

determining a mapping relationship between the feature map and the medical image;

determining a position of the target object and a candidate target size in the medical image based on the predicted target center point, the predicted target size, and the mapping relationship;

Determining an offset for the candidate target size according to the feature map; and

a size of the target is determined in the medical image based on the offset and the candidate target size.

2. The object detection method according to claim 1, wherein determining an object center point prediction graph from the feature graph comprises:

performing convolution processing on the feature map to determine a convolved feature map, wherein the convolved feature map comprises a plurality of third pixel points;

for each third pixel point in the convolved feature map,

determining a maximum horizontal pixel value of the convolved feature map in the horizontal direction,

determining a maximum vertical pixel value of the convolved feature map in a vertical direction, and

a value of a first pixel point in the target center point prediction graph is determined based on the maximum horizontal pixel value and the maximum vertical pixel value.

3. The object detection method according to claim 2, wherein,

determining a maximum horizontal pixel value of the convolved feature map in a horizontal direction includes:

searching the values of all the third pixel points from the third pixel point to the first horizontal edge of the feature map along the horizontal direction, determining the maximum value in the searched values as the first maximum horizontal pixel value, and/or

Searching the values of all the third pixel points from the third pixel point to the second horizontal edge of the characteristic diagram along the horizontal direction, determining the maximum value in the searched values as the second maximum horizontal pixel value,

determining a maximum vertical pixel value of the convolved feature map in a vertical direction includes:

searching values of all third pixel points from the third pixel point to the first vertical edge of the feature map along the vertical direction, determining the maximum value in the searched values as the first maximum vertical pixel value, and/or

Searching values of all third pixel points from the third pixel point to the second vertical edge of the characteristic diagram along the vertical direction, determining the maximum value in the searched values as a second maximum vertical pixel value,

determining a value of a first pixel point in the target center point prediction graph based on the maximum horizontal pixel value and the maximum vertical pixel value includes:

and summing the first maximum horizontal pixel value and/or the second maximum horizontal pixel value with the first maximum vertical pixel value and/or the second maximum vertical pixel value, and determining the value of a first pixel point corresponding to the third pixel point in the target central point prediction graph.

4. The object detection method of claim 1, wherein the object center point prediction graph includes at least one channel, the value of the first pixel in each channel indicating a likelihood that the first pixel is a predicted object center point for an object of a corresponding class.

5. The object detection method according to claim 1, wherein determining an object size prediction graph from the feature graph comprises:

and carrying out convolution processing on the feature map to obtain the target size prediction map, wherein the target size prediction map comprises at least one channel for each category of prediction target center point, and the value of the second pixel point of each channel indicates the prediction length in the preset size direction, which is applicable to the first pixel point corresponding to the second pixel point serving as the prediction target center point of the category.

6. The target detection method of claim 1, wherein the predicted target center point and the predicted target size comprise a predicted target center point for at least one class of target objects and a predicted target size for the class of target objects.

7. The object detection method according to claim 1, wherein the medical image is a 3D CT image, the method further comprising a preprocessing step, prior to determining a feature map of the medical image, the preprocessing comprising normalizing according to the CT value for each pixel point in the medical image.

8. The object detection method of claim 1, wherein determining a feature map of the medical image comprises:

processing the medical image by using at least one convolution layer to obtain a convolved medical image;

at least one downsampling and at least one upsampling of the convolved medical image with at least one downsampling layer and at least one upsampling layer to determine a feature map of the medical image, wherein the feature map is smaller in size than the medical image.

9. The object detection method of claim 1, wherein the feature map, the object center point prediction map, and the object size prediction map are generated by at least one convolutional network trained by:

determining a training dataset, wherein the training dataset comprises at least one training image, the at least one training image being marked with a true position and a true size of a true target object;

for each training image:

determining a training feature map of the training image;

determining a training target center point prediction graph according to the training feature graph;

determining a training target size prediction graph according to the training feature graph;

Determining a training predicted target center point and a training predicted target size according to the training target center point prediction graph and the training target size prediction graph;

determining a training position and a training size of a training target object in the training image based on the training prediction target center point and the training prediction target size;

parameters of the at least one convolution network are adjusted to minimize a loss between the training position and training size of the training target object in a training image and the real position and real size of the real target object.

10. The target detection method of claim 9, wherein the penalty comprises at least one of:

a difference between the training position and the true position; and

the difference between the training size and the real size.

11. An object detection device of a medical image, comprising:

a feature map determining unit configured to determine a feature map of the medical image;

a center point prediction unit configured to determine a target center point prediction graph from the feature graph, wherein the target center point prediction graph includes a plurality of first pixel points, a value of each first pixel point indicating a likelihood that the first pixel point is a prediction target center point;

A size prediction unit configured to determine a target size prediction graph from the feature graph, wherein the target size prediction graph includes a plurality of second pixel points, each second pixel point corresponding to one of the first pixel points, and a value of each second pixel point indicates a predicted size to which the first pixel point corresponding to the second pixel point is applied;

a target prediction unit configured to determine a predicted target center point and a predicted target size from the target center point prediction map and the target size prediction map;

a target determination unit configured to determine a position and a size of a target object in the medical image based on the predicted target center point and the predicted target size,

wherein the target prediction unit, when determining the predicted target center point and the predicted target size, is configured to:

Wherein the target determination unit, when determining the position and size of the target object in the medical image, is configured to:

12. An object detection apparatus of a medical image, comprising:

one or more processors; and

one or more memories having stored therein computer readable code which, when executed by the one or more processors, performs the object detection method of any of claims 1-10.

13. A computer readable storage medium having stored thereon instructions which, when executed by a processor, cause the processor to perform the object detection method of any of claims 1-10.