WO2022151755A1

WO2022151755A1 - Target detection method and apparatus, and electronic device, storage medium, computer program product and computer program

Info

Publication number: WO2022151755A1
Application number: PCT/CN2021/119982
Authority: WO
Inventors: 王娜; 宋涛; 刘星龙; 黄宁; 张少霆
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-01-15
Filing date: 2021-09-23
Publication date: 2022-07-21
Also published as: CN112785565B; CN112785565A

Abstract

A target detection method and apparatus, and an electronic device, a storage medium, a computer program product and a computer program. The method comprises: performing feature extraction on a first image to be detected, so as to obtain a first feature map for a plurality of scales of the first image (S11); and processing the first feature map for the plurality of scales of the first image by means of a trained target detection network, so as to obtain the location of a first object of a target category in the first image (S12).

Description

Target detection method and apparatus, electronic device, storage medium, computer program product and computer program

CROSS-REFERENCE TO RELATED APPLICATIONS

The present disclosure is based on the Chinese patent application with the application number of 202110057241.X, the application date of January 15, 2021, and the application name of "target detection method and device, electronic equipment and storage medium", and requires the priority of the Chinese patent application The entire content of this Chinese patent application is incorporated herein by reference.

technical field

The present disclosure relates to, but is not limited to, the field of computer technology, and in particular, to a target detection method and apparatus, electronic equipment, storage medium, computer program product, and computer program.

Background technique

Pulmonary nodules are a common lesion, and the characteristics of nodules often indicate the nature of lung disease. The detection of pulmonary nodules is of great significance to determine whether the lesion is lung cancer. The early detection, diagnosis and treatment of pulmonary nodules are beneficial to the early diagnosis and treatment of lung cancer and the key to reducing the mortality of lung cancer. Pulmonary nodules can be detected based on Computed Tomography (CT) images.

SUMMARY OF THE INVENTION

The embodiments of the present disclosure provide a target detection method and apparatus, electronic equipment, storage medium, computer program product and computer program, which not only improve the sensitivity of target detection, but also improve the accuracy of target detection.

An embodiment of the present disclosure provides a target detection method, including: performing feature extraction on a first image to be detected to obtain first feature maps of multiple scales of the first image; The first feature maps of multiple scales of the first image are processed to obtain the position of the first object of the target category existing in the first image; wherein, the target detection network is trained in a recursive manner; the target The detection network includes a classification sub-network, a regression sub-network and a segmentation sub-network, the classification sub-network is used to determine whether the first object exists in the first image, and the regression sub-network is used to determine the first image The bounding box of the first object existing in the first image, the segmentation sub-network is used to determine the outline of the first object existing in the first image.

In the embodiments of the present disclosure, on the one hand, the training of the target detection network is performed based on the multi-task learning of classification, regression and segmentation, and the correlation between tasks is used to improve the recognition ability of objects of the target category; The recursive phased training strategy is used to train the target detection network, which not only improves the sensitivity of target detection, but also improves the accuracy of target detection.

In some embodiments, the method further includes: training the target detection network according to a first training set to obtain a target detection network in a first state, where the first training set includes a plurality of sample images and the The first annotation information of the sample image, the first annotation information includes the real position of the second object in the sample image; the sample image is processed through the target detection network in the first state to obtain the sample image The predicted position of the second object in the sample image; according to the predicted position and real position of the second object, determine the false positive area, false negative area and true positive area in the sample image; A target detection network in one state is trained to obtain a trained target detection network. The second training set includes a plurality of sample images and second annotation information of the sample images, and the second annotation information includes the sample images. False positive regions, false negative regions, and true positive regions.

In the embodiment of the present disclosure, the training process of the target detection network is divided into two stages. In the first stage, the focus is on sensitivity, so that the target detection network can obtain as many suspected first objects as possible; in the second stage, the focus is on accuracy, so that the target detection network can obtain relatively high sensitivity based on high sensitivity. high accuracy.

In some embodiments, the plurality of sample images include positive sample images and negative sample images, and the method further includes: cropping the marked second image to obtain a positive sample image and a negative sample image of a preset size, The positive sample image includes at least one second object, and the negative sample image does not include the second object.

In this way, the problem that the GPU cannot be directly processed due to reasons such as the large amount of data contained in the second image and the limited video memory of a graphics processor (Graphics Processing Unit, GPU) can be improved.

In some embodiments, the real position of the second object includes a bounding box of the second object, and the target detection network is trained according to a first training set to obtain a target detection network in a first state, The method includes: performing feature extraction on the sample image to obtain second feature maps of multiple scales of the sample image; determining the sample according to the second feature maps of multiple scales and a plurality of preset anchor frames A plurality of first reference frames in the image; according to the bounding box of the second object in the sample image, a preset number of training samples are determined from the plurality of first reference frames, and the training samples include label information as The positive samples belonging to the target category and the negative samples not belonging to the target category are marked with information; the classification sub-network is trained according to the training samples.

In this way, positive and negative samples can be balanced, overfitting can be avoided, and the classification accuracy of the classification sub-network can be improved.

In some embodiments, determining a preset number of training samples from the plurality of first reference frames according to the bounding box of the second object in the sample image includes: converting the boundary in the sample image The frame is divided into multiple bounding box sets, and the size of the bounding box in each bounding box set is within a preset size interval; for any bounding box set, removing from the multiple first reference frames has been determined as training The first reference frame of the sample, to obtain a reference frame set corresponding to the bounding box set; for any bounding box in the bounding box set, according to the bounding box and each first reference in the corresponding reference frame set The intersection ratio between the boxes determines the positive samples and negative samples corresponding to the bounding box, and the number of positive samples is negatively correlated with the size interval of the bounding box set; according to the order of the size interval from small to large Each bounding box set is processed to obtain the preset number of training samples.

In this way, the second object with a larger size and the second object with a smaller size can be taken into consideration.

In some embodiments, the training of the classification sub-network according to the training sample includes: cropping the second feature map to obtain a third feature map corresponding to the training sample; The feature map is input to the classification sub-network, and the first probability that the training sample belongs to the target category is obtained; according to the first probability that the training sample belongs to the target category and the label information of the training sample, the classification sub-network is determined. The first loss; according to the first loss, adjust the network parameters of the classification sub-network.

In this way, the classification of the second object can be made more accurate.

In some embodiments, the real position of the second object includes a bounding box of the second object, and the target detection network is trained according to a first training set to obtain a target detection network in a first state, The method includes: performing feature extraction on the positive sample image to obtain fourth feature maps of multiple scales of the positive sample image; multiple second reference frames in the positive sample image; for any bounding box of the second object in the sample image: determine the intersection ratio of the bounding frame and the multiple second reference frames, The second reference frame with the largest sum ratio is determined as the matching frame corresponding to the bounding box; the fifth feature map corresponding to the matching frame is input into the regression sub-network to obtain the prediction frame of the matching frame; according to the The difference between the bounding box and the prediction box determines the second loss of the regression sub-network; according to the second loss, the network parameters of the regression sub-network are adjusted.

In this way, the position of the second object can be made more accurate.

In some embodiments, the determining the second loss of the regression sub-network according to the difference between the bounding box and the prediction box includes: according to the coordinates between the bounding box and the prediction box Offset and intersection ratio, determine the first regression loss of the matching box; determine the second regression loss of the matching box according to the intersection, union and minimum closed area between the bounding box and the prediction box loss; according to the first regression loss and the second regression loss, determine the second loss of the regression sub-network.

In this way, by using the intersection ratio of the predicted frame and the corresponding bounding box as a guide, a larger loss value is given to the smaller predicted frame, so that when the regression sub-network is trained using the matching frame corresponding to the predicted frame, the regression sub-network will The parameters of the network are updated more vigorously.

In some embodiments, the real position of the second object includes the outline of the second object, and the target detection network is trained according to the first training set to obtain the target detection network in the first state, including : perform feature extraction on the positive sample image to obtain fourth feature maps of multiple scales of the positive sample image; input the fourth feature maps of multiple scales into the segmentation sub-network to obtain the positive sample The second probability that each pixel of the image belongs to the target category; the segmentation is determined according to the number of pixels in the positive sample image, the contour of the second object in the positive sample image, and the second probability that each pixel belongs to the target category The third loss of the sub-network; according to the third loss, the network parameters of the segmentation sub-network are adjusted.

In this way, the positioning of the second object can be made more accurate.

In some embodiments, the training of the target detection network in the first state according to the second training set to obtain the trained target detection network includes: according to the second label information, performing training on the sample image The second feature maps of multiple scales of the The third probability that the false negative area and the true positive area belong to the target category; according to the third probability that the false positive area, the false negative area and the true positive area belong to the target category, and the true category of the false positive area, the false negative area and the true positive area , determine the fourth loss of the classification sub-network; adjust the network parameters of the classification sub-network according to the fourth loss.

In some embodiments, the training of the target detection network in the first state according to the second training set to obtain the trained target detection network includes: according to the second label information, performing training on the sample image The second feature maps of multiple scales are cropped to obtain the sixth feature map corresponding to the true positive area and the false negative area; determine the bounding box matching the true positive area and the false negative area; Input the regression sub-network to obtain the prediction frame of the true positive area and the false negative area; determine the regression sub-network according to the difference between the prediction frame of the true positive area and the false negative area and the corresponding bounding box The fifth loss; according to the fifth loss, adjust the network parameters of the regression sub-network.

In some embodiments, the training of the target detection network in the first state according to the second training set to obtain the trained target detection network includes: assigning the first state corresponding to the true positive area and the false negative area The six feature maps are input into the segmentation sub-network to obtain the fourth probability that each pixel in the true positive area and the false negative area belongs to the target category; according to the number of pixels in the true positive area and the false negative area, the true positive area The outline of the second object in the positive area and the false negative area and the fourth probability that each pixel belongs to the target category determines the sixth loss of the segmentation sub-network; according to the sixth loss, adjust the network of the segmentation sub-network parameter.

In some embodiments, the first image includes a 2D medical image and/or a 3D medical image, and the target category includes a nodule and/or a cyst.

An embodiment of the present disclosure provides a target detection device, comprising: an extraction part, configured to perform feature extraction on a first image to be detected to obtain first feature maps of multiple scales of the first image; a first processing part, is configured to process the first feature maps of multiple scales of the first image through the trained target detection network to obtain the position of the first object of the target category existing in the first image; wherein, the target The detection network is trained in a recursive manner; the target detection network includes a classification sub-network, a regression sub-network and a segmentation sub-network, and the classification sub-network is used to determine whether the first object, the The regression sub-network is used to determine the bounding box of the first object existing in the first image, and the segmentation sub-network is used to determine the outline of the first object existing in the first image.

An embodiment of the present disclosure provides an electronic device, including: a processor; a memory configured to store instructions executable by the processor; wherein the processor is configured to invoke the instructions stored in the memory to execute the above method.

Embodiments of the present disclosure provide a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the foregoing method is implemented.

Embodiments of the present disclosure provide a computer program product, including computer-readable codes. When the computer-readable codes are run on a device, a processor in the device executes the video detection method for implementing any of the embodiments of the present disclosure. some or all of the steps.

An embodiment of the present disclosure provides a computer program configured to store computer-readable instructions, which, when executed, cause a computer to execute part or all of the steps of the video detection method in any of the embodiments of the present disclosure.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure. Other features of embodiments of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

Description of drawings

The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate embodiments consistent with the present disclosure, and together with the description, serve to explain the technical solutions of the present disclosure.

FIG. 1 is a schematic diagram of an implementation flowchart of a target detection method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of the composition and structure of a residual attention network according to an embodiment of the present disclosure;

3 is a schematic diagram of the composition and structure of a feature pyramid network provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of the composition structure of a target detection architecture provided by an embodiment of the present disclosure;

5 is a schematic diagram of a prediction frame of a lung nodule when the target detection network shown in FIG. 4 is the target detection network in the first state;

6 is a schematic diagram of a prediction frame of a lung nodule when the target detection network shown in FIG. 4 is a trained target detection network;

FIG. 7 is a schematic diagram of the composition and structure of a target detection device according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of the composition and structure of an electronic device according to an embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed ways

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures denote elements that have the same or similar function. While various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless otherwise indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" in this article is only an association relationship to describe the associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and A and B exist independently B these three cases. In addition, the term "at least one" herein refers to any combination of any one of the plurality or at least two of the plurality, for example, including at least one of A, B, and C, and may mean including from A, B, and C. Any one or more elements selected from the set of B and C.

In addition, in order to better illustrate the embodiments of the present disclosure, numerous specific details are given in the following detailed description. It should be understood by those skilled in the art that the embodiments of the present disclosure may be practiced without certain specific details. In some embodiments, methods, means, components and circuits well known to those skilled in the art are not described in detail so as to highlight the gist of the embodiments of the present disclosure.

FIG. 1 is a schematic diagram of an implementation flowchart of a target detection method provided by an embodiment of the present disclosure. As shown in FIG. 1 , the method may include:

Step S11, perform feature extraction on the first image to be detected, and obtain first feature maps of multiple scales of the first image.

Step S12 , processing the first feature maps of multiple scales of the first image through the trained target detection network to obtain the position of the first object of the target category in the first image.

Wherein, the target detection network is trained in a recursive manner; the target detection network includes a classification sub-network, a regression sub-network and a segmentation sub-network, and the classification sub-network is used to determine whether the first image has the The first object and the regression sub-network are used for determining the bounding box of the first object existing in the first image, and the segmentation sub-network is used for determining the outline of the first object existing in the first image.

In some embodiments, the target detection method may be performed by an electronic device such as a terminal device or a server, and the terminal device may be a user equipment (User Equipment, UE), a mobile device, a user terminal, a terminal, a cellular phone, a cordless phone, a personal For digital processing (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc., the method can be implemented by the processor calling the computer-readable instructions stored in the memory. Alternatively, the method may be performed by a server.

In an embodiment of the present disclosure, the first object may represent an object of a target category. The target categories may include nodules (eg, lung nodules, breast nodules, etc.), cysts, and the like. The first image may represent an image to be subjected to the first object detection. The first image may include 2D medical images (eg, X-ray films, etc.) and/or 3D medical images (eg, CT images and MRI images, etc.). This embodiment of the present disclosure does not limit the first image and the target category. According to the target detection method provided by the embodiment of the present disclosure, whether there is a first object in the first image can be detected, and the position of the first object in the first image can be obtained. In some embodiments, when the target category is pulmonary nodules, the network parameters of the target detection network can be initialized by using the public lung nodule data set LUNA to reduce problems such as long network training time and disappearance of gradients.

In step S11, it is considered that the size difference between different first subjects may be large (eg, the diameter of the lung nodules is distributed between 3 millimeters (mm) to 30 mm). In the case of target detection for a first object with a smaller size, low-level feature information at high resolution (ie, a feature map with a smaller scale) is required, and in the case of target detection for a first object with a larger size , requires high-order feature information under a large receptive field (ie, a feature map with a larger scale). Therefore, in order to take into account the first objects of different sizes and improve the accuracy of target detection, in this step, first feature maps of multiple scales may be extracted from the first image. Here, the first feature map may be used to represent a feature map obtained by performing feature extraction on the first image. In one example, for the three-dimensional first image, the scales of the extracted first feature maps of multiple scales may include 48*48*48, 24*24*24, 12*12*12, 6*6*6, etc. . For the two-dimensional first image, the scales of the extracted first feature maps of multiple scales may include 48*48, 24*24, 12*12, 6*6, and so on. In the following description of the embodiments of the present disclosure, a three-dimensional first image is used as an example for description, and the processing process of the two-dimensional first image may refer to the three-dimensional first image.

In implementation, feature extraction may be performed on the first image through a feature extraction network to obtain first feature maps of multiple scales of the first image. The feature extraction network can be any network capable of multi-scale feature extraction. In one example, the feature extraction network can be trained on a large number of images in the visualization database ImageNet. In order to achieve multi-scale feature extraction, the feature extraction network in the embodiment of the present disclosure may include a basic network and a feature pyramid network (Feature Pyramid Networks, FPN).

Wherein, the basic network can be used to extract the basic feature map of the first image. In some embodiments, the base network may include a residual network (Residual Network, ResNet), such as ResNet18. Among them, the convolution parameters of each layer in the backbone network of the residual network can be set as: the convolution kernel size K is 3*3*3, the step size S is 1, the expansion P is 1, and a batch is connected after each layer of convolution Normalization (Batch Normalization, BN) layer and linear rectification unit (Rectified Linear Unit, ReLU). In some embodiments, the basic network may include a Residual Attention Network (Residual Attention Network) formed by combining a residual network and an attention model (Attention Model). Considering that the residual network usually extracts features on the entire image range, in actual target detection, the local features of the first object are more valuable than the regional features far away from the first object. Therefore, the introduction of an attention model into the basic network can enable the basic network to focus on extracting and learning feature information with more reference value (ie, local features of the first object). That is to say, using the residual attention network as the basic network to extract the basic feature map can make the extracted basic feature map more representative of the local features of the first object, thereby improving the accuracy of target detection.

FIG. 2 is a schematic diagram of the composition and structure of a residual attention network provided by an embodiment of the present disclosure. As shown in FIG. 2 , the residual attention network includes: a residual network 10 and an attention model 20 . The backbone feature map of the first image 31 can be obtained through the residual network, and the attention feature map of the first image can be obtained through the attention model (it should be noted that the scale of the attention feature map is the same as the scale of the backbone feature map), The basic feature map 32 of the first image can be obtained by combining the backbone feature map and the attention feature map. In some embodiments, the base feature map of the first image=(1+attention feature map)*backbone feature map.

In some embodiments, as shown in FIG. 2 , the attention model may include a global mean pooling unit 21 , a fully connected modified linear unit 22 and a fully connected activation unit 23 .

After the basic feature map is acquired, feature maps of multiple scales of the first image may be acquired through FPN. FPN includes downsampling processing and upsampling processing. Among them, the downsampling process can reduce the scale of the feature map and expand the receptive field, but it will lose the feature information of the first object with a small size, and the upsampling process can increase the scale of the feature map and retain the features of the first object with a small size information, but narrows the receptive field.

Obtain the first features of 4 scales (including: 48*48*48, 24*24*24, 12*12*12 and 6*6*6, unit: pixel) of the first image from the base feature map with FPN Figure as an example to illustrate. FIG. 3 is a schematic diagram of the composition and structure of an FPN provided by an embodiment of the present disclosure. As shown in FIG. 3 , C1 may be used to represent a basic feature map of a first image acquired through a basic network. Since the first feature maps of four scales are finally required, in the embodiment of the present disclosure, C1 is sequentially downsampled four times to obtain C2, C3, C4, and C5, respectively. Convolve C5 with the convolution kernel of 1*1*1 to get P5; upsample P5, convolve C4 with the convolution kernel of 1*1*1, the upsampling result of P5 and the volume of C4 Add the product results to get P4; upsample P4, convolve C3 with the 1*1*1 convolution kernel, add the upsampling result of P4 and the convolution result of C3 to get P3; perform upsampling on P3 Sampling, convolve C2 with the convolution kernel of 1*1*1, and add the upsampling result of P3 and the convolution result of C2 to obtain P2. Convolve P5, P4, P3 and P2 with 3*3*3 convolution kernels, respectively, to get 6*6*6, 12*12*12, 24*24*24 and 48*48*48 features Figure, that is, the first feature map of the four scales of the first image is obtained.

In the embodiment of the present disclosure, the basic feature map extracted by the basic network is converted into a multi-scale feature map through FPN, so that the first object of various sizes can be detected, and the amount of calculation can be basically not increased by changing the simple network connection. In the case of , the performance of detecting the first object of small size can be effectively improved.

In step S12, the first feature maps of multiple scales of the first image may be processed by the trained target detection network, so as to obtain the position of the first object existing in the first image.

The position of the first object may be represented by the bounding box of the first object and the outline of the first object. The target detection network includes a classification sub-network, a regression sub-network and a segmentation sub-network, wherein the classification sub-network can be used to determine whether the first object exists in the first image, and the regression sub-network can be used to determine whether the first object exists in the first image. The bounding box of the first object, the segmentation sub-network may be used to determine the outline of the first object present in the first image. It is obtained through the joint training of multiple tasks of classification, regression and segmentation, and the ability to recognize the first object can be improved by using the correlation between tasks. Moreover, in the embodiment of the present disclosure, the above-mentioned target detection network including the classification sub-network, the regression sub-network and the segmentation sub-network is trained in a recursive manner. On the basis of improving the sensitivity of target detection, the target detection can be improved. accuracy.

In the embodiment of the present disclosure, a trained target detection network is obtained based on multi-task learning and recursive training. Considering that for the target detection network: in the case of maintaining high sensitivity, there is a problem of low accuracy (that is, a large number of objects are misclassified); in the case of maintaining high accuracy, there is a problem of high sensitivity. low (that is, there are a large number of objects of the target class that are not detected). For example: when the sensitivity reaches more than 95%, there are a large number of false positive sample images (about 32%); when the false positive sample images are controlled below 3%, the sensitivity is low (about 32%). 20% of objects are not detected).

Therefore, in the embodiment of the present disclosure, the training process of the target detection network is divided into two stages. In the first stage, the focus is on sensitivity, so that the target detection network can obtain as many suspected first objects as possible; in the second stage, the focus is on accuracy, so that the target detection network can obtain relatively high sensitivity based on high sensitivity. high accuracy.

In some embodiments, the method further includes: training the target detection network according to the first training set to obtain the target detection network in the first state; and training the target detection network in the first state according to the second training set , to get the trained object detection network.

That is to say, in the embodiment of the present disclosure, the training process of the target detection network is divided into two stages: in the training of the first stage, the target detection network is trained according to the first training set, and the target of the first state is obtained. The detection network is the training of the first stage; in the training of the second stage, the target detection network in the first state is trained to obtain the trained target detection network.

In the first stage, the first training set is used to train the target detection network. The first training set includes a plurality of sample images and first annotation information of the sample images, where the first annotation information includes the real position of the second object in the sample image. Wherein, the plurality of sample images include positive sample images and negative sample images. Here, the positive sample image includes at least one second object, and the negative sample image does not include the second object. The second object may represent an object of the target category existing in the training sample image, and the second object may refer to the first object, which will not be repeated here.

The acquisition process of the first training set will be described below.

In some implementations, the method further includes: cropping the marked second image to obtain a positive sample image and a negative sample image of a preset size.

The second image may be used to represent the annotated image. In one example, the second image may be an annotated medical image. The annotation information of the second image may be used to indicate the real position (including the bounding box and outline) of each second object in the second image. In some embodiments, the bounding box of the second object may be represented by a binarized cuboid. In some embodiments, the bounding box of the second object may be represented by a binarized sphere. It can be understood that the center point of the binarized sphere is the same as the center point of the second object, and the radius of the binarized sphere is a radius set as required. The contour of the second object may be represented by whether each pixel in the second image is a target category. The default size can be set as required, for example, the default size can be 96*96*96 (unit: pixel*pixel*pixel).

In implementation, a positive sample image and a negative sample image of a preset size may be acquired from the second image according to the label information of the second image.

In some embodiments, the position (center point, bounding box, etc.) of each second object in the second image may be determined according to the label information of the second image. Then, according to the position of the second object (eg, centered on the second object), an image block with a size of a preset size and including the second object is cropped from the second image, and an image block with a size of a preset size and not including the second object is cropped from the second image. The image block of the second object. The cropped image block including the second object may be used as a positive sample image, and the cropped image block not including the second object may be used as a negative sample image.

By cropping the second image to obtain image blocks including the second object and image blocks not including the second object, it is possible to improve the problems caused by the large amount of data contained in the second image and the limited video memory of the graphics processor (Graphics Processing Unit, GPU). Problems that the GPU cannot directly handle due to other reasons. By cropping an image block of a preset size, the problem of imbalance between the area where the second object is located and the area where the second object is not located can be reduced, for example, the size of the lung nodule area in the lung CT image is much smaller than the size of the normal tissue area.

In some embodiments, data augmentation is performed on the cropped image blocks including the second object and the image blocks not including the second object through operations such as rotation, translation, mirroring, and scaling, so as to implement data expansion and increase the data including the second object. , and increase the number of image blocks that do not include the second object. These image blocks including the second object obtained through data augmentation can also be used as positive sample images, and these image blocks obtained through data augmentation without including the second object can also be used as negative sample images. By performing data enhancement on the cropped image blocks including the second object and the image blocks not including the second object, the number of sample images can be effectively enlarged, and the generalization ability of the target detection network can be improved.

In some embodiments, the same number of positive sample images and negative sample images are acquired. By acquiring the same number of image blocks including the second object and image blocks not including the second object, the positive and negative sample images can be effectively balanced, thereby reducing overfitting.

In some embodiments, a positive sample image and a negative sample image of a preset size may be obtained by first preprocessing the marked second image, and then cropping the preprocessed second image. In this way, the image quality of the obtained positive sample images and negative sample images can be improved, which is beneficial to the subsequent training of the target detection network. Preprocessing of the second image may include one or more of resampling, cropping, normalization, and the like.

Taking the lung CT image as the second image as an example, the preprocessing process of the second image will be described. Considering that lung CT images are 3D images, the thickness of CT images obtained by different CT instruments may be different (for example, the thickness of lung CT images may be 4 mm, 2.5 mm, 1.25 mm, 1 mm, and 0.7 mm, etc.). By resampling the lung CT images to a resolution of 1*1*1, the thickness difference between the lung CT images can be effectively eliminated. After resampling, the area where the lung parenchyma is located can be cropped out. In this way, both the positive sample image and the negative sample image can be made of tissue in the lung area, which can reduce the interference of other organs on the training target detection network. After cropping out the area where the lung parenchyma is located, the value of each pixel (also called voxel) in the cropped area can be normalized to a value range of 0-1 to obtain the preprocessed lung CT image. This can effectively reduce the amount of subsequent calculations.

It should be noted that, the method of cropping the positive sample image and the negative sample image of the preset size from the preprocessed second image may refer to the method of directly cropping the positive sample image and the negative sample image of the preset size from the second image. .

So far, the acquisition of positive sample images and negative sample images is completed, that is, the acquisition of sample images in the first training set is completed.

It can be understood that, according to the label information of the second image, the position of each second object in the second image can be determined. Therefore, according to the annotation information of the second image, the annotation information of each positive sample image and the annotation information of each negative sample image can be determined, that is, the first annotation information of each sample image in the first training set is determined.

So far, the sample images in the first training set are obtained, and the first label information of each sample image is determined. That is, the acquisition of the first training set is completed. The following describes the process of using the first training set to train the target detection network to obtain the target detection network in the first state.

The training of the target detection network according to the first training set to obtain the target detection network in the first state includes the classification, regression and segmentation of the target detection network according to the first training set. The network is trained. In the embodiment of the present disclosure, the training of the target detection network is performed based on the multi-task learning of classification, regression and segmentation, and the ability to recognize objects of the target category is improved by utilizing the correlation between the tasks.

When training the classification sub-network, the sample images to be used include: the original image and the negative sample image, and the label information to be used includes: the bounding box of the second object.

In some embodiments, the training of the classification sub-network of the target detection network according to the first training set may include steps S21 to S24.

In step S21, feature extraction is performed on the sample image to obtain second feature maps of multiple scales of the sample image.

Wherein, the second feature map may represent a feature map extracted from the sample image. The process of performing feature extraction on the sample image may refer to the process of performing feature extraction on the first image. For example, the scale of the second feature map may include 6*6**6, 12*12*12, 24*24*24, 48*48*48, and so on.

In step S22, a plurality of first reference frames in the sample image are determined according to the second feature maps of the plurality of scales and a plurality of preset anchor frames.

The preset anchor frame may be used to indicate the size of the first reference frame. The preset anchor boxes can be preset as needed. In some embodiments, the size of the lung nodule is 3 mm to 30 mm, so the area of the preset anchor frame can be set to 4, 8, 16, and 32 (unit: pixel*pixel), etc. There can be multiple preset anchor boxes with the same area. Assuming that the area of the preset anchor frame is 4, the shape of the preset anchor frame may include: 1*4, 2*2 and 4*1 (unit: pixel*pixel). Assuming that the area of the preset anchor frame is 8, the shapes of the preset anchor frame may include: 1*8, 2*4, 4*2 and 8*1. In the embodiment of the present disclosure, the area and shape of the preset anchor frame can be set in advance as required, and the embodiment of the present disclosure does not limit the area and shape of the preset anchor frame.

For the second feature map of one scale of the sample image, the center points of a plurality of first reference frames may be determined in the sample image. For example, assuming that the scale of the feature map of a certain scale of the sample image is 3*3*3, the sample image is divided into 9 areas on average, and the center point of each area is the center point of a first reference frame. . For a center point of a first reference frame and a preset anchor frame, a first reference frame may be determined.

In step S23, a preset number of training samples are determined from the plurality of first reference frames according to the bounding frame of the second object in the sample image.

Among them, the training samples include positive samples and negative samples, the label information of positive samples belongs to the target category, and the label information of negative samples does not belong to the target category.

According to the intersection ratio of the first reference frame and the bounding box of the second object, the gap between the first reference frame and the bounding box of the second object can be determined, so as to determine whether the label of the first reference frame is a target category or a non-target category . In the case where the intersection of a first reference frame and a bounding box of a second object is relatively large, it indicates that the gap between the two is small. At this time, the label of the first reference frame may be the target category. The reference frame can be used as a positive sample for the classification sub-network. In the case where the intersection of a first reference frame and a bounding box is relatively small, it indicates that the gap between the two is large. At this time, the first reference frame may be a non-target category, and the first reference frame can be used as a classification Negative samples for the subnetworks.

In some embodiments, step S23 may include: dividing the bounding box in the sample image into multiple bounding box sets, and the size of the bounding box in each bounding box set is within a preset size interval; A frame set, removing the first reference frame that has been determined as a training sample from the plurality of first reference frames, to obtain a reference frame set corresponding to the bounding box set; for any boundary in the bounding box set frame, according to the intersection ratio between the bounding box and each first reference frame in the corresponding reference frame set, determine the positive samples and negative samples corresponding to the bounding box, and the number of the positive samples is the same as that of the The size intervals of the bounding box sets are negatively correlated; each bounding box set is sequentially processed according to the order of the size intervals from small to large, to obtain the preset number of training samples.

Since the size gap between the second objects is large, the size gap between the bounding boxes of the second objects is also large. In order to take into account the second object with a larger size and the second object with a smaller size, in this embodiment of the present disclosure, the bounding box in the sample image may be divided into multiple bounding box sets according to the size, and then each bounding box set is divided into multiple bounding box sets. to be processed.

In implementation, a size interval can be preset for each bounding box set. When the size of a bounding box is within a size range corresponding to a bounding box set, the bounding box can be divided into the bounding box set. In this way, the size of the bounding boxes in each bounding box set is within a preset size range for the bounding box set.

The size interval preset for the bounding box set may be set as required (for example, according to the size of the second object), and the embodiment of the present disclosure does not limit the size interval. A pulmonary nodule is taken as an example of the second object for description. The size of pulmonary nodules is between 3mm and 30mm. Among them, those with a size less than or equal to 6mm can be called small nodules, those with a size greater than 6mm and less than 12mm are called middle nodules, and those with a size greater than or equal to 12mm are called nodules. large nodules. Therefore, set three bounding box sets, and set a size interval for each bounding box set.

After the division of the bounding box set is completed, each bounding box set may be processed in sequence according to the order of size intervals from small to large.

The first bounding box set may represent any one of the divided bounding box sets. For the process of processing other bounding box sets, refer to the process of processing the first bounding box set. The process of processing the first bounding box set includes: removing a first reference frame determined as a training sample from the plurality of first reference frames to obtain a reference frame set corresponding to the first bounding box set; Any one of the bounding boxes in the first bounding box set: according to the intersection ratio between the bounding box and each first reference frame in the reference frame set corresponding to the first bounding box set, determine the positive sample corresponding to the bounding box and negative samples.

The reference frame set includes a plurality of first reference frames, and the reference frame set can limit the range of selecting positive samples and negative samples. If the first bounding box set is the first processed bounding box set after sorting, it indicates that there is currently no first reference box determined as a training sample (including positive samples and negative samples). In this case, for any sample image, all the first reference frames in the sample image may be used to form a reference frame set corresponding to the first bounding frame set. If the first bounding box set is not the first processed bounding box set after sorting, it indicates that some of the first reference boxes may have been determined as training samples. In this case, for any sample image, the first reference frame of the sample image can be determined as the first reference frame row of the training sample, and the remaining first reference frame can be used to form a first bounding box set corresponding to collection of reference frames. In this way, the number of computations of the cross-union ratio can be reduced, and the amount of computation and workload can be reduced.

In the embodiment of the present disclosure, the number of positive samples corresponding to a bounding box is negatively correlated with the size interval of the bounding box set of the bounding box. That is to say, when the size interval of the bounding box set to which a bounding box belongs is large, the number of positive samples corresponding to the bounding box is small; when the size interval of the bounding box set to which a bounding box belongs is small, The number of positive samples corresponding to the bounding box is large. Taking lung nodules as the second object for illustration, the number of positive samples corresponding to the bounding box set representing small nodules can be 6, the number of positive samples corresponding to the bounding box representing medium nodules can be 4, and the number of positive samples representing large nodules can be 4. The number of positive samples corresponding to the bounding box can be 2. Since the learning difficulty of the second object with a smaller size is higher, and the learning difficulty of the second object with a larger size is lower, in this way, more positive samples are determined for the second object with a smaller size, and more positive samples are determined for the second object with a larger size. Determining fewer positive samples for the second object can balance the difficulty of learning second objects of different sizes, thereby ensuring that the second objects of various sizes have sufficient sensitivity.

In some embodiments, for each bounding box in the first bounding box set, in the order of the intersection ratio of each first reference box in the corresponding reference frame set and the bounding box from small to large, the The first reference frames in the reference frame set are sorted, and the first to Nth first reference frames are determined as the positive samples corresponding to the bounding frame, where N can be set as required; (It can be set as required, for example, it can be greater than 0.02 and less than 0.2) The first reference frame is determined as the negative sample corresponding to the bounding box. And, in order to reduce over-fitting, the number of positive samples corresponding to a bounding box can be the same or similar to the number of negative samples.

In step S24, the classification sub-network is trained according to the training samples.

In some embodiments, step S24 may include: cropping the second feature map to obtain a third feature map corresponding to the training sample; inputting the third feature map into the classification sub-network to obtain the The first probability that the training sample belongs to the target category; the first loss of the classification sub-network is determined according to the first probability that the training sample belongs to the target category and the label information of the training sample; Network parameters of the classification sub-network.

In implementation, for each training sample: according to the position of the training sample in the sample image, the position of the third feature map corresponding to the training sample in the second feature map corresponding to the sample image can be determined, and according to the third feature map At the position of the second feature map, the second feature map is cropped to obtain a third feature map corresponding to the training sample. It is understandable that the second feature map has multiple scales, and the cropped third feature map also has multiple scales.

The third feature map of the training sample is input into the classification sub-network of the target detection network, and the first probability that the training sample belongs to the target category is output. Then, through formula 1, the first loss of the classification sub-network can be determined according to the first probability and the label information of the training sample.

Among them, L _ft represents the first loss, y represents the label information of the training sample, y=1 represents that the training sample belongs to the target category, and y=0 represents that the training sample does not belong to the target category. y' represents the first probability of the output of the classification sub-network. γ and α are hyperparameters. Among them, γ is mainly used to reduce the weight of the easy-to-classify training samples, so that the classification sub-network of the target detection network pays more attention to the difficult-to-classify training samples. In some embodiments, the value of γ may be 2. α is mainly used to balance the ratio of positive samples and negative samples in training samples, effectively reducing the problem of serious imbalance in the proportion of positive and negative samples in target detection. In some embodiments, the value of α may be 0.25.

When a training sample belongs to the target category and the first probability of the training sample is greater than the first threshold, it can be considered that the training sample belongs to the easy-to-classify training sample. In the case that a training sample belongs to a non-target category and the first probability of the training sample is less than the second threshold, it can be considered that the training sample belongs to the easy-to-classify training sample. Wherein, the first threshold and the second threshold can be set as required. The first threshold may be set to a value closer to 1, for example, may be set to 0.9 or 0.95, etc. The second threshold may be set to a value closer to 0, for example, may be set to 0.05 or 0.1. This embodiment of the present disclosure does not limit the settings of the first threshold and the second threshold. It can be seen from formula 1 that the L _ft obtained for the easily classified training samples is relatively small. That is to say, the first loss caused by the easy-to-classify training samples is relatively small, and the impact on the network parameters of the classification sub-network is relatively small. This is equivalent to reducing the weight of easily classified training samples.

In the case that a training sample belongs to the target category and the first probability of the training sample is smaller than the third threshold, it can be considered that the training sample belongs to the difficult-to-classify training sample. When a training sample belongs to a non-target category, and the first probability of the training sample is greater than the fourth threshold, it can be considered that the training sample belongs to a difficult-to-classify training sample. Wherein, the third threshold and the fourth threshold can be set as required. The third and fourth thresholds may be set to values close to 0.5. For example, the third threshold may be set to 0.55 or 0.6, etc., and the fourth threshold may be set to 0.4 or 0.45, etc. This embodiment of the present disclosure does not limit the settings of the third threshold and the fourth threshold. According to formula 1, the L _ft obtained for the hard-to-classify training samples is relatively large. That is to say, the first loss brought by the hard-to-classify samples is relatively large, and the impact on the network parameters of the classification sub-network is relatively large, which is equivalent to increasing the weight of the hard-to-classify training samples, making the classification sub-network pay more attention to the hard-to-classify samples. training samples.

It should be noted that, before determining the classification loss, a smoothing operation can be performed on the label information of the training samples, for example, the value of y can be softened from 0 and 1 to 0.1 and 0.9, so as to enhance the generalization of the target detection network performance.

So far, the training of the classification sub-network of the target detection network according to the first training set is completed.

When training the regression sub-network, the sample images to be used include: positive sample images, and the label information to be used includes: the bounding box of the second object.

In some embodiments, according to the first training set, training the regression sub-network of the target detection network may include steps S31 to S36.

In step S31, feature extraction is performed on the positive sample image to obtain fourth feature maps of multiple scales of the positive sample image.

The fourth feature map may represent a feature map of the positive sample image. Step S31 may refer to step S21.

In step S32, a plurality of second reference frames in the positive sample image are determined according to the fourth feature maps of the plurality of scales and a plurality of preset anchor frames.

Step S32 may refer to step S22.

In step S33, for any bounding box of the second object in the sample image, determine the intersection ratio of the bounding box and the plurality of second reference frames, and determine the second reference frame with the largest intersection ratio A matching box corresponding to the bounding box is determined.

In step S34, for any bounding box of the second object in the sample image, the fifth feature map corresponding to the matching box is input into the regression sub-network to obtain a prediction box of the matching box.

The fifth feature map may represent the feature map corresponding to the matching frame. For the manner of obtaining the fifth feature map corresponding to the matching frame, reference may be made to the manner of obtaining the third feature map corresponding to the training sample in step S24.

In step S35, for any bounding box of the second object in the sample image, the second loss of the regression sub-network is determined according to the difference between the bounding box and the predicted box of the corresponding matching box.

In some embodiments, step S35 may include: determining the first regression loss of the matching box according to the coordinate offset and the intersection ratio between the bounding box and the prediction box; The intersection, union and minimum closed area between the prediction frames determine the second regression loss of the matching frame; according to the first regression loss and the second regression loss, determine the first regression loss of the regression sub-network. Two losses.

In some embodiments, the first regression loss can be determined by formula two:

in,

Can represent the first regression loss, W _iou represents the weight of the prediction box, W _iou = (e ^-iou +0.4), iou represents the intersection ratio between the prediction box and the corresponding bounding box, x represents the coordinates of the prediction box relative to the corresponding bounding box Offset.

By using the intersection ratio of the prediction box and the corresponding bounding box as a guide, according to formula 2, the loss value of the smaller prediction box is given a larger loss value, so that when the regression sub-network is trained using the matching box corresponding to the prediction frame, the regression The parameters of the sub-network are updated more vigorously.

Considering that the positions of different prediction frames are quite different when the first regression loss is the same, a second regression loss is introduced in the embodiment of the present disclosure to make the positioning of the second object more accurate.

In some embodiments, the second regression loss can be determined by formula three;

Among them, L _GIoU represents the second regression loss, A and B represent the prediction box and the corresponding bounding box respectively, C represents the minimum closed area of A and B, A∪B represents the union of the prediction box and the corresponding bounding box, A∩B represents The intersection of the predicted box and the corresponding bounding box.

By introducing the second regression loss as an aid, the overlapping area and the non-overlapping area between the prediction box and the corresponding bounding box are optimized, so as to more accurately locate the area where the second object is located.

In some embodiments, the weighted summation of the first regression loss and the second regression loss may be performed to obtain the second loss of the regression sub-network.

In step S36, the network parameters of the regression sub-network are adjusted according to the second loss.

So far, the training of the regression sub-network of the target detection network according to the first training set is completed.

When training the segmentation sub-network, the sample images to be used include: positive sample images, and the label information to be used includes: the outline of the second object.

In some embodiments, according to the first training set, training the segmentation sub-network of the target detection network may include steps S41 to S44.

In step S41, feature extraction is performed on the positive sample image to obtain fourth feature maps of multiple scales of the positive sample image.

Step S41 may refer to step S31.

In step S42, the fourth feature maps of the multiple scales are input into the segmentation sub-network to obtain the second probability that each pixel of the positive sample image belongs to the target category.

In step S43, the third loss of the segmentation sub-network is determined according to the number of pixels in the positive sample image, the contour of the second object in the positive sample image, and the second probability that each pixel belongs to the target category.

In some embodiments, the third loss of the segmentation sub-network can be determined by Equation 4:

Among them, L _dice represents the third loss, N is the number of pixels in the positive sample image, i represents the ith pixel in the positive sample image, 0<i≤N, p _i represents the ith pixel in the positive sample image output by the segmentation sub-network The second probability that the pixels belong to the target category, _gi represents the true category of the _ith pixel in the positive sample image, respectively, and the value of gi includes 0 and 1, where a value of 0 indicates that the ith pixel belongs to a non-target Category, a value of 1 indicates that the i-th pixel belongs to the target category. g _i can be determined according to the contour of each second object in the positive sample image.

Considering that the proportion of the second object in the second image is small, and there is a certain degree of imbalance between positive and negative sample images, the third loss is used in the embodiment of the present disclosure to optimize the segmentation task, which is beneficial to balance the positive and negative sample images, thereby improving the The ability to segment the second object with smaller size is improved.

In step S44, the network parameters of the segmentation sub-network are adjusted according to the third loss.

So far, the training of the segmentation sub-network of the target detection network according to the first training set is completed.

After completing the training of the classification sub-network, regression sub-network and segmentation sub-network of the target detection network according to the first training set, the first stage of training is also completed, and the target detection network in the first state is obtained. After that, the second stage is entered. In the second stage, the target detection network in the first state can be trained according to the second training set to obtain a trained target detection network. Here, the process of training the target detection network in the first state may be a fine-tuning process.

The second training set includes a plurality of sample images and second label information of the sample images, where the second label information includes false positive areas, false negative areas and true positive areas in the sample images.

The acquisition process of the second training set will be described below.

In some embodiments, the method further includes: processing the sample image through the target detection network in the first state to obtain a predicted position of the second object in the sample image; Predict the position and the real position, and determine the false positive area, false negative area and true positive area in the sample image.

In implementation, the false positive (False Positive, FP) area indicates that the first label information in the sample image is displayed as not the second object, but the output result of the classification sub-network in the first state is displayed as the area of the second object; true positive (Truth Positive, TP) area indicates that the first label information in the sample image is displayed as the second object, and the classification sub-network output result of the first state is also displayed as the area of the second object; False Negative (False Negtive, FN) area indicates the sample image. The first annotation information is displayed as the second object, but the output result of the classification sub-network in the first state shows the area that is not the second object; the true negative (Truth Negtive, TN) area indicates that the first annotation information in the sample image is displayed as not The second object, and the output result of the classification sub-network of the first state is also displayed as a sample image that is not the second object. Considering that the false-positive area is not actually a second object, and there is a classification error, it needs to be corrected. Therefore, the negative sample images in the second training set can be determined according to the false positive regions. Considering that the true positive regions and the false negative regions are actually second objects, the positive sample images in the second training set can be determined according to the true positive regions and the false negative regions. In some embodiments, all false positive regions may be used as negative sample images in the second training set; false negative regions may be triple-enhanced, and a portion (eg, 2/3) of true positive regions may be selected as Positive images in the second training set.

The following describes the process of training the target detection network in the first state according to the second training set.

The training of the target detection network in the first state according to the second training set to obtain the trained target detection network includes: according to the second training set, respectively classifying the classification sub-network, The regression sub-network and the segmentation sub-network are trained. In the embodiment of the present disclosure, the training of the target detection network in the first state is performed based on multi-task learning of classification, regression and segmentation, and the ability to recognize objects of the target category is improved by utilizing the correlation between tasks.

When training the classification sub-network, the sample images used include: false positive area, false negative area and true positive area, and the labeling information to be used includes: the bounding box of the second object.

In some embodiments, the training of the classification sub-network of the target detection network in the first state according to the second training set may include: according to the second label information, performing the training on the first state of the sample image at multiple scales. The second feature map is trimmed to determine the fifth feature map corresponding to the false positive area, the false negative area and the true positive area; the fifth feature map is input into the classification sub-network to obtain the false positive area, false negative area and true positive area The third probability that the area belongs to the target category; the classifier is determined according to the third probability that the false positive area, the false negative area and the true positive area belong to the target category, and the true category of the false positive area, the false negative area and the true positive area. The fourth loss of the network; according to the fourth loss, the network parameters of the classification sub-network are adjusted.

The above process may refer to steps S21 to S24.

During the training of the regression sub-network, the sample images used include true positive regions and false negative regions, and the annotation information to be used includes: the bounding box of the second object.

In some embodiments, the training of the regression sub-network of the target detection network in the first state according to the second training set may include: determining bounding boxes matching the true positive regions and false negative regions; The sixth feature map is input to the regression sub-network to obtain the prediction frame of the true positive area and the false negative area; according to the difference between the prediction frame of the true positive area and the false negative area and the corresponding bounding box, determine the prediction frame of the true positive area and the false negative area. The fifth loss of the regression sub-network; according to the fifth loss, the network parameters of the regression sub-network are adjusted.

The above process may refer to steps S31 to S36.

During the training of the segmentation sub-network, the sample images used include true positive regions and false negative regions, and the annotation information to be used includes: the outline of the second object.

In some embodiments, according to the second training set, training the segmentation sub-network of the target detection network in the first state may include: inputting the sixth feature map corresponding to the true positive area and the false negative area into the Segment the sub-network to obtain the fourth probability that each pixel in the true positive area and the false negative area belongs to the target category; according to the number of pixels in the true positive area and the false negative area, the true positive area and the false negative area The contour of the second object and the fourth probability that each pixel belongs to the target category determines the sixth loss of the segmentation sub-network; and adjusts the network parameters of the segmentation sub-network according to the sixth loss.

The above process may refer to steps S41 to S44.

In some embodiments, in the second-stage training process, the coefficient of the corresponding loss (including the fourth loss) of the false positive region, the third probability of the false negative region and the true positive region may be determined according to the third probability of the false positive region It can be used as the coefficient of the corresponding losses (including the fourth loss, the fifth loss and the sixth loss) of the false negative area and the true positive area. In this way, convergence can be accelerated and training time can be saved.

In some embodiments, during the second-stage training process, an online-hardness-minig method may be used (for example, each iteration focuses on optimizing the 10 regions with the largest loss values), The object detection network is trained as the trained object detection network. In this way, convergence can be accelerated and training time can be saved.

It should be noted that the recursive training process and the multi-task learning training process are closely integrated, not two separate processes. Each stage of the process of training an object detection network recursively is combined with multi-task learning.

FIG. 4 is a schematic structural diagram of the composition of a target detection architecture provided by an embodiment of the present disclosure. As shown in FIG. 4 , the target detection architecture includes a feature extraction network 40 and a target detection network 50 . The feature extraction network 40 includes a basic network and FPN, and the target detection network 50 includes a classification sub-network 51 , a regression sub-network 52 and a segmentation sub-network 53 .

The process of the target detection network for detecting lung nodules from the lung CT image shown in FIG. 4 may include: firstly, the lung CT image may be divided into image blocks of a specified size, and each image block is a first image ; Then, each first image is respectively input into the target detection network shown in FIG. 4 to obtain the bounding box and outline of the lung nodule in each first image. Finally, according to the bounding box and contour of the lung nodule in each first image, the bounding box and contour of the lung nodule in the lung CT image can be determined.

For each first image, the first image is input into the feature extraction network shown in FIG. 4 for processing, and first feature maps of multiple scales of the first image are obtained. The first feature maps of multiple scales of the first image are respectively input into the classification sub-network, regression sub-network and segmentation sub-network of the trained target detection network to obtain whether there are lung nodules in the first image, and whether each lung nodule exists in the first image. The bounding box and contours of each lung nodule.

FIG. 5 is a schematic diagram of a prediction frame of a lung nodule when the target detection network shown in FIG. 4 is the target detection network in the first state. As shown in FIG. 5 , when the target detection network shown in FIG. 4 is the target detection network in the first state trained through the first stage, there are a large number of false positive lung nodules 61 and some false negative lung nodules 62.

FIG. 6 is a schematic diagram of a prediction frame of a lung nodule when the target detection network shown in FIG. 4 is a trained target detection network. As shown in Figure 6, when the target detection network shown in Figure 4 is a trained target detection network trained through the first and second stages, the number of false positive lung nodules is reduced.

The target detection method provided by the embodiment of the present application can be used to detect whether there is a first object in the first image, and can obtain the position of the first object in the first image. When the first image is a lung CT image and the first object is a lung nodule, the target detection method provided in this embodiment of the present application can be used to detect whether there is a lung nodule in the lung CT image, and can obtain Location of lung nodules in lung CT images. During implementation, the target detection method provided in this embodiment of the present application can be used in any suitable scenario that needs to detect whether there is a lung nodule in a lung CT image. For example, for areas with low medical level, the target detection method can be used to screen lung nodules in the CT images of the lungs to be detected through remote cloud platforms or clinical landing equipment in hospitals, which is beneficial to improve the medical level in areas with low medical level. State of the art in lung nodule detection. For another example, for a hospital with a high level of medical care, with many patients and a large workload for clinicians to read images, the automatic screening of pulmonary nodules in lung CT images can be completed through the remote cloud platform or the hospital’s clinical floor equipment, which is helpful for doctors’ care. Rapid and accurate diagnosis provides auxiliary means. Another example is the automatic screening of pulmonary nodules on the obtained lung CT images in the physical examination center to improve the detection level of pulmonary nodules.

It can be understood that the foregoing method embodiments provided in the present disclosure can be combined with each other to form a combined embodiment without violating the principle and logic. Those skilled in the art can understand that, in the above method of the specific embodiment, the specific execution order of each step should be determined by its function and possible internal logic.

In addition, the embodiments of the present disclosure also provide target detection devices, electronic devices, computer-readable storage media, computer programs, and computer program products, all of which can be used to implement any target detection method provided by the present disclosure, and corresponding technical solutions and descriptions See the corresponding entry in the Methods section.

FIG. 7 is a schematic structural diagram of a target detection apparatus provided by an embodiment of the present disclosure. As shown in FIG. 7 , the apparatus 700 includes:

The extraction part 701 is configured to perform feature extraction on the first image to be detected to obtain first feature maps of multiple scales of the first image; the first processing part 702 is configured to perform the feature extraction on the The first feature maps of multiple scales of the first image are processed to obtain the position of the first object of the target category existing in the first image; wherein, the target detection network is trained in a recursive manner; the target The detection network includes a classification sub-network, a regression sub-network and a segmentation sub-network, the classification sub-network is used to determine whether the first object exists in the first image, and the regression sub-network is used to determine the first image The bounding box of the first object existing in the first image, the segmentation sub-network is used to determine the contour of the first object existing in the first image.

In some embodiments, the apparatus further includes:

The first training part is configured to train the target detection network according to a first training set to obtain a target detection network in a first state, and the first training set includes a plurality of sample images and a first sample image of the sample images. Labeling information, the first labeling information includes the real position of the second object in the sample image;

The second processing part is configured to process the sample image through the target detection network in the first state to obtain the predicted position of the second object in the sample image;

a determining part, configured to determine a false positive area, a false negative area and a true positive area in the sample image according to the predicted position and the real position of the second object;

The second training part is configured to train the target detection network in the first state according to a second training set to obtain a trained target detection network, and the second training set includes a plurality of sample images and the sample images The second annotation information includes the false positive area, the false negative area and the true positive area in the sample image.

In some embodiments, the plurality of sample images include positive sample images and negative sample images, and the apparatus further includes: a cropping part configured to crop the marked second image to obtain a positive sample image of a preset size and a negative sample image, the positive sample image includes at least one second object, and the negative sample image does not include the second object.

In some embodiments, the real position of the second object includes a bounding box of the second object, and the first training part is further configured to: perform feature extraction on the sample image to obtain multiple features of the sample image. second feature maps of one scale; multiple first reference frames in the sample image are determined according to the second feature maps of multiple scales and multiple preset anchor frames; according to the second feature maps in the sample image The bounding box of the object, a preset number of training samples are determined from the plurality of first reference frames, and the training samples include positive samples whose annotation information belongs to the target category, and negative samples whose annotation information does not belong to the target category ; Train the classification sub-network according to the training samples.

In some embodiments, the determining a preset number of training samples from the plurality of first reference frames according to the bounding box of the second object in the sample image, includes: The frame is divided into multiple bounding box sets, and the size of the bounding box in each bounding box set is within a preset size interval; for any bounding box set, removing from the multiple first reference frames has been determined as training The first reference frame of the sample, to obtain a reference frame set corresponding to the bounding box set; for any bounding box in the bounding box set, according to the bounding box and each first reference in the corresponding reference frame set The intersection ratio between boxes determines the positive samples and negative samples corresponding to the bounding box, and the number of positive samples is negatively correlated with the size interval of the bounding box set; according to the order of the size interval from small to large Each bounding box set is processed to obtain the preset number of training samples.

In some embodiments, the training of the classification sub-network according to the training sample includes: cropping the second feature map to obtain a third feature map corresponding to the training sample; The feature map is input to the classification sub-network, and the first probability that the training sample belongs to the target category is obtained; according to the first probability that the training sample belongs to the target category and the label information of the training sample, the classification sub-network is determined. a first loss; according to the first loss, adjust the network parameters of the classification sub-network.

In some embodiments, the real position of the second object includes a bounding box of the second object, and the first training part is further configured to: perform feature extraction on the positive sample image to obtain the positive sample image fourth feature maps of multiple scales; according to the fourth feature maps of multiple scales and multiple preset anchor frames, determine multiple second reference frames in the positive sample image; for the sample image Any bounding box of the second object in: determine the intersection ratio of the bounding box and the plurality of second reference frames, and determine the second reference frame with the largest intersection ratio as the match corresponding to the bounding box frame; input the fifth feature map corresponding to the matching frame into the regression sub-network to obtain the prediction frame of the matching frame; determine the regression sub-network according to the difference between the bounding frame and the prediction frame The second loss; according to the second loss, adjust the network parameters of the regression sub-network.

In some embodiments, the first training part is further configured to: determine the first regression loss of the matching box according to the coordinate offset and the intersection ratio between the bounding box and the prediction box; The intersection, union and minimum closed area between the bounding box and the prediction box determine the second regression loss of the matching box; according to the first regression loss and the second regression loss, determine the The second loss of the regression sub-network.

In some embodiments, the real position of the second object includes the outline of the second object, and the first training part is further configured to: perform feature extraction on the positive sample image to obtain the Fourth feature maps of multiple scales; input the fourth feature maps of multiple scales into the segmentation sub-network to obtain the second probability that each pixel of the positive sample image belongs to the target category; according to the positive sample image The number of pixels in the positive sample image, the contour of the second object in the positive sample image, and the second probability that each pixel belongs to the target category, determine the third loss of the segmentation sub-network; according to the third loss, adjust the segmentation Network parameters for the subnet.

In some embodiments, the second training part is further configured to: according to the second label information, crop the second feature maps of multiple scales of the sample image to determine false positive areas, false negative areas and The fifth feature map corresponding to the true positive region; input the fifth feature map into the classification sub-network to obtain the third probability that the false positive region, the false negative region and the true positive region belong to the target category; The third probability that the negative area and the true positive area belong to the target category, and the true categories of the false positive area, the false negative area and the true positive area, determine the fourth loss of the classification sub-network; according to the fourth loss, adjust the Describe the network parameters of the classification sub-network.

In some embodiments, the second training part is further configured to: according to the second label information, crop the second feature maps of multiple scales of the sample image to obtain the correspondence between true positive regions and false negative regions The sixth feature map of ; determine the bounding box matching the true positive region and the false negative region; input the sixth feature map into the regression sub-network to obtain the prediction frame of the true positive region and the false negative region; Determine the fifth loss of the regression sub-network according to the difference between the prediction boxes and the corresponding bounding boxes of the true positive area and the false negative area; adjust the network parameters of the regression sub-network according to the fifth loss .

In some embodiments, the second training part is further configured to: input the sixth feature map corresponding to the true positive area and the false negative area into the segmentation sub-network, to obtain the difference between the true positive area and the false negative area The fourth probability that each pixel belongs to the target category; according to the number of pixels in the true positive area and the false negative area, the outline of the second object in the true positive area and the false negative area, and the first probability that each pixel belongs to the target category. With four probabilities, the sixth loss of the segmentation sub-network is determined; according to the sixth loss, the network parameters of the segmentation sub-network are adjusted.

In some embodiments, the functions or included parts of the apparatus provided in the embodiments of the present disclosure may be configured to execute the methods described in the above method embodiments, and the specific implementation may refer to the descriptions in the above method embodiments.

In the embodiments of the present disclosure and other embodiments, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course, a unit, a module, or a non-modularity.

Embodiments of the present disclosure further provide a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the foregoing method is implemented. The computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory configured to store instructions executable by the processor; wherein the processor is configured to invoke the instructions stored in the memory to execute the above method.

Embodiments of the present disclosure also provide a computer program, including computer-readable codes. When the computer-readable codes are run on a device, the processor in the device executes the method for implementing the target detection provided in any of the above embodiments. instruction.

Embodiments of the present disclosure further provide a computer program product for storing computer-readable instructions, which, when executed, cause a computer to execute the steps of the target detection method provided by any of the foregoing embodiments.

The electronic device may be provided as a terminal, server or other form of device.

FIG. 8 is a schematic structural diagram of an electronic device 800 according to an embodiment of the disclosure. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, etc. terminal.

8, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812 , sensor component 814 , and communication component 816 .

The processing component 802 generally controls the overall operation of the electronic device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or some of the steps of the methods described above. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.

Memory 804 is configured to store various types of data to support operation at electronic device 800 . Examples of such data include instructions for any application or method operating on electronic device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random-Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable) Erasable Programmable Read Only Memory, EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (Read-Only Memory) , ROM), magnetic memory, flash memory, magnetic disk or optical disk.

Power supply assembly 806 provides power to various components of electronic device 800 . Power supply components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 800 .

Multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action. In some embodiments, multimedia component 808 includes a front-facing camera and/or a rear-facing camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.

Audio component 810 is configured to output and/or input audio signals. For example, audio component 810 includes a microphone (MIC) that is configured to receive external audio signals when electronic device 800 is in operating modes, such as calling mode, recording mode, and voice recognition mode. In some embodiments, the received audio signal may be stored in memory 804 or transmitted via communication component 816 . In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.

Sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of electronic device 800 . For example, the sensor assembly 814 can detect the open/closed state of the electronic device 800, the relative positioning of the components, such as the display and the keypad of the electronic device 800, the sensor assembly 814 can also detect the electronic device 800 or one of the electronic device 800 Changes in the position of components, presence or absence of user contact with the electronic device 800 , orientation or acceleration/deceleration of the electronic device 800 and changes in the temperature of the electronic device 800 . Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 814 may also include a light sensor, such as a Complementary Metal-Oxide-Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 816 is configured to facilitate wired or wireless communication between electronic device 800 and other devices. The electronic device 800 can access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (The 2nd Generation, 2G) or a third generation mobile communication technology (The 3rd Generation, 3G), or their The combination. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 also includes a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (Bluetooth, BT) technology and other technology to achieve.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (Digital Signal Processing Devices) , DSPD), Programmable Logic Device (PLD), Field Programmable Gate Array (FPGA), controller, microcontroller, microprocessor, or other electronic component implementation for performing the above method.

In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as a memory 804 comprising computer program instructions executable by the processor 820 of the electronic device 800 to perform the above method.

FIG. 9 is a schematic structural diagram of an electronic device 1900 according to an embodiment of the present disclosure. For example, the electronic device 1900 may be implemented as a server. 9, an electronic device 1900 includes a processing component 1922, which in some embodiments may include one or more processors, and a memory resource, represented by memory 1932, for storing instructions executable by the processing component 1922, such as applications program. An application program stored in memory 1932 may include one or more modules, each corresponding to a set of instructions. Additionally, the processing component 1922 is configured to execute instructions to perform the above-described methods.

The electronic device 1900 may also include a power supply assembly 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input output (I/O) interface 1958 . The electronic device 1900 can operate based on an operating system stored in the memory 1932, such as a Microsoft server operating system (Windows Server ^™ ), a graphical user interface based operating system (Mac OS X ^™ ) introduced by Apple, a multi-user multi-process computer operating system (Unix ^™ ), Free and Open Source Unix-like Operating System (Linux ^™ ), Open Source Unix-like Operating System (FreeBSD ^™ ) or the like.

In some embodiments, a non-volatile computer-readable storage medium is also provided, such as memory 1932 comprising computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described method.

Embodiments of the present disclosure may be one or more of a system, a method, a computer-readable storage medium, a computer program, or a computer program product. The computer program product may include a computer-readable storage medium loaded with computer-readable program instructions for enabling the processor to implement the target detection method provided by any of the above embodiments of the present disclosure.

A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), Static Random Access Memory (SRAM), Portable Compact Disc Read-Only Memory (CD-ROM), Digital Video Disc (DVD), Memory Stick, Floppy Disk, Mechanical Encoding devices, such as punched cards or raised structures in grooves on which instructions are stored, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.

The computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

The computer program instructions for performing the steps of the embodiments of the present disclosure may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in a Source or object code written in any combination of one or more programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the "C" language or similar Programming language. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer can be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or, can be connected to an external computer (e.g. use an internet service provider to connect via the internet). In some embodiments, custom electronic circuits, such as programmable logic circuits, Field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), are personalized by utilizing state information of computer readable program instructions, The electronic circuit may execute computer-readable program instructions to implement embodiments of the present disclosure.

Embodiments of the present disclosure are described herein with reference to flowchart illustrations and/or structural diagrams of methods, apparatus (systems) and computer program products according to embodiments of the present disclosure. It will be understood that each block of the flowcharts and/or structural diagrams, and combinations of blocks in the flowcharts and/or structural diagrams, can be implemented by computer readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more of the blocks in the flowcharts and/or constituent block diagrams. These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium storing the instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in the flowchart and/or block diagrams.

Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks in the flowcharts and/or constituent block diagrams.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, segment, or portion of an instruction that contains one or more logic for implementing the specified Executable instructions for the function. In some implementations, the functions noted in the blocks may also occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the constituent block diagrams and/or flowchart illustrations, and combinations of blocks in the constituent block diagrams and/or flowchart illustrations, may be implemented using special purpose hardware-based hardware that performs the specified function or action. system, or can be implemented using a combination of dedicated hardware and computer instructions.

The computer program product can be specifically implemented by hardware, software or a combination thereof. In some embodiments, the computer program product is embodied as a computer storage medium, and in other embodiments, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.

Various embodiments of the present disclosure have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Industrial Applicability

Embodiments of the present disclosure provide a target detection method and device, electronic equipment, storage medium, computer program product, and computer program, wherein the method includes: performing feature extraction on a first image to be detected, and obtaining a feature of the first image. First feature maps of multiple scales; processing the first feature maps of multiple scales of the first image through the trained target detection network to obtain the position of the first object of the target category in the first image. According to the embodiments of the present disclosure, the first object of the target category existing in the image to be detected can be detected, and the sensitivity and accuracy of target detection can be improved.

Claims

A target detection method, comprising:

performing feature extraction on the first image to be detected to obtain first feature maps of multiple scales of the first image;

Process the first feature maps of multiple scales of the first image through the trained target detection network to obtain the position of the first object of the target category existing in the first image;

Wherein, the target detection network is trained in a recursive manner; the target detection network includes a classification sub-network, a regression sub-network and a segmentation sub-network, and the classification sub-network is used to determine whether the first image has the The first object and the regression sub-network are used for determining the bounding box of the first object existing in the first image, and the segmentation sub-network is used for determining the outline of the first object existing in the first image.
The method of claim 1, further comprising:

According to the first training set, the target detection network is trained to obtain the target detection network in the first state. The first training set includes a plurality of sample images and the first annotation information of the sample images. The first The annotation information includes the real position of the second object in the sample image;

The sample image is processed by the target detection network in the first state to obtain the predicted position of the second object in the sample image;

According to the predicted position and the real position of the second object, determine the false positive area, the false negative area and the true positive area in the sample image;

According to the second training set, the target detection network in the first state is trained to obtain the trained target detection network. The second training set includes a plurality of sample images and the second label information of the sample images. The second label information includes false positive areas, false negative areas and true positive areas in the sample image.
The method of claim 2, wherein the plurality of sample images include positive sample images and negative sample images, the method further comprising:

The marked second image is cropped to obtain a positive sample image and a negative sample image of a preset size, wherein the positive sample image includes at least one second object, and the negative sample image does not include the second object.
The method according to claim 2, wherein the real position of the second object includes the bounding box of the second object, and the target detection network is trained according to the first training set to obtain the first state object detection network, including:

performing feature extraction on the sample image to obtain second feature maps of multiple scales of the sample image;

determining a plurality of first reference frames in the sample image according to the second feature maps of the plurality of scales and a plurality of preset anchor frames;

According to the bounding box of the second object in the sample image, a preset number of training samples are determined from the plurality of first reference frames, and the training samples include positive samples whose annotation information belongs to the target category, and annotation information are negative samples that do not belong to the target category;

The classification sub-network is trained according to the training samples.
The method according to claim 4, wherein determining a preset number of training samples from the plurality of first reference frames according to the bounding box of the second object in the sample image, comprising:

dividing the bounding box in the sample image into multiple bounding box sets, and the size of the bounding box in each bounding box set is within a preset size interval;

For any set of bounding boxes, remove the first reference frame that has been determined as a training sample from the plurality of first reference frames, to obtain a set of reference frames corresponding to the set of bounding boxes;

For any bounding box in the bounding box set, determine a positive sample and a negative sample corresponding to the bounding box according to the intersection ratio between the bounding box and each first reference box in the corresponding reference box set samples, the number of positive samples is negatively correlated with the size interval of the bounding box set;

Each bounding box set is sequentially processed according to the size interval from small to large to obtain the preset number of training samples.
The method according to claim 4 or 5, wherein the training the classification sub-network according to the training samples comprises:

Cropping the second feature map to obtain a third feature map corresponding to the training sample;

Inputting the third feature map into the classification sub-network to obtain the first probability that the training sample belongs to the target category;

determining the first loss of the classification sub-network according to the first probability that the training sample belongs to the target category and the labeling information of the training sample;

According to the first loss, the network parameters of the classification sub-network are adjusted.
The method according to claim 3, wherein the real position of the second object includes the bounding box of the second object, and the target detection network is trained according to the first training set to obtain the first state object detection network, including:

performing feature extraction on the positive sample image to obtain fourth feature maps of multiple scales of the positive sample image;

determining a plurality of second reference frames in the positive sample image according to the fourth feature maps of the plurality of scales and a plurality of preset anchor frames;

For any bounding box of the second object in the sample image:

determining the intersection ratio of the bounding box and the plurality of second reference frames, and determining the second reference frame with the largest intersection ratio as the matching frame corresponding to the bounding box;

Input the fifth feature map corresponding to the matching frame into the regression sub-network to obtain the prediction frame of the matching frame;

determining the second loss of the regression sub-network according to the difference between the bounding box and the prediction box;

According to the second loss, network parameters of the regression sub-network are adjusted.
The method according to claim 7, wherein the determining the second loss of the regression sub-network according to the difference between the bounding box and the prediction box comprises:

determining the first regression loss of the matching frame according to the coordinate offset and the intersection ratio between the bounding frame and the prediction frame;

determining the second regression loss of the matching box according to the intersection, union and minimum closed region between the bounding box and the prediction box;

A second loss of the regression sub-network is determined according to the first regression loss and the second regression loss.
The method according to claim 3, wherein the real position of the second object includes the outline of the second object, and the target detection network is trained according to the first training set to obtain the first state of the object detection network. Object detection network, including:

performing feature extraction on the positive sample image to obtain fourth feature maps of multiple scales of the positive sample image;

Inputting the fourth feature maps of the multiple scales into the segmentation sub-network to obtain the second probability that each pixel of the positive sample image belongs to the target category;

Determine the third loss of the segmentation sub-network according to the number of pixels in the positive sample image, the contour of the second object in the positive sample image, and the second probability that each pixel belongs to the target category;

According to the third loss, the network parameters of the segmentation sub-network are adjusted.
The method according to claim 2, wherein, according to the second training set, the target detection network in the first state is trained to obtain a trained target detection network, comprising:

According to the second annotation information, the second feature maps of multiple scales of the sample image are cropped, and the fifth feature maps corresponding to the false positive area, the false negative area and the true positive area are determined;

Inputting the fifth feature map into the classification sub-network to obtain the third probability that the false positive area, the false negative area and the true positive area belong to the target category;

Determine the fourth loss of the classification sub-network according to the third probability that the false positive area, the false negative area and the true positive area belong to the target category, and the true category of the false positive area, the false negative area and the true positive area;

According to the fourth loss, network parameters of the classification sub-network are adjusted.
The method according to claim 2, wherein, according to the second training set, the target detection network in the first state is trained to obtain a trained target detection network, comprising:

According to the second annotation information, the second feature maps of multiple scales of the sample image are cropped to obtain sixth feature maps corresponding to the true positive area and the false negative area;

determining bounding boxes that match the true positive and false negative regions;

Input the sixth feature map into the regression sub-network to obtain the prediction frame of the true positive area and the false negative area;

determining the fifth loss of the regression sub-network according to the difference between the prediction boxes of the true positive regions and the false negative regions and the corresponding bounding boxes;

According to the fifth loss, network parameters of the regression sub-network are adjusted.
The method according to claim 2, wherein, according to the second training set, the target detection network in the first state is trained to obtain a trained target detection network, comprising:

Inputting the sixth feature map corresponding to the true positive area and the false negative area into the segmentation sub-network to obtain the fourth probability that each pixel in the true positive area and the false negative area belongs to the target category;

According to the number of pixels in the true positive area and the false negative area, the outline of the second object in the true positive area and the false negative area, and the fourth probability that each pixel belongs to the target category, determine the first segment of the segmentation sub-network. six losses;

According to the sixth loss, the network parameters of the segmentation sub-network are adjusted.
The method according to any one of claims 1 to 12, wherein the first image comprises a 2D medical image and/or a 3D medical image, and the target category comprises nodules and/or cysts.
A target detection device, comprising:

an extraction part, used for feature extraction of the first image to be detected, to obtain first feature maps of multiple scales of the first image;

a first processing part, configured to process the first feature maps of multiple scales of the first image through the trained target detection network to obtain the position of the first object of the target category existing in the first image;

Wherein, the target detection network is trained in a recursive manner; the target detection network includes a classification sub-network, a regression sub-network and a segmentation sub-network, and the classification sub-network is used to determine whether the first image has the The first object and the regression sub-network are used for determining the bounding box of the first object existing in the first image, and the segmentation sub-network is used for determining the outline of the first object existing in the first image.
The apparatus according to claim 14, further comprising: a first training part configured to train the target detection network according to a first training set to obtain a target detection network in a first state, the first The training set includes a plurality of sample images and first annotation information of the sample images, the first annotation information includes the real position of the second object in the sample image; the second processing part is configured to pass the first state The target detection network processes the sample image to obtain the predicted position of the second object in the sample image; the determining part is configured to determine the predicted position and real position of the second object in the sample image. A false positive area, a false negative area and a true positive area; the second training part is configured to train the target detection network in the first state according to the second training set to obtain a trained target detection network, the second The training set includes a plurality of sample images and second annotation information of the sample images, where the second annotation information includes false positive areas, false negative areas and true positive areas in the sample images.
The apparatus according to claim 15, wherein the plurality of sample images include positive sample images and negative sample images, and the apparatus further comprises: a cropping part configured to crop the marked second image to obtain a preset A positive sample image and a negative sample image of the size, the positive sample image includes at least one second object, and the negative sample image does not include the second object.
The apparatus according to claim 15, wherein the real position of the second object includes a bounding box of the second object, and the first training part is further configured to: perform feature extraction on the sample image to obtain the multiple scale second feature maps of the sample image; according to the multiple scale second feature maps and multiple preset anchor frames, determine multiple first reference frames in the sample image; according to the The bounding box of the second object in the sample image, a preset number of training samples are determined from the plurality of first reference frames, and the training samples include positive samples whose annotation information belongs to the target category, and whose annotation information does not belong to the target category. Negative samples of the target category; according to the training samples, the classification sub-network is trained.
The apparatus according to claim 17, wherein the determining a preset number of training samples from the plurality of first reference frames according to the bounding box of the second object in the sample image comprises: The bounding box in the sample image is divided into multiple bounding box sets, and the size of the bounding box in each bounding box set is within a preset size range; for any bounding box set, it is removed from the multiple first reference frames It has been determined as the first reference frame of the training sample, and a reference frame set corresponding to the bounding box set is obtained; for any bounding box in the bounding box set, according to the bounding box and the corresponding reference frame set The intersection ratio between each first reference frame of , determines the positive samples and negative samples corresponding to the bounding box, and the number of the positive samples is negatively correlated with the size interval of the bounding box set; Each bounding box set is processed in order in order to obtain the preset number of training samples.
The apparatus according to claim 17 or 18, wherein the training the classification sub-network according to the training sample comprises: cropping the second feature map to obtain a third feature corresponding to the training sample Figure; input the third feature map into the classification sub-network to obtain the first probability that the training sample belongs to the target category; according to the first probability that the training sample belongs to the target category and the labeling information of the training sample, determining a first loss of the classification sub-network; and adjusting network parameters of the classification sub-network according to the first loss.
The apparatus according to claim 16, wherein the real position of the second object includes a bounding box of the second object, and the first training part is further configured to: perform feature extraction on the positive sample image to obtain fourth feature maps of multiple scales of the positive sample image; determining multiple second reference frames in the positive sample image according to the fourth feature maps of the multiple scales and a plurality of preset anchor frames; For any bounding box of the second object in the sample image: determine the intersection ratio of the bounding box and the plurality of second reference frames, and determine the second reference frame with the largest intersection ratio as the same as the second reference frame. The matching frame corresponding to the bounding box; input the fifth feature map corresponding to the matching frame into the regression sub-network to obtain the prediction frame of the matching frame; according to the difference between the bounding frame and the prediction frame, determine The second loss of the regression sub-network; according to the second loss, the network parameters of the regression sub-network are adjusted.
The apparatus according to claim 20, wherein the first training part is further configured to: determine the first training part of the matching frame according to the coordinate offset and the intersection ratio between the bounding box and the prediction frame a regression loss; according to the intersection, union and minimum closed area between the bounding box and the prediction box, determine the second regression loss of the matching box; according to the first regression loss and the second regression loss loss, which determines the second loss of the regression sub-network.
The apparatus according to claim 16, wherein the real position of the second object includes the contour of the second object, and the first training part is further configured to: perform feature extraction on the positive sample image to obtain the the fourth feature maps of multiple scales of the positive sample image; input the fourth feature maps of the multiple scales into the segmentation sub-network to obtain the second probability that each pixel of the positive sample image belongs to the target category; according to The number of pixels in the positive sample image, the contour of the second object in the positive sample image, and the second probability that each pixel belongs to the target category determines the third loss of the segmentation sub-network; according to the third loss , and adjust the network parameters of the segmentation sub-network.
The apparatus according to claim 15, wherein the second training part is further configured to: according to the second label information, crop the second feature maps of multiple scales of the sample image to determine false positive regions , the fifth feature map corresponding to the false negative region and the true positive region; input the fifth feature map into the classification sub-network to obtain the third probability that the false positive region, the false negative region and the true positive region belong to the target category; according to The third probability that the false positive area, the false negative area and the true positive area belong to the target category, and the true category of the false positive area, the false negative area and the true positive area, determine the fourth loss of the classification sub-network; Four losses, which adjust the network parameters of the classification sub-network.
The apparatus according to claim 15, wherein the second training part is further configured to: according to the second label information, crop the second feature maps of multiple scales of the sample image to obtain a true positive region The sixth feature map corresponding to the false negative region; determine the bounding box matching the true positive region and the false negative region; input the sixth feature map into the regression sub-network to obtain the true positive region and false negative The prediction frame of the region; according to the difference between the prediction frame of the true positive region and the false negative region and the corresponding bounding box, determine the fifth loss of the regression sub-network; according to the fifth loss, adjust the regression Network parameters for the subnet.
The apparatus according to claim 15, wherein the second training part is further configured to: input the sixth feature map corresponding to the true positive area and the false negative area into the segmentation sub-network to obtain the true positive area and the fourth probability that each pixel in the false negative area belongs to the target category; according to the number of pixels in the true positive area and the false negative area, the outline of the second object in the true positive area and the false negative area, and each pixel point The fourth probability of belonging to the target category determines the sixth loss of the segmentation sub-network; according to the sixth loss, the network parameters of the segmentation sub-network are adjusted.
The apparatus of any one of claims 14 to 25, wherein the first image comprises a 2D medical image and/or a 3D medical image, and the target category comprises a nodule and/or a cyst.
An electronic device comprising:

processor;

a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any one of claims 1 to 13.
A computer-readable storage medium having computer program instructions stored thereon, the computer program instructions implementing the method of any one of claims 1 to 13 when executed by a processor.
A computer program comprising computer readable code, where the computer readable code is run on a device, a processor in the device executes instructions for implementing the method of any one of claims 1 to 13.
A computer program product configured to store computer readable instructions which, when executed, cause a computer to perform the method of any one of claims 1 to 13.