CN113517056B

CN113517056B - Medical image target area identification method, neural network model and application

Info

Publication number: CN113517056B
Application number: CN202110680955.6A
Authority: CN
Inventors: 梁振; 单淳劼; 赵维佳
Original assignee: Anhui Medical University
Current assignee: Anhui Medical University
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2023-09-19
Anticipated expiration: 2041-06-18
Also published as: CN113517056A

Abstract

The invention discloses a method for identifying a medical image target area, which comprises the following steps: reading information of a current Dicom file, and if the information comprises a target keyword, processing the Dicom file to obtain an image to be identified; inputting the image to be identified into a neural network model based on a residual error structure, and outputting the image through different detection heads to respectively obtain target images with different receptive fields; the neural network model based on the residual error structure comprises a backbone network structure, an SPP layer and a detection head. The invention also discloses a neural network model based on the residual structure. The network model established by the invention has the advantages of good nonlinear expression capability, reduced parameter quantity of the network model, improved calculation speed of the network model, improved accuracy and robustness of the network model, good sensitivity to targets with different sizes and accurate target identification.

Description

Medical image target area identification method, neural network model and application

Technical Field

The invention relates to the field of medicine, in particular to a method for identifying a target area of a medical image, a neural network model and application.

Background

Currently, image processing, particularly, MRI image recognition, is often performed by image segmentation. A method, apparatus, terminal device and storage medium for segmenting magnetic resonance images as disclosed in patent application 201911243400.4, and an improved glioma segmentation method using cross-sequence mri as disclosed in patent application 202011164826.3.

However, the conventional segmentation network model has the defects of inaccurate segmentation of the target region, easy interference of surrounding targets and low reasoning speed. Meanwhile, the network model is difficult to train due to large consumption of the video memory, the iteration period of training is long, the super parameters are more, and adjustment is difficult. In addition, as the depth of the network model increases, the sensitivity of the network model to smaller target recognition decreases.

Disclosure of Invention

The invention aims to solve the technical problems of providing a recognition method, a neural network model and application of a medical image target area, which can optimize the nonlinear expression capability of a network model, reduce the parameter quantity of the network model, improve the calculation speed of the network model, improve the accuracy and the robustness of the network model and have good sensitivity to targets with different sizes, and can accurately recognize the targets.

The invention mainly solves the technical problems by the following technical means: a method of identifying a target region of a medical image, comprising the steps of:

step one, reading information of a current Dicom file, and if the information comprises a target keyword, processing the Dicom file to obtain an image to be identified;

inputting the image to be identified into a neural network model based on a residual error structure, and outputting the image through different detection heads to respectively obtain target images with different receptive fields;

the neural network model based on the residual error structure comprises a backbone network structure, an SPP layer and a detection head;

each backbone network structure comprises a residual structure, wherein the residual structure comprises basic units formed by a mode of 1X 1Conv layer pretreatment and 3X 3Conv layer post treatment; the basic unit in the same residual structure is at least one;

the backbone network structures are matched in sequence according to the sequence of processing the images; the output end of each backbone network structure is matched with the input end of a corresponding detection head; the SPP layer is arranged between the backbone network structure of the final processed image and the detection head corresponding to the backbone network structure.

Preferably: in the first step, the information includes MRI image weight information, and the target keyword is T2;

the first step is specifically as follows: and using an open source tool PyDicom to sequentially read the sequence information of each Dicom file, judging whether the sequence information contains a keyword T2, if the current Dicom file contains the keyword T2, determining that the Dicom file is a target file, using a PyDicom third party library to read matrix information in the Dicom file, and saving the matrix information as an image in a JPG format, and if the current Dicom file does not contain the keyword T2, identifying the next Dicom file.

Preferably: the method also comprises the following steps:

preprocessing of data, which is used for uniformly adjusting the data size of Dicom to 412×412 and normalizing the pixel values of the image.

Preferably: the backbone network structure comprises a shallow backbone network structure, a middle backbone network structure and a deep backbone network structure; the detection head comprises a first detection head, a second detection head and a third detection head;

each backbone network structure comprises a residual structure, wherein the residual structure comprises basic units formed by a mode of 1X 1Conv layer pretreatment and 3X 3Conv layer post treatment; the basic unit in the same residual structure is at least one; when an image to be identified is processed by the shallow backbone network structure and then is input to a first detection head, a target image with a first receptive field can be obtained through output; when the image to be identified is sequentially processed by the shallow backbone network structure and the middle backbone network structure and then is input to a second detection head, a target image with a second receptive field can be obtained through output; when the image to be identified is processed by the shallow backbone network structure, the middle backbone network structure, the deep backbone network structure and the SPP layer in sequence, the image is input to the third detection head, and the target image with the third receptive field can be output.

Preferably: in the shallow backbone network structure, a first residual structure, a second residual structure and a third residual structure are sequentially included according to the sequence of image processing; the first residual structure comprises a basic unit; the second residual structure comprises two basic units matched in sequence; the third residual structure comprises eight basic units matched in sequence; the middle backbone network structure comprises a fourth residual structure, and the fourth residual structure comprises eight basic units matched in sequence; the deep backbone network structure comprises a fifth residual structure comprising four basic units that are sequentially matched.

Preferably: in each residual structure, the number of channels of the 1×1Conv layer is smaller than the number of channels of the 3×3Conv layer.

Preferably: the shallow backbone network structure sequentially comprises two 3×3Conv layers, a first residual structure, a 3×3Conv layer, a second residual structure, a 3×3Conv layer and a third residual structure according to the processing sequence of the image; the middle backbone network structure sequentially comprises a 3 multiplied by 3Conv layer and a fourth residual structure according to the processing sequence of the image; the deep backbone network structure sequentially comprises a 3 x 3Conv layer and a fifth residual structure according to the processing sequence of the image.

Preferably: the SPP layer comprises a 5×5 Max Pooling layer, a 9×9 Max Pooling layer and a 13×13 Max Pooling layer which are arranged in parallel; the image to be identified is processed by the deep backbone network structure, then is respectively input into a 5×5 Max Pooling layer, a 9×9 Max Pooling layer and a 13×13 Max Pooling layer for compression, and then is fused and output.

Preferably: a Convolition Set is also arranged between each backbone network structure and the detection head matched with the backbone network structure; wherein, the Convolition Set matched with the backbone network structure of the final processed image is positioned between the SPP layer and the corresponding detection head; the Convolition Set includes a structure in which 1X 1Conv layers and 3X 3Conv layers are alternately stacked, and ensures that 1X 1Conv layers are located at the end.

Preferably: each Conv layer finally comprises a Conv2D layer and a Batch Normalization layer.

Preferably: the neural network model (hereinafter referred to as network model) based on the residual structure also performs training of the network model before processing the image to be identified, and includes the following steps: and inputting the images in the training set into the network model, training the network model, and completing the training of the network model when the parameters of the network model enable the loss function to be converged.

Preferably: for the images in the training set, a data enhancement process is further performed before the images are input to the network model, wherein the data enhancement process comprises at least one of adjustment of random contrast, rotation of random angles and a Mosaic enhancement method.

Preferably: the optimizer adopted by the training of the network model is an Adam optimizer, and the Loss function is a Focal Loss function;

expression of the Focal Loss function:

wherein p in expression (1) represents a confidence level, α=0.25, γ=2; if y=0, it represents a negative sample, and y=1 represents a positive sample.

The invention also discloses a neural network model based on the residual error structure, which comprises a shallow backbone network structure, a middle backbone network structure, a deep backbone network structure, an SPP layer, a first detection head, a second detection head and a third detection head;

each backbone network structure comprises a residual structure, wherein the residual structure comprises basic units formed by a mode of 1X 1Conv layer pretreatment and 3X 3Conv layer post treatment; the basic unit in the same residual structure is at least one; when an image to be identified is processed by the shallow backbone network structure and then is input to a first detection head, a target image with a first receptive field can be obtained through output; when the image to be identified is sequentially processed by the shallow backbone network structure and the middle backbone network structure and then is input to a second detection head, a target image with a second receptive field can be obtained through output; when the image to be identified is processed by the shallow backbone network structure, the middle backbone network structure, the deep backbone network structure and the SPP layer in sequence, the image is input to a third detection head, and a target image with a third receptive field can be output; wherein the first receptive field, the second receptive field, and the third receptive field are all different.

The invention also discloses an application of the neural network based on the residual structure in identifying the region of interest in the non-medical image.

The invention has the advantages that: the invention greatly reduces the information quantity of the output items, thereby improving the calculation speed of the network model, and simultaneously, in order to improve the robustness of the network model as much as possible, a plurality of detection heads with different sizes exist in the network model, so that the network model can be sensitive to large targets and small targets. The basic unit of the structure of the network model comprises a residual structure formed by alternately combining a Conv layer with the convolution kernel size of 3 multiplied by 3 and a Conv layer with the convolution kernel size of 1 multiplied by 1, so that the accuracy of the network model can be greatly improved, and the condition that the network model is degraded under the condition of having a certain depth is ensured.

Further, by adopting the residual structure, the feature matrix of the image information is subjected to twice Conv layer operation, and then the result and the original feature matrix are subjected to one-time superposition operation. The conventional residual structure uses the mode that the sizes of two Conv layers of a basic unit are 3×3Conv, while the residual structure of the invention comprises a Conv Layer with a Conv kernel size of 1×1 and a Conv Layer with a Conv kernel size of 3×3 as a basic unit, each Conv Layer comprises a 2-dimensional Conv (Conv 2D) Layer, namely 2D Convolutional Layer and Batch NormalizationLayer in fig. 4, and is processed through a ReLU activation function of a ReLU Layer, and the dimensions of the Conv layers are reduced by using 1×1Conv layers and then increased by a 3×3Conv Layer. By establishing a deeper network model, the performance of the network model can be greatly improved, the detailed characteristics of deeper data can be extracted, and the judgment of the target category and the positioning of the target position become more accurate. The network model of the present invention will reduce the parameter by approximately 80% or more.

Further, in the present invention, the number of channels of the 1×1Conv layer in each residual structure is smaller than the number of channels of the 3×3Conv layer, and the number of parameters can be reduced by performing the dimension reduction operation by the 1×1 convolution and then the dimension increase operation by the 3×3 convolution.

Further, the SPP structure used in the present invention, including Max Pooling layers of 5×5, 9×9 and 13×13, improves the final accuracy by 3.8%. Compared with the SPP structure with the conventional structure, which adopts the Max Pooling layers with the sizes of 6 multiplied by 6,3 multiplied by 3 and 2 multiplied by 2, the experimental result of the invention has stronger robustness. If SPP structure with conventional structure is adopted, the final result is improved by 2.1%.

Furthermore, each detection head is provided with a Convolition Set (namely a Convolution output end) formed by alternately stacking and combining 1×1Conv layers and 3×3Conv layers before performing dimension reduction and improving the nonlinear fitting capability of the network model.

Further, the three detection heads output from the shallower 26 layers, the middle 43 layers and the deeper 52 layers respectively, so that the receptive fields with different sizes can be obtained, and the shallow network is suitable for predicting smaller targets due to smaller receptive fields; and the deep network is suitable for predicting larger targets because of larger receptive fields. Thus solving the pain point that the traditional target detection algorithm is very insensitive to the detection of small targets. According to the invention, through the establishment of detection heads with different dimensions, the sensitivity of the network model to targets with different sizes is greatly improved.

Drawings

FIG. 1 is a comparison of four weighted MRI images in accordance with the present invention.

FIG. 2 is a schematic diagram of the basic unit of the residual structure in the present invention;

FIG. 3 is a schematic view of SPP structure in the present invention

FIG. 4 is a schematic diagram of a convolutional layer of the present invention;

FIG. 5 is a schematic diagram of a neural network model based on a residual structure in the present invention;

FIG. 6 is a schematic diagram of the residual structure in the dashed block diagram of FIG. 5 containing the 2 marks according to the present invention;

FIG. 7 is a schematic diagram of a configuration of a Convolitioning Set according to the present invention;

FIG. 8 is a schematic diagram of a target image according to the present invention;

FIG. 9 is a schematic diagram of an image to be identified in the present invention that is not processed by the network model of the present invention;

FIG. 10 is a schematic diagram of a target image of the image to be identified of FIG. 9 output via the network model of the present invention;

FIG. 11 is a schematic illustration of a stitched image according to the present invention;

fig. 12 is a schematic diagram of a target image output by the network model of the present invention when the human motion is recognized.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The embodiment discloses a method for identifying an image target area, wherein the image of the embodiment is described by taking an MRI image as an example, and the target area identified by the embodiment is described by taking a brain tumor in the MRI image as an example. The Dicom file comprises MRI image weight information, and the target keyword is T2;

the method comprises the following steps:

step one, traversing all the Dicom files, using an open source tool Pydicom to read the sequence information of each Dicom file, judging whether the sequence information contains a keyword T2, if so, determining the Dicom file as a target file, using a Pydicom third party library to read the matrix information in the Dicom file, and saving the matrix information as an image in a JPG format.

The format of the pictures shot by the nuclear magnetic resonance apparatus in the hospital is Dicom format. The so-called Dicom format (Digital Imaging and Communications in Medicine), digital imaging and communication in medicine, is an international standard for medical images and related information (ISO 12052). It defines a medical image format that can be used for data exchange with quality meeting clinical needs. The Dicom format file includes not only MRI image information of the patient but also basic information of an image such as a weight of the MRI image in addition to basic information of the patient such as a name, an age, and the like of the patient.

Since the Dicom file cannot be directly identified as an image file feed model, format conversion using a format conversion module is required.

As shown in FIG. 1, the MRI image is composed of a plurality of slices, and the image of each slice is divided into four different weights, FLAIR, T1CE (i.e., T1c in FIG. 1) and T2.

The MRI image weight information of each fault is stored in a Dicom file, the brain tumor signal with the T2 weight is a high signal, the tumor position is easy to distinguish, and in order to reduce the workload of a system and save the working time, only the Dicom image with the T2 weight of each fault is needed to be found and converted into an image file in a JPG format, so that an image to be identified is obtained. The method is that the weight information in the Dicom file is read, whether the weight information is T2 weight is judged, if yes, the format is reserved and converted, and if not, the format is not converted.

Inputting the image to be identified into a neural network model based on a residual error structure for processing, and outputting the image through different detection heads to respectively obtain target images with different receptive fields;

the neural network model based on the residual structure of the present embodiment includes a shallow backbone network structure, a middle backbone network structure, a deep backbone network structure, an SPP layer, a first Detection Head (Detection Head 1), a second Detection Head (Detection Head 2), and a third Detection Head (Detection Head 3);

as shown in fig. 2, each backbone network structure includes a residual structure including a basic unit formed in such a manner that 1×1Conv layers are previously processed and 3×3Conv layers (Conv layers, i.e., convolutional Layer or convolutional layers) are post-processed; the basic unit in the same residual structure is at least one; x 1 in fig. 2 represents one basic unit.

For deep learning algorithms, networks with deeper layers are often used, enabling better results. The reason for this is that deeper networks are able to extract deeper details into the image. However, if a convolutional neural network with a general linear structure is used, network degradation is likely to occur, that is, weights of individual layers are 0 because they are not activated.

By adopting the residual structure, the Feature Map, namely the Feature matrix of the image information, is subjected to twice Conv layer operation, and then the result and the original Feature matrix are subjected to one-time superposition operation. The conventional residual structure uses the mode that the sizes of two Conv layers of a basic unit are 3×3Conv, while the residual structure of the invention comprises a Conv Layer with a Conv kernel size of 1×1 and a Conv Layer with a Conv kernel size of 3×3 as a basic unit, each Conv Layer comprises a 2-dimensional Conv (Conv 2D) Layer, namely 2D Convolutional Layer and Batch NormalizationLayer in fig. 4, and is processed through a ReLU activation function of a ReLU Layer, and the dimensions of the Conv layers are reduced by using 1×1Conv layers and then increased by a 3×3Conv Layer. By establishing a deeper network model, the performance of the network model can be greatly improved, the detailed characteristics of deeper data can be extracted, and the judgment of the target category and the positioning of the target position become more accurate. The network model of the present invention will reduce the parameter by approximately 80% or more.

Further, in each residual structure, the number of channels of the 1×1Conv layer is smaller than the number of channels of the 3×3Conv layer.

In the present invention, the number of channels of the 1×1Conv layer in each residual structure is smaller than the number of channels of the 3×3Conv layer, and the number of parameters can be reduced by performing the dimension reduction operation by the 1×1 convolution and then the dimension increase operation by the 3×3 convolution. If the convolution kernel size of the upper layer is 3×3, the number of channels is C ₁ The number of channels of the current layer is C ₂ The calculated amount is 3×3×c ₁ ×C ₂ . If the dimension reduction is carried out by 1X 1 convolution until the channel number is C ₃ Then 3X 3 is used for ascending dimension until the channel number is C ₂ The calculated amount is 1×1×c ₁ ×C ₃ +3×3×C ₃ ×C ₂ ＝C ₃ ×(C ₁ +9C ₂ )。

Therefore, if the number C of channels of the previous layer is set ₁ 192, current layer channel number C ₂ 128, the calculated amount is 3×3×192×128= 221184, if 1×1 one layer convolution is used to reduce the dimension to the channel number C ₃ For 96, we can go up to 128 again, the calculated amount should be 96× (192+9×128) = 129024, and it can be seen that the calculated amount is reduced by nearly one time, and the nonlinear fitting capability of the network model of the present invention can be further increased due to the addition of one Conv layer. Therefore, the operation of reducing and then increasing the dimension by using the 1X 1 convolution not only compresses the size of the network model of the invention and improves the reasoning speed, but also improves the performance of the network model of the invention.

In the invention, when the image to be identified is processed by the shallow backbone network structure and then is input to the first detection head, the target image with the first receptive field can be output and obtained; when the image to be identified is sequentially processed by the shallow backbone network structure and the middle backbone network structure and then is input to a second detection head, a target image with a second receptive field can be obtained through output; when the image to be identified is processed by the shallow backbone network structure, the middle backbone network structure, the deep backbone network structure and the SPP layer in sequence, the image is input to a third detection head, and a target image with a third receptive field can be output;

as shown in fig. 3, the SPP layer includes a 5×5 MaxPooling layer, a 9×9 Max Pooling layer, and a 13×13 Max Pooling layer, which are juxtaposed; the image to be identified is processed by the deep backbone network structure, then is respectively input into a 5×5 Max Pooling layer, a 9×9 Max Pooling layer and a 13×13 Max Pooling layer for compression, and then is fused and output. The fusion here refers to that the feature matrix output by the three Max Pooling layers is spliced in the Channel dimension. If the feature matrix size of the input SPP structure is 16×16×512 at this time, the final spliced size is 16×16×1536.

In order to further increase the nonlinear fitting capability of the network model and improve the robustness of the network model to data with serious spatial distortion, the invention adds an SPP layer, namely a Spatial Pyramid Pooling layer, at the tail end of the backbone network. Because the invention is oriented to MRI images, the invention is characterized in that spatial distortion is easy to occur after the shape is changed, and therefore, the SPP structure is required to be used for optimizing the network model algorithm.

The SPP structure used in the invention comprises Max Pooling layers with the sizes of 5×5, 9×9 and 13×13, so that the final accuracy is improved by 3.8%. Compared with the SPP structure with the conventional structure, which adopts the Max Pooling layers with the sizes of 6 multiplied by 6,3 multiplied by 3 and 2 multiplied by 2, the experimental result of the invention has stronger robustness. If SPP structure with conventional structure is adopted, the final result is improved by 2.1%.

As shown in fig. 5, the shallow backbone network structure of the present invention sequentially includes two 3×3Conv layers, a first residual structure, a 3×3Conv layer, a second residual structure, a 3×3Conv layer, and a third residual structure according to the order of processing the image from first to last; the first residual structure comprises a basic unit; the second residual structure comprises two basic units matched in sequence; the third residual structure comprises eight basic units matched in sequence;

the middle backbone network structure sequentially comprises a 3 multiplied by 3Conv layer and a fourth residual structure according to the processing sequence of the images from first to last; the fourth residual structure comprises eight basic units matched in sequence;

the deep backbone network structure sequentially comprises a 3 multiplied by 3Conv layer and a fifth residual structure according to the processing sequence of the images from first to last; the fifth residual structure comprises four basic units that are sequentially matched.

When the residual structure comprises a plurality of basic units matched in sequence, such as two basic units matched in sequence, the 1 multiplied by 1Conv layer and the 3 multiplied by 3Conv layer are formed by stacking one basic unit in a one-time cycle manner; when four basic units are matched in sequence, the basic units are formed by stacking 1 multiplied by 1Conv layers and 3 multiplied by 3Conv layers in three cycles; eight basic units which are matched sequentially are formed by stacking 1×1Conv layers and 3×3Conv layers which are one basic unit through seven cycles. The numbers within the dashed boxes in fig. 5 indicate that several residual structures of the same structure are stacked. As the number of layers of the network model increases, the number of channels per layer increases and more residual structures of the same structure are stacked in order to enable the network model to learn the detailed characteristics of the image.

X 1 in fig. 5 indicates that the residual structure has only one basic unit, x 4 indicates that the residual structure has four basic units, x 8 indicates that the residual structure has eight basic units; the convolutions of the unidentified step sizes in fig. 5 have a convolution step size of 1; the Residual in the dashed box in fig. 5 indicates that the structure in the dashed box is a Residual structure.

Fig. 6 shows a specific structure of the residual structure in the dashed box containing the x2 mark in fig. 5, that is, the residual structure includes two basic units that are sequentially matched, and other specific structures of the residual structure including a plurality of basic units that are sequentially matched in fig. 5 refer to the specific structure disclosed in fig. 6.

Further, after spatial distortion correction is performed by using the SPP, the output end of the neural network model based on the residual structure (network model for short) of the present invention is obtained. The network model comprises three output ends with different sizes, namely a first detection head, a second detection head and a third detection head, which are used as detection heads.

Furthermore, each detection head is provided with a Convolition Set (namely a Convolution output end) formed by stacking and combining 1×1Conv layers and 3×3Conv layers before performing dimension reduction and improving the nonlinear fitting capability of the network model. The Convolition Set matched with the backbone network structure (namely the deep backbone network structure) of the final processed image is positioned between the SPP layer and the corresponding detection head (the third detection head), namely, the Convolition Set is arranged between the SPP layer and the third detection head;

the structure of the Convolition Set is shown in FIG. 7, i.e., combined alternately twice by 1X 1Conv layers and 3X 3Conv layers, and ending with 1X 1Conv layers.

As shown in fig. 5, the method further includes a 3×3Conv layer and a 1×1Conv layer in order after the Convolition Set, and the detection head is the 1×1Conv layer after the Convolition Set.

As shown in fig. 5, the three detection heads output from the shallower 26 layers, the middle 43 layers and the deeper 52 layers respectively, so that the receptive fields with different sizes can be obtained, and the shallow network is suitable for predicting smaller targets due to smaller receptive fields; and the deep network is suitable for predicting larger targets because of larger receptive fields. Thus solving the pain point that the traditional target detection algorithm is very insensitive to the detection of small targets. According to the invention, through the establishment of the detection heads with different dimensions, the sensitivity degree of the network model to targets with different sizes can be greatly improved.

In contrast, if the network model uses only one detection head to output the position information of the target, there is a case where the information of a small target (a target occupying a relatively small area in the image) cannot be detected. If multiple detection heads are used, the shallower detection heads have smaller receptive fields, which are more sensitive to small targets and can effectively detect the small targets. The deeper detection head has a larger receptive field, which is more sensitive to a large target (a target occupying a larger area in the image), so that the large target can be effectively detected.

As shown in fig. 8, the output content of the detection head includes the position of the box and the category information of the target.

The location information of the box is the coordinates (x, y) of the center point of the box, the width and height (w, h) of the box, and the numerical values are the locations relative to the size of the original image (i.e. the image to be identified). If the original size of the image is w×h, the following transformation is needed to obtain the actual coordinate parameters of the block. X is X _real ＝x×W、Y _real＝ y×H、W _real ＝w×W、H _real＝ h×H. Wherein the actual center point coordinates of the box are (X _real ,Y _real ) The actual width of the square is W _real The actual height of the square frame is H _real The method comprises the steps of carrying out a first treatment on the surface of the In the original size of the image, the width of the image is represented by W, and the height of the image is represented by H.

If the input image to be recognized is 412×412×3, the output matrix sizes of the three detection heads are 52×52×6, 26×26×6, and 13×13×6, respectively. Here, 6 in the third dimension can be disassembled into 4+1+1, i.e. the output content can be decomposed into: the 4 numbers are used to represent the information of the boxes in the figure, i.e. the coordinates (x, y) of the upper left corner and the wide width and high height of the box, 1 number representing the category and 1 number representing the confidence. After traversing the information of all the frames, drawing the frames with confidence higher than 0.5 on the image, and outputting the categories at the same time.

The result is shown in fig. 8, and the Tumor location is marked with boxes, and Brain Tumor is marked above the boxes.

Furthermore, the invention also carries out data preprocessing before inputting the image to be identified into the neural network model of the residual structure.

Preprocessing of data includes unifying the size of the data. The size of Dicom data is 512×512, and the invention is uniformly converted into 412×412, thereby reducing the input size of the image and reducing the parameter number of the network model.

Further, in order to enable the network model of the present invention to converge better, the present invention normalizes the pixel values for all images, i.e., divides all pixel values by 127.5 and then subtracts 1.

In summary, different from the traditional medical image processing algorithm, which uses the mask with the output item of the segmentation algorithm as the tumor region, the output item of the network model is the position and the category of the tumor, and the calculation speed of the model is improved by greatly reducing the information quantity of the output item.

P and N in table 1 represent Positive Samples and Negative Samples, respectively, and are interpreted as: in 300 test pictures, a doctor marks 401 tumors manually, but the network model of the invention successfully finds 396 brain tumors, 45 are false detection results of the network model of the invention, and 5 tumors are not successfully identified, so that the network model of the invention detects 441 tumors in total.

TABLE 1

TABLE 2

Recognition result	Correct identification	Missing identity	False recognition
				Quantity of	396	5	45

As can be seen from table 2: tp=396, fn=5, fp=45, tn=0. Thus when the threshold is set to 0.4, the accuracy of the network model of the present invention is: precision=tp/(tp+fp) =89.80%; the recall rate of the network model of the invention is: recall=tp ≡ (tp+fn) =98.75%;

in order to better highlight the superiority of the network model of the invention, the data set is trained by respectively selecting a classical dual-stage algorithm Faster-RCNN and a single-stage algorithm Yolo and then tested by using the same test data. The test results are shown in Table 3, and the network model of the present invention performs better than Faster-RCNN and Yolo, both in accuracy and recall.

TABLE 3 Table 3

When the screening work of brain tumor is completed, the recall rate is required to be ensured to be as high as possible, namely, more brain tumors are detected, and the recall rate of the network model is up to 98.75 percent, so that the network model can find out patients with brain tumor as much as possible. The recall and precision are often a pair of contradictory values, but the network model of the present invention can ensure that at higher recall rates, nearly 90% precision is still possessed. It can be seen that the network model of the invention can be well applied to the recognition of brain tumor.

The calculation amount and the user friendliness of the network model are judged by testing the detection speed of the network model.

The computing platform used was NVIDIA RTX2070, and the average computing time of the network model of the present invention on the test dataset was 20.81ms, i.e. 48.05fps. A comparison of the inference speeds of the network model of the present invention with Faster-RCNN and Yolo was also made, as shown in Table 4.

TABLE 4 Table 4

Algorithm name	Faster-RCNN	Yolo	Network model of the invention
				Inference time (ms)	164.16	27.83	20.81

As is known, if the parameter quantity of the model is too large, the reasoning speed of the model is slow to a certain extent, which is very unfriendly to doctors and patients, and the network model of the invention not only ensures the rapid reasoning speed on the premise of ensuring higher accuracy and recall rate, but also has the reasoning speed exceeding the single-stage algorithm and the double-stage algorithm which are currently mainstream, and can finish the identification of one image in about 20ms, thereby effectively saving the waiting time of doctors.

Further, the neural network model based on the residual structure also performs training of the network model before processing the image to be identified, and the method comprises the following steps: and inputting the images in the training set into the network model, training the network model, and completing the training of the network model when the parameters of the network model enable the loss function to be converged.

Further, an optimizer adopted by the training of the network model is an Adam optimizer, and a Loss function adopted by the training of the network model is a Focal Loss function;

when the network model is trained, forward propagation is firstly carried out, namely data are substituted into the network model to be calculated, a result is obtained, and Loss function is used for calculating the value of Loss. After the Loss calculation is completed, back propagation is carried out so as to complete the updating and optimization of the parameters of each convolution layer. The optimizer Adam provides a parameter optimization algorithm that has a faster optimization speed than conventional optimization algorithms, such as a random gradient descent algorithm.

FocalLoss can effectively relieve the imbalance of positive and negative samples in training data. In practical datasets, only one target in one picture is often detected (i.e. positive samples), and most of the targets are background independent (i.e. negative samples), so the positive samples are far smaller than the negative samples. When the Focal Loss function is used, if y=0, namely the negative sample, the confidence coefficient p is very high, and the values of 1-p are very small, so that the Loss value of the negative sample is greatly reduced, and the network model of the invention is more focused to optimize the Loss of the positive sample.

Expression of the Focal Loss function:

wherein p in expression (1) represents confidence, α=0.25, γ=2; if y=0, it represents a negative sample, and y=1 represents a positive sample. Positive samples here refer to correctly predicted samples, and negative samples are incorrectly predicted samples.

Further, the images in the training set are also subjected to data preprocessing prior to training.

Further, the images in the training set are subjected to data enhancement processing, which occurs after the data preprocessing.

The data enhancement processing used in the present invention includes at least one of adjustment of random contrast of an image, rotation of random angle of an image, and a mosaics enhancement method.

The mosaics enhancement method is to randomly adjust the contrast of the images, and then splice the images into one image for training, and the effect is shown in fig. 11. It can be seen that a plurality of pictures are spliced into one picture, and the understanding is that the learning of a small target is forcibly added to the network model of the invention. Compared with the existing single-stage algorithm in the market, the network model provided by the invention can be used for identifying small targets very hard just because the image does not adopt a Mosaic enhancement method. The method further improves the accurate energy record for identifying the small target.

Of course, the optimizer and the loss function of the invention can also adopt other prior art, and can also adopt a back propagation algorithm and a gradient descent algorithm to enable the loss value of the loss function to be converged, thus completing the training of the network model of the invention.

Of course, the image of the present invention may be other medical images or non-medical images, nor is the identified target area limited to a tumor, such as the identification of other cells or tissue in the image, or the identification of tissue in an ex vivo tissue image, or the identification of a portion of interest extending to the non-medical image. Such as recognition, detection, etc. of faces, masks, and fire hazards in the image.

Example 2

The embodiment discloses a neural network model based on a residual error structure, which comprises a shallow backbone network structure, a middle backbone network structure, a deep backbone network structure, an SPP layer, a first detection head, a second detection head and a third detection head;

The neural network model based on the residual structure is deployed on the Jetson Nano through TensorRT. In order to be better and more convenient for carrying out related operations, the invention uses USB to connect the Jetson Nano with the nuclear magnetic resonance apparatus, and the nuclear magnetic resonance apparatus transmits the data into the Jetson Nano through a USB data line after the data acquisition is completed. Jetson Nano can be externally connected with a touch display screen, and an operator can operate the system through the touch screen. The specific operation is as follows: browsing the path of data storage, the system will automatically convert all data into picture format, and identify all pictures, after identification, the picture display area will display the picture marked with the target area. When the target is a tumor, the upper left corner of the image will show the patient's name and the detailed information display area of the system will show the size and number of tumors.

The TensorRT of the invention is a neural network model reasoning acceleration tool based on CUDA. The TensorRT reconstructs the structure of the neural network model, merges all the calculation which can be merged, and enables the calculation to support GPU acceleration. The operation greatly accelerates the reasoning speed of the neural network model, and greatly reduces the occupied video memory of the neural network when the neural network performs calculation, thereby saving the calculation resources to a great extent. In actual use, the real-time performance is required to be met while high accuracy is met, so that a TensorRT acceleration tool is used for accelerating the reasoning speed of the network model on the JetsonNano.

Jetson Nano is a computer, which can be used by embedded designers, researchers and DIY manufacturers, provides powerful functions of modern AI on a compact and easy-to-use platform, and has the characteristics of small volume, high performance and complete functions. Jetson Nano provides 472GFLOPS of computational performance using four cores 64-bit ARMCPU and 128 cores integrated NVIDIA GPU. It also includes a 4GBLPDDR4 memory, employing an efficient, low power package, with a 5W/10W power mode and a 5V DC input. JetsonnNano provides a complete desktop Linu x environment, supports NVIDIA CUDA Toolkit 10.0.0, and libraries such as cuDNN 7.3 and TensorRT, supports acceleration of graphics operations. The development suite also includes a functional open source machine learning framework that is popular for native installation, such as TensorFlow, pyTorch, caffe, keras, etc., and a framework for computer vision and robotic development, such as OpenCV and ROS, jetson Nano provides a complete and efficient solution for developers of artificial intelligence that can help the developer of artificial intelligence develop complex AI applications in a short time.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for identifying a target area of a medical image, characterized by: the method comprises the following steps:

the backbone network structure comprises a residual structure, wherein the residual structure comprises basic units formed in a mode of 1X 1Conv layer pretreatment and 3X 3Conv layer post treatment; the backbone network structures are matched in sequence according to the sequence of processing the images; the output end of each backbone network structure is matched with the input end of a corresponding detection head; the SPP layer is also arranged between the backbone network structure of the final processed image and the detection head corresponding to the backbone network structure;

the backbone network structure comprises a shallow backbone network structure, a middle backbone network structure and a deep backbone network structure; the detection head comprises a first detection head, a second detection head and a third detection head; each backbone network structure comprises a residual structure, wherein the residual structure comprises basic units formed by a mode of 1X 1Conv layer pretreatment and 3X 3Conv layer post treatment; the basic unit in the same residual structure is at least one; when an image to be identified is processed by the shallow backbone network structure and then is input to a first detection head, a target image with a first receptive field can be obtained through output; when the image to be identified is sequentially processed by the shallow backbone network structure and the middle backbone network structure and then is input to a second detection head, a target image with a second receptive field can be obtained through output; when the image to be identified is processed by the shallow backbone network structure, the middle backbone network structure, the deep backbone network structure and the SPP layer in sequence, the image is input to a third detection head, and a target image with a third receptive field can be output;

in the shallow backbone network structure, a first residual structure, a second residual structure and a third residual structure are sequentially included according to the sequence of image processing; the first residual structure comprises a basic unit; the second residual structure comprises two basic units matched in sequence; the third residual structure comprises eight basic units matched in sequence; the middle backbone network structure comprises a fourth residual structure, and the fourth residual structure comprises eight basic units matched in sequence; the deep backbone network structure comprises a fifth residual structure comprising four basic units that are sequentially matched.

2. The method of identifying a target region of a medical image according to claim 1, wherein: in the first step, the information includes MRI image weight information, and the target keyword is T2;

traversing all the Dicom files, using an open source tool PyDicom to read the sequence information of each Dicom file, judging whether the sequence information contains a keyword T2, if so, determining the Dicom file as a target file, reading matrix information in the Dicom file, and saving the matrix information as an image in a JPG format.

3. The method of identifying a target region of a medical image according to claim 1, wherein: the SPP layer comprises a 5×5 Max Pooling layer, a 9×9 Max Pooling layer and a 13×13 Max Pooling layer which are arranged in parallel; the image to be identified is processed by the deep backbone network structure, then is respectively input into a 5×5 Max Pooling layer, a 9×9 Max Pooling layer and a 13×13 Max Pooling layer for compression, and then is fused and output.

4. The method of identifying a target region of a medical image according to claim 1, wherein: a Convolition Set is also arranged between each backbone network structure and the detection head matched with the backbone network structure; wherein, the Convolition Set matched with the backbone network structure of the final processed image is positioned between the SPP layer and the corresponding detection head; the Convolition Set includes a structure in which 1X 1Conv layers and 3X 3Conv layers are alternately stacked, and ensures that 1X 1Conv layers are located at the end.

5. The method of identifying a target region of a medical image according to claim 1, wherein: the training of the neural network model based on the residual structure comprises the following steps: inputting images in a training set into the neural network model based on the residual error structure, training the neural network model based on the residual error structure, and completing training of the neural network model based on the residual error structure when parameters of the neural network model based on the residual error structure enable a loss function to converge.

6. The method of identifying a target region of a medical image according to claim 5, wherein: the optimizer adopted in the training of the neural network model based on the residual structure is an Adam optimizer, and the loss function is a FocalLoss loss function;

expression of the FocalLoss function:

wherein in expression (1), p represents a confidence level, α=0.25, γ=2; if y=0, it represents a negative sample, and y=1 represents a positive sample.