CN117726798A

CN117726798A - Training method of image target detection model

Info

Publication number: CN117726798A
Application number: CN202311744270.9A
Authority: CN
Inventors: 乔建森; 张文杰; 郑维红; 徐圆飞; 李保磊
Original assignee: Beijing Hangxing Machinery Manufacturing Co Ltd
Current assignee: Beijing Hangxing Machinery Manufacturing Co Ltd
Priority date: 2023-12-18
Filing date: 2023-12-18
Publication date: 2024-03-19

Abstract

The invention relates to a training method of an image target detection model, belongs to the technical field of CT reconstruction image detection, and solves the problem of inaccurate identification of CT reconstruction images in the prior art. The training method comprises the following steps: screening a first preset number of CT reconstructed image samples; performing projection of a first preset view angle on each CT reconstructed image sample to obtain a plurality of two-dimensional projection images corresponding to the first preset view angle; storing each two-dimensional projection image, the corresponding two-dimensional bounding box and the class label as target training samples into a target training sample library until a second preset number of target training samples are obtained; dividing target training samples in a target training sample library into a training set and a testing set, and training an image target detection model until training iteration conditions are met, so as to obtain a trained image target detection model; the image target detection model is a deep neural network model based on an attention mechanism. Accurate identification of CT reconstructed images is achieved.

Description

Training method of image target detection model

Technical Field

The invention relates to the technical field of CT reconstruction image detection, in particular to a training method of an image target detection model.

Background

The CT security inspection device plays a vital role in security protection systems, especially in traffic nodes with dense pedestrian flows such as airports, railway stations, subway stations, etc., and is an important security tool for ensuring personal and property security of passengers and stable state and society, and is often used for identifying targets such as contraband.

In the prior art, a depth neural network model is generally adopted to identify a visible light image, and when the depth neural network model is simply moved to identify a CT reconstructed image, the identification effect is often unsatisfactory, and the identification precision is poor. Meanwhile, when the model is trained, the training speed is low.

Therefore, a new technical solution for training a model for target detection of CT reconstructed images is needed.

Disclosure of Invention

In view of the above analysis, the present embodiment of the present invention is directed to providing a training method for an image target detection model, so as to solve the problem in the prior art that CT reconstructed image recognition is inaccurate.

The embodiment of the invention provides a training method of an image target detection model, which comprises the following steps:

acquiring a large number of CT reconstructed images from a historical CT reconstructed image database of the CT security inspection equipment, and screening a first preset number of CT reconstructed image samples from the large number of CT reconstructed images; performing projection of a first preset view angle on each CT reconstructed image sample to obtain a plurality of two-dimensional projection images corresponding to the first preset view angle;

Determining a plurality of targets included in each two-dimensional projection image, setting a two-dimensional bounding box of each target according to a preset coordinate system, setting a class label of each target at the same time, and storing each two-dimensional projection image, the corresponding two-dimensional bounding box and the class label as target training samples into a target training sample library until a second preset number of target training samples are obtained;

dividing target training samples in a target training sample library into a training set and a testing set, and training an image target detection model until training iteration conditions are met, so as to obtain a trained image target detection model; the image target detection model is a deep neural network model based on an attention mechanism.

Based on the further improvement of the training method, the CT reconstructed image is projected at a first preset visual angle through any one of the following projection methods:

orthographic projection;

perspective projection;

orthographic projection;

flight projection;

and (5) nonlinear projection.

Based on a further improvement of the training method, the first preset viewing angle includes three directions of XY, XZ and YZ.

Based on the further improvement of the training method, the image target detection model comprises a feature extraction layer, a feature fusion layer and a prediction layer;

The feature extraction layer is used for receiving each two-dimensional projection image, extracting feature images with different scales of each two-dimensional projection image based on an attention mechanism, and inputting the extracted feature images with different scales into the feature fusion layer;

the feature fusion layer is used for fusing the feature images with different scales and inputting the fused feature images to the prediction layer;

and the prediction layer is used for predicting each two-dimensional projection image according to the fused feature images to obtain a two-dimensional bounding box of the object to be identified and a corresponding category confidence coefficient.

Based on further improvement of the training method, the feature extraction layer comprises a first convolution module, a first feature extraction layer, a second feature extraction layer, a third feature extraction layer and a fourth feature extraction layer which are sequentially connected;

the first convolution module is used for receiving each two-dimensional projection image, carrying out convolution operation on each two-dimensional projection image once, and inputting the obtained initial feature image into the first feature extraction layer;

the first feature extraction layer is used for carrying out one-time convolution operation and residual convolution operation on the initial feature map to obtain a first feature map, and inputting the first feature map into the second feature extraction layer and the feature fusion layer;

The second feature extraction layer is used for carrying out one-time convolution operation and residual convolution operation on the first feature image to obtain a second feature image, and inputting the second feature image into the third feature extraction layer and the feature fusion layer;

the third feature extraction layer is used for carrying out one-time convolution operation and residual convolution operation on the second feature image to obtain a third feature image, and inputting the third feature image into the fourth feature extraction layer and the feature fusion layer;

and the fourth feature extraction layer is used for carrying out one-time convolution operation, residual convolution operation, space channel attention operation and fusion operation on the third feature map to obtain a fourth feature map, and inputting the fourth feature map into the feature fusion layer.

Based on further improvement of the training method, the first feature extraction layer, the second feature extraction layer and the third feature extraction layer all comprise a second convolution module and a first residual convolution module which are sequentially connected; the fourth feature extraction layer comprises a second convolution module, a first residual convolution module, a spatial channel attention module and a fusion module which are sequentially connected;

the second convolution module is used for carrying out one-time convolution operation on the input feature map and inputting the feature map after the convolution operation to the first residual convolution module;

The first residual convolution module is used for carrying out residual convolution operation on the input feature images and outputting the feature images after the convolution operation;

the spatial channel attention module is used for performing spatial channel attention operation on the input feature images and inputting the feature images after the spatial channel attention operation to the fusion module;

and the fusion module is used for fusing the feature images after the attention operation of the space channel to obtain a fourth feature image, and inputting the fourth feature image into the feature fusion layer.

Based on the further improvement of the training method, the spatial channel attention module comprises a channel attention module and a spatial attention module which are connected in sequence;

the channel attention module is used for obtaining a first attention feature map based on the input feature map, determining the weight of each channel according to the first attention feature map, and obtaining a second attention feature map based on the weight of each channel and the first attention feature map;

the space attention module receives the second attention feature map, determines the weight of each space according to the second attention feature map, obtains a third attention feature map based on the weight of each space and the second attention feature map, and inputs the third attention feature map to the fusion module as a feature map after the attention operation of the space channel.

Based on the further improvement of the training method, the characteristic fusion layer comprises a deep-to-shallow fusion layer and a shallow-to-deep fusion layer; the deep-to-shallow fusion layer comprises a first sampling fusion layer, a second sampling fusion layer and a third sampling fusion layer which are sequentially connected; the shallow-to-deep fusion layer comprises a first convolution fusion layer, a second convolution fusion layer and a third convolution fusion layer which are sequentially connected;

the first sampling fusion layer is used for receiving and fusing the fourth feature map and the third feature map to obtain a first fusion feature map; the second sampling fusion layer is used for receiving the first fusion feature map and the second fusion feature map and fusing the first fusion feature map and the second fusion feature map to obtain a second fusion feature map; the third sampling fusion layer is used for receiving the second fusion feature map and the first feature map and fusing the second fusion feature map and the first feature map to obtain a third fusion feature map;

the first convolution fusion layer is used for receiving the second fusion feature map and the third fusion feature map and fusing the second fusion feature map and the third fusion feature map to obtain a fourth fusion feature map; the second convolution fusion layer is used for receiving the fourth fusion feature map and the first fusion feature map and fusing the fourth fusion feature map and the first fusion feature map to obtain a fifth fusion feature map; the third convolution fusion layer is used for receiving the fifth fusion feature map and the fourth fusion feature map and fusing the fifth fusion feature map and the fourth fusion feature map to obtain a sixth fusion feature map;

And inputting the third fusion characteristic diagram, the fourth fusion characteristic diagram, the fifth fusion characteristic diagram and the sixth fusion characteristic diagram to the prediction layer.

Based on further improvement of the training method, the prediction layer comprises a first prediction layer, a second prediction layer, a third prediction layer and a fourth prediction layer, and is used for respectively receiving the third fusion feature map, the fourth fusion feature map, the fifth fusion feature map and the sixth fusion feature map, and obtaining a first prediction result, a second prediction result, a third prediction result and a fourth prediction result according to the respective fusion feature maps;

the prediction layer further comprises an average output layer, and the average output layer is used for receiving the first prediction result, the second prediction result, the third prediction result and the fourth prediction result and determining the prediction result of the image target detection model.

Based on the further improvement of the training method, the loss function of the two-dimensional bounding box predicted by the image target detection model is as follows:

wherein,representing a loss function, x, of a two-dimensional bounding box _t ，v _b ，x _l ，x _r Respectively representing two-dimensional coordinates of the prediction bounding box, +.>Respectively representing two-dimensional coordinates of an actual bounding box, wherein I is an intersection area, and U is a union area;

the image target detection model prediction category loss function is as follows:

Wherein L represents a loss function of the predicted class, N represents the total number of classes, i represents the ith class, Y _i Tag value, P, representing the ith class _i Representing the predicted value of the i-th class.

Compared with the prior art, the invention has at least one of the following beneficial effects:

1. by acquiring two-dimensional projection images of the CT reconstructed image under different view angles, setting a category label and a two-dimensional bounding box for each two-dimensional projection image, quickly acquiring a training sample of the image target detection model, and improving the training speed of the image target detection model based on an attention mechanism;

2. the three directions of XY, XZ and YZ are used as a first preset visual angle for projection, a training sample is rapidly obtained, and the recognition effect of the training-completed image target detection model on the two-dimensional projection image is improved;

3. and acquiring four feature images from shallow to deep through the feature extraction layer, combining the feature images from deep to shallow in the feature fusion layer and fusing the feature images from shallow to deep in the feature fusion layer to obtain rich feature information, and finally judging whether the training iteration condition is met through the prediction layer to quickly finish the training of the image target detection model.

In the invention, the technical schemes can be mutually combined to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

FIG. 1 is a flow chart of a training method of an image target detection model according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an image object detection model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a first convolution module, a second convolution module, a third convolution module, and a convolution module according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a first residual convolution module, a second residual convolution module, and a third residual convolution module according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a spatial channel attention module according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and together with the description serve to explain the principles of the invention, and are not intended to limit the scope of the invention.

In one embodiment of the present invention, a training method of an image target detection model is disclosed, as shown in fig. 1, where the training method includes:

Step S1: acquiring a large number of CT reconstructed images from a historical CT reconstructed image database of the CT security inspection equipment, and screening a first preset number of CT reconstructed image samples from the large number of CT reconstructed images; performing projection of a first preset view angle on each CT reconstructed image sample to obtain a plurality of two-dimensional projection images corresponding to the first preset view angle;

step S2, determining a plurality of targets included in each two-dimensional projection image, setting a two-dimensional bounding box of each target according to a preset coordinate system, setting a class label of each target at the same time, and storing each two-dimensional projection image, the corresponding two-dimensional bounding box and the class label as target training samples into a target training sample library until a second preset number of target training samples are obtained;

step S3: dividing target training samples in a target training sample library into a training set and a testing set, and training an image target detection model until training iteration conditions are met, so as to obtain a trained image target detection model; the image target detection model is a deep neural network model based on an attention mechanism.

Specifically, the working principle of CT is that a radiation source scans the layer of an object, a detector receives radiation penetrating through the layer, then the radiation is converted into visible light, the photoelectric conversion is converted into an electric signal, the electric signal is converted into a digital signal through an analog/digital converter, the digital signal is input into a computer for image processing, and the selected layer is divided into a plurality of cuboids with the same volume, which are called voxels.

Specifically, the information received by the detector is calculated to obtain the ray attenuation coefficient or absorption coefficient of each voxel, and the ray attenuation coefficient or absorption coefficient is arranged into a digital matrix which is stored in a magnetic disk or an optical disk; each digit in the digit matrix is converted into small squares, i.e. pixels, with unequal gray scales from black to white by a digital/analog converter, and the pixels are arranged in a matrix, i.e. a CT reconstructed image is formed.

Specifically, the radiation absorption coefficient of each voxel includes a high-energy absorption coefficient and a low-energy absorption coefficient, and the atomic number and the electron density can be calculated from the high-energy absorption coefficient and the low-energy absorption coefficient.

Preferably, the CT reconstructed image is any one of the following:

reconstructing an image with high energy;

reconstructing an image with low energy;

electron density map;

atomic number diagram.

Specifically, in step S1, a CT reconstructed image is obtained by using a historical CT reconstructed image database of the CT security inspection apparatus, the first preset number is set in advance, and a sufficient number of CT reconstructed image samples are obtained, for example, the first preset number is 10000, and then 10000 CT reconstructed image samples need to be obtained.

Specifically, when the first preset viewing angle is used, the CT reconstruction image is required to be set in advance, orthographic projection is carried out on the CT reconstruction image at different viewing angles, two-dimensional projection images of the CT reconstruction image at different first preset viewing angles are obtained, and a plurality of two-dimensional projection images corresponding to the CT reconstruction image are obtained.

Preferably, the first preset viewing angle includes three directions of XY, XZ, and YZ.

Specifically, the first preset visual angle is set to be in three directions of XY, XZ and YZ, and two-dimensional projection images of the CT reconstructed image in the three directions of XY, XZ and YZ are conveniently and quickly calculated after projection.

Preferably, the projection of the first preset viewing angle is performed on the CT reconstructed image by any one of the following projection methods:

orthographic projection;

perspective projection;

orthographic projection;

flight projection;

and (5) nonlinear projection.

Specifically, the orthogonal projection is to project the object along a certain axis or plane, and the projection line is perpendicular to the object, so that the sizes of the target object in the CT reconstructed image on three coordinate axes can be reserved.

Specifically, the perspective projection is to project the CT reconstructed image according to the distance from the viewpoint to the object, and the projection line is not perpendicular to the surface of the target object, but shows perspective effects with different degrees of distance.

Specifically, the orthographic projection is to project a CT reconstructed image along a certain axis, but the projection line is not perpendicular to the surface of the target object, but is inclined at a certain angle, so as to present a certain perspective effect.

Specifically, the flying projection is to project the CT reconstructed image along a certain trajectory.

Specifically, the nonlinear projection is to project the CT reconstructed image in a nonlinear manner.

Specifically, in step S1, for example, the first preset viewing angle includes S preset viewing angles, and through the projection of the first preset viewing angle, one CT reconstructed image may obtain S two-dimensional projection images.

It is worth noting that a different number of two-dimensional bounding boxes may be included in each two-dimensional projection image, i.e. the two-dimensional projection images of the respective objects are different at different first preset viewing angles, even if some objects are not presented at all.

Specifically, after step S1, a plurality of two-dimensional projection images of the CT reconstructed image under the first preset viewing angle are obtained, and each two-dimensional projection image may include different numbers of objects to be identified, that is, each two-dimensional projection image may include different numbers of two-dimensional bounding boxes.

Specifically, in step S2, the two-dimensional projection images obtained in step S1 are sequentially determined, the target included in each two-dimensional projection image is determined, the two-dimensional bounding box and the class label of the target are determined, and each two-dimensional projection image, the corresponding two-dimensional bounding box and the class label are taken as a target training sample and taken as one sample in the target training samples. And presetting a second preset number of target training samples in advance until a sufficient number of target training samples are obtained, and starting training the image target detection model.

Specifically, in step S3, a second preset number of target training samples are obtained, classification is performed, one part is used as a training set, the other part is used as a test set, the training set is used for training the image target detection model, the test set is used for testing the image target detection model, and when the image target detection model meets the training iteration condition, a trained image target detection model is obtained, wherein the image target detection model is a deep neural network model based on an attention mechanism.

It should be noted that, the image object detection model provided by the embodiment of the invention is a neural network model based on an attention mechanism. The attention mechanism in deep learning can imitate human vision and cognitive systems, allows a neural network to concentrate on relevant parts when processing input data, and by introducing the attention mechanism, an image target detection model can automatically learn and selectively focus on important information in input, so that the performance and generalization capability of the image target detection model are improved, and the recognition of a two-dimensional projection image is improved.

Specifically, as shown in fig. 2, the image target detection model includes a feature extraction layer, a feature fusion layer and a prediction layer;

Specifically, two-dimensional projection images in a target training sample are input from a feature extraction layer, features of different scales of the two-dimensional projection images are extracted in the feature extraction layer based on an attention mechanism, feature diagrams of different scales from deep to shallow are obtained, and the feature diagrams are input to a feature fusion layer.

Specifically, in the feature fusion layer, the feature graphs with different dimensions are fused, and a plurality of fused feature graphs are obtained after fusion and are input to the prediction layer.

Specifically, in the prediction layer, the two-dimensional bounding box may adopt a vertical distance between an origin and the two-dimensional rectangular bounding box, one two-dimensional bounding box includes 4 sides, the vertical distance between the origin and the 4 sides is used as the two-dimensional bounding box, and the category confidence corresponding to the two-dimensional bounding box includes the preset confidence of each category. For example, in the S-th two-dimensional projection image, the image includes M two-dimensional bounding boxes, that is, includes M objects to be identified, and the category to which each two-dimensional bounding box belongs is represented by a category confidence level, and if the preset category includes P, the category confidence level for each two-dimensional bounding box corresponds to the confidence level for P preset categories.

Specifically, as shown in fig. 2, the feature extraction layer includes a first convolution module, a first feature extraction layer, a second feature extraction layer, a third feature extraction layer, and a fourth feature extraction layer that are sequentially connected;

Specifically, the first convolution module receives the two-dimensional projection image and carries out convolution operation on the two-dimensional projection once to obtain an initial feature map. Preferably, as shown in fig. 3, the first convolution module includes a convolution layer, a batch normalization layer and a SiLU activation function layer that are sequentially connected, and the two-dimensional projection image sequentially passes through the convolution layer, the batch normalization layer and the SiLU activation function layer to obtain an initial feature map. It can be understood that the convolution layer adopts two-dimensional convolution, and the size of the convolution kernel, the convolution step length, the number of the convolution kernels and whether the convolution kernel is filled are reasonably set according to actual conditions; the batch normalization layer plays a role in parameter normalization, so that the model can be deepened without overfitting; the SiLU activation function layer plays a role in adding nonlinear characteristics into the depth network and improving the fitting capacity of the model.

Specifically, as shown in fig. 2, the first convolution module, the first feature extraction layer, the second feature extraction layer, the third feature extraction layer and the fourth feature extraction layer are sequentially connected from top to bottom, feature images of different scales of the two-dimensional projection image are extracted, and the obtained first feature image, the second feature image, the third feature image and the fourth feature image of different scales are input to the feature fusion layer.

Specifically, as shown in fig. 2, a convolution operation and a residual convolution operation are performed on the input feature map in the first feature extraction layer, the second feature extraction layer and the third feature extraction layer, respectively; and performing one-time convolution operation, residual convolution operation, spatial channel attention operation and fusion operation on the input feature map in a fourth feature extraction layer.

Preferably, as shown in fig. 2, the first feature extraction layer, the second feature extraction layer and the third feature extraction layer each include a second convolution module and a first residual convolution module that are sequentially connected; the fourth feature extraction layer comprises a second convolution module, a first residual convolution module, a spatial channel attention module and a fusion module which are sequentially connected;

Specifically, the first feature extraction layer, the second feature extraction layer, the third feature extraction layer and the fourth feature extraction layer respectively comprise a second convolution module, as shown in fig. 3, the second convolution module is similar to the first convolution module in structure, and each second convolution module comprises a convolution layer, a batch normalization layer and a SiLU activation function layer which are sequentially connected, and the two-dimensional projection image sequentially passes through the convolution layer, the batch normalization layer and the SiLU activation function layer to obtain an initial feature map. It can be understood that the convolution layer adopts two-dimensional convolution, and the size of the convolution kernel, the convolution step length, the number of the convolution kernels and whether the convolution kernel is filled are reasonably set according to actual conditions.

It should be noted that the structures of the first convolution module and the second convolution module are similar, and the difference is the size of the convolution kernel, the convolution step length, the number of convolution kernels and whether the convolution kernels are filled or not.

Specifically, as shown in fig. 2, the first feature extraction layer, the second feature extraction layer, the third feature extraction layer and the fourth feature extraction layer respectively include a first residual convolution module, as shown in fig. 4, where the first residual convolution module includes a layer of convolution module, a layer of matrix slicing layer, a plurality of hidden neck layers, a layer of splicing layer and a layer of convolution module that are sequentially connected from top to bottom, and the structure of the convolution module is shown in fig. 3. The matrix segmentation layer is used for carrying out matrix segmentation on the input feature images so as to accelerate the training speed of the model; the hidden neck layer comprises two convolution modules which are connected in sequence and is used for further extracting the characteristics; the splicing layer carries out matrix splicing on the input feature images, and features extracted by different layers can be better utilized.

Preferably, as shown in fig. 5, the spatial channel attention module includes a channel attention module and a spatial attention module connected in sequence;

Specifically, the spatial channel attention module is used for extracting the salient features of the input feature map, promoting the refinement of the self-adaptive features, emphasizing the features with significance in the channel and spatial dimensions, and simultaneously inhibiting the redundant features.

Specifically, the channel attention module is configured to perform a maximum pooling operation and a mean pooling operation on an input first attention feature map, and obtain a second attention feature map by passing the two output feature maps through a full connection layer and an activation function layer.

Specifically, the spatial attention module is configured to perform a maximum pooling operation and a mean pooling operation on an input second attention feature map, splice two output feature maps, and obtain a third attention feature map through a convolution layer and an activation function layer.

It should be noted that the channel attention module is a pooling operation performed on the channel axis, and the spatial attention module is a pooling operation performed on the image feature axis.

Preferably, as shown in fig. 2, the feature fusion layer includes a deep-to-shallow fusion layer and a shallow-to-deep fusion layer; the deep-to-shallow fusion layer comprises a first sampling fusion layer, a second sampling fusion layer and a third sampling fusion layer which are sequentially connected; the shallow-to-deep fusion layer comprises a first convolution fusion layer, a second convolution fusion layer and a third convolution fusion layer which are sequentially connected;

Specifically, as shown in fig. 2, the feature fusion layer is configured to fuse the input first feature map, second feature map, third feature map and fourth feature map, combine the image features of the shallow layer with the semantic features of the deep layer, and obtain a more complete feature map, where the image features of the shallow layer are the first feature map and the second feature map, and the semantic features of the deep layer are the third feature map and the fourth feature map.

It is worth to describe that, in the embodiment of the invention, the first feature map of the shallowest layer features is added in the feature fusion layer, so that the features of the shallower layer can be extracted conveniently, the detection of the small target by the image target detection model is improved, and the detection precision of the small target in the two-dimensional projection image is improved.

Preferably, as shown in fig. 2, the first sampling fusion layer, the second sampling fusion layer and the third sampling fusion layer all comprise an up-sampling layer, a splicing layer and a second residual convolution module which are sequentially connected;

in the first sampling fusion layer, an up-sampling layer carries out up-sampling operation on the fourth feature map; the splicing layer receives the third characteristic diagram and the characteristic diagram after the up-sampling operation to carry out splicing operation, and the characteristic diagram after the splicing operation is input to a second residual convolution module; the second residual convolution module carries out residual convolution operation on the feature images after the splicing operation to obtain a first fusion feature image;

in the second sampling fusion layer, the up-sampling layer receives the first fusion feature map to perform up-sampling operation; the splicing layer receives the second feature map and the feature map subjected to the up-sampling operation to carry out splicing operation, and the feature map subjected to the splicing operation is input to a second residual convolution module; the second residual convolution module carries out residual convolution operation on the feature images after the splicing operation to obtain a second fusion feature image;

in the third sampling fusion layer, the up-sampling layer receives the second fusion feature map to perform up-sampling operation; the splicing layer receives the first feature image and the feature image subjected to the up-sampling operation to carry out splicing operation, and the feature image subjected to the splicing operation is input to a second residual convolution module; and the second residual convolution module carries out residual convolution operation on the feature images after the splicing operation to obtain a third fusion feature image.

Preferably, as shown in fig. 2, the first convolution fusion layer, the second convolution fusion layer and the third convolution fusion layer all comprise a third convolution module, a splicing layer and a second residual convolution module which are sequentially connected;

in the first convolution fusion layer, a third convolution module is used for receiving a third fusion feature map to carry out convolution operation; the splicing layer receives the second fusion feature image and the feature image after the convolution operation to carry out splicing operation, and the feature image after the splicing operation is input to a third residual convolution module; the third residual convolution module carries out residual convolution operation on the feature images after the splicing operation to obtain a fourth fusion feature image;

in the second convolution fusion layer, the third convolution module is used for receiving the fourth fusion feature map to carry out convolution operation; the splicing layer receives the first fusion feature image and the feature image after the convolution operation to carry out splicing operation, and the feature image after the splicing operation is input to a third residual convolution module; the third residual convolution module carries out residual convolution operation on the feature images after the splicing operation to obtain a fifth fusion feature image;

in the third convolution fusion layer, the third convolution module is used for receiving the fifth fusion feature map to carry out convolution operation; the splicing layer receives the fourth characteristic diagram and the characteristic diagram after the convolution operation to carry out splicing operation, and the characteristic diagram after the splicing operation is input to a third residual convolution module; and the third residual convolution module performs residual convolution operation on the feature images after the splicing operation to obtain a sixth fusion feature image.

It is worth to say that the second residual convolution module and the first residual convolution module are similar in structure, and the specific parameter setting is reasonably set according to actual conditions.

The third convolution module has the same structure as the first convolution module, the second convolution module and the third convolution module.

Specifically, the feature fusion layer inputs the third fusion feature map, the fourth fusion feature map, the fifth fusion feature map and the sixth fusion feature map to the prediction layer for detection and identification.

Preferably, as shown in fig. 2, the prediction layer includes a first prediction layer, a second prediction layer, a third prediction layer, and a fourth prediction layer, which are configured to receive the third fusion feature map, the fourth fusion feature map, the fifth fusion feature map, and the sixth fusion feature map, respectively, and obtain a first prediction result, a second prediction result, a third prediction result, and a fourth prediction result according to the respective fusion feature maps;

Specifically, the first prediction result, the second prediction result, the third prediction result and the fourth prediction result all comprise two-dimensional bounding boxes and corresponding category confidence coefficients, a final recognition result is determined through average output layer calculation, and the recognition result is the two-dimensional bounding boxes and the corresponding category confidence coefficients of all objects to be recognized in the two-dimensional projection image. For example, if one two-dimensional projection image includes 3 objects to be identified, respectively obtaining two-dimensional bounding boxes of the 3 objects to be identified and corresponding category confidence degrees according to a third fusion feature image, a fourth fusion feature image, a fifth fusion feature image and a sixth fusion feature image in a prediction layer, respectively including 4 two-dimensional bounding boxes and 4 category confidence degrees for each object to be identified, calculating an average value of the 4 two-dimensional bounding boxes as the two-dimensional bounding boxes of the object to be identified, and calculating an average value of the 4 category confidence degrees as the category confidence degrees of the object to be identified. And finally obtaining a prediction result of the image target detection model, namely a two-dimensional bounding box of the target to be identified after averaging and a corresponding category confidence.

Preferably, the image target detection model predicts a loss function of a two-dimensional bounding box as:

wherein,representing a loss function, x, of a two-dimensional bounding box _t ，x _b ，x _l ，x _r Respectively representing two-dimensional coordinates of the prediction bounding box, +.>Respectively representing two-dimensional coordinates of an actual bounding box, wherein I is an intersection area, and U is a union area;

Specifically, the two-dimensional bounding box and the loss function of the prediction category are predicted through the image target detection model in the prediction layer to train, and the trained image target detection model is obtained.

It is worth to say that, when the training-completed image target detection model is used, a two-dimensional projection image is directly input, the two-dimensional projection image can be obtained to comprise a plurality of targets, and a two-dimensional bounding box and a category confidence corresponding to the targets are obtained.

It is worth to describe that, the image target detection model obtained through training by the training method of the image target detection model provided by the embodiment of the invention can identify the CT reconstructed image, and the two-dimensional bounding box of the target included in the CT reconstructed image and the corresponding category confidence coefficient are obtained.

Compared with the prior art, the training method of the image target detection model provided by the embodiment of the invention has the advantages that the two-dimensional projection images of the CT reconstructed images under different visual angles are obtained, the category labels and the two-dimensional bounding boxes are set for each two-dimensional projection image, the training sample of the image target detection model is rapidly obtained, and the training speed of the image target detection model based on an attention mechanism is improved; meanwhile, three directions of XY, XZ and YZ are used as a first preset visual angle for projection, a training sample is rapidly obtained, and the recognition effect of the training-completed image target detection model on the two-dimensional projection image is improved; and the feature extraction layer is used for acquiring four feature images from shallow to deep from the two-dimensional projection image, the feature fusion layer is combined with the deep to shallow fusion layer and the feature images are fused from the shallow to deep fusion layer to obtain rich feature information, and finally, whether training iteration conditions are met or not is judged through the prediction layer, so that training of the image target detection model is completed rapidly.

Those skilled in the art will appreciate that all or part of the flow of the methods of the embodiments described above may be accomplished by way of a computer program to instruct associated hardware, where the program may be stored on a computer readable storage medium. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. A training method for an image target detection model, the training method comprising:

2. Training method according to claim 1, characterized in that the CT reconstructed image is projected at a first preset view angle by any one of the following projection methods:

orthographic projection;

perspective projection;

orthographic projection;

flight projection;

and (5) nonlinear projection.

3. The training method of claim 2, wherein the first preset viewing angle comprises three directions XY, XZ, and YZ.

4. The training method of claim 1, wherein the image object detection model comprises a feature extraction layer, a feature fusion layer, and a prediction layer;

5. The training method of claim 4, wherein the feature extraction layer comprises a first convolution module, a first feature extraction layer, a second feature extraction layer, a third feature extraction layer, and a fourth feature extraction layer connected in sequence;

6. The training method of claim 5, wherein the first feature extraction layer, the second feature extraction layer, and the third feature extraction layer each comprise a second convolution module and a first residual convolution module connected in sequence; the fourth feature extraction layer comprises a second convolution module, a first residual convolution module, a spatial channel attention module and a fusion module which are sequentially connected;

7. The training method of claim 6, wherein the spatial channel attention module comprises a channel attention module and a spatial attention module connected in sequence;

8. The training method of claim 4, wherein the feature fusion layer comprises a deep-to-shallow fusion layer and a shallow-to-deep fusion layer; the deep-to-shallow fusion layer comprises a first sampling fusion layer, a second sampling fusion layer and a third sampling fusion layer which are sequentially connected; the shallow-to-deep fusion layer comprises a first convolution fusion layer, a second convolution fusion layer and a third convolution fusion layer which are sequentially connected;

9. The training method of claim 8, wherein the prediction layers include a first prediction layer, a second prediction layer, a third prediction layer, and a fourth prediction layer, and are configured to receive the third fused feature map, the fourth fused feature map, the fifth fused feature map, and the sixth fused feature map, respectively, and obtain a first prediction result, a second prediction result, a third prediction result, and a fourth prediction result according to the respective fused feature maps;

10. The training method of claim 1, wherein the image object detection model predicts a loss function of a two-dimensional bounding box as:

where L represents the loss function of the predicted class, N represents the total number of classes, and i represents the ith classIn other words, Y _i Tag value, P, representing the ith class _i Representing the predicted value of the i-th class.