WO2022120901A1

WO2022120901A1 - Image detection model training method based on feature pyramid, medium, and device

Info

Publication number: WO2022120901A1
Application number: PCT/CN2020/136553
Authority: WO
Inventors: 胡庆茂; 张伟烽
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-12-09
Filing date: 2020-12-15
Publication date: 2022-06-16
Also published as: CN114612374A

Abstract

An image detection model training method based on a feature pyramid, a storage medium, and a device. The training method comprises: inputting an obtained original detection image into a feature extraction network to obtain multiple hierarchical feature maps of different scales (S10); inputting the hierarchical feature maps into a triangular feature pyramid fusion network to obtain multiple fusion feature maps of different scales (S20); inputting the multiple fusion feature maps of different scales into a regression prediction network to obtain a predicted target value (S30); updating a loss function according to the predicted target value and an obtained true target value (S40); and updating, according to the updated loss function, a network parameter of an image detection model to be trained (S50). In the scheme, a fusion network having at least five different fusion paths is constructed, such that the feature maps of different scales are fully fused, more detail information and original information are reserved, the detection accuracy of the model is improved, and the performance and efficiency of detection networks in the field of security check are improved.

Description

Training method, medium and device for image detection model based on feature pyramid

technical field

The invention belongs to the technical field of image processing, and in particular, relates to a training method of an image detection model based on a feature pyramid, a computer-readable storage medium, and a computer device.

Background technique

X-ray security inspection technology is widely used in the safety control of public transportation places such as subways and airports. In order to adapt to the ever-increasing traffic throughput and severe security situation, security inspection must have both high real-time and accuracy. However, in real life, the current mainstream work methods are mainly screened by security staff who have undergone certain professional training. The quality and efficiency of security inspections are easily negatively affected by external factors such as work status, mood swings, and work intensity. In addition, the pre-training expenditure and high labor cost are also one of the inherent disadvantages of the enterprise that cannot be ignored.

The target detection algorithm based on deep learning effectively overcomes the shortcomings of the existing methods discussed above, and has shown great potential in the detection of dangerous goods in X-ray security images. As an auxiliary detection method, the use of algorithms to automatically detect dangerous goods can maintain the alertness of human operators to a certain extent, reduce the false detection rate and response time, and greatly reduce labor costs.

Due to its wide application prospects and market value, the automatic detection of dangerous goods in X-ray security inspection images based on deep learning has always been one of the research hotspots in academia and industry. Generally speaking, target detection algorithms based on deep learning are mainly divided into anchor-based and anchor-free networks according to whether a preset anchor mechanism is used. Among the common target detection algorithms, Faster R-CNN, Mask R-CNN, RetinaNet and other networks are anchor-based mechanisms, while FCOS, CenterNet and other networks are anchor-free mechanisms.

The object detection networks discussed above (Faster R-CNN, Mask R-CNN, RetinaNet, YOLOv3, etc.) have achieved impressive performance in the automatic detection of dangerous goods in public X-ray security image datasets. However, the above networks all use the most basic feature fusion module FPN, which to a certain extent plays the role of fusing features of different scales, which can improve the accuracy. However, the security inspection image is very complex in nature, not only contains a large number of dangerous goods of varying sizes and shapes, but also has a lot of background information interference and potential problems such as occlusion and overlap. Ordinary simple feature fusion structures cannot further integrate multi-scale The feature information and the inability to extract more detailed information for the network for subsequent classification and localization make the overall performance unsatisfactory.

SUMMARY OF THE INVENTION

(1) Technical problem to be solved by the present invention

How to incorporate more scale features in the training process to obtain more detailed information to improve the classification prediction accuracy of the model.

(2) Technical scheme adopted in the present invention

The present application discloses a training method for an image detection model based on a feature pyramid. The image detection model to be trained includes a feature extraction network, a triangular feature pyramid fusion network and a regression prediction network, wherein the triangular feature pyramid fusion network includes several fusion units. And the triangular feature pyramid fusion network has at least five different fusion paths, and the training method includes:

Input the acquired original detection image into the feature extraction network to obtain several hierarchical feature maps of different scales;

Inputting the hierarchical feature map into the triangular feature pyramid fusion network to obtain fusion feature maps of several different scales;

Input several fused feature maps of different scales into the regression prediction network to obtain the predicted target value;

Update the loss function according to the predicted target value and the obtained real target value;

The network parameters of the image detection model to be trained are updated according to the updated loss function.

Optionally, the triangular feature pyramid fusion network includes at least three fusion layers, and the number of fusion layers decreases as the scale of the fusion layers decreases.

Optionally, the triangular feature pyramid fusion network has:

The first fusion path is used for fusion to form feature maps of different scales;

The second fusion path is used to shorten the transmission distance from low-level features to high-level features;

The third fusion path is used to fuse the feature information of the same scale;

The fourth fusion path is used to fuse the data of the fusion units that are respectively located in the two adjacent fusion layers and are respectively located in the first fusion path and the second fusion path;

The fifth fusion path is used to fuse the feature information of the input unit and the output unit of the same fusion layer.

Optionally, the triangular feature pyramid fusion network includes five fusion layers, and the number of fusion units in the five fusion layers is five, four, three, two, and one, respectively.

Optionally, the image detection model to be trained further includes a symmetrical triangular feature pyramid fusion network, the symmetrical triangular feature pyramid fusion network includes several fusion units, and the symmetrical triangular feature pyramid fusion network has at least five different fusion paths. , and each fusion unit of the symmetrical triangular feature pyramid fusion network and each fusion unit of the triangular feature pyramid fusion network are symmetrically distributed, wherein the training method further includes:

Inputting the hierarchical feature map into the symmetrical triangular feature pyramid fusion network to obtain symmetrical fusion feature maps of several different scales;

adding the fusion feature map and the symmetrical fusion feature map of the same scale to obtain a global feature map;

Inputting the global feature maps of different scales into the regression prediction network to obtain a global prediction target value;

Update the loss function according to the global predicted target value and the obtained real target value;

Optionally, the symmetric triangular feature pyramid fusion network includes at least three fusion layers, and the number of fusion layers decreases as the scale of the fusion layers increases.

Optionally, the symmetrical triangular feature pyramid fusion network has:

The sixth fusion path is used to fuse to form feature maps of different scales;

The seventh fusion path is used to shorten the transmission distance from low-level features to high-level features;

The eighth fusion path is used to fuse the feature information of the same scale;

a ninth fusion path, used to fuse fusion units that are respectively located in two adjacent fusion layers and are respectively located in the first fusion path and the second fusion path;

The tenth fusion path is used to fuse the feature information of the input unit and the output unit of the same fusion layer.

Optionally, the symmetrical triangular feature pyramid fusion network includes five fusion layers, and the number of fusion units in the five fusion layers is five, four, three, two, and one, respectively.

The present invention also discloses a computer-readable storage medium, which stores a training program of the image detection model based on the feature pyramid, and when the training program of the image detection model based on the feature pyramid is executed by the processor The above-mentioned training method of the image detection model based on the feature pyramid is realized.

The present invention also discloses a computer device, which includes a computer-readable storage medium, a processor, and a training program for a feature pyramid-based image detection model stored in the computer-readable storage medium. When the training program of the image detection model of the pyramid is executed by the processor, the above-mentioned training method of the image detection model based on the feature pyramid is realized.

(3) Beneficial effects

The invention discloses a training method for an image detection model based on a feature pyramid, which has the following technical effects compared with the traditional training method:

The application constructs a fusion network with at least five different fusion paths, so that the feature maps of different scales are fully fused, more detailed information and original information are retained, the detection accuracy of the model is improved, and the detection network in the field of security inspection is improved. performance and efficiency.

Description of drawings

1 is a flowchart of a training method for an image detection model based on a feature pyramid according to Embodiment 1 of the present invention;

2 is a frame diagram of an image detection model based on a feature pyramid according to Embodiment 1 of the present invention;

3 is a schematic structural diagram of a triangular feature pyramid fusion network according to Embodiment 1 of the present invention;

4 is a schematic structural diagram of a symmetrical triangular feature pyramid fusion network according to Embodiment 2 of the present invention;

5 is a flowchart of a method for training an image detection model based on a feature pyramid according to Embodiment 2 of the present invention;

FIG. 6 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

Before describing each embodiment of this application in detail, first briefly describe the inventive concept of this application: the existing target detection network is based on the simplest feature fusion module FPN (Feature Pyramid Network), which can only achieve simple feature fusion, and faces security inspection. Scenes and images are complex in nature, and a simple feature fusion module cannot fuse more detailed feature information. In this application, by constructing a fusion network with at least five different fusion paths, the feature maps of different scales are fully fused, and more features are preserved. Detailed information and original information to improve the detection accuracy of the model.

Specifically, as shown in FIGS. 1 and 2 , the image detection model to be trained in the first embodiment includes a feature extraction network, a triangular feature pyramid fusion network and a regression prediction network, wherein the triangular feature pyramid fusion network includes several fusion units, And the triangular feature pyramid fusion network has at least five different fusion paths, and the training method of the image detection model based on the feature pyramid includes the following steps:

Step S10: Input the acquired original detection image into the feature extraction network to obtain several hierarchical feature maps of different scales;

Step S20: inputting the hierarchical feature map into the triangular feature pyramid fusion network to obtain fusion feature maps of several different scales;

Step S30: inputting several fusion feature maps of different scales into the regression prediction network to obtain the prediction target value;

Step S40: Update the loss function according to the predicted target value and the obtained real target value;

Step S50: Update the network parameters of the image detection model to be trained according to the updated loss function.

Exemplarily, in step S10, the feature extraction network adopts C3-C5 layers of ResNet, and the acquired original detection images are input into the feature extraction network to obtain three hierarchical feature maps C3, C4 and C5 with successively increasing scales. In order to obtain a hierarchical feature map with more scales, it can be obtained by downsampling, for example, downsampling C5 to obtain a higher-scale C6, downsampling C6 to obtain a higher-scale C7, and so on. The first embodiment takes the hierarchical feature map of five scales as an example.

Further, the triangular feature pyramid fusion network includes at least three fusion layers, and the number of fusion layers decreases as the scale of the fusion layers decreases. As a preferred embodiment, the triangular feature pyramid fusion network includes five fusion layers, which are the first fusion layer R1, the second fusion layer R2, the third fusion layer R3, the fourth fusion layer R4, and the fifth fusion layer R5, respectively. The number of fusion units in the five fusion layers are five, four, three, two and one, respectively. As shown in Figure 2, each blank circle represents a fusion unit. The triangular feature pyramid fusion network of the first embodiment includes 15 fusion units. The scales of the five-layer fusion layers decrease sequentially from top to bottom. The number of fusion units is Decrease from top to bottom. The fusion unit in each fusion layer has the same scale as the hierarchical feature map of the corresponding layer. The last fusion unit in each fusion layer is called the output unit (P3-P7). The direction of the arrow represents the transmission direction of the data, that is, the fusion path.

As a preferred embodiment, the triangular feature pyramid fusion network has: a first fusion path 11 , a second fusion path 12 , a third fusion path 13 , a fourth fusion path 14 and a fifth fusion path 15 . Among them, the first fusion path 11 is from top to bottom, from a large-scale fusion unit to a small-scale fusion unit, and the first fusion path 11 is used for fusion to form feature maps of different scales. The second fusion path 12 is from bottom to top, from a small-scale fusion unit to a large-scale fusion unit, and the second fusion path 12 is used to shorten the transmission distance from low-level features to high-level features. The third fusion path 13 connects the fusion units of the same layer horizontally, and is used to fuse the feature information of the same scale. The fourth fusion path 14 diagonally connects two adjacent fusion units, and is used for fusion of fusion units respectively located in two adjacent fusion layers and respectively located in the first fusion path and the second fusion path. The fifth fusion path 15 is used to fuse the feature information of the input unit and the output unit of the same fusion layer to retain more original information. It should be noted that when merging features of different scales, the resolution of each feature needs to be adjusted to the same. Taking the input unit P5 as an example, since the resolution of a feature with a higher level is lower, it needs to be enlarged. The higher the resolution of the features of the level, the compression processing is required. For example, the feature information transmitted from the fusion unit P4 of the fourth fusion layer R4 to the P5 needs to be compressed by 0.5 times, and transmitted from the fusion unit of the second fusion layer R2 to P5. The feature information of P5 needs to be enlarged by 2 times.

Exemplarily, in step S20, the hierarchical feature maps C3-C7 of five scales are respectively input into the corresponding triangular feature pyramid fusion network to obtain five fusion feature maps _P3 , _P4 , P ₅ , P ₆ , P ₇ .

Further, in step S30, input the fused feature maps P ₃ , P ₄ , P ₅ , P ₆ , and P ₇ of five different scales into the regression prediction network to obtain the predicted target value, where the target predicted value includes categories and Location. Exemplarily, the regression prediction network adopts a first-order fully convolutional one-stage object detection network (Fully Convolutional One-Stage Object Detection, FCOS for short). Dangerous Goods. For example, the input feature units of five heads from bottom to top are P ₃ , P ₄ , P ₅ , P ₆ , and P ₇ respectively, and the range of dangerous goods detected is [0, 64], [64, 128], [ 128, 256], [256, 512], [512, +∞]. Samples that exceed this range or background samples will be considered negative samples. The method of pixel-by-pixel prediction is used here, that is, each pixel is regarded as a key point and a positive sample for regression prediction is calculated. If a pixel falls into multiple ground-truth regions in the same layer, the smallest region is used as the regression target. Repeat until the entire image is detected.

Further, in steps S40 and S50, the loss function is updated according to the predicted target value and the obtained real target value, and the network parameters of the image detection model to be trained are updated according to the updated loss function. The updating process of the loss function and the updating process of the network parameters are both in the prior art and are well known to those skilled in the art, and will not be repeated here.

The training method for an image detection model based on a feature pyramid provided in the first embodiment, by constructing a fusion network with at least five different fusion paths, enables the feature maps of different scales to be fully fused, and retains more detailed information and original information, improve the detection accuracy of the model, and improve the performance and efficiency of the detection network in the field of security inspection.

The training method for an image detection model based on a feature pyramid disclosed in the second embodiment adds a symmetrical triangular feature pyramid fusion network on the basis of the first embodiment, and the symmetrical triangular feature pyramid fusion network includes several fusion units. The pyramid fusion network has at least five different fusion paths, and each fusion unit of the symmetrical triangular feature pyramid fusion network and each fusion unit of the triangular feature pyramid fusion network are symmetrically distributed.

The symmetric triangular feature pyramid fusion network includes at least three fusion layers, and the number of fusion layers decreases as the scale of the fusion layers increases. The symmetrical triangular feature pyramid fusion network includes five fusion layers, which are respectively the sixth fusion layer R6, the seventh fusion layer R7, the eighth fusion layer R8, the ninth fusion layer R9, the tenth fusion layer R10 and the fifth fusion layer. The number of fusion units is five, four, three, two and one, respectively. As shown in the figure, the symmetrical triangular feature pyramid fusion network of the second embodiment includes 15 fusion units, the scale of the five-layer fusion layer decreases sequentially from top to bottom, and the number of fusion units increases sequentially from top to bottom. The fusion unit in each fusion layer has the same scale as the hierarchical feature map of the corresponding layer. The last fusion unit in each fusion layer is called the output unit (N3-N7). The direction of the arrow represents the transmission direction of the data, that is, the fusion path.

As a preferred embodiment, the triangular feature pyramid fusion network has: a sixth fusion path 16 , a seventh fusion path 17 , an eighth fusion path 18 , a ninth fusion path 19 and a tenth fusion path 20 . Among them, the sixth fusion path 16 is from top to bottom, from a large-scale fusion unit to a small-scale fusion unit, and the sixth fusion path 16 is used for fusion to form feature maps of different scales. The seventh fusion path 17 is from bottom to top, from a large-scale fusion unit to a small-scale fusion unit, and the seventh fusion path 17 is used to shorten the transmission distance from high-level features to low-level features. The eighth fusion path 18 connects the fusion units of the same layer horizontally, and is used to fuse the feature information of the same scale. The ninth fusion path 19 diagonally connects two adjacent fusion units, and is used for fusion of fusion units respectively located in two adjacent fusion layers and respectively located in the seventh fusion path 17 and the eighth fusion path 18 . The tenth fusion path 20 is used to fuse the feature information of the input unit and the output unit of the same fusion layer, so as to retain more original information. It should be noted that when fusing features of different scales, the resolution of each feature needs to be adjusted to the same.

Further, as shown in FIG. 5 , the training method of the second embodiment further includes:

Step S20': input the hierarchical feature map into the symmetrical triangular feature pyramid fusion network to obtain symmetrical fusion feature maps of several different scales;

Step S30': adding the fusion feature map and the symmetrical fusion feature map of the same scale to obtain a global feature map;

Step S40': input the global feature maps of different scales into the regression prediction network to obtain a global prediction target value;

Step S50': update the loss function according to the global predicted target value and the obtained real target value;

Step S60': Update the network parameters of the image detection model to be trained according to the updated loss function.

Specifically, in step S20', the hierarchical feature maps C3-C7 of five scales are respectively input to the corresponding input to the above-mentioned symmetrical triangular feature pyramid fusion network, to obtain five different scales of fusion feature maps N ₃ , N ₄ , N ₅ , N ₆ , N ₇ .

In step S30', the fused feature map and the symmetrical fused feature map of the same scale are added to obtain a global feature map, that is, P ₃ +N ₃ =M ₃ , P ₄ +N ₄ =M ₄ , P ₅ +N ₅ =M ₅ , P ₆ +N ₆ =M ₆ , P ₇ +N ₇ =M ₇ , and the global feature maps are M ₃ , M ₄ , M ₅ , M ₆ , and M ₇ respectively. In step S40 ′, input the fused feature maps M ₃ , M ₄ , M ₅ , M ₆ , and M ₇ of five different scales into the regression prediction network to obtain the predicted target value, where the target predicted value includes category and location. Exemplarily, the regression prediction network adopts a first-order fully convolutional one-stage object detection network (Fully Convolutional One-Stage Object Detection, FCOS for short). Dangerous Goods. For example, the input feature units of the five heads from bottom to top are M ₃ , M ₄ , M ₅ , M ₆ , and M ₇ , and the range of dangerous goods detected is [0, 64], [64, 128], [ 128, 256], [256, 512], [512, +∞]. Samples that exceed this range or background samples will be considered negative samples. The method of pixel-by-pixel prediction is used here, that is, each pixel is regarded as a key point and a positive sample for regression prediction is calculated. If a pixel falls into multiple ground-truth regions in the same layer, the smallest region is used as the regression target. Repeat until the entire image is detected.

In step S50' and step S60', the loss function is updated according to the global predicted target value and the obtained real target value, and the network parameters of the image detection model to be trained are updated according to the updated loss function. The updating process of the loss function and the updating process of the network parameters are both in the prior art and are well known to those skilled in the art, and will not be repeated here.

The training method for an image detection model based on a feature pyramid provided in the second embodiment, on the basis of the first embodiment, constructs another symmetrical triangular feature pyramid fusion network with at least five different fusion paths, which is mutually compatible with the triangular feature pyramid fusion network. When used together, a global feature map can be obtained, and the symmetric structure can effectively supplement the global feature information, retain more detailed information and original information, improve the detection accuracy of the model, and improve the performance and efficiency of the detection network in the field of security inspection.

Further, this embodiment discloses a computer-readable storage medium that stores a training program for an image detection model based on a feature pyramid, and the training program for the image detection model based on a feature pyramid is processed. The above-mentioned training method of the image detection model based on the feature pyramid is realized when the processor is executed.

Further, the present application also discloses a computer device. At the hardware level, as shown in FIG. 6 , the computer device includes a processor 20 , an internal bus 30 , a network interface 40 , and a computer-readable storage medium 50 . The processor 20 reads the corresponding computer program from the computer-readable storage medium and then executes it, forming a request processing device on a logical level. Of course, in addition to software implementations, one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc., that is to say, the execution subjects of the following processing procedures are not limited to each Logic unit, which can also be hardware or logic device. The computer-readable storage medium 50 stores the training program of the image detection model based on the feature pyramid, and the training program of the image detection model based on the feature pyramid realizes the above-mentioned image detection model based on the feature pyramid when the training program is executed by the processor. training method.

Computer-readable storage media includes both persistent and non-permanent, removable and non-removable media, and storage of information can be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer-readable storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage , magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.

In order to verify the effect of the image detection model obtained by the training method in Example 2, we selected 3130 gun images and 1953 knife images in the SIXray dataset as the evaluation datasets for our experiments. The proposed method is experimental in Python 3.6 using the Pytorch backend. We scale the image to 1333×800 as input and train the model on NVIDIA TITAN RTX with 24GB RAM. During the training phase, we employ a stochastic gradient optimizer with a learning rate of 0.0001 and a weight decay of 0.001. All datasets are randomly divided into training set (60%), validation set (20%) and test set (20%) so that each split has a similar distribution.

Under the premise of the same training parameters, different methods are used to detect firearms and knives on SIXray data. The AP value of each category and the overall mAP results are shown in Table 1. The model (Ours) obtained by this training method is the best result among the listed methods in terms of AP value and overall performance mAP result of each category, which verifies that the model obtained by this training method has the best performance of dangerous goods in X-ray security inspection images. Advantages of automatic detection.

Table 1. Accuracy comparison of different methods on the SIXray dataset

The specific embodiments of the present invention have been described in detail above. Although some embodiments have been shown and described, those skilled in the art should understand that the principles and spirit of the present invention, which are defined in the scope of the claims and their equivalents, are not departed from. Under the circumstances, these embodiments can be modified and perfected, and these modifications and improvements should also fall within the protection scope of the present invention.

Claims

A training method for an image detection model based on a feature pyramid, wherein the image detection model to be trained includes a feature extraction network, a triangular feature pyramid fusion network and a regression prediction network, wherein the triangular feature pyramid fusion network includes several fusion units, and all The triangular feature pyramid fusion network has at least five different fusion paths, and the training method includes:

Input the acquired original detection image into the feature extraction network to obtain several hierarchical feature maps of different scales;

Inputting the hierarchical feature map into the triangular feature pyramid fusion network to obtain fusion feature maps of several different scales;

Input several fused feature maps of different scales into the regression prediction network to obtain the predicted target value;

Update the loss function according to the predicted target value and the obtained real target value;

The network parameters of the image detection model to be trained are updated according to the updated loss function.
The method for training an image detection model based on a feature pyramid according to claim 1, wherein the triangular feature pyramid fusion network comprises at least three fusion layers, and the number of fusion layers decreases as the scale of the fusion layers decreases.
The training method of the image detection model based on feature pyramid according to claim 2, wherein, the triangular feature pyramid fusion network has:

The first fusion path is used for fusion to form feature maps of different scales;

The second fusion path is used to shorten the transmission distance from low-level features to high-level features;

The third fusion path is used to fuse the feature information of the same scale;

The fourth fusion path is used to fuse the data of the fusion units that are respectively located in the two adjacent fusion layers and are respectively located in the first fusion path and the second fusion path;

The fifth fusion path is used to fuse the feature information of the input unit and the output unit of the same fusion layer.
The method for training an image detection model based on a feature pyramid according to claim 2, wherein the triangular feature pyramid fusion network comprises five layers of fusion layers, and the number of fusion units of the five layers of fusion layers is five, four, and three, respectively. one, two and one.
The method for training an image detection model based on a feature pyramid according to claim 1, wherein the image detection model to be trained further comprises a symmetrical triangular feature pyramid fusion network, and the symmetrical triangular feature pyramid fusion network comprises several fusion units, The symmetrical triangular feature pyramid fusion network has at least five different fusion paths, and each fusion unit of the symmetrical triangular feature pyramid fusion network and each fusion unit of the triangular feature pyramid fusion network are symmetrically distributed, wherein the Training methods also include:

Inputting the hierarchical feature map into the symmetrical triangular feature pyramid fusion network to obtain symmetrical fusion feature maps of several different scales;

adding the fusion feature map and the symmetrical fusion feature map of the same scale to obtain a global feature map;

Inputting the global feature maps of different scales into the regression prediction network to obtain a global prediction target value;

Update the loss function according to the global predicted target value and the obtained real target value;

The network parameters of the image detection model to be trained are updated according to the updated loss function.
The method for training an image detection model based on a feature pyramid according to claim 5, wherein the symmetric triangular feature pyramid fusion network comprises at least three fusion layers, and the number of fusion layers decreases as the scale of the fusion layers increases .
The training method of the image detection model based on feature pyramid according to claim 6, wherein, the symmetrical triangular feature pyramid fusion network has:

The sixth fusion path is used to fuse to form feature maps of different scales;

The seventh fusion path is used to shorten the transmission distance from low-level features to high-level features;

The eighth fusion path is used to fuse the feature information of the same scale;

a ninth fusion path, used to fuse fusion units that are respectively located in two adjacent fusion layers and are respectively located in the first fusion path and the second fusion path;

The tenth fusion path is used to fuse the feature information of the input unit and the output unit of the same fusion layer.
The method for training an image detection model based on a feature pyramid according to claim 6, wherein the symmetrical triangular feature pyramid fusion network comprises five layers of fusion layers, and the number of fusion units in the five layers of fusion layers is five, four, Three, two and one.
A computer-readable storage medium, wherein the computer-readable storage medium stores a training program for an image detection model based on a feature pyramid, and when the training program for the image detection model based on a feature pyramid is executed by a processor, the claims are realized The training method of the image detection model based on the feature pyramid described in 1.
The computer-readable storage medium according to claim 9, wherein the image detection model to be trained comprises a feature extraction network, a triangular feature pyramid fusion network and a regression prediction network, wherein the triangular feature pyramid fusion network comprises several fusion units, and the The triangular feature pyramid fusion network has at least five different fusion paths, and the training method includes:

Input the acquired original detection image into the feature extraction network to obtain several hierarchical feature maps of different scales;

Inputting the hierarchical feature map into the triangular feature pyramid fusion network to obtain fusion feature maps of several different scales;

Input several fused feature maps of different scales into the regression prediction network to obtain the predicted target value;

Update the loss function according to the predicted target value and the obtained real target value;

The network parameters of the image detection model to be trained are updated according to the updated loss function.
The computer-readable storage medium of claim 10, wherein the triangular feature pyramid fusion network comprises at least three fusion layers, and the number of fusion layers decreases as the scale of the fusion layers decreases.
The computer-readable storage medium of claim 11, wherein the triangular feature pyramid fusion network has:

The first fusion path is used for fusion to form feature maps of different scales;

The second fusion path is used to shorten the transmission distance from low-level features to high-level features;

The third fusion path is used to fuse the feature information of the same scale;

The fourth fusion path is used to fuse the data of the fusion units that are respectively located in the two adjacent fusion layers and are respectively located in the first fusion path and the second fusion path;

The fifth fusion path is used to fuse the feature information of the input unit and the output unit of the same fusion layer.
The computer-readable storage medium according to claim 11, wherein the triangular feature pyramid fusion network comprises five layers of fusion layers, and the number of fusion units of the five layers of fusion layers is five, four, three, two and One.
The computer-readable storage medium according to claim 9, wherein the image detection model to be trained further comprises a symmetric triangular feature pyramid fusion network, the symmetric triangular feature pyramid fusion network comprises several fusion units, and the symmetric triangular feature The pyramid fusion network has at least five different fusion paths, and each fusion unit of the symmetrical triangular feature pyramid fusion network and each fusion unit of the triangular feature pyramid fusion network are symmetrically distributed, wherein the training method further includes:

Inputting the hierarchical feature map into the symmetrical triangular feature pyramid fusion network to obtain symmetrical fusion feature maps of several different scales;

adding the fusion feature map and the symmetrical fusion feature map of the same scale to obtain a global feature map;

Inputting the global feature maps of different scales into the regression prediction network to obtain a global prediction target value;

Update the loss function according to the global predicted target value and the obtained real target value;

The network parameters of the image detection model to be trained are updated according to the updated loss function.
The computer-readable storage medium of claim 14, wherein the symmetric triangular feature pyramid fusion network comprises at least three fusion layers, and the number of fusion layers decreases as the scale of the fusion layers increases.
The computer-readable storage medium of claim 15, wherein the symmetric triangular feature pyramid fusion network has:

The sixth fusion path is used to fuse to form feature maps of different scales;

The seventh fusion path is used to shorten the transmission distance from low-level features to high-level features;

The eighth fusion path is used to fuse the feature information of the same scale;

a ninth fusion path, used to fuse fusion units that are respectively located in two adjacent fusion layers and are respectively located in the first fusion path and the second fusion path;

The tenth fusion path is used to fuse the feature information of the input unit and the output unit of the same fusion layer.
The computer-readable storage medium according to claim 15, wherein the symmetric triangular feature pyramid fusion network comprises five layers of fusion layers, and the number of fusion units of the five layers of fusion layers is five, four, three, and two, respectively. and a.
A computer device, wherein the computer device comprises a computer-readable storage medium, a processor, and a training program for a feature pyramid-based image detection model stored in the computer-readable storage medium, the feature pyramid-based image When the training program of the detection model is executed by the processor, the training method of the image detection model based on the feature pyramid according to claim 1 is realized.