CN113160062B

CN113160062B - Infrared image target detection method, device, equipment and storage medium

Info

Publication number: CN113160062B
Application number: CN202110572928.7A
Authority: CN
Inventors: 徐召飞; 金荣璐; 刘晴; 王云奇; 王水根
Original assignee: Iray Technology Co Ltd
Current assignee: Iray Technology Co Ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2023-06-06
Anticipated expiration: 2041-05-25
Also published as: CN113160062A

Abstract

The application discloses an infrared image target detection method, device, equipment and storage medium, comprising the following steps: establishing an infrared image data set, and performing data enhancement pretreatment to generate a training sample set; according to an operator supported by the embedded platform and the provided computing power, an infrared image target detection network is constructed, and a backbone network of the infrared image target detection network is constructed by adopting 1 focus image rearrangement module, a plurality of Ghost-depthWise convolution modules with residual structures and 1 SPP convolution module; the multi-scale target detection layer is constructed by adopting an FPN network and a PAN network; training the network by using a training sample set; channel pruning and retraining are carried out on the trained network; and converting the retrained network according to the requirement of the target platform. Therefore, complex operator calculation is avoided, redundant characteristic information is abandoned, characteristic contents of the characteristics are enriched, the problem of module conversion is avoided, the operation speed is high, and the detection precision is high.

Description

Infrared image target detection method, device, equipment and storage medium

Technical Field

The present invention relates to the field of infrared image processing, and in particular, to a method, apparatus, device, and storage medium for detecting an infrared image target.

Background

In recent years, infrared imaging technology is widely applied in the fields of automatic driving, intelligent security, remote sensing, industrial monitoring and the like, and in view of the requirements of all-weather uninterrupted detection, dual-light equipment (visible light and infrared) is increasingly deployed, and meanwhile, the requirements for visual image processing technology are also gradually increased. When weather conditions are good, most of requirements can be met by using common visible equipment, however, the imaging effect of the visible equipment is poor in rainy days, foggy days or nights and the like, the rear-end image processing is not facilitated, and long-wave infrared can exactly compensate the condition that the high-quality image effect cannot be obtained by using the visible light in severe weather by using the passive imaging principle. In the application field of real-time monitoring by using dual-light equipment, target detection and classification are one of important machine vision business requirements.

With the co-development of intelligent science and technology and society, artificial intelligence technology has made a great progress in the field of target detection and recognition, and has achieved excellent recognition effect in various scenes. The typical single-stage target detection network comprises DetectorNet, overFeat, YOLO series, SSD (Single Shot Detector) and the like, and has the characteristics of high speed, low precision and the like; the typical dual-stage target detection network comprises RCNN series, SPPNet (Spatial Pyramid Pooling), RFCN (Region based Fully ConvNet), MRCNN and the like, and has the characteristics of high detection precision, low running speed and the like. On one hand, the backbone structure of the network is mostly VGG, resNet, googleNet, darknet or AlexNet, and the network is mostly applied to RGB color images, and the characteristics of the network such as color, texture and the like can be easily extracted for network training to obtain more accurate identification and positioning, and for infrared images, the identification is mainly carried out by extracting the outline characteristics of the infrared images, so that a general depth target detection network cannot be directly migrated and applied to the infrared images well; on the other hand, the current target detection network model is still difficult to deploy relative to the low-cost low-computation-power embedded platform, a large amount of computing resources and storage resources are consumed in the model reasoning process, so that other services cannot be simultaneously operated on the same embedded platform, and if more custom operators or newer operators exist in the designed network model, and the platform needing to deploy cannot analyze and calculate the operators, the whole model design work needs to be started from the beginning. Although the target positioning can be finished based on the traditional moving target detection mode, the detected target type (human, vehicle, non-motor vehicle and the like) cannot be judged, and the traditional method is relatively suitable for the condition that monitoring equipment is fixed, namely the background is basically unchanged, and if the carried equipment is a scene that equipment such as a vehicle-mounted or rotating cradle head is changed at any time, a large amount of false detection can occur by using the traditional method.

Disclosure of Invention

In view of the above, the present invention aims to provide a method, a device and a storage medium for detecting infrared image targets, which can solve the problems of complex network model, too many parameters, slow operation and poor detection of infrared small targets in the current human-vehicle target detection based on deep learning. The specific scheme is as follows:

an infrared image target detection method, comprising:

establishing an infrared image data set, and performing data enhancement pretreatment to generate a training sample set;

constructing an infrared image target detection network according to operators supported by the embedded platform and the provided computing power; the target feature extraction backbone network of the infrared image target detection network is constructed by adopting 1 focus image rearrangement module, a plurality of Ghost-depthWise convolution modules with residual structures and 1 SPP convolution module; the multi-scale target detection layer of the infrared image target detection network is constructed by adopting a pyramid structure formed by combining an FPN network and a PAN network;

training the constructed infrared image target detection network by using the training sample set;

performing channel pruning and retraining on the infrared image target detection network after training is completed;

And converting the infrared image target detection network after the retraining is completed by using an embedded platform AI module conversion tool chain according to the target platform requirement.

Preferably, in the method for detecting an infrared image target provided by the embodiment of the present invention, after the establishing an infrared image dataset, the method further includes:

labeling the infrared image dataset with a rectangular frame, and recording the coordinate position of the rectangular frame;

and marking two targets with the overlapping rate of more than 90% as one target in the marking stage, and deleting marking information with the target pixel area occupying less than three parts per million after all targets are marked.

Preferably, in the method for detecting an infrared image target provided by the embodiment of the present invention, the performing data enhancement preprocessing includes:

randomly selecting four images from the infrared image data set each time, and randomly selecting one image from the four images to serve as a master image;

equally scaling the remaining three images to the size of the master image;

randomly selecting the target in the other three images, randomly scaling the target, wherein the size of the scaled target cannot exceed 1/16 of the total pixel area of the image;

Randomly scaling the three scratched image targets and then randomly pasting the three scratched image targets into a master image; if the position of the random pasting of the target covers more than 60% of the existing target in the image or the position of the random pasting of the target exceeds the boundary of the image, discarding the target which is currently scratched;

and correspondingly adjusting the coordinate position of the rectangular frame marked by the pasted target.

Preferably, in the method for detecting an infrared image object provided by the embodiment of the present invention, the focus image rearrangement module uses a focus structure to slice an image to expand the number of channels of the image;

the Ghost-depthWise convolution module generates a first feature map through depth separable convolution, generates a second feature map through grouping 1x1 convolution operation in combination with the first feature map, and combines information in the first feature map and the second feature map to obtain a feature map with all feature information;

the SPP convolution module divides the feature map into different space regions on different scales, calculates feature vectors on each region, and combines all the calculated feature vectors to convert the feature map with any resolution into feature vectors with the same dimension as the full connection layer;

The focus image rearrangement module is connected with the first Ghost-depthWise convolution module by adopting a convolution layer with the step length of 1; the adjacent two Ghost-depthWise convolution modules are connected by adopting a convolution layer with the step length of 2; the last Ghost-depthWise convolution module is connected with the SPP convolution module by adopting a convolution layer with the step length of 2; the number of channels of the Ghost-depthWise convolution module is in an increasing mode along the direction from the image input to the image output.

Preferably, in the method for detecting an infrared image target provided by the embodiment of the present invention, after target detection is performed, a target detection result output by the infrared image target detection network is screened using a non-maximum suppression constraint.

Preferably, in the method for detecting an infrared image target provided by the embodiment of the present invention, when the infrared image target detection network is trained, a softmax loss function is selected by regression positioning prediction, and a cross entropy loss function is selected by classification prediction.

Preferably, in the method for detecting an infrared image target provided by the embodiment of the present invention, the performing channel pruning and retraining on the trained infrared image target detection network includes:

The target feature of the infrared image target detection network after the statistical training is completed extracts the sum of kernel weight values in the channel direction of each convolution layer in a main network;

cutting the corresponding convolution layers according to the sum of the weight values and the cutting proportion of each convolution layer; the convolution layer between the focus image rearrangement module and the first Ghost-depthWise convolution module and the convolution layer connected with the FPN network are not cut;

and using the weight which is not cut by the cut convolution layer as an initialization parameter of the infrared image target detection network, wherein the weight of the convolution layer which is not cut is not updated in the training process, and retraining the infrared image target detection network.

The embodiment of the invention also provides an infrared image target detection device, which comprises:

the training set generating unit is used for establishing an infrared image data set, performing data enhancement pretreatment, and generating a training sample set;

the network construction unit is used for constructing an infrared image target detection network according to operators supported by the embedded platform and the provided computing power; the target feature extraction backbone network of the infrared image target detection network is constructed by adopting 1 focus image rearrangement module, a plurality of Ghost-depthWise convolution modules with residual structures and 1 SPP convolution module; the multi-scale target detection layer of the infrared image target detection network is constructed by adopting a pyramid structure formed by combining an FPN network and a PAN network;

The network training unit is used for training the constructed infrared image target detection network by using the training sample set; the method is also used for carrying out channel pruning and retraining on the infrared image target detection network after training is completed;

and the network conversion unit is used for converting the infrared image target detection network after the retraining is completed by using an embedded platform AI module conversion tool chain according to the target platform requirement.

The embodiment of the invention also provides infrared image target detection equipment, which comprises a processor and a memory, wherein the infrared image target detection method provided by the embodiment of the invention is realized when the processor executes the computer program stored in the memory.

The embodiment of the invention also provides a computer readable storage medium for storing a computer program, wherein the computer program realizes the infrared image target detection method provided by the embodiment of the invention when being executed by a processor.

According to the technical scheme, the infrared image target detection method provided by the invention comprises the following steps of: establishing an infrared image data set, and performing data enhancement pretreatment to generate a training sample set; constructing an infrared image target detection network according to operators supported by the embedded platform and the provided computing power; the target feature extraction backbone network of the infrared image target detection network is constructed by adopting 1 focus image rearrangement module, a plurality of Ghost-depthWise convolution modules with residual structures and 1 SPP convolution module; the multi-scale target detection layer of the infrared image target detection network is constructed by adopting a pyramid structure formed by combining an FPN network and a PAN network; training the constructed infrared image target detection network by using a training sample set; performing channel pruning and retraining on the infrared image target detection network after training; and converting the infrared image target detection network after the retraining by using an embedded platform AI module conversion tool chain according to the target platform requirement.

The main network of the infrared image target detection network constructed by the method provided by the invention consists of a focus image rearrangement module, a Ghost-depthWise convolution module and an SPP convolution module, the multi-scale target detection layer consists of an FPN network and a PAN network, and the network is further compressed by a channel pruning method, so that complex operator calculation can be avoided, redundant characteristic information is omitted, the characteristic content of the characteristic is greatly enriched, the detection performance is improved, the network is designed according to the corresponding embedded platform, operators used by the network are all of the type supported by the platform, the problem of module conversion does not exist, the whole network model is small, the calculation speed is high, the universality is strong, the requirements of embedded platform service are met, the embedded platform is easy to transplant to a low-end AI chip, the detection precision and the operation speed are higher, the main flow application platform on the current market can be supported, and the application potential is provided. In addition, the invention also provides a corresponding device, equipment and a computer readable storage medium for the infrared image target detection method, so that the method has more practicability, and the device, equipment and computer readable storage medium have corresponding advantages.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only embodiments of the present invention, and other drawings may be obtained according to the provided drawings without inventive effort for those skilled in the art.

FIG. 1 is a flowchart of an infrared image target detection method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an infrared image object detection network according to an embodiment of the present invention;

fig. 3 is an operation schematic diagram of a focus image rearrangement module according to an embodiment of the present invention;

FIG. 4 is a convolutional layer structure of a feature extraction module according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a residual structure provided in an embodiment of the present invention;

fig. 6 is a schematic diagram of a convolution process of a Ghost-depthWise convolution module provided in an embodiment of the present invention;

FIG. 7 is a schematic diagram of an SPP spatial pyramid pooling structure according to an embodiment of the present invention;

fig. 8 is a schematic diagram of FPN and PAN structures provided in an embodiment of the present invention;

FIG. 9 is a schematic view of pruning a channel according to an embodiment of the present invention;

Fig. 10 is a schematic structural diagram of an infrared image target detection device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides an infrared image target detection method, as shown in fig. 1, comprising the following steps:

s101, establishing an infrared image data set, and performing data enhancement preprocessing to generate a training sample set.

Specifically, the infrared imaging cameras with different focal lengths can be adopted to shoot the intersection in different time periods under different weather, for example, 9 infrared cameras with different focal lengths hung at different positions are used for data acquisition (2 mm, 3.2mm, 4.3mm, 7mm, 13mm, 19mm, 25mm and 35 mm), the heights of the infrared imaging instruments with different focal lengths from the ground are different (from 2m to 25 m), the characteristic information of the vehicles under different visual angles can be acquired, and finally, three tens of thousands of infrared human images are obtained as detection data sets after unqualified images without targets, unclear and the like are screened out. Targets herein may include people, vehicles, non-motor vehicles, and the like.

S102, constructing an infrared image target detection network according to operators supported by the embedded platform and the provided computing power; the target feature extraction backbone network of the infrared image target detection network is constructed by adopting 1 focus image rearrangement module, a plurality of Ghost-depthWise convolution modules with residual structures and 1 SPP (Spatial Pyramid Poolinglayer, space pyramid pooling) convolution module; the multi-scale object detection layer of the infrared image object detection network is constructed by adopting a pyramid structure formed by combining a FPN (Feature Pyramid Network, characteristic pyramid network) network and a PAN (Path Aggregation Network ) network.

It can be understood that the method combines operators supported by the embedded platform and the provided computing power as the basis, designs a neural network model capable of meeting the detection precision requirement, is easy to deploy in the carried front-end computing equipment, and the whole network design thought does not depend on equipment of a certain fixed model, and can refer to the invention in different embedded equipment deployment processes aiming at target detection business.

Because the infrared image has the characteristics of low contrast, less details, only single-channel brightness information (the higher the temperature is, the larger the gray value is, and vice versa) and the like, the infrared image lacks more information compared with the common RGB image, and the difficulty of target detection on the infrared image is increased, so the network design of the invention mainly focuses on the extraction of the infrared target contour, avoids using complex operator calculation, and discards redundant characteristic information. Under the premise, as shown in fig. 2, the backbone network of the invention adopts a Ghost-depthwise group convolution module to acquire the simplified target feature diagrams on different scales by combining with SPP; and then performing multi-scale fusion on the high-level semantic information and the bottom-level position information by using the FPN and the PAN structure, and finally respectively sending the fused characteristic images with different scales into a convolution layer serving as a detection layer for processing to obtain a real target detection result.

S103, training the constructed infrared image target detection network by using a training sample set.

S104, channel pruning and retraining are carried out on the infrared image target detection network after training is completed.

It should be noted that, the design of the infrared image target detection network has been primarily completed by combining the characteristics of the infrared image and the computational power supported by the embedded device, but the deployment of the platform with higher real-time requirements is still difficult to complete, so the invention further compresses the network model by a channel pruning method.

S105, converting the infrared image target detection network after retraining by using an embedded platform AI module conversion tool chain according to the target platform requirement. And finally, deploying and infrared target detection verification can be performed on the generated target platform model.

In the above-mentioned infrared image target detection method provided by the embodiment of the invention, the backbone network of the built infrared image target detection network is composed of a focus image rearrangement module, a Ghost-depthWise convolution module and an SPP convolution module, the multi-scale target detection layer is composed of an FPN network and a PAN network, and the network is further compressed by using a channel pruning method, so that complex operator calculation can be avoided, redundant characteristic information is abandoned, characterization content of the characteristics is greatly enriched, detection performance is improved, the network is designed according to a corresponding embedded platform, operators used by the network are all of the type supported by the platform, the problem of module conversion does not exist, the whole network model is small, the calculation speed is high, the universality is high, the requirements of embedded platform services are met, the network is easy to transplant to a low-end AI chip, the main stream application platform in the market at present can be supported, and the application potential is provided.

In practical application, the method for detecting the infrared image target provided by the embodiment of the invention can be used for all-weather detection even in equipment with a single infrared camera.

In a specific implementation, in the above method for detecting an infrared image target provided by the embodiment of the present invention, after establishing the infrared image dataset in step S101, the method may further include: marking an infrared image dataset by using a rectangular frame, recording the coordinate position of the rectangular frame, specifically recording the position of the rectangular frame in a form of [ x1, y1, x2, y2] by taking the left upper corner of the image as an origin of coordinates [0,0], wherein x1 represents the left upper corner abscissa of the rectangular frame, y1 represents the left upper corner ordinate of the rectangular frame, x2 represents the right lower corner abscissa of the rectangular frame, y2 represents the right lower corner ordinate of the rectangular frame, and all label information is stored in the form of xml files; in order to reduce the false detection rate of targets, two targets with the overlapping rate of more than 90% are marked as one target in the data set marking stage, and marking information with the target pixel area ratio of less than three parts per million is deleted after all target marks are finished.

The model precision is improved by using various data enhancement aliasing modes because the problem of model precision reduction possibly occurs in the network model clipping process and the richness of samples in deep learning can increase the generalization capability and the prediction capability of the network model. The generalization capability of the network model can be improved by carrying out data enhancement on the images in training, and the data enhancement method in target detection is different from the common data enhancement method in that the transformation of the candidate frames is also considered. Conventional data enhancement includes random clipping, warping, amplification, mirroring, deformation, and the like. In order to improve the model detection precision, the invention uses a random scaling splicing mode as a supplement for data enhancement during training.

The dataset contains a large number of small objects and the distribution of these small objects is not uniform, and detection of small objects is more difficult because the infrared data contains less information than general color data. In the training data set, the target distribution characteristics of different sizes can influence the training precision of the detection network, so that the training data is required to be subjected to data enhancement, the data set is enriched, the robustness of the detection network is enhanced, and the detection capability of the detection network on small targets is enhanced.

In a specific implementation, in the method for detecting an infrared image target provided by the embodiment of the present invention, the step S101 of performing data enhancement preprocessing may include: randomly selecting four images from the infrared image data set each time, and randomly selecting one image from the four images to serve as a master image; equally scaling the remaining three images to the size of the master image; randomly selecting and matting out targets in the other three images, for example, 1, 2 or 3 targets can be randomly selected out in one image, randomly scaling the scratched out targets, and the size after scaling cannot exceed 1/16 of the total pixel area of the image; randomly scaling the three scratched image targets and then randomly pasting the three scratched image targets into a master image; if the position of the random pasting of the target covers more than 60% of the existing target in the image or the position of the random pasting of the target exceeds the boundary of the image, discarding the target which is currently scratched; and correspondingly adjusting the coordinate position of the rectangular frame marked by the pasted target.

The method can enrich the data set, and the number of targets is increased in the spliced images; the integrity of target features in the image is maintained as much as possible by utilizing random scaling, and the feature information loss caused by random clipping is relieved. In addition, the data distribution is more randomized by the mode, so that the model can learn general characteristics of all data conveniently, and the generalization capability of the model is improved.

Before performing the data enhancement preprocessing in step S101, two steps may be further included: and (5) adjusting the size of the image and normalizing.

Since the network needs to output images of a fixed size, all the images need to be stretched first, and the image size is adjusted to the required size. In order to keep the image from deformation, the image needs to be compressed in the same proportion, the spare positions are filled with black, and the specific calculation modes are as shown in the formulas (1) - (3):

let the model input size be (model_in_w, model_in_h), the image true resolution be (img_in_w, img_in_h), then the scale is:

Ration＝min(model_in_w/img_in_w,model_in_h/img_in_h)； (1)

Pad_w＝model_in_w-img_in_w*Ration； (2)

Pad_h＝model_in_h-img_in_h*Ration； (3)

in order to avoid numerical problems and speed up network convergence, normalization processing needs to be performed on the input data: firstly, calculating the mean value and standard deviation of all pixel points in an image, then subtracting the mean value from each pixel value of an input image, and dividing the mean value by the standard deviation to obtain a normalized result, wherein the calculation mode is as shown in the formula (4):

Where u represents the image mean, x _img Representing an image matrix, stddev represents the standard deviation of the image.

In a specific implementation, in the method for detecting an infrared image target provided in the embodiment of the present invention, before executing step S102, the method may further include: the embedded equipment to be deployed is researched, and hardware resources used by an AI model reasoning module and a supported open source model framework are checked.

Specifically, since operators supported by each embedded AI platform are different from the platform computing power, the supported model framework and the conventional model reasoning performance need to be checked at the initial stage of network model design. If the platform to be deployed only supports the caffe model framework, the caffe is preferably used for model design in the model training process, otherwise, the conversion step from other models to the caffe model is additionally added for working; if the platform to be deployed only supports main stream model frames such as caffe, tensorflow, pytorch, onnx and the like, the whole model structure design can be carried out by using the good modules; after knowing the training frame supported by the platform, checking the supported operator types, if the open source frame supported by the platform is selected, but the model design of the open source frame has unsupported operators, the whole model design is still invalid, and the design scheme needs to be changed again; in addition, a platform basic model performance test table, such as mobilenet, vgg, resnet, ssd and yolov series model reasoning time length and memory occupation conditions, needs to be checked, so that whether a designed model can be deployed on a corresponding platform under the condition of service requirements is estimated preliminarily.

In this embodiment, the whole network is designed to be deployed to the embedded device, so that the backbone network has enough expression capability to extract the features of the target, and has a fast reasoning speed, so that the target feature extraction backbone network has the effects of accelerating feature extraction and reducing the feature input scale; the multi-scale target detection layer has the effect that the size targets can be classified and positioned, so that target detection needs to be completed in a plurality of scales, the final detection result output layer needs to fuse the detection results in a plurality of scales, and a certain strategy is used for completing optimal target positioning and classification. In the above, the present invention can divide the entire network design into three modules: the device comprises a feature extraction module (namely a target feature extraction backbone network), a feature detection module (namely a multi-scale target detection layer) and a detection result output module.

In the specific implementation, when the feature extraction module is designed, firstly, a focus image rearrangement module is designed, then the focus image rearrangement module is connected into a Ghost convolution module with a residual structure, and finally, the SPP module is connected.

In the focus image rearrangement module, the resolution of the input image is reduced by using a focus structure, and the downsampled image content is spliced to the Z axis in the 3-dimensional direction of the input layer, namely, the channel number of the expanded image. As shown in fig. 3, the image is sliced, and a specific operation is that a value is taken every other pixel in an image, similar to the adjacent downsampling, four images with the length and width reduced by half compared with the original image can be acquired, W, H information is concentrated into a channel space, and the input channel is expanded by 4 times. And finally, carrying out convolution operation on the obtained new image to finally obtain a double downsampling characteristic diagram under the condition of no information loss.

The Focus module is connected into the Ghost convolution module with the residual structure, so that the thought of the residual structure is used for connecting the input and the output, and a deeper network can be constructed by crossing more layers. The feature map transmitted from the input layer is divided into two parts, and Part2 is sent to a residual block structure for processing and then is subjected to channel splicing with Part1 to obtain the final output. A schematic diagram of the convolutional layer structure of the feature extraction module is shown in fig. 4, where the residual structure is shown in fig. 5.

The Ghost-depthWise convolution module is mainly applied to a residual structure accessed by part2, the core idea of the Ghost-depthWise module is to divide the traditional convolution operation into two steps, as shown in fig. 6, a first feature map (a feature map with smaller number) is generated through depth separable convolution by using smaller calculation amount, then a second feature map (a new similar feature map) is generated by combining the first feature map through grouping 1x1 convolution operation with smaller calculation amount on the basis of the feature map, and information in the first feature map and the second feature map is combined to obtain the feature map with all feature information.

The amount of computation saved using the Ghost-depthwise convolution module will be described in the following with specific examples:

Assuming that the input featuremap is (c, h, w) = (64,100,100), the output featuremap is (c, h, w) = (200, 100), and the calculation amount using the normal convolution calculation method with kernel being 3x3 is:

ConvFlops＝100*100*3*3*64*200＝1152000000

the calculation amount of the same input and output by using the Ghost-depthWise module is as follows:

the first step uses the computation of the depthWise convolution with kernel 3x3, with both input and output (c, h, w) = (64,100,100);

Ghost-ConvFlops_Step1＝100*100*3*3*64＝5760000

the second step uses the combination output of depthwise convolution with the grouping kernel of 1x1 point convolution and the kernel of 3x3, and then the output of the second step is connected with the output of the first step in the channel direction; the total output is the channel number 200, the first step output channel number is 64, if the remaining channel number to be output is 136, two groups of 3x3 depthwise convolution modules can be used to output channels with the size of 64, and a group of 1x1 point convolution modules can output channels with the size of 8;

Ghost-ConvFlops_Step2_1＝100*100*3*3*64＝5760000

Ghost-ConvFlops_Step2_2＝100*100*3*3*64＝5760000

Ghost-ConvFlops_Step2_3＝100*100*1*1*64*8＝5120000

the total calculated amount is as follows:

Ghost-ConvFlops＝Ghost-ConvFlops_Step1+Ghost-ConvFlops_Step2_(1,2,3)＝22400000；

the speed ratio is ratio=convfeps/Ghost-convfeps≡50:1.

Therefore, the Ghost-depthWise convolution module provided by the invention can greatly reduce the parameter quantity of the network model and improve the forward reasoning speed of the model.

In the SPP convolution module, as shown in fig. 7, the feature map is divided into different spatial regions on different scales, feature vectors are calculated on each region, and all the calculated feature vectors are combined to convert the feature map with any resolution into the designed feature vector with the same dimension as the full connection layer, so that the problem of repeated feature extraction of the image by the convolution neural network can be solved, the speed of generating candidate frames is greatly improved, and the calculation cost is saved.

In addition, as shown in fig. 2, the focus image rearrangement module and the first Ghost-depthWise convolution module may be connected by adopting a convolution layer with a step length of 1; the adjacent two Ghost-depthWise convolution modules can be connected by adopting a convolution layer with the step length of 2, so that the operation of downsampling is completed, and the overall calculated amount is gradually reduced; the last Ghost-depthWise convolution module and the SPP convolution module can be connected by adopting a convolution layer with the step length of 2. The number of channels of the Ghost-depthWise convolution module is in an incremental mode along the direction from the image input to the image output.

In the implementation, when the feature detection module is designed, the FPN network and the PAN network as shown in fig. 8 are adopted for combination, and specifically, several layers of pyramids are used for fusion, so that adjustment can be performed according to the calculation force of the platform and the required detection precision (namely, the size range of target detection). For the feature pyramid network of the FPN structure, the sizes of the feature graphs are different scales, and the invention can fuse different features, but the features at the bottom of the feature pyramid cannot be fused with the features at the top due to the different sizes of the feature graphs. After the PAN structure is connected with the FPN structure, the function of path aggregation can be achieved, so that the bottom fine features are easily transferred to an upper network, and fusion is carried out on the same scale. In combination with the operation, the FPN layer conveys the strong semantic features from top to bottom, and the PAN layer conveys the strong positioning features from bottom to top, so that the characterization content of the features is greatly enriched, and the detection performance is improved.

And finally, performing convolution layer and dimensional transposition operation on the detection result after feature fusion, and outputting classification and position prediction results on each anchor frame. When the category is N categories, the classification prediction result on each anchor frame outputs N+1 values, wherein N values respectively represent the confidence level of the anchor frame on each category, and the additional 1 value represents the confidence level of the category as the background. The position prediction result on each anchor frame is represented by 4 values, [ priority_center_x, priority_center_y, priority_h, priority_w ] respectively represent the central point coordinates and length and width of the anchor frame, and each prediction frame corrects the anchor frame by using 4 compensation parameters, which is marked as: [ center_x_offset, center_y_offset, h_offset, w_offset ] represents the offset at the anchor frame center point coordinate position and the offset at the anchor frame length and width, respectively. The [ center_x, center_y, h, w ] form mark box of the original image can be obtained through the following conversion:

center_x＝center_x_offset×center_variance*priors_center_x+priors_center_x；

center_y＝center_y_offset×center_variance*priors_center_y+priors_center_y；

h＝exp(h_offset×size_variance*priors_h)×priors_h；

w＝exp(w_offset×size_variance*priors_w)×priors_w；

the coordinates and length of the center point of the anchor frame and the width of the anchor frame are respectively represented by the priority_center_x, the priority_h and the priority_w, the center_variance and the size_variance are constant values representing scale transformation, and are generally set to 0.1 and 0.2, and exp () is an exponential function of a natural number e.

In the method for detecting the infrared image target provided by the embodiment of the invention, after target detection is performed, non-maximum suppression constraint (Non Maximum Suppression, NMS) can be used for screening target detection results output by an infrared image target detection network so as to suppress the anchor frame prediction regression results with low confidence level, thereby providing a final infrared target detection result. Before the NMS inhibition process is carried out, all anchor frame prediction regression results with low confidence coefficient are screened out through a probability threshold value, and the probability threshold value is set to be between 0.2 and 0.5; and then converting the regression positioning prediction result relative to the anchor frame into a marker frame expression mode relative to the original image, and then performing a suppression process. In the NMS algorithm, the cross ratio threshold is set to between 0.2 and 0.6. The NMS algorithm flow chart may include: the first step is to build a set H for storing candidate frames to be processed, initialize the set H to contain all N frames, build a set M for storing optimal frames, and initialize the set M to be an empty set; the second step is to sort the frames in all the sets H, select the frame M with highest score, and move from the set H to the set M; the third step is to traverse the frame in the set H, calculate the intersection-over-unit ratio (IoU) with the frame m, if it is higher than the threshold, consider that the frame overlaps with m, remove the frame from the set H; the fourth step is to go back to the second step for iteration until set H is empty and the boxes in set M are desired.

In a specific implementation, in the method for detecting an infrared image target provided by the embodiment of the present invention, step S103 is to train the built lightweight infrared human target detection network model on a pre-prepared training data set until the loss function converges, thereby completing the network training. In training the infrared image target detection network, regression localization prediction may select a softmax loss function and classification prediction may select a cross entropy loss function.

In a specific implementation, in the method for detecting an infrared image target provided by the embodiment of the present invention, step S104 performs channel pruning and retraining on the trained infrared image target detection network, and may specifically include: the method comprises the steps of (1) extracting the sum of kernel (kernel) weight values in the channel (channel) direction of each convolution layer in a main network by using target characteristics of an infrared image target detection network after statistical training; cutting the corresponding convolution layers according to the sum of the weight values and the cutting proportion set by each convolution layer; the convolution layer between the focus image rearrangement module and the first Ghost-depthWise convolution module and the convolution layer connected with the FPN network are not cut; and taking the weight which is not cut by the cut convolution layer as an initialization parameter of the infrared image target detection network, wherein the weight of the convolution layer which is not cut is not updated in the training process, and retraining the infrared image target detection network.

It should be appreciated that, although the use of the Ghost-depthWise module can greatly reduce the parameters of the network model, the conventional convolution layer still has a large number of redundant feature maps compared with the infrared image detection, and the embedded platform with high real-time requirements can still continue to compress the model. For the training completed network, the sum of weight values of all convolution layers in the infrared man-car feature extraction backbone network is counted, channels with smaller sum of weights are deleted according to channel scaling factors set by each layer, the number of deleted channels is required to be a multiple of 2 due to hardware optimization, the cutting proportion of each convolution layer should be gradually increased in the process from input to detection result output of the network, for example, the proportion of the first convolution layer to be cut is 1/32, the proportion of the second convolution layer can be set to be 1/16, the third convolution layer can be set to be 1/8, and the like, and the cutting proportion of all convolution layers can be set to be the same value, and the fact that the proportion of the convolution layers close to the input is not suitable for cutting is verified to be excessively large. Because the convolution layer (namely the first layer convolution) between the focus image rearrangement module and the first Ghost-depthWise convolution module is local texture feature information, the local texture feature information is not cut, the convolution layer connected with the FPN is not cut in order to ensure that the structure of the target detection layer is not changed, only the middle layer is cut, only the number of channels is changed in the process of cutting channels, the main structure of the convolution layer is maintained, and the process hardly influences the generalization performance of a network.

Assuming that the output channel of a certain convolution layer is 128, and a common convolution mode with kernel of 3x3 is used in the intermediate calculation, the total number of kernels is 128, and the size of each kernel is 3x 64, and the specific clipping process may include: firstly, counting the sum of 128 kernel weight values, and then sequencing the values from small to large; if the layer clipping proportion is 1/16, finding out kernel deletion corresponding to the first 8 weights, and keeping other kernel weights unchanged; according to the mode, the layer output channel is changed to 120, and then the next layer taking the layer as input needs to delete the corresponding channel, and simultaneously delete the kernel weight corresponding to the channel direction. As shown in fig. 9, the overall channel pruning dark color is kernel and channel to be deleted, and the light color is the weight parameter to be retained.

And then, the cut convolution layer uses the weight of the uncut channel as an initialization parameter of the network, the weight of the uncut convolution layer is not updated in the training process, the network is retrained until convergence, and the detection effect of the final network model is close to the accuracy of the uncut network model.

Based on the same inventive concept, the embodiment of the invention also provides an infrared image target detection device, and because the principle of solving the problem of the device is similar to that of the infrared image target detection method, the implementation of the device can refer to the implementation of the infrared image target detection method, and the repetition is omitted.

In implementation, as shown in fig. 2, the infrared image target detection device provided in the embodiment of the present invention specifically includes:

a training set generating unit 11, configured to create an infrared image data set, perform data enhancement preprocessing, and generate a training sample set;

a network construction unit 12, configured to construct an infrared image target detection network according to an operator supported by the embedded platform and the provided computing power; the target feature extraction backbone network of the infrared image target detection network is constructed by adopting 1 focus image rearrangement module, a plurality of Ghost-depthWise convolution modules with residual structures and 1 SPP convolution module; the multi-scale target detection layer of the infrared image target detection network is constructed by adopting a pyramid structure formed by combining an FPN network and a PAN network;

a network training unit 13, configured to train the constructed infrared image target detection network using the training sample set; the method is also used for carrying out channel pruning and retraining on the infrared image target detection network after training is completed;

the network conversion unit 14 is configured to convert the retrained infrared image target detection network according to the target platform requirement by using the embedded platform AI module conversion tool chain.

In the infrared image target detection device provided by the embodiment of the invention, complex operator calculation can be avoided through interaction of the four units, redundant characteristic information is abandoned, characterization content of the characteristics is greatly enriched, detection performance is improved, the problem of module conversion is solved, the whole network model is small, the calculation speed is high, the universality is high, the requirement of embedded platform business is met, the embedded platform is easy to transplant to a low-end AI chip, the detection precision and the operation speed are high, a main stream application platform in the current market can be supported, and the application potential is provided.

For more specific working procedures of the above units, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.

Correspondingly, the embodiment of the invention also discloses an infrared image target detection device which comprises a processor and a memory; the method for detecting the infrared image targets disclosed in the foregoing embodiment is implemented when the processor executes the computer program stored in the memory.

For more specific procedures of the above method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.

Further, the invention also discloses a computer readable storage medium for storing a computer program; the computer program, when executed by a processor, implements the infrared image target detection method disclosed previously.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. The apparatus, device, and storage medium disclosed in the embodiments are relatively simple to describe, and the relevant parts refer to the description of the method section because they correspond to the methods disclosed in the embodiments.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The infrared image target detection method provided by the embodiment of the invention comprises the following steps: establishing an infrared image data set, and performing data enhancement pretreatment to generate a training sample set; constructing an infrared image target detection network according to operators supported by the embedded platform and the provided computing power; the target feature extraction backbone network of the infrared image target detection network is constructed by adopting 1 focus image rearrangement module, a plurality of Ghost-depthWise convolution modules with residual structures and 1 SPP convolution module; the multi-scale target detection layer of the infrared image target detection network is constructed by adopting a pyramid structure formed by combining an FPN network and a PAN network; training the constructed infrared image target detection network by using a training sample set; performing channel pruning and retraining on the infrared image target detection network after training; and converting the infrared image target detection network after the retraining by using an embedded platform AI module conversion tool chain according to the target platform requirement. The main network of the infrared image target detection network constructed in the method is composed of a focus image rearrangement module, a Ghost-depthWise convolution module and an SPP convolution module, the multi-scale target detection layer is composed of an FPN network and a PAN network, the network is further compressed by a channel pruning method, complex operator calculation can be avoided, redundant characteristic information is abandoned, the characteristic content of the characteristic is greatly enriched, the detection performance is improved, the network is designed according to the corresponding embedded platform, operators used by the network are of the type supported by the platform, the problem of module conversion does not exist, the whole network model is small, the calculation speed is high, the universality is high, the requirements of embedded platform services are met, the embedded platform is easy to transplant to a low-end AI chip, the main stream application platform in the current market can be supported, and the application potential is provided. In addition, the invention also provides a corresponding device, equipment and a computer readable storage medium for the infrared image target detection method, so that the method has more practicability, and the device, equipment and computer readable storage medium have corresponding advantages.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The method, the device, the equipment and the storage medium for detecting the infrared image targets provided by the invention are described in detail, and specific examples are applied to the principle and the implementation mode of the invention, and the description of the examples is only used for helping to understand the method and the core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. An infrared image target detection method, comprising:

constructing an infrared image target detection network according to operators supported by the embedded platform and the provided computing power; the target feature extraction backbone network of the infrared image target detection network is constructed by adopting 1 focus image rearrangement module, a plurality of Ghost-depthWise convolution modules with residual structures and 1 SPP convolution module; the multi-scale target detection layer of the infrared image target detection network is constructed by adopting a pyramid structure formed by combining an FPN network and a PAN network; the focus image rearrangement module uses a focus structure to slice an image to expand the number of channels of the image; the Ghost-depthWise convolution module generates a first feature map through depth separable convolution, generates a second feature map through grouping 1x1 convolution operation in combination with the first feature map, and combines information in the first feature map and the second feature map to obtain a feature map with all feature information; the SPP convolution module divides the feature map into different space regions on different scales, calculates feature vectors on each region, and combines all the calculated feature vectors to convert the feature map with any resolution into feature vectors with the same dimension as the full connection layer; the focus image rearrangement module is connected with the first Ghost-depthWise convolution module by adopting a convolution layer with the step length of 1; the adjacent two Ghost-depthWise convolution modules are connected by adopting a convolution layer with the step length of 2; the last Ghost-depthWise convolution module is connected with the SPP convolution module by adopting a convolution layer with the step length of 2; the number of channels of the Ghost-depthWise convolution module is in an increasing mode along the direction from the image input to the image output;

2. The method of claim 1, further comprising, after said establishing an infrared image dataset:

3. The method for detecting an infrared image target according to claim 2, wherein the performing data enhancement preprocessing includes:

equally scaling the remaining three images to the size of the master image;

4. The method according to claim 3, wherein after performing object detection, the object detection result output from the infrared image object detection network is screened using a non-maximum suppression constraint.

5. The method of claim 4, wherein regression localization prediction selects a softmax loss function and classification prediction selects a cross entropy loss function when training the infrared image target detection network.

6. The method for detecting an infrared image object according to claim 5, wherein the performing channel pruning and retraining on the trained infrared image object detection network comprises:

7. An infrared image target detection apparatus, comprising:

the network construction unit is used for constructing an infrared image target detection network according to operators supported by the embedded platform and the provided computing power; the target feature extraction backbone network of the infrared image target detection network is constructed by adopting 1 focus image rearrangement module, a plurality of Ghost-depthWise convolution modules with residual structures and 1 SPP convolution module; the multi-scale target detection layer of the infrared image target detection network is constructed by adopting a pyramid structure formed by combining an FPN network and a PAN network; the focus image rearrangement module uses a focus structure to slice an image to expand the number of channels of the image; the Ghost-depthWise convolution module generates a first feature map through depth separable convolution, generates a second feature map through grouping 1x1 convolution operation in combination with the first feature map, and combines information in the first feature map and the second feature map to obtain a feature map with all feature information; the SPP convolution module divides the feature map into different space regions on different scales, calculates feature vectors on each region, and combines all the calculated feature vectors to convert the feature map with any resolution into feature vectors with the same dimension as the full connection layer; the focus image rearrangement module is connected with the first Ghost-depthWise convolution module by adopting a convolution layer with the step length of 1; the adjacent two Ghost-depthWise convolution modules are connected by adopting a convolution layer with the step length of 2; the last Ghost-depthWise convolution module is connected with the SPP convolution module by adopting a convolution layer with the step length of 2; the number of channels of the Ghost-depthWise convolution module is in an increasing mode along the direction from the image input to the image output;

8. An infrared image target detection apparatus comprising a processor and a memory, wherein the processor implements the infrared image target detection method according to any one of claims 1 to 6 when executing a computer program stored in the memory.

9. A computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the infrared image target detection method according to any one of claims 1 to 6.