CN117036890A - Training of pedestrian detection model, pedestrian detection method, device, equipment and medium - Google Patents

Training of pedestrian detection model, pedestrian detection method, device, equipment and medium Download PDF

Info

Publication number
CN117036890A
CN117036890A CN202311062534.2A CN202311062534A CN117036890A CN 117036890 A CN117036890 A CN 117036890A CN 202311062534 A CN202311062534 A CN 202311062534A CN 117036890 A CN117036890 A CN 117036890A
Authority
CN
China
Prior art keywords
visible light
mode
thermal infrared
image
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311062534.2A
Other languages
Chinese (zh)
Inventor
李明月
龚向锋
崔文朋
李长柏
田志仲
聂玉虎
王春冬
霍磊
张桂庆
孟颖出
于秀丽
李春晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Smartchip Microelectronics Technology Co Ltd
Original Assignee
Beijing Smartchip Microelectronics Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Smartchip Microelectronics Technology Co Ltd filed Critical Beijing Smartchip Microelectronics Technology Co Ltd
Priority to CN202311062534.2A priority Critical patent/CN117036890A/en
Publication of CN117036890A publication Critical patent/CN117036890A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/12Details of acquisition arrangements; Constructional details thereof
    • G06V10/14Optical characteristics of the device performing the acquisition or on the illumination arrangements
    • G06V10/143Sensing or illuminating at different wavelengths
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a method, a device, equipment and a medium for training a pedestrian detection model and detecting pedestrians. Obtaining a visible light sample image and a thermal infrared sample image aiming at a target pedestrian scene; and performing element-by-element addition operation on the characteristics of the visible light sample image and the characteristics of the thermal infrared sample image to obtain the multi-mode fusion image characteristics. And carrying out feature reconstruction based on the multi-mode fusion image features to obtain visible light reconstruction features and thermal infrared reconstruction features. Determining a first single-mode similarity loss between the visible light extraction feature and the visible light reconstruction feature, a second single-mode similarity loss between the thermal infrared extraction feature and the thermal infrared reconstruction feature, a first multi-mode interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and a second multi-mode interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature. And training the initial pedestrian detection model according to the similarity loss and the interaction loss until the model training stopping condition is met, so as to obtain the target pedestrian detection model.

Description

Training of pedestrian detection model, pedestrian detection method, device, equipment and medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a medium for training a pedestrian detection model and detecting pedestrians.
Background
The pedestrian detection model can identify and track pedestrians appearing in images or videos, so that efficiency and accuracy in traffic safety, monitoring, search and rescue and the like are improved.
In the related art, a multi-modal feature fusion method (such as a mode of splicing different modal features) may be used to detect pedestrians. However, the mutual interference between the multi-modal features may affect the accuracy of pedestrian detection.
Disclosure of Invention
The embodiments of the present specification aim to solve at least one of the technical problems in the related art to some extent. For this reason, the embodiments of the present specification propose a training of a pedestrian detection model, a pedestrian detection method, a device, equipment, and a medium.
The embodiment of the specification provides a training method of a pedestrian detection model, which comprises the following steps:
obtaining a visible light sample image and a thermal infrared sample image aiming at a target pedestrian scene;
performing feature reconstruction based on multi-mode fusion image features between the visible light sample image and the thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features; the multi-mode fusion image features are obtained by performing element-by-element addition operation based on the features of the visible light sample image and the features of the thermal infrared sample image;
Determining a first single-mode similarity loss between a visible light extraction feature and the visible light reconstruction feature of the visible light sample image, a second single-mode similarity loss between a thermal infrared extraction feature and the thermal infrared reconstruction feature of the thermal infrared sample image, a first multi-mode interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and a second multi-mode interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature;
and according to the first single-mode similarity loss, the second single-mode similarity loss, the first multi-mode interaction loss and the second multi-mode interaction loss, performing single-mode supervision training and cross-mode supervision training on the initial pedestrian detection model at the same time until the model stopping training condition is met, and obtaining a target pedestrian detection model.
In one embodiment, the initial pedestrian detection model comprises a first encoding network and a second encoding network which are connected in parallel, wherein the first encoding network and the second encoding network are connected to a fusion component together, and the fusion component is connected with a first decoding network and a second decoding network which are connected in parallel; the feature reconstruction is performed based on the multi-mode fusion image features between the visible light sample image and the thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features, including:
Inputting the visible light sample image into the first coding network for feature extraction to obtain visible light mode features;
inputting the thermal infrared sample image into the second coding network for feature extraction to obtain thermal infrared mode features;
performing element-by-element addition operation on the visible light mode characteristics and the thermal infrared mode characteristics through the fusion component to obtain the multi-mode fusion image characteristics;
and inputting the multi-mode fusion image features into the first decoding network and the second decoding network to respectively perform feature reconstruction, and correspondingly obtaining the visible light reconstruction features and the thermal infrared reconstruction features.
In one embodiment, the determining manner of the multi-mode fusion image feature includes:
acquiring visible light mode characteristics of the visible light sample image and thermal infrared mode characteristics of the thermal infrared sample image;
and performing element-by-element addition operation on the thermal infrared mode characteristics and the visible light mode characteristics to obtain the multi-mode fusion image characteristics.
In one embodiment, the acquiring a visible light sample image and a thermal infrared sample image for a target pedestrian scene includes:
Acquiring an initial visible light image and an initial thermal infrared image which are obtained by shooting the target pedestrian scene;
preprocessing the initial visible light image and the initial thermal infrared image according to the input data size of the initial pedestrian detection model to obtain the visible light sample image and the thermal infrared sample image; and taking the normalized visible light sample image and the normalized thermal infrared sample image as input data of the initial pedestrian detection model.
In one embodiment, the visible light sample image and the thermal infrared sample image are normalized in the following manner:
normalizing the visible light sample image according to a first channel mean value and a first channel standard deviation corresponding to the visible light sample image to obtain input data corresponding to the visible light sample image;
and carrying out normalization processing on the thermal infrared sample image according to a second channel mean value and a second channel standard deviation corresponding to the thermal infrared sample image to obtain input data corresponding to the thermal infrared sample image.
In one embodiment, model loss data is calculated using the following loss function:
L=L1+L2+L3+L4
L1=MSE(RGB i ,RGB′ i )
L2=MSE(TH i ,TH′ i )
L3=MSE(RGB i ,TH′ i )
L4=MSE(TH i ,RGB′ i )
Wherein L is model loss data, L1 is first single-mode similarity loss, L2 is second single-mode similarity loss, L3 is first multi-mode interaction loss, L4 is second multi-mode interaction loss, RGB i For visible light extraction features, RGB' i For visible light reconstruction features, TH i For thermal infrared extraction of characteristics, TH' i For thermal infrared reconstruction features, i denotes the feature count number and MSE denotes the mean square error loss.
The embodiment of the present specification provides a pedestrian detection method including:
obtaining a visible light to-be-detected image and a thermal infrared to-be-detected image which are shot aiming at any pedestrian scene;
and inputting the visible light to-be-detected image and the visible light to-be-detected image into the target pedestrian detection model obtained by the method in any embodiment to detect the pedestrian, thereby obtaining a pedestrian detection result.
The embodiment of the present specification provides a pedestrian detection method including:
obtaining a visible light to-be-detected image and a thermal infrared to-be-detected image which are shot aiming at any pedestrian scene;
inputting the visible light to-be-detected image and the visible light to-be-detected image into a target pedestrian detection model for pedestrian detection to obtain a pedestrian detection result; the loss data of the target pedestrian detection model in the training process comprises single-mode loss and cross-mode loss; the single-mode loss and the cross-mode loss are used together for supervising and training the capability of the target pedestrian detection module to extract multi-mode fusion image characteristics between the visible light to-be-detected image and the thermal infrared to-be-detected image;
The single-mode loss comprises a first single-mode similar loss between a visible light extraction feature and a visible light reconstruction feature of the visible light sample image and a second single-mode similar loss between a thermal infrared extraction feature and a thermal infrared reconstruction feature of the thermal infrared sample image;
the cross-modal loss includes a first multi-modal interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, a second multi-modal interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature.
In one embodiment, the target pedestrian detection model comprises a first encoding network and a second encoding network connected in parallel, wherein the first encoding network and the second encoding network are commonly connected to a fusion component; inputting the visible light to-be-detected image and the visible light to-be-detected image into a target pedestrian detection model for pedestrian detection to obtain a pedestrian detection result, wherein the method comprises the following steps of:
inputting the visible light to-be-detected image into the first coding network for feature extraction to obtain visible light to-be-detected features;
inputting the thermal infrared image to be detected into the second coding network for feature extraction to obtain thermal infrared features to be detected;
Performing element-by-element addition operation on the visible light feature to be detected and the thermal infrared feature to be detected through the fusion component to obtain a fusion image feature to be detected;
and performing target detection based on the fusion image features to be detected to obtain the pedestrian detection result.
In one embodiment, the step of detecting the pedestrian based on the feature of the fusion image to be detected to obtain the pedestrian detection result includes:
and carrying out convolution, pooling and activation processing operations according to the characteristics of the fusion image to be detected so as to carry out target detection and obtain the pedestrian detection result.
The embodiment of the specification provides a training device of a pedestrian detection model, the device comprises:
the sample image acquisition module is used for acquiring a visible light sample image and a thermal infrared sample image aiming at a target pedestrian scene;
the image feature reconstruction module is used for carrying out feature reconstruction based on the multi-mode fusion image features between the visible light sample image and the thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features; the multi-mode fusion image features are obtained by performing element-by-element addition operation based on the features of the visible light sample image and the features of the thermal infrared sample image;
The loss data determining module is used for determining a first single-mode similarity loss between the visible light extraction feature and the visible light reconstruction feature of the visible light sample image, a second single-mode similarity loss between the thermal infrared extraction feature and the thermal infrared reconstruction feature of the thermal infrared sample image, a first multi-mode interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and a second multi-mode interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature;
and the detection model determining module is used for simultaneously carrying out single-mode supervision training and cross-mode supervision training on the initial pedestrian detection model according to the first single-mode similarity loss, the second single-mode similarity loss, the first multi-mode interaction loss and the second multi-mode interaction loss until the model stopping training condition is met, so as to obtain a target pedestrian detection model.
The present specification embodiment provides a pedestrian detection apparatus including:
the to-be-detected image acquisition module is used for acquiring a visible light to-be-detected image and a thermal infrared to-be-detected image which are shot for any pedestrian scene;
the detection result determining module is configured to input the visible light to-be-detected image and the visible light to-be-detected image into the target pedestrian detection model obtained in any one of the above embodiments to perform pedestrian detection, so as to obtain a pedestrian detection result.
The present specification embodiment provides a pedestrian detection apparatus including:
the to-be-detected image acquisition module is used for acquiring a visible light to-be-detected image and a thermal infrared to-be-detected image which are shot for any pedestrian scene;
the detection result determining module is used for inputting the visible light to-be-detected image and the visible light to-be-detected image into a target pedestrian detection model to detect pedestrians and obtain pedestrian detection results; the loss data of the target pedestrian detection model in the training process comprises single-mode loss and cross-mode loss; the single-mode loss and the cross-mode loss are used together for supervising and training the capability of the target pedestrian detection module to extract multi-mode fusion image characteristics between the visible light to-be-detected image and the thermal infrared to-be-detected image;
the single-mode loss comprises a first single-mode similar loss between a visible light extraction feature and a visible light reconstruction feature of the visible light sample image and a second single-mode similar loss between a thermal infrared extraction feature and a thermal infrared reconstruction feature of the thermal infrared sample image;
the cross-modal loss includes a first multi-modal interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, a second multi-modal interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature.
The present description provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method according to any of the above embodiments.
The present description provides a computer program product comprising instructions which, when executed by a processor of a computer device, enable the computer device to perform the steps of the method of any one of the embodiments described above.
The present description provides a chip comprising a storage unit storing a computer program and a processing unit implementing the steps of the method according to any one of the embodiments above when the processing unit executes the computer program.
In the above-described embodiments, first, a visible light sample image and a thermal infrared sample image for a target pedestrian scene are acquired; then, performing element-by-element addition operation based on the features of the visible light sample image and the features of the thermal infrared sample image to obtain multi-mode fusion image features, and performing feature reconstruction based on the multi-mode fusion image features to obtain visible light reconstruction features and thermal infrared reconstruction features; and finally, according to the first single-mode similarity loss between the visible light extraction feature and the visible light reconstruction feature of the visible light sample image, the second single-mode similarity loss between the thermal infrared extraction feature and the thermal infrared reconstruction feature of the thermal infrared sample image, the first multi-mode interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and the second multi-mode interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature, performing single-mode supervision training and cross-mode supervision training on the initial pedestrian detection model at the same time to obtain a target pedestrian detection model. On one hand, the characteristic primary fusion is realized through element-by-element addition operation, so that the multi-mode fusion image characteristic is obtained, and the mutual interference between the two mode characteristics of the visible light image and the thermal infrared image is reduced; on the other hand, the model parameters are optimized and updated through the four losses of the first single-mode similarity loss, the second single-mode similarity loss, the first multi-mode interaction loss and the second multi-mode interaction loss, so that the model has the capability of obtaining more accurate multi-mode fusion image characteristics, the quality of the multi-mode fusion image characteristics can be improved, and the accuracy of pedestrian detection results is further improved.
Drawings
FIG. 1a is a schematic diagram of a training initial pedestrian detection model based on the YOLOv5 framework provided in an embodiment of the present disclosure;
FIG. 1b is a schematic diagram of a target pedestrian detection model based on the YOLOv5 framework provided in an embodiment of the present disclosure;
fig. 1c is a schematic flow chart of a training method of a pedestrian detection model according to an embodiment of the present disclosure;
FIG. 2a is a schematic flow chart of obtaining a visible light reconstruction feature and a thermal infrared reconstruction feature according to an embodiment of the present disclosure;
fig. 2b is a schematic diagram of the structure of the first coding network according to the embodiment of the present disclosure;
FIG. 2c is a schematic diagram of a CBS module provided by embodiments of the present disclosure;
FIG. 2d is a schematic diagram of a CSP1_X module provided in an embodiment of the present disclosure;
FIG. 2e is a schematic diagram of a Res unit module provided in an embodiment of the present disclosure;
fig. 2f is a schematic diagram of a csp2_x module provided in an embodiment of the present disclosure;
FIG. 2g is a schematic diagram of an SPPF module provided in an embodiment of the present disclosure;
FIG. 2h is a schematic diagram of determining multi-modality fusion image features provided by embodiments of the present disclosure;
FIG. 2i is a schematic diagram of a fusion component provided in an embodiment of the present disclosure;
fig. 2j is a schematic diagram of a first decoding network according to an embodiment of the present disclosure;
FIG. 3 is a schematic flow chart of obtaining multi-modal fusion image features according to an embodiment of the present disclosure;
FIG. 4 is a schematic flow chart of obtaining a visible light sample image and a thermal infrared sample image according to an embodiment of the present disclosure;
fig. 5 is a schematic flow chart of a pedestrian detection method according to an embodiment of the present disclosure;
fig. 6 is a flowchart of a pedestrian detection method according to an embodiment of the present disclosure;
fig. 7 is a schematic flow chart of obtaining a pedestrian detection result according to an embodiment of the present disclosure;
fig. 8 is a flowchart of a training method of a pedestrian detection model according to an embodiment of the present disclosure;
fig. 9 is a schematic diagram of a training device of a pedestrian detection model according to an embodiment of the present disclosure;
fig. 10 is a schematic view of a pedestrian detection apparatus provided in an embodiment of the present specification;
fig. 11 is a schematic diagram of a pedestrian detection apparatus provided in an embodiment of the present specification;
fig. 12 is an internal configuration diagram of a computer device according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
In the task of pedestrian detection using convolutional neural networks, the depth of the convolutional neural network is important to the performance impact of the model. In the process of pedestrian detection by using multi-mode data, after the number of layers of the convolutional neural network is increased, the convolutional neural network can extract more complex characteristic modes, so that a better result can be obtained when the model extracts multi-mode characteristics to perform pedestrian detection.
However, as the multi-modal characteristics of the convolutional neural network are extracted, the multi-modal characteristics can interfere with each other, and the accuracy of the convolutional neural network detection is saturated or even reduced. The residual network can solve the problems of saturation and even decline of accuracy by using short circuit connection, but the residual network performs congruent mapping, redundant features can be transferred to a later convolution layer, redundant information can cause interference to pedestrian detection, and pedestrian detection is affected.
In the related art, the feature fusion method can directly use the fusion features to detect pedestrians. However, the quality of the fusion features in the related art may be low, and since the feature fusion method in the related art does not consider the selectivity of the features, the complementary features between the multi-modal data cannot be fully utilized, thereby affecting the accuracy of pedestrian detection. Therefore, there is a need to improve the quality of fusion features when using multimodal features for pedestrian detection.
Based on this, the present embodiment provides a training method of a pedestrian detection model. Firstly, obtaining a visible light sample image and a thermal infrared sample image aiming at a target pedestrian scene; then, performing element-by-element addition operation based on the characteristics of the visible light sample image and the characteristics of the thermal infrared sample image to obtain multi-mode fusion image characteristics; and performing feature reconstruction based on the multi-mode fusion image features to obtain visible light reconstruction features and thermal infrared reconstruction features. Finally, according to the first single-mode similar loss between the visible light extraction feature and the visible light reconstruction feature of the visible light sample image, the second single-mode similar loss between the thermal infrared extraction feature and the thermal infrared reconstruction feature of the thermal infrared sample image, the first multi-mode interactive loss between the thermal infrared reconstruction feature and the visible light extraction feature, and the second multi-mode interactive loss between the visible light reconstruction feature and the thermal infrared extraction feature; and according to the first single-mode similarity loss, the second single-mode similarity loss, the first multi-mode interaction loss and the second multi-mode interaction loss, performing single-mode supervision training and cross-mode supervision training on the initial pedestrian detection model at the same time until the model stopping training condition is met, and obtaining the target pedestrian detection model.
In the embodiment of the specification, on one hand, the characteristic primary fusion is realized through element-by-element addition operation, so that the multi-mode fusion image characteristic is obtained, and the mutual interference between two mode characteristics of a visible light image and a thermal infrared image is reduced; on the other hand, the model parameters are optimized and updated through the four losses of the first single-mode similarity loss, the second single-mode similarity loss, the first multi-mode interaction loss and the second multi-mode interaction loss, so that the model has the capability of obtaining more accurate multi-mode fusion image characteristics, the quality of the multi-mode fusion image characteristics can be improved, and the accuracy of pedestrian detection results is further improved.
The embodiments of the present specification provide a scenario example of a pedestrian detection model training method. And acquiring an initial visible light image (RGB) and an initial Thermal infrared image (Thermal) of the application scene by a camera, and uploading the initial visible light image and the initial Thermal infrared image to a server by the camera. And the server performs image preprocessing on the initial visible light image and the initial Thermal infrared image to obtain an RGB image and a Thermal image. The server side may construct a training sample set, each training sample in the training sample set comprising a visible light image and a thermal infrared image.
In the model training phase, the RGB image and the Thermal image are input into an initial pedestrian detection model. Referring to fig. 1a, the target pedestrian detection model includes an encoder 106, an encoder 108, a Fusion module 110, a decoder 112, and a decoder 114. The RGB image 102 is input to the encoder 106 for feature extraction, so that the visible light mode feature of the RGB image 102 can be obtained, and the visible light extraction feature of the RGB image 102 can be obtained when passing through the convolution module corresponding to the encoder 106. The Thermal image 104 is input to the encoder 108 for feature extraction, so that Thermal infrared state features of the Thermal image 104 can be obtained, and Thermal infrared extraction features of the Thermal image 104 can be obtained when the Thermal image passes through a convolution module corresponding to the encoder 108. The visible light mode characteristic and the thermal infrared mode characteristic can be fused in a pixel-by-pixel addition mode through a Fusion module, and the multi-mode Fusion image characteristic is obtained. The multi-mode fusion image features are respectively input to the decoder 112 and the decoder 114, the reconstruction of RGB image data can be realized through the decoder 112, the visible light reconstruction feature can be obtained, the reconstruction of Thermal image data can be realized through the decoder 114, and the Thermal infrared reconstruction feature can be obtained. And the first single-mode similarity Loss1 can be obtained by performing similarity comparison on the visible light extraction characteristics and the visible light reconstruction characteristics. And the similarity comparison is carried out on the thermal infrared extraction characteristics and the thermal infrared reconstruction characteristics, so that the second single-mode similarity Loss2 can be obtained. And through interactive supervision of the visible light extraction features and the thermal infrared reconstruction features, the first multi-mode interactive Loss Loss3 can be obtained. And performing interactive supervision on the thermal infrared extraction characteristics and the visible light reconstruction characteristics to obtain a second multi-mode interactive Loss4. And adding the single-mode similar loss and the multi-mode interaction loss to obtain loss data, and updating parameters of the initial pedestrian detection model through the loss data until the model training stopping condition is met to obtain the target pedestrian detection model. Illustratively, the parameters of the encoder 106, encoder 108, decoder 112, decoder 114 are updated based on the first single-mode similarity Loss1, the second single-mode similarity Loss2, the first multi-mode interaction Loss3, and the second multi-mode interaction Loss4. It should be noted that, the Fusion module 110 performs preliminary Fusion (i.e., pixel-by-pixel addition) on the visible light mode feature and the thermal infrared mode feature, and the Fusion module 110 does not need to perform parameter update.
In the model reasoning stage, the visible light image to be detected and the thermal infrared image to be detected are input into a target pedestrian detection model. Referring to fig. 1b, the target pedestrian detection model includes a parameter updated encoder 128, a parameter updated encoder 130, a Fusion module 110, and a CONV module (Standard convolution mudule, standard convolution module) 132. The visible light image 124 to be detected can be used as an input of an encoder 128, the visible light feature to be detected of the visible light image 124 to be detected can be obtained through the encoder 128, the thermal infrared image 126 to be detected can be used as an input of an encoder 130, and the thermal infrared feature to be detected of the thermal infrared image to be detected can be obtained through the encoder 130. And the visible light feature to be detected and the thermal infrared feature to be detected are subjected to feature Fusion through a Fusion module 110, so that the feature of the Fusion image to be detected is obtained. The to-be-detected fusion image features can be subjected to convolution processing operation through the CONV module 132 to perform target detection, so as to obtain a pedestrian detection result 134.
The embodiment of the present disclosure provides a training method for a pedestrian detection model, referring to fig. 1c, the method may include the following steps:
S110, a visible light sample image and a thermal infrared sample image aiming at a target pedestrian scene are acquired.
Wherein the visible light sample image may be textured detail providing high spatial resolution and clarity in a manner consistent with the human visual system, capturing reflected light. For example, the visible light sample image may be an RGB image having three channels containing red, green and blue visible light color information. The thermal infrared sample image may be such that the object is distinguished from its background by differences in thermal radiation, which works well both in all weather and all day/night conditions. The thermal infrared sample image has only one channel, contains intensity information of near infrared light, and can be a gray scale image. The wavelength ranges of the visible light sample image and the thermal infrared sample image are also different from the imaging principle, and the effects produced by different sharpness and illumination conditions on the two types of images may also be greatly different. The infrared ray can detect the heat energy emitted by the human body, and the heat energy of the pedestrian can show obvious characteristics in the thermal infrared sample image. This allows the thermal infrared sample image to more accurately determine whether the target is a pedestrian in some cases, particularly at night or in low light environments. Therefore, the thermal infrared sample image plays a very important role in pedestrian detection, and can improve the accuracy and reliability of pedestrian detection. The visible light sample image and the thermal infrared sample image may be sample images of a target pedestrian scene at the same time and under the same scene, and the visible light sample image and the thermal infrared sample image may be used as a pair of samples for training the initial pedestrian detection model to obtain the target pedestrian detection model.
Specifically, the server locally stores a training sample set from which a visible light sample image and a thermal infrared sample image for a target pedestrian scene are directly acquired. In other embodiments, the image capturing device may be used to continuously capture the target pedestrian scene to obtain a visible light image and a thermal infrared image, and the obtained visible light image and thermal infrared image are sent to the server, and the server performs clipping processing or data enhancement processing based on the visible light image and the thermal infrared image to obtain a visible light sample image and a thermal infrared sample image for the target pedestrian scene.
Illustratively, the visible light sample image may be a 640 x 3 image and the thermal infrared sample image may be a 640 x 1 image.
The image capturing device may be at least one of a video camera, a camera, an infrared camera, a fisheye camera, and the like.
And S120, carrying out feature reconstruction based on the multi-mode fusion image features between the visible light sample image and the thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features.
The multi-mode fusion image features are obtained by performing element-by-element addition operation based on the features of the visible light sample image and the features of the thermal infrared sample image. Feature reconstruction generally refers to the use of algorithms or models to extract important information representing different features from the raw data and then use that information to reconstruct the data.
In some cases, in a pedestrian detection task, detection using a visible light sample image simply may be affected by various factors such as cloudy days, rainy days, haze, and the like. In the night and low light environment, the conventional visible light image acquisition device cannot acquire clear images, so that pedestrian detection is difficult through a visible light sample image, and the infrared image acquisition device can irradiate a target by using infrared rays, so that bright images are acquired in the night and low light environment, and the accuracy of pedestrian detection can be improved. The infrared sample image can penetrate through haze, smog, rain and snow and other weather conditions and cannot be influenced by ambient light. Therefore, in extreme weather, pedestrians can be accurately detected through the infrared sample image. The infrared image and the visible light image have complementarity, and the multi-mode fusion image characteristics with strong robustness and large information quantity can be obtained by carrying out characteristic fusion based on the characteristics of the visible light sample image and the characteristics of the thermal infrared sample image. The multi-mode method of fusing the visible light image and the thermal infrared image is adopted, so that the effect of improving the pedestrian detection purpose is achieved.
Specifically, the features of the visible light sample image and the features of the thermal infrared sample image are subjected to element-by-element addition operation, so that the multi-mode fusion image features can be obtained. And carrying out feature reconstruction by using the multi-mode fusion image features through a decoder corresponding to the visible light mode to obtain visible light reconstruction features. And carrying out feature reconstruction by using the multi-mode fusion image features through a decoder corresponding to the thermal infrared mode to obtain thermal infrared reconstruction features.
Illustratively, the visible light reconstruction feature may be an image of 320×320×32 and the thermal infrared sample image may be an image of 320×320×32.
S130, determining a first single-mode similarity loss between a visible light extraction feature and a visible light reconstruction feature of the visible light sample image, a second single-mode similarity loss between a thermal infrared extraction feature and a thermal infrared reconstruction feature of the thermal infrared sample image, a first multi-mode interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and a second multi-mode interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature.
The visible light extraction feature may be a feature of a visible light sample image extracted by a convolution operation. The thermal infrared extraction features may be features of thermal infrared sample images extracted by a convolution operation. The unimodal similarity loss is a loss function used to train models that measures similarity between different samples within the same class and can facilitate the aggregation of similar samples in feature space. The multi-modal interaction loss is a loss function for training a multi-modal deep learning model, and cross-modal interaction and joint modeling are realized by combining information of different modal data (such as different types of images). The different modes in this embodiment may be a visible light image mode and a thermal infrared image mode.
In some cases, in a complex environment with variability, the data of both the visible light sample image and the thermal infrared sample image are good and bad. For example, in an environment with sufficient illumination, the visible light sample image has better practicality, while the thermal infrared sample image may not greatly improve the quality of the fused feature. The multi-modal fusion image features for pedestrian detection are more effectively extracted by the model by optimizing the interactive supervision loss function in consideration of the importance of the mutual supervision between the two extracted features and the existence of the complementary features of the multi-modal data. The extraction of the visible light features and the thermal infrared features by the model can be optimized through the single-mode similarity loss.
Specifically, by selecting an appropriate single-mode similarity loss function, a first single-mode similarity loss is determined based on the visible light extraction features and the visible light reconstruction features of the visible light sample image. A second unimodal similarity loss is determined based on the thermal infrared extraction features and the thermal infrared reconstruction features of the thermal infrared sample image. By selecting an appropriate multi-modal interaction loss function, a first multi-modal interaction loss is determined based on the thermal infrared reconstruction features and the visible light extraction features, and a second multi-modal interaction loss is determined based on the visible light reconstruction features and the thermal infrared extraction features.
In some embodiments, the feature extraction is performed on the visible light sample image by a convolution module corresponding to the first coding network, so as to obtain a visible light extraction feature of the visible light sample image. And performing feature extraction on the thermal infrared sample image through a convolution module corresponding to the second coding network to obtain thermal infrared extraction features of the thermal infrared sample image. The size of the visible light extraction features may be 320×320×32 and the size of the thermal infrared extraction features may be 320×320×32.
And S140, performing single-mode supervision training and cross-mode supervision training on the initial pedestrian detection model simultaneously according to the first single-mode similarity loss, the second single-mode similarity loss, the first multi-mode interaction loss and the second multi-mode interaction loss until the model stopping training condition is met, and obtaining a target pedestrian detection model.
Wherein, the single-mode supervised training is a machine learning method, and uses the marking data from a single data source to perform model training. The method is generally applied to only one type of input data (e.g., image) and various patterns and features can be identified and classified by a supervised learning algorithm. Cross-modality supervised training refers to the use of tag information in one modality (e.g., one type of image) to assist in learning of another modality (e.g., another type of image). In this way, model performance may be improved in the absence of sufficient tagged data and enable the model to learn more comprehensive features from multiple modalities.
Under some conditions, the initial pedestrian detection model is trained through multi-modal interaction loss, so that multi-modal fusion image features with higher quality can be obtained, the multi-modal fusion image features with improved quality are used for pedestrian detection, and the accuracy of pedestrian detection can be improved. The quality of the visible light extraction features and the thermal infrared extraction features can be improved by single mode similarity loss. Further, the quality of the multi-mode fusion image features can be improved by improving the quality of the visible light extraction features and the thermal infrared extraction features.
Specifically, parameters of the initial pedestrian detection model are updated based on the first single-mode similarity loss and the second single-mode similarity loss so as to realize single-mode supervision training. And updating parameters of the initial pedestrian detection model based on the first multi-modal interaction loss and the second multi-modal interaction loss so as to realize cross-modal supervision training. And by analogy, training the updated initial pedestrian detection model continuously, and obtaining a target pedestrian detection model when the model training stopping condition is reached. The model training stopping condition may be that model loss data tends to converge, or that training turns reach a preset number of turns.
In the above embodiment, first, a visible light sample image and a thermal infrared sample image for a target pedestrian scene are acquired; then, performing element-by-element addition operation based on the features of the visible light sample image and the features of the thermal infrared sample image to obtain multi-mode fusion image features, and performing feature reconstruction based on the multi-mode fusion image features to obtain visible light reconstruction features and thermal infrared reconstruction features; and finally, according to the first single-mode similarity loss between the visible light extraction feature and the visible light reconstruction feature of the visible light sample image, the second single-mode similarity loss between the thermal infrared extraction feature and the thermal infrared reconstruction feature of the thermal infrared sample image, the first multi-mode interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and the second multi-mode interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature, performing single-mode supervision training and cross-mode supervision training on the initial pedestrian detection model at the same time to obtain a target pedestrian detection model. On one hand, the characteristic primary fusion is realized through element-by-element addition operation, so that the multi-mode fusion image characteristic is obtained, and the mutual interference between the two mode characteristics of the visible light image and the thermal infrared image is reduced; on the other hand, the model parameters are optimized and updated through the four losses of the first single-mode similarity loss, the second single-mode similarity loss, the first multi-mode interaction loss and the second multi-mode interaction loss, so that the model has the capability of obtaining more accurate multi-mode fusion image characteristics, the quality of the multi-mode fusion image characteristics can be improved, and the accuracy of pedestrian detection results is further improved.
In some embodiments, referring to fig. 2a, the initial pedestrian detection model includes a first encoding network and a second encoding network in parallel, the first encoding network and the second encoding network being commonly connected to a fusion component, the fusion component being connected with a first decoding network and a second decoding network in parallel. Performing feature reconstruction based on multi-mode fusion image features between a visible light sample image and a thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features, wherein the method can comprise the following steps:
s210, inputting the visible light sample image into a first coding network for feature extraction to obtain visible light mode features.
S220, inputting the thermal infrared sample image into a second coding network for feature extraction to obtain thermal infrared mode features.
Wherein the encoding network is a deep learning neural network for converting high-dimensional input data (such as images, audio or text) into a low-dimensional representation for more efficient analysis and processing.
In some cases, the first encoding network and the second encoding network in parallel may be a two-way feature extraction network, which is a neural network structure in which there are two input paths, each path containing the same or different feature extraction layers, and the outputs of the two paths are eventually added pixel-by-pixel to form the final network output. The structure can effectively preliminarily integrate different types of features, reduce interference of features among different modes and improve performance and robustness of the model.
Specifically, the visible light sample image is input into a first coding network to perform feature extraction, and features of a visible light sample image mode can be extracted through convolution operation in the first coding network to obtain visible light mode features. And inputting the thermal infrared sample image into a second coding network for feature extraction, and extracting the features of the thermal infrared sample image mode through convolution operation in the second coding network to obtain thermal infrared mode features.
The initial pedestrian detection model may be, for example, a Cross-modal supervision model (Cross-Modal Supervision, CMS) that proposes a visible light sample image and a thermal infrared sample image based on the YOLOv5 framework. Referring to fig. 2b, fig. 2b shows a first coding network structure, where the first coding network 210 is composed of a CBS module (Content-Based Switching network), a csp1_x module (Cross Stage Partial Network, cross-phase local network), a csp2_x module, and an SPPF module (Fast Spatial Pyramid Pooling Mudule, fast spatial pyramid pooling module). Referring to fig. 2c, the cbs module 220 is composed of a Conv module (Standard convolution mudule, standard convolution module), a BN module (Batch Normalization, batch normalization module), and a SILU module (Sigmoid-Weighted Linear Unit, adaptive linear rectification unit), the SILU activation function is a variant of the swish activation function, and the formula of the SILU activation function is SILU (X) =x.sigmoid (X). Referring to fig. 2d, the csp1_x module 230 is composed of a CBS module, a Residual Unit (Residual Unit), a Concat module (connection module), a BN module, and a hulu module. Referring to fig. 2e, the res unit module 240 is composed of a CBS module and an ADD module (Address Decoder). Referring to fig. 2f, the csp2_x module 250 is composed of a CBS module, a connection module, a BN module, and a Silu module. Referring to fig. 2g, the sppf module 260 is composed of a CBS module, a maxipool module, and a Concat module. The structural composition of the second coding network is identical to that of the first coding network.
For example, a visible light sample image with a size of 640×640×3 pixels may be input into the first coding network to perform feature extraction, so as to obtain a visible light mode feature with a size of 16×16×512 pixels. The thermal infrared sample image with the size of 640 x 1 pixels can be input into a second coding network to perform feature extraction, so as to obtain thermal infrared mode features with the size of 16 x 512 elements.
S230, carrying out element-by-element addition operation on the visible light mode characteristics and the thermal infrared mode characteristics through the fusion component to obtain multi-mode fusion image characteristics.
In some cases, the prediction performance of the pedestrian detection model can be improved by preliminarily fusing the visible light model features and the thermal infrared model features, the possibility of overfitting can be reduced by fusing the features of different sources, and the generalization capability of the pedestrian detection model can be improved. In this embodiment, the multi-mode fusion image features are simple and preliminary multi-mode addition fusion features, that is, the multi-mode fusion image features are obtained by adding visible light mode features and thermal infrared mode features element by element.
Specifically, elements in the characteristic positions of the visible light modes are fused through the fusion componentElement in the same position as the thermal infrared mode feature +. >And performing addition operation to realize fusion of the visible light mode characteristics and the thermal infrared mode characteristics so as to obtain a multi-mode fusion image characteristic U.
For example, referring to FIG. 2h, visible light mode feature 270 may include four elements, A 1 、A 2 、A 3 、A 4 。A 1 、A 2 、A 3 、A 4 The element values of (2) are 35, 67, 24, 48, respectively. Thermal infrared morphological feature 280 may include four elements, B 1 、B 2 、B 3 、B 4 。B 1 、B 2 、B 3 、B 4 The element values of (2) are 15, 29, 7, 36, respectively. Element A of visible light mode characteristic 1 Element value 35 of (2) and element B of the thermal infrared mode feature 1 Adding the element values 15 of (2) may result in element C of the multimodal fusion image feature 290 1 Element value 50 of (c). Element A of visible light mode characteristic 2 Element value 67 of (c) and element B of the thermal infrared mode feature 2 Adding the element values 29 of (2) may result in element C of the multimodal fusion image feature 290 2 Element value 96 of (2). Element A of visible light mode characteristic 3 Element value 24 of (2) and element B of the thermal infrared morphological feature 3 Adding the element values 7 of (2) may result in element C of the multimodal fusion image feature 290 3 Element value 31 of (a). Element A of visible light mode characteristic 4 Element value 48 of (2) and element B of the thermal infrared morphological feature 4 Adding the element values 36 of (1) may result in element C of the multimodal fusion image feature 230 4 Element value 84 of (2).
In some embodiments, referring to fig. 2i, the Fusion component may be a Fusion module. The visible light mode feature 202 and the thermal infrared mode feature 204 are input to the Fusion module 206, and the Fusion module 206 can perform pixel-by-pixel addition operation on the visible light mode feature and the thermal infrared mode feature to obtain a multi-mode Fusion image feature 208.
S240, inputting the multi-mode fusion image features into a first decoding network and a second decoding network to respectively reconstruct the features, and correspondingly obtaining visible light reconstruction features and thermal infrared reconstruction features.
In some cases, the feature reconstruction of the multi-mode fusion image features is realized through the first decoding network and the second decoding network, so that the difference between the visible light mode features and the thermal infrared mode features can be accurately identified, and the quality of feature extraction is improved.
Specifically, the multi-mode fusion image characteristics obtained through pixel-by-pixel addition operation are input into a first decoding network, and up-sampling, convolution and other operations are carried out through the first decoding network, so that visible light reconstruction characteristics are obtained. And inputting the multi-mode fusion image characteristics obtained by the pixel-by-pixel addition operation into a second decoding network, and performing operations such as up-sampling, convolution and the like through the second decoding network to obtain thermal infrared reconstruction characteristics.
For example, referring to fig. 2j, the first decoding network 212 may be composed of a CBS module and an upsampling module. The composition of the second decoding network structure is the same as the composition of the first decoding network structure. And inputting the multi-mode fusion image characteristics obtained by the pixel-by-pixel addition operation into a first decoding network to obtain visible light reconstruction characteristics. And inputting the multi-mode fusion image characteristics obtained by the pixel-by-pixel addition operation into a second decoding network to obtain thermal infrared reconstruction characteristics.
In the above embodiment, the single-mode similarity loss and the multi-mode interaction loss of the initial pedestrian detection model can be determined by determining the visible light mode feature, the thermal infrared mode feature, the visible light reconstruction feature and the thermal infrared reconstruction feature, so that the parameters extracted from the features of the initial pedestrian detection model are optimized, and the quality of feature extraction is improved.
In some embodiments, referring to fig. 3, the determining method of the multi-mode fusion image features may include the following steps:
s310, obtaining visible light mode characteristics of a visible light sample image and thermal infrared mode characteristics of a thermal infrared sample image.
And S320, performing element-by-element addition operation on the thermal infrared mode characteristics and the visible light mode characteristics to obtain multi-mode fusion image characteristics.
Specifically, the visible light mode characteristics are obtained by extracting the characteristics of the visible light sample image. And obtaining thermal infrared mode characteristics by extracting the characteristics of the thermal infrared sample image. Then, the element at the visible light mode characteristic position and the element at the same position of the thermal infrared mode characteristic can be added, so that the element-by-element addition operation between the visible light mode characteristic and the thermal infrared mode characteristic can be realized, and the multi-mode fusion image characteristic can be obtained.
In the above embodiment, the visible light mode feature of the visible light sample image and the thermal infrared mode feature of the thermal infrared sample image are obtained, and the element-by-element addition operation is performed on the thermal infrared mode feature and the visible light mode feature, so as to obtain the multi-mode fusion image feature. By fusing the visible light model feature and the thermal infrared model feature, the prediction performance of the pedestrian detection model can be improved, the features of different sources are fused, the possibility of overfitting can be reduced, and the generalization capability of the pedestrian detection model is improved.
In some embodiments, referring to fig. 4, the training method of the pedestrian detection model may further include the following steps:
s410, acquiring an initial visible light image and an initial thermal infrared image which are obtained by shooting a target pedestrian scene.
S420, preprocessing an initial visible light image and an initial thermal infrared image according to the input data size of the initial pedestrian detection model to obtain a visible light sample image and a thermal infrared sample image.
And taking the normalized visible light sample image and the normalized thermal infrared sample image as input data of an initial pedestrian detection model.
In some cases, the pixel data of the initial visible light image and the initial thermal infrared image are normalized to fall within a specified range, typically between 0 and 1. The dimension difference among different variables can be eliminated through normalization processing, so that comparison and analysis can be better performed, and the consumption of calculation resources can be saved.
Specifically, an image acquisition device is used for shooting a target pedestrian scene in a real application scene, so that an initial visible light image and an initial thermal infrared image can be obtained. The initial visible light image and the initial thermal infrared image are presented in pairs, and each target pedestrian scene may have at least one pair of the initial visible light image and the initial thermal infrared image. According to the input data size of the initial pedestrian detection model, the initial visible light sample image and the initial thermal infrared sample image are subjected to image preprocessing (such as clipping or scaling in equal proportion, etc.), so that the visible light sample image and the thermal infrared sample image aiming at the target pedestrian scene can be obtained.
Illustratively, the input data size of the initial pedestrian detection model may be 640 x 640. The preprocessing operation can be performed on the initial visible light image and the initial thermal infrared image by means of scaling or clipping in equal proportion. The image size of the original visible image and the original thermal infrared image is converted to 640 x 640 by preprocessing.
In the above embodiment, an initial visible light image and an initial thermal infrared image obtained by shooting a target pedestrian scene are obtained, and the initial visible light image and the initial thermal infrared image are preprocessed according to the input data size of the initial pedestrian detection model, so as to obtain a visible light sample image and a thermal infrared sample image. By preprocessing the initial visible light image and the initial thermal infrared image, input data can be provided for subsequent feature extraction.
In some embodiments, the visible light sample image and the thermal infrared sample image are normalized in the following manner:
and carrying out normalization processing on the visible light sample image according to the first channel mean value and the first channel standard deviation corresponding to the visible light sample image to obtain input data corresponding to the visible light sample image.
And carrying out normalization processing on the thermal infrared sample image according to the second channel mean value and the second channel standard deviation corresponding to the thermal infrared sample image to obtain input data corresponding to the thermal infrared sample image.
The channel mean may be the sum of the pixel values of all pixels on any channel divided by the number of pixels. The channel standard deviation may be a measure of the degree of dispersion of pixels in any channel, the channel standard deviation measures the difference between the pixel value and the average number of each pixel, and is the square root of the variance. The larger the standard deviation, the more dispersed the pixel values representing the pixels and vice versa.
Specifically, the visible light sample image has an R channel, a G channel, and a B channel. For any channel of the RGB three channels, the pixel data on the any channel is utilized to firstly subtract the first channel mean value on the any channel and then divide the first channel standard deviation on the any channel, so that the input data corresponding to the visible light sample image on the any channel can be obtained. For the thermal infrared sample image, the pixel data of the thermal infrared sample image is used for subtracting the second channel mean value and dividing the second channel mean value by the second channel standard deviation, so that input data corresponding to the thermal infrared sample image can be obtained.
In the embodiment, through normalization processing, the dimensional difference between different variables can be eliminated, so that comparison and analysis can be better performed, and the consumption of calculation resources can be saved.
In some embodiments, model loss data is calculated using the following loss function:
L=L1+L2+L3+L4
L1=MSE(RGB i ,RGB′ i )
L2=MSE(TH i ,TH′ i )
L3=MSE(RGB i ,TH′ i )
L4=MSE(TH i ,RGB′ i )
wherein L is model loss data, L1 is first single-mode similarity loss, L2 is second single-mode similarity loss, L3 is first multi-mode interaction loss, L4 is second multi-mode interaction loss, RGB i For visible light extraction features, RGB' i For visible light reconstruction features, TH i For thermal infrared extraction of characteristics, TH' i For thermal infrared reconstruction features, i denotes the feature count number and MSE denotes the mean square error loss.
The mean square error loss is the mean value of the sum of squares of the errors of the corresponding points of the predicted data and the original data.
Specifically, similarity comparison is performed on the visible light reconstruction feature and the visible light extraction feature, the parameters of the initial pedestrian detection model are iterated by using a mean square error loss function MSE, loss between the visible light reconstruction feature and the visible light extraction feature is minimized, and more effective fusion features can be obtained. The first single mode similarity loss equation between the visible light reconstruction feature and the visible light extraction feature is as follows:
L1=MSE(RGB i ,RGB′ i )
and (3) comparing the similarity of the thermal infrared reconstruction feature and the thermal infrared extraction feature, iterating parameters of the initial pedestrian detection model by using a mean square error loss function MSE, and minimizing the loss between the thermal infrared reconstruction feature and the thermal infrared extraction feature, so that more effective fusion features can be obtained. The second single-mode similarity loss equation between the thermal infrared reconstruction feature and the thermal infrared extraction feature is as follows:
L2=MSE(TH i ,TH′ i )
And carrying out interactive supervision on the visible light extraction characteristics and the thermal infrared reconstruction characteristics, and carrying out loss calculation on the two mode data characteristics. By using the mean square error loss function MSE, the parameters of the initial pedestrian detection model are iterated, the loss between the visible light extraction feature and the thermal infrared reconstruction feature is minimized, and more effective fusion features can be obtained. The first multi-modal interaction loss equation between the visible light extraction feature and the thermal infrared reconstruction feature is as follows:
L3=MSE(RGB i ,TH′ i )
and carrying out interactive supervision on the thermal infrared extraction characteristics and the visible light reconstruction characteristics, and carrying out loss calculation on the two mode data characteristics. By using the mean square error loss function MSE, the parameters of the initial pedestrian detection model are iterated, the loss between the thermal infrared extraction feature and the visible light reconstruction feature is minimized, and more effective fusion features can be obtained. The second multi-modal interaction loss equation between the thermal infrared extraction feature and the visible light reconstruction feature is as follows:
L4=MSE(TH i ,RGB′ i )
and adding the single-mode similarity loss and the multi-mode interaction loss to obtain model loss data, and iterating and optimizing the loss to obtain a target pedestrian detection model. The model loss data is formulated as follows:
L=L1+L2+L3+L4
in the above embodiment, the visible light extraction feature and the thermal infrared extraction feature may be optimized by determining a first single-mode similarity loss between the visible light extraction feature and the visible light reconstruction feature of the visible light sample image, and a second single-mode similarity loss between the thermal infrared extraction feature and the thermal infrared reconstruction feature of the thermal infrared sample image. By determining the first multi-mode interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature and the second multi-mode interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature, the quality of the fusion feature can be improved, and therefore the accuracy of pedestrian detection results is improved.
The embodiment of the present disclosure provides a pedestrian detection method, referring to fig. 5, the method may include the following steps:
s510, obtaining a visible light to-be-detected image and a thermal infrared to-be-detected image which are shot for any pedestrian scene.
S520, inputting the visible light to-be-detected image and the visible light to-be-detected image into the target pedestrian detection model obtained in any one of the above embodiments to detect pedestrians, thereby obtaining a pedestrian detection result.
Specifically, any pedestrian scene is shot through the image acquisition equipment, so that an initial visible light to-be-detected image and an initial thermal infrared to-be-detected image can be obtained. Inputting the initial visible light to-be-detected image and the initial thermal infrared to-be-detected image into a target pedestrian detection model, and realizing pedestrian detection through the target pedestrian detection model to obtain a pedestrian detection result.
In the above embodiment, the visible light to-be-detected image and the thermal infrared to-be-detected image obtained by shooting any pedestrian scene are obtained, and the visible light to-be-detected image are input into the target pedestrian detection model obtained in any one of the above embodiments to perform pedestrian detection, so as to obtain a pedestrian detection result. The pedestrian detection can be applied to the aspects of vehicle auxiliary driving systems, intelligent video monitoring, robots, aerial images, man-machine interaction systems, motion analysis and the like.
The embodiment of the present disclosure provides a pedestrian detection method, referring to fig. 6, the method may include the following steps:
s610, obtaining a visible light to-be-detected image and a thermal infrared to-be-detected image which are shot for any pedestrian scene.
S620, inputting the visible light to-be-detected image and the visible light to-be-detected image into a target pedestrian detection model to detect pedestrians, and obtaining a pedestrian detection result.
The loss data of the target pedestrian detection model in the training process comprises single-mode loss and cross-mode loss. The single-mode loss and the cross-mode loss are used for supervising the capability of the training target pedestrian detection module to extract the multi-mode fusion image characteristics between the visible light to-be-detected image and the thermal infrared to-be-detected image. The single mode loss includes a first single mode loss of similarity between the visible light extraction features and the visible light reconstruction features of the visible light sample image and a second single mode loss of similarity between the thermal infrared extraction features and the thermal infrared reconstruction features of the thermal infrared sample image. The cross-modal loss includes a first multi-modal interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, a second multi-modal interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature.
Specifically, any pedestrian scene is shot through the image acquisition equipment, so that an initial visible light to-be-detected image and an initial thermal infrared to-be-detected image can be obtained. And carrying out image preprocessing on the initial visible light to-be-detected image and the initial thermal infrared to-be-detected image to obtain the visible light to-be-detected image and the thermal infrared to-be-detected image which have the same size as the input data of the target pedestrian detection model. And inputting the visible light to-be-detected image and the thermal infrared to-be-detected image subjected to image pretreatment into a target pedestrian detection model, and realizing pedestrian detection through the target pedestrian detection model to obtain a pedestrian detection result.
In the above embodiment, the visible light to-be-detected image and the thermal infrared to-be-detected image obtained by shooting any pedestrian scene are obtained, and the visible light to-be-detected image are input into the target pedestrian detection model to perform pedestrian detection, so that a pedestrian detection result is obtained. The pedestrian detection can be applied to the aspects of vehicle auxiliary driving systems, intelligent video monitoring, robots, aerial images, man-machine interaction systems, motion analysis and the like.
In some embodiments, referring to fig. 7, the target pedestrian detection model includes a first encoding network and a second encoding network in parallel, the first encoding network and the second encoding network being commonly connected to the fusion component. The method for detecting the pedestrian by inputting the visible light to-be-detected image and the visible light to-be-detected image into the target pedestrian detection model to detect the pedestrian to obtain a pedestrian detection result can comprise the following steps:
And S710, inputting the visible light to-be-detected image into a first coding network for feature extraction to obtain visible light to-be-detected features.
S720, inputting the thermal infrared to-be-detected image into a second coding network for feature extraction to obtain the thermal infrared to-be-detected feature.
And S730, performing element-by-element addition operation on the visible light to-be-detected feature and the thermal infrared to-be-detected feature through the fusion component to obtain to-be-detected fusion image features.
And S740, performing target detection based on the fusion image characteristics to be detected to obtain a pedestrian detection result.
Specifically, the visible light to-be-detected image is input into a first coding network included in the target pedestrian detection model to perform feature extraction to obtain the visible light to-be-detected feature, and the thermal infrared to-be-detected image is input into a second coding network included in the target pedestrian detection model to perform feature extraction to obtain the thermal infrared to-be-detected feature. And carrying out element-by-element addition operation on the visible light feature to be detected and the thermal infrared feature to be detected through a fusion component included in the target pedestrian detection model, so as to obtain the feature of the fusion image to be detected. And carrying out target detection on the fusion image characteristics to be detected to obtain a pedestrian detection result.
In the above embodiment, the visible light to-be-detected image is input into the first coding network to perform feature extraction to obtain the visible light to-be-detected feature, the thermal infrared to-be-detected image is input into the second coding network to perform feature extraction to obtain the thermal infrared to-be-detected feature, the element-by-element addition operation is performed on the visible light to-be-detected feature and the thermal infrared to-be-detected feature through the fusion component to obtain the to-be-detected fusion image feature, and the target detection is performed based on the to-be-detected fusion image feature to obtain the pedestrian detection result. The pedestrian detection can be applied to the aspects of vehicle auxiliary driving systems, intelligent video monitoring, robots, aerial images, man-machine interaction systems, motion analysis and the like.
In some embodiments, pedestrian detection is performed based on the feature of the fusion image to be detected, and a pedestrian detection result is obtained, which may include: and carrying out convolution, pooling and activation processing operations according to the characteristics of the fusion image to be detected so as to carry out target detection and obtain a pedestrian detection result.
Specifically, the fusion image features to be detected are input into a convolution layer for convolution processing, and a convolution processing result is obtained. And inputting the convolution processing result into a pooling layer for pooling processing to obtain a pooling processing result. And activating the pooling treatment result through an activating layer to perform target detection so as to obtain a pedestrian target result. The activation process may be implemented by a sigmoid function, for example.
In the above embodiment, convolution, pooling and activation processing are performed according to the feature of the fusion image to be detected, so as to perform target detection, and obtain a pedestrian detection result. The pedestrian detection can be applied to the aspects of vehicle auxiliary driving systems, intelligent video monitoring, robots, aerial images, man-machine interaction systems, motion analysis and the like.
The embodiment of the specification also provides a training method of the pedestrian detection model, wherein the initial pedestrian detection model comprises a first coding network and a second coding network which are connected in parallel, the first coding network and the second coding network are connected to a fusion component together, and the fusion component is connected with a first decoding network and a second decoding network which are connected in parallel. For example, referring to fig. 8, the training method of the pedestrian detection model may include the steps of:
S802, acquiring an initial visible light image and an initial thermal infrared image which are obtained by shooting a target pedestrian scene.
S804, preprocessing the initial visible light image and the initial thermal infrared image according to the input data size of the initial pedestrian detection model to obtain a visible light sample image and a thermal infrared sample image.
S806, inputting the visible light sample image into a first coding network for feature extraction to obtain visible light mode features.
S808, inputting the thermal infrared sample image into a second coding network for feature extraction to obtain thermal infrared mode features.
And S810, performing element-by-element addition operation on the visible light mode characteristics and the thermal infrared mode characteristics through a fusion component to obtain multi-mode fusion image characteristics.
And S812, inputting the multi-mode fusion image features into a first decoding network and a second decoding network to respectively reconstruct the features, and correspondingly obtaining visible light reconstruction features and thermal infrared reconstruction features.
S814, determining a first single-mode similarity loss between a visible light extraction feature and a visible light reconstruction feature of the visible light sample image, a second single-mode similarity loss between a thermal infrared extraction feature and a thermal infrared reconstruction feature of the thermal infrared sample image, a first multi-mode interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and a second multi-mode interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature.
Specifically, model loss data is calculated using the following loss function:
L=L1+L2+L3+L4
L1=MSE(RGB i ,RGB′ i )
L2=MSE(TH i ,TH′ i )
L3=MSE(RGB i ,TH′ i )
L4=MSE(TH i ,RGB′ i )
wherein L is model loss data, L1 is first single-mode similarity loss, L2 is second single-mode similarity loss, L3 is first multi-mode interaction loss, L4 is second multi-mode interaction loss, RGB i For visible light mode characteristics, RGB' i For visible light reconstruction features, TH i Is characterized by thermal infrared mode, TH' i For thermal infrared reconstruction features, i denotes the feature count number and MSE denotes the mean square error loss.
And S816, performing single-mode supervision training and cross-mode supervision training on the initial pedestrian detection model simultaneously according to the first single-mode similarity loss, the second single-mode similarity loss, the first multi-mode interaction loss and the second multi-mode interaction loss until the model stopping training condition is met, and obtaining a target pedestrian detection model.
Referring to fig. 9, a training apparatus 900 for a pedestrian detection model is provided in an embodiment of the present disclosure, the training apparatus 900 for a pedestrian detection model includes: a sample image acquisition module 910, an image feature reconstruction module 920, a loss data determination module 930, and a detection model determination module 940.
A sample image acquisition module 910 for acquiring a visible light sample image and a thermal infrared sample image for a target pedestrian scene;
The image feature reconstruction module 920 is configured to perform feature reconstruction based on the multi-mode fusion image feature between the visible light sample image and the thermal infrared sample image, so as to obtain a visible light reconstruction feature and a thermal infrared reconstruction feature; the multi-mode fusion image features are obtained by performing element-by-element addition operation based on the features of the visible light sample image and the features of the thermal infrared sample image;
a loss data determining module 930, configured to determine a first single-mode similarity loss between the visible light extraction feature and the visible light reconstruction feature of the visible light sample image, a second single-mode similarity loss between the thermal infrared extraction feature and the thermal infrared reconstruction feature of the thermal infrared sample image, a first multi-mode interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and a second multi-mode interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature;
and the detection model determining module 940 is configured to perform single-mode supervised training and cross-mode supervised training on the initial pedestrian detection model simultaneously according to the first single-mode similarity loss, the second single-mode similarity loss, the first multi-mode interaction loss and the second multi-mode interaction loss until a model stopping training condition is met, so as to obtain a target pedestrian detection model.
Referring to fig. 10, the embodiment of the present disclosure provides a pedestrian detection device 1000, and the pedestrian detection device 1000 includes: the device comprises a to-be-detected image acquisition module 1010 and a detection result determination module 1020.
The to-be-detected image obtaining module 1010 is configured to obtain a visible light to-be-detected image and a thermal infrared to-be-detected image that are obtained by shooting for any pedestrian scene;
the detection result determining module 1020 is configured to input the visible light to-be-detected image and the visible light to-be-detected image into the target pedestrian detection model obtained in any one of the above embodiments to perform pedestrian detection, thereby obtaining a pedestrian detection result.
Referring to fig. 11, the embodiment of the present disclosure provides a pedestrian detection apparatus 1100, the pedestrian detection apparatus 1100 includes: the device comprises an image acquisition module 1110 to be detected, a detection result determining module 1120, a similarity loss determining module 1130 and an interaction loss determining module 1140.
The to-be-detected image obtaining module 1110 is configured to obtain a visible light to-be-detected image and a thermal infrared to-be-detected image that are obtained by shooting for any pedestrian scene;
the detection result determining module 1120 is configured to input the visible light to-be-detected image and the visible light to-be-detected image into a target pedestrian detection model to perform pedestrian detection, so as to obtain a pedestrian detection result; the loss data of the target pedestrian detection model in the training process comprises single-mode loss and cross-mode loss; the single-mode loss and the cross-mode loss are used together for supervising and training the capability of the target pedestrian detection module to extract multi-mode fusion image characteristics between the visible light to-be-detected image and the thermal infrared to-be-detected image;
The single-mode loss comprises a first single-mode similar loss between a visible light extraction feature and a visible light reconstruction feature of the visible light sample image and a second single-mode similar loss between a thermal infrared extraction feature and a thermal infrared reconstruction feature of the thermal infrared sample image;
the cross-modal loss includes a first multi-modal interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, a second multi-modal interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature.
For specific description of the training device of the pedestrian detection model and the pedestrian detection device, reference may be made to the description of the training method of the pedestrian detection model and the pedestrian detection method hereinabove, and the description thereof will not be repeated here.
In some embodiments, a computer device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 12. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a training method for a pedestrian detection model. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 12 is merely a block diagram of a portion of the structure associated with the aspects disclosed herein and is not limiting of the computer device to which the aspects disclosed herein apply, and in particular, the computer device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The present description provides a chip comprising a memory unit storing a computer program and a processing unit implementing the steps of the method of any of the above embodiments when the computer program is executed by the processing unit.
In some embodiments, a computer device is provided, comprising a memory in which a computer program is stored, and a processor which, when executing the computer program, carries out the method steps of the above embodiments.
The present description embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method of any of the above embodiments.
An embodiment of the present specification provides a computer program product comprising instructions which, when executed by a processor of a computer device, enable the computer device to perform the steps of the method of any one of the embodiments described above.
It should be noted that the logic and/or steps represented in the flowcharts or otherwise described herein, for example, may be considered as a ordered listing of executable instructions for implementing logical functions, and may be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

Claims (20)

1. A method of training a pedestrian detection model, the method comprising:
obtaining a visible light sample image and a thermal infrared sample image aiming at a target pedestrian scene;
performing feature reconstruction based on multi-mode fusion image features between the visible light sample image and the thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features; the multi-mode fusion image features are obtained by performing element-by-element addition operation based on the features of the visible light sample image and the features of the thermal infrared sample image;
determining a first single-mode similarity loss between a visible light extraction feature and the visible light reconstruction feature of the visible light sample image, a second single-mode similarity loss between a thermal infrared extraction feature and the thermal infrared reconstruction feature of the thermal infrared sample image, a first multi-mode interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and a second multi-mode interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature;
and according to the first single-mode similarity loss, the second single-mode similarity loss, the first multi-mode interaction loss and the second multi-mode interaction loss, performing single-mode supervision training and cross-mode supervision training on the initial pedestrian detection model at the same time until the model stopping training condition is met, and obtaining a target pedestrian detection model.
2. The method of claim 1, wherein the initial pedestrian detection model comprises first and second encoding networks connected in parallel, the first and second encoding networks being commonly connected to a fusion component having connected thereto first and second decoding networks connected in parallel; the feature reconstruction is performed based on the multi-mode fusion image features between the visible light sample image and the thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features, including:
inputting the visible light sample image into the first coding network for feature extraction to obtain visible light mode features;
inputting the thermal infrared sample image into the second coding network for feature extraction to obtain thermal infrared mode features;
performing element-by-element addition operation on the visible light mode characteristics and the thermal infrared mode characteristics through the fusion component to obtain the multi-mode fusion image characteristics;
and inputting the multi-mode fusion image features into the first decoding network and the second decoding network to respectively perform feature reconstruction, and correspondingly obtaining the visible light reconstruction features and the thermal infrared reconstruction features.
3. The method of claim 1, wherein the determining the multi-modality fusion image feature comprises:
acquiring visible light mode characteristics of the visible light sample image and thermal infrared mode characteristics of the thermal infrared sample image;
and performing element-by-element addition operation on the thermal infrared mode characteristics and the visible light mode characteristics to obtain the multi-mode fusion image characteristics.
4. The method according to claim 1, wherein the method further comprises:
acquiring an initial visible light image and an initial thermal infrared image which are obtained by shooting the target pedestrian scene;
preprocessing the initial visible light image and the initial thermal infrared image according to the input data size of the initial pedestrian detection model to obtain the visible light sample image and the thermal infrared sample image; and taking the normalized visible light sample image and the normalized thermal infrared sample image as input data of the initial pedestrian detection model.
5. The method of claim 4, wherein the visible light sample image and the thermal infrared sample image are normalized by:
Normalizing the visible light sample image according to a first channel mean value and a first channel standard deviation corresponding to the visible light sample image to obtain input data corresponding to the visible light sample image;
and carrying out normalization processing on the thermal infrared sample image according to a second channel mean value and a second channel standard deviation corresponding to the thermal infrared sample image to obtain input data corresponding to the thermal infrared sample image.
6. The method according to any one of claims 1 to 5, wherein model loss data is calculated using the following loss function:
L=L1+L2+L3+L4
L1=MSE(RGB i ,RGB′ i )
L2=MSE(TH i ,TH′ i )
L3=MSE(RGB i ,TH′ i )
L4=MSE(TH i ,RGB′ i )
wherein L is model loss data, L1 is first single-mode similarity loss, L2 is second single-mode similarity loss, L3 is first multi-mode interaction loss, L4 is second multi-mode interaction loss, RGB i For visible light extraction features, RGB' i For visible light reconstruction features, TH i For thermal infrared extraction of characteristics, TH' i For thermal infrared reconstruction features, i denotes the feature count number and MSE denotes the mean square error loss.
7. A pedestrian detection method, the method comprising:
obtaining a visible light to-be-detected image and a thermal infrared to-be-detected image which are shot aiming at any pedestrian scene;
Inputting the visible light to-be-detected image and the visible light to-be-detected image into a target pedestrian detection model obtained by the method of any one of claims 1 to 6 for pedestrian detection, so as to obtain a pedestrian detection result.
8. A pedestrian detection method, the method comprising:
obtaining a visible light to-be-detected image and a thermal infrared to-be-detected image which are shot aiming at any pedestrian scene;
inputting the visible light to-be-detected image and the visible light to-be-detected image into a target pedestrian detection model for pedestrian detection to obtain a pedestrian detection result; the loss data of the target pedestrian detection model in the training process comprises single-mode loss and cross-mode loss; the single-mode loss and the cross-mode loss are used together for supervising and training the capability of the target pedestrian detection module to extract multi-mode fusion image characteristics between the visible light to-be-detected image and the thermal infrared to-be-detected image;
the single-mode loss comprises a first single-mode similar loss between a visible light extraction feature and a visible light reconstruction feature of the visible light sample image and a second single-mode similar loss between a thermal infrared extraction feature and a thermal infrared reconstruction feature of the thermal infrared sample image;
The cross-modal loss includes a first multi-modal interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, a second multi-modal interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature.
9. The method of claim 8, wherein the target pedestrian detection model comprises first and second encoding networks in parallel, the first and second encoding networks being commonly connected to a fusion component; inputting the visible light to-be-detected image and the visible light to-be-detected image into a target pedestrian detection model for pedestrian detection to obtain a pedestrian detection result, wherein the method comprises the following steps of:
inputting the visible light to-be-detected image into the first coding network for feature extraction to obtain visible light to-be-detected features;
inputting the thermal infrared image to be detected into the second coding network for feature extraction to obtain thermal infrared features to be detected;
performing element-by-element addition operation on the visible light feature to be detected and the thermal infrared feature to be detected through the fusion component to obtain a fusion image feature to be detected;
and performing target detection based on the fusion image features to be detected to obtain the pedestrian detection result.
10. The method according to claim 9, wherein the step of performing pedestrian detection based on the feature of the fusion image to be detected to obtain the pedestrian detection result includes:
and carrying out convolution, pooling and activation processing operations according to the characteristics of the fusion image to be detected so as to carry out target detection and obtain the pedestrian detection result.
11. A training device for a pedestrian detection model, the device comprising:
the sample image acquisition module is used for acquiring a visible light sample image and a thermal infrared sample image aiming at a target pedestrian scene;
the image feature reconstruction module is used for carrying out feature reconstruction based on the multi-mode fusion image features between the visible light sample image and the thermal infrared sample image to obtain visible light reconstruction features and thermal infrared reconstruction features; the multi-mode fusion image features are obtained by performing element-by-element addition operation based on the features of the visible light sample image and the features of the thermal infrared sample image;
the loss data determining module is used for determining a first single-mode similarity loss between the visible light extraction feature and the visible light reconstruction feature of the visible light sample image, a second single-mode similarity loss between the thermal infrared extraction feature and the thermal infrared reconstruction feature of the thermal infrared sample image, a first multi-mode interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, and a second multi-mode interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature;
And the detection model determining module is used for simultaneously carrying out single-mode supervision training and cross-mode supervision training on the initial pedestrian detection model according to the first single-mode similarity loss, the second single-mode similarity loss, the first multi-mode interaction loss and the second multi-mode interaction loss until the model stopping training condition is met, so as to obtain a target pedestrian detection model.
12. The apparatus of claim 11, wherein the initial pedestrian detection model comprises first and second encoding networks connected in parallel, the first and second encoding networks being commonly connected to a fusion component having connected thereto first and second decoding networks connected in parallel; the image feature reconstruction module is further configured to input the visible light sample image into the first coding network for feature extraction, so as to obtain visible light mode features; inputting the thermal infrared sample image into the second coding network for feature extraction to obtain thermal infrared mode features; performing element-by-element addition operation on the visible light mode characteristics and the thermal infrared mode characteristics through the fusion component to obtain the multi-mode fusion image characteristics; and inputting the multi-mode fusion image features into the first decoding network and the second decoding network to respectively perform feature reconstruction, and correspondingly obtaining the visible light reconstruction features and the thermal infrared reconstruction features.
13. The apparatus of claim 11, further comprising a fusion feature determination module for acquiring a visible light mode feature of the visible light sample image and a thermal infrared mode feature of the thermal infrared sample image; and performing element-by-element addition operation on the thermal infrared mode characteristics and the visible light mode characteristics to obtain the multi-mode fusion image characteristics.
14. The apparatus according to any one of claims 11 to 13, wherein model loss data is calculated using the following loss function:
L=L1+L2+L3+L4
L1=MSE(RGB i ,RGB′ i )
L2=MSE(TH i ,TH′ i )
L3=MSE(RGB i ,TH′ i )
L4=MSE(TH i ,RGB′ i )
wherein L is model loss data, L1 is first single-mode similarity loss, L2 is second single-mode similarity loss, L3 is first multi-mode interaction loss, L4 is second multi-mode interaction loss, RGB i For visible light extraction features, RGB' i For visible light reconstruction features, TH i For thermal infrared extraction of characteristics, TH' i For thermal infrared reconstruction features, i denotes the feature count number and MSE denotes the mean square error loss.
15. A pedestrian detection apparatus, characterized in that the apparatus comprises:
the to-be-detected image acquisition module is used for acquiring a visible light to-be-detected image and a thermal infrared to-be-detected image which are shot for any pedestrian scene;
The detection result determining module is configured to input the visible light to-be-detected image and the visible light to-be-detected image into the target pedestrian detection model obtained by the method according to any one of claims 1 to 6 to perform pedestrian detection, so as to obtain a pedestrian detection result.
16. A pedestrian detection apparatus, characterized in that the apparatus comprises:
the to-be-detected image acquisition module is used for acquiring a visible light to-be-detected image and a thermal infrared to-be-detected image which are shot for any pedestrian scene;
the detection result determining module is used for inputting the visible light to-be-detected image and the visible light to-be-detected image into a target pedestrian detection model to detect pedestrians and obtain pedestrian detection results; the loss data of the target pedestrian detection model in the training process comprises single-mode loss and cross-mode loss; the single-mode loss and the cross-mode loss are used together for supervising and training the capability of the target pedestrian detection module to extract multi-mode fusion image characteristics between the visible light to-be-detected image and the thermal infrared to-be-detected image;
the single-mode loss comprises a first single-mode similar loss between a visible light extraction feature and a visible light reconstruction feature of the visible light sample image and a second single-mode similar loss between a thermal infrared extraction feature and a thermal infrared reconstruction feature of the thermal infrared sample image;
The cross-modal loss includes a first multi-modal interaction loss between the thermal infrared reconstruction feature and the visible light extraction feature, a second multi-modal interaction loss between the visible light reconstruction feature and the thermal infrared extraction feature.
17. The apparatus of claim 16, wherein the target pedestrian detection model comprises first and second encoding networks in parallel, the first and second encoding networks being commonly connected to a fusion component;
the detection result determining module is further configured to input the visible light to-be-detected image into the first coding network for feature extraction, so as to obtain visible light to-be-detected features; inputting the thermal infrared image to be detected into the second coding network for feature extraction to obtain thermal infrared features to be detected; performing element-by-element addition operation on the visible light feature to be detected and the thermal infrared feature to be detected through the fusion component to obtain a fusion image feature to be detected; and performing target detection based on the fusion image features to be detected to obtain the pedestrian detection result.
18. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 10 when the computer program is executed.
19. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 10.
20. A chip comprising a memory unit and a processing unit, the memory unit storing a computer program, characterized in that the processing unit implements the steps of the method of any of claims 1 to 10 when the computer program is executed.
CN202311062534.2A 2023-08-22 2023-08-22 Training of pedestrian detection model, pedestrian detection method, device, equipment and medium Pending CN117036890A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311062534.2A CN117036890A (en) 2023-08-22 2023-08-22 Training of pedestrian detection model, pedestrian detection method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311062534.2A CN117036890A (en) 2023-08-22 2023-08-22 Training of pedestrian detection model, pedestrian detection method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN117036890A true CN117036890A (en) 2023-11-10

Family

ID=88642937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311062534.2A Pending CN117036890A (en) 2023-08-22 2023-08-22 Training of pedestrian detection model, pedestrian detection method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117036890A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022104618A1 (en) * 2020-11-19 2022-05-27 Intel Corporation Bidirectional compact deep fusion networks for multimodality visual analysis applications
CN114612937A (en) * 2022-03-15 2022-06-10 西安电子科技大学 Single-mode enhancement-based infrared and visible light fusion pedestrian detection method
WO2022127112A1 (en) * 2020-12-14 2022-06-23 奥比中光科技集团股份有限公司 Cross-modal face recognition method, apparatus and device, and storage medium
CN115457456A (en) * 2022-08-22 2022-12-09 武汉理工大学 Multispectral pedestrian detection method and system based on intelligent vehicle

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022104618A1 (en) * 2020-11-19 2022-05-27 Intel Corporation Bidirectional compact deep fusion networks for multimodality visual analysis applications
WO2022127112A1 (en) * 2020-12-14 2022-06-23 奥比中光科技集团股份有限公司 Cross-modal face recognition method, apparatus and device, and storage medium
CN114612937A (en) * 2022-03-15 2022-06-10 西安电子科技大学 Single-mode enhancement-based infrared and visible light fusion pedestrian detection method
CN115457456A (en) * 2022-08-22 2022-12-09 武汉理工大学 Multispectral pedestrian detection method and system based on intelligent vehicle

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MINGYUE LI 等: "Multimodal Interactive Supervised Pedestrian Detection Based on yolov5", 2023 3RD INTERNATIONAL SYMPOSIUM ON COMPUTER TECHNOLOGY AND INFORMATION SCIENCE, pages 61 - 64 *

Similar Documents

Publication Publication Date Title
Dasgupta et al. Spatio-contextual deep network-based multimodal pedestrian detection for autonomous driving
Luo et al. Thermal infrared image colorization for nighttime driving scenes with top-down guided attention
US11106903B1 (en) Object detection in image data
CN113673425B (en) Multi-view target detection method and system based on Transformer
KR20210031427A (en) Methods, devices, computer devices and media for recognizing traffic images
CN109657581A (en) Urban track traffic gate passing control method based on binocular camera behavioral value
Li et al. IVFuseNet: Fusion of infrared and visible light images for depth prediction
He et al. A feature fusion method to improve the driving obstacle detection under foggy weather
Singh Surround-view vision-based 3d detection for autonomous driving: A survey
Cho et al. Semantic segmentation with low light images by modified CycleGAN-based image enhancement
CN110490171B (en) Dangerous posture recognition method and device, computer equipment and storage medium
Zhao et al. FSDF: A high-performance fire detection framework
Zhou et al. A pedestrian extraction algorithm based on single infrared image
Wei et al. Infrared pedestrian detection using improved UNet and YOLO through sharing visible light domain information
CN117789153B (en) Automobile oil tank outer cover positioning system and method based on computer vision
CN117475355A (en) Security early warning method and device based on monitoring video, equipment and storage medium
Yang et al. A review on infrared and visible image fusion algorithms based on neural networks
Zhang et al. A quality index metric and method for online self-assessment of autonomous vehicles sensory perception
CN116681687B (en) Wire detection method and device based on computer vision and computer equipment
CN113489958A (en) Dynamic gesture recognition method and system based on video coding data multi-feature fusion
CN117115630A (en) Multispectral vehicle re-identification method under strong light based on cyclical consistency
Maheswari et al. Thermal infrared image semantic segmentation for night-time driving scenes based on deep learning
CN116051872A (en) Feature point matching method of cross-spectrum image
CN117036890A (en) Training of pedestrian detection model, pedestrian detection method, device, equipment and medium
Chen et al. Transformer fusion-based scale-aware attention network for multispectral victim detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination