CN112183236A

CN112183236A - Unmanned aerial vehicle aerial photography video content identification method, device and system

Info

Publication number: CN112183236A
Application number: CN202010946775.3A
Authority: CN
Inventors: 吴晓琳; 杜永红; 张凯; 夏林元; 杨嘉贺
Original assignee: Foshan Ju Zhuo Technology Co ltd
Current assignee: Foshan Ju Zhuo Technology Co ltd
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2021-01-05

Abstract

The invention discloses an unmanned aerial vehicle aerial photography video content identification method based on deep learning, which comprises the following steps: acquiring a video image shot by an unmanned aerial vehicle in real time; performing frame extraction processing on the video image to extract a sample image, and performing preprocessing on the sample image; marking an object to be identified in the sample image to generate an object type database; expanding an object class database by using a data enhancement technology; training a deep neural network model by using the expanded object class database, wherein the deep neural network model is an SSD network model, and the SSD network model comprises a multi-branch convolution structure and a multi-scale feature map fusion structure; and recognizing the video image by using the trained deep neural network model to output the position information and the size information of each object class. The invention also discloses an unmanned aerial vehicle aerial photography video content identification device and system based on deep learning. The method and the device can effectively solve the problem that the category identification and the position positioning influence each other in the aerial video object detection.

Description

Unmanned aerial vehicle aerial photography video content identification method, device and system

Technical Field

The invention relates to the technical field of image recognition, in particular to a method, a device and a system for recognizing contents of aerial video shot by an unmanned aerial vehicle based on deep learning.

Background

In recent years, with the continuous development of computer technology, multimedia technology and network technology, devices for acquiring videos, such as mobile phones, cameras, monitors and the like, are rapidly popularized, and video resources are increasingly abundant. The method for rapidly and accurately acquiring the information in the video is more and more important, and the method based on deep learning has huge potential, is applied and developed in the field of computer vision and is a trend in the present and future. Furthermore, scene texts in the positioning videos are accurately detected in real time and object identification is carried out, so that the method has important application to scenes such as public security monitoring, security protection, unmanned aerial vehicle flying, automatic driving and the like.

In recent years, scholars at home and abroad propose a plurality of deep neural network models for extracting visual features, and meanwhile, some scholars begin to research the application of the deep neural network in the video field and propose some networks for video motion recognition and feature extraction; deep neural networks are also currently introduced in video content retrieval for extracting structured information from video.

So far, the deep convolutional neural network has been a general method of the target detection algorithm, and the current high-performance object detection algorithm and the latest research are also based on the deep convolutional neural network. In order to improve the speed of a target detection network, Liu et al propose an SSD network, which detects targets of different sizes by using different size characteristic diagrams through target category classification and position regression on the characteristic diagrams of different sizes, and save an RPN network so as to greatly improve the speed of the network; furthermore, Redmon et al propose a YOLO network, which divides the original image into regions of size 7 × 7, and performs target classification and position regression on each region directly through a neural network, thereby omitting the classification and regression operations of feature maps with different sizes and further improving the speed of the target detection network. The SSD and the YOLO network are slightly degraded in detection accuracy compared to Fast-RCNN. In addition, based on the above three models, many other object detection network models are proposed.

Disclosure of Invention

The invention aims to solve the technical problem of providing an unmanned aerial vehicle aerial video content identification method, device and system based on deep learning, and effectively solving the problem that category identification and position positioning influence each other in aerial video object detection.

In order to solve the technical problem, the invention provides an unmanned aerial vehicle aerial photography video content identification method based on deep learning, which comprises the following steps: acquiring a video image shot by an unmanned aerial vehicle in real time; performing frame extraction processing on the video image to extract a sample image, and performing preprocessing on the sample image; marking the object to be identified in the sample image to generate an object type database; augmenting the object class database with data enhancement techniques; training a deep neural network model by using the extended object class database, wherein the deep neural network model is an SSD network model, and the SSD network model comprises a multi-branch convolution structure and a multi-scale feature map fusion structure; and identifying the video image by using the trained deep neural network model so as to output the position information and the size information of each object class.

As an improvement of the above scheme, the step of preprocessing the sample image includes: correcting the sample image by using a distortion correction algorithm to form a sample image of a regular plane; and compressing the sample image after the correction processing so as to enable the sample image to reach a target size capable of target identification.

As an improvement of the above scheme, the method for labeling the object to be identified in the sample image comprises a manual labeling method and/or an image target detection algorithm.

As an improvement to the above solution, the step of augmenting the object class database with the data enhancement technique includes: and carrying out data enhancement processing on the sample image in the object class database in a random probability superposition mode, wherein the data enhancement processing comprises rotation processing, filling type cutting processing and gray data processing.

As an improvement of the above solution, the step of training the deep neural network model by using the augmented object class database includes: inputting a plurality of sample images in the object class database into the deep neural network model; carrying out convolution processing on a plurality of sample images through a multi-branch convolution layer respectively; respectively carrying out normalization processing on the sample images after convolution processing to generate a characteristic map of a scale; performing feature fusion processing on all feature graphs; and performing convolution processing on the spliced feature map through a convolution layer to generate a branch convolution feature map.

As an improvement of the above scheme, the step of performing feature fusion processing on all feature maps includes: all the feature maps are subjected to size unification processing; respectively carrying out category identification and position positioning processing on each feature map with uniform size; and performing feature fusion processing on all the feature maps subjected to the identification and positioning processing according to a weighting mode.

Correspondingly, the invention also provides an unmanned aerial vehicle aerial photography video content identification device based on deep learning, which comprises the following components: the acquisition module is used for acquiring a video image shot by the unmanned aerial vehicle in real time; the preprocessing module is used for performing frame extraction processing on the video image to extract a sample image and preprocessing the sample image; the marking module is used for marking the object to be identified in the sample image to generate an object type database; an expansion module for expanding the object class database using data enhancement techniques; the training module is used for training a deep neural network model by utilizing the extended object class database, wherein the deep neural network model is an SSD network model, and the SSD network model comprises a multi-branch convolution structure and a multi-scale feature map fusion structure; and the identification module is used for identifying the video image by using the trained deep neural network model so as to output the position information and the size information of each object type.

As an improvement of the above solution, the training module comprises: an input unit, configured to input a plurality of sample images in the object class database into the deep neural network model; the first convolution unit is used for respectively carrying out convolution processing on the plurality of sample images through the multi-branch convolution layer; the normalization unit is used for respectively carrying out normalization processing on the sample images after the convolution processing so as to generate a characteristic map of a scale; the fusion unit is used for carrying out feature fusion processing on all the feature maps; and the second convolution unit is used for performing convolution processing on the spliced feature map through the convolution layer to generate a branch convolution feature map.

As an improvement of the above, the fusion unit includes: the size adjusting subunit is used for carrying out size unified processing on all the feature maps; the recognition positioning subunit is used for respectively performing category recognition and position positioning processing on each feature map with the unified size; and the characteristic fusion subunit is used for performing characteristic fusion processing on all the characteristic graphs after the identification and positioning processing according to a weighting mode.

Correspondingly, the invention also provides an unmanned aerial vehicle aerial video content identification system based on deep learning, which comprises an unmanned aerial vehicle platform and an unmanned aerial vehicle aerial video content identification device, wherein the unmanned aerial vehicle platform is a platform for carrying a visible light camera and a thermal infrared camera and carrying out multi-source image acquisition.

The implementation of the invention has the following beneficial effects:

the invention improves the existing SSD network model, increases a multi-branch convolution structure on the basis of the SSD network model to improve the detection performance of the network on small targets, adopts a multi-scale feature map fusion structure to perform feature map fusion of different scales on multi-scale feature maps, trains a deep neural network model by using an extended object class database, and solves the problem that class identification and position positioning influence each other in aerial video object detection.

Drawings

FIG. 1 is a flowchart of an embodiment of an unmanned aerial vehicle aerial video content identification method based on deep learning according to the present invention;

FIG. 2 is a flowchart of an embodiment of training a deep neural network model using an augmented object class database according to the present invention;

FIG. 3 is a schematic diagram of the structure of the multi-branch convolution of the SSD network model of the present invention;

FIG. 4 is a schematic diagram of a multi-scale feature map fusion structure of an SSD network model in the present invention;

FIG. 5 is a schematic diagram of an SSD network model in the present invention;

FIG. 6 is a schematic structural diagram of the unmanned aerial vehicle aerial photography video content identification system based on deep learning according to the present invention;

fig. 7 is a schematic structural diagram of the unmanned aerial vehicle aerial photography video content recognition device based on deep learning.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 shows a flowchart of an embodiment of the method for identifying content of an aerial video shot by an unmanned aerial vehicle based on deep learning, which includes:

and S101, acquiring a video image shot by the unmanned aerial vehicle in real time.

According to the invention, the unmanned aerial vehicle carries out overlook shooting at low altitude, and the shot video image is transmitted to the host computer (namely, the unmanned aerial vehicle aerial shooting video content recognition device) in real time, so that the host computer can acquire the video image shot by the unmanned aerial vehicle in real time.

S102, performing frame extraction on the video image to extract a sample image, and performing pretreatment on the sample image.

The frame extraction strategy may be to extract the first frame and the last frame at a specified time interval, or may be to extract random frames in a video segment, which is not specifically limited herein and may be selected as needed.

Specifically, the step of preprocessing the sample image includes:

(1) and carrying out correction processing on the sample image by using a distortion correction algorithm to form a sample image of a regular plane.

During correction, the fisheye lens image of the unmanned aerial vehicle can be processed by using a distortion correction algorithm, so that a regular and planar sample image is obtained.

(2) And compressing the sample image after the correction processing so as to enable the sample image to reach a target size capable of target identification.

After the correction is completed, the size of the sample image needs to be compressed and changed, and several target sizes capable of performing target identification are obtained. For example, the target size may be set to 300mm × 300 mm.

And S103, marking the object to be identified in the sample image to generate an object type database.

The method for labeling the object to be recognized in the sample image comprises a manual labeling method and/or an image target detection algorithm.

It should be noted that, labeling various objects in the sample image may be manual labeling (without relevant data information), and only labeling categories that need content identification; on the basis of having relevant basic data, other image target detection algorithms can be used for carrying out automatic machine annotation. And obtaining an object type database after the labeling is finished, wherein the object type database is used as a target space for video content identification.

And S104, expanding the object type database by using a data enhancement technology.

Specifically, the step of augmenting the object class database with data enhancement techniques includes: and carrying out data enhancement processing on the sample image in the object class database in a random probability superposition mode, wherein the data enhancement processing comprises rotation processing, filling type cutting processing and gray data processing.

Expanding the object type database in the step S103 by using a data enhancement technology to increase the content diversity of the object type database, and specifically, transforming a sample image in the object type database into a new sample image through enhancement operations including rotation, filling type cutting and gray data;

it should be noted that the enhancement operation is only used in the training process, and is not needed in the test and the practical application, and all the enhancement operations are applied to the original sample image with a certain probability, so as to ensure the randomness of the result of the enhancement operation, and then the result is used as the input data of the model iterative training.

And S105, training a deep neural network model by using the extended object class database.

The deep neural network model is an SSD network model, and the SSD network model comprises a multi-branch convolution structure and a multi-scale feature map fusion structure.

Compared with the prior art, the method is characterized in that the conventional SSD network model is improved, a multi-branch convolution structure is added on the basis of the SSD network model to improve the detection performance of the network on small targets, a multi-scale feature map fusion structure is adopted to perform feature map fusion of different scales on the multi-scale feature map, the object class database expanded in the step S104 is used for training the deep neural network model, and the problem that class identification and position positioning influence each other in aerial video object detection is solved.

And S106, recognizing the video image by using the trained deep neural network model so as to output the position information and the size information of each object class.

And identifying the content in the video image transmitted back by the unmanned aerial vehicle by using the trained deep neural network model, positioning the content to the position and size of each object category, and finally outputting the object content category and position information in the corresponding window.

Therefore, by improving the SSD network model, the identification speed and efficiency of the model can be effectively improved when the video image is processed, the problem that the category identification and the position positioning are contradictory in the aerial video content detection is effectively solved, and the content identification accuracy is improved.

Referring to fig. 2, fig. 2 is a flowchart illustrating an embodiment of training a deep neural network model using an augmented object class database according to the present invention, which includes:

s201, inputting a plurality of sample images in the object class database into the deep neural network model.

S202, the plurality of sample images are convolved by the multi-branch convolution layer, respectively.

And S203, respectively carrying out normalization processing on the plurality of sample images after the convolution processing to generate a characteristic map of a scale.

And S204, performing feature fusion processing on all feature maps.

Specifically, the step of performing feature fusion processing on all feature maps includes:

(1) all the feature maps are subjected to size unification processing;

(2) respectively carrying out category identification and position positioning processing on each feature map with uniform size;

(3) and performing feature fusion processing on all the feature maps subjected to the identification and positioning processing according to a weighting mode.

And S205, performing convolution processing on the spliced feature map through the convolution layer to generate a branch convolution feature map.

Therefore, the invention adds the convolution kernel which is firstly branched and then spliced in the SSD network model for improvement, and carries out multi-scale feature map fusion on the multi-scale feature maps generated by the convolution kernel on the same type of pictures.

The following describes the training process of the deep neural network model in further detail with reference to fig. 3 to 5:

as shown in fig. 3, in the multi-branch convolution structure of the improved SSD network model, after a sample image is subjected to convolution kernel convolution, normalization, and fusion with dimensions of n and m, respectively, a branch convolution feature map is obtained by 1 × 1 convolution. Wherein, the multi-branch convolution operation can perform a plurality of convolution pooling operations. Preferably, n is 1 and m is 3.

As shown in fig. 4, in the multi-scale feature map fusion structure of the improved SSD network model, the number k of fused feature maps is set to 3, and the input is continuous 3 feature maps; the m-1 layer is subjected to convolution once, the m-layer feature diagram is not changed, the m +1 layer is subjected to deconvolution once, so that after the sizes of the 3 feature diagrams are unified, category identification and position positioning are respectively performed on the 3 feature diagrams, finally, results obtained by the 3 feature diagrams are fused in a weighting mode, feature fusion weighting parameters are obtained by network learning, and the initial setting is 1/3.

Therefore, different from the traditional SSD network which carries out category identification and position positioning from a single feature map, the invention inputs k continuous feature maps (wherein, the features of the deep feature map have stronger representation capability, and the features of the shallow feature map are beneficial to position positioning), and after converting feature maps with different scales into the same size, carries out category identification and position positioning on the k feature maps respectively.

As shown in fig. 5, the improved SSD network model follows the basic structure of the one-step network model, the backbone network of the model adopts VGG-16, and the multi-branch convolution kernel structure is continuously utilized from the last convolution layer of the network for convolution and pooling several times, so as to obtain several feature maps with different scales. And sliding a window on the feature map sequence by adopting a sliding window, setting the size of the sliding window as t, performing feature fusion of feature maps with different scales, taking t feature maps in the sliding window as input of a feature fusion algorithm, sequentially outputting object position regression and category identification results in the corresponding window, and obtaining a final identification result after non-maximum value inhibition. For example, the number of convolution pooling is set to five, and a total of 6 feature maps of different scales are obtained. The size of the sliding window may be set to t-3.

According to the method, the SSD network model is improved by adding the convolution kernel which is branched and spliced first in the SSD network model, and the multi-scale feature maps of the same type of pictures generated by the convolution kernel are fused, so that the problem that the category identification and the position positioning are contradictory in the aerial video content detection is effectively solved, and the content identification accuracy is improved.

Referring to fig. 6, fig. 6 shows a specific structure of the unmanned aerial vehicle aerial video content identification system based on deep learning, which includes an unmanned aerial vehicle platform 1 and an unmanned aerial vehicle aerial video content identification device 2.

The unmanned aerial vehicle platform 1 is a platform for carrying a visible light camera and a thermal infrared camera and carrying out multi-source image acquisition. According to the invention, the unmanned aerial vehicle carries out overlook shooting at low altitude, and the shot video image is transmitted to the unmanned aerial vehicle aerial video content recognition device in real time, so that the unmanned aerial vehicle aerial video content recognition device can acquire the video image shot by the unmanned aerial vehicle in real time.

Specifically, unmanned aerial vehicle platform 1 includes power, computer mainboard, ground control customer end, visible light camera, thermal infrared camera, camera mount, image acquisition card, 4G module and basic station. The unmanned aerial vehicle platform is provided with a flight controller, a power system, a GPS (global positioning system), a battery and the like, and supports module expansion; the computer main board, the visible light camera and the thermal infrared camera are all fixed on the unmanned aerial vehicle platform; the image acquisition card is used for ensuring that the computer mainboard acquires the image data of the thermal infrared camera; the computer mainboard is provided with an image acquisition card drive, adopts an SDK development structure matched with the image acquisition card, and is programmed to synchronously acquire acquisition data of the visible light camera and the thermal infrared camera; the 4G module is carried on a computer mainboard and is connected with a base station through automatic dialing; the ground monitoring client is connected to the base station, and the computer mainboard carried on the unmanned aerial vehicle is connected with the ground monitoring client.

As shown in fig. 6, the device 2 for identifying content of aerial video captured by unmanned aerial vehicle based on deep learning includes an acquisition module 21, a preprocessing module 22, a labeling module 23, an expansion module 24, a training module 25, and an identification module 26, specifically:

and the acquisition module 21 is used for acquiring the video image shot by the unmanned aerial vehicle in real time.

And the preprocessing module 22 is configured to perform frame extraction on the video image to extract a sample image, and perform preprocessing on the sample image. Specifically, the frame extraction strategy may be to extract the first frame and the last frame at a specified time interval, or may be to extract random frames in a segment of video, which is not specifically limited herein and may be selected as needed. In addition, after the frame extraction processing is completed, the preprocessing module 22 uses a distortion correction algorithm to correct the sample image to form a sample image with a regular plane; next, the sample image size is compressed and changed to obtain several target sizes for which target recognition can be performed. For example, the target size may be set to 300mm × 300 mm.

And the labeling module 23 is configured to label the object to be identified in the sample image, and generate an object category database. It should be noted that the labeling module 23 may label the object to be identified in the sample image by using a manual labeling method and/or an image target detection algorithm; under the condition of no relevant data information, marking various objects in the sample image can be manual marking, and only marking the category needing content identification; on the basis of having relevant basic data, other image target detection algorithms can be used for automatic machine annotation; and obtaining an object type database after the labeling is finished, wherein the object type database is used as a target space for video content identification.

And an expansion module 24 for expanding the object class database by using a data enhancement technology. The expansion module 24 can expand the object class database by using a data enhancement technology to increase the content diversity of the object class database, and specifically, a sample image in the object class database is converted into a new sample image through the enhancement operations including rotation, filling type cutting and gray data; the enhancement operation is only used in the training process, the test and the practical application are not needed, all the enhancement operation is applied to the original sample image with a certain probability, the randomness of the enhancement operation result is ensured, and then the enhancement operation result is used as input data of the model iterative training.

And the training module 25 is used for training the deep neural network model by using the extended object class database. The deep neural network model is an SSD network model, and the SSD network model comprises a multi-branch convolution structure and a multi-scale feature map fusion structure.

And the recognition module 26 is used for recognizing the video image by using the trained deep neural network model so as to output the position information and the size information of each object class and finally output the object content class and the position information in the corresponding window.

Therefore, the invention improves the existing SSD network model, increases a multi-branch convolution structure on the basis of the SSD network model to improve the detection performance of the network on small targets, adopts a multi-scale feature map fusion structure to perform feature map fusion of different scales on multi-scale feature maps, trains a deep neural network model by using an extended object class database, and solves the problem that class identification and position positioning influence each other in aerial video object detection.

As shown in fig. 7, the training module 25 includes:

an input unit 251, configured to input the plurality of sample images in the object class database into the deep neural network model;

a first convolution unit 252 configured to perform convolution processing on the plurality of sample images by using the multi-branch convolution layer, respectively;

a normalization unit 253, configured to perform normalization processing on the convolved sample images respectively to generate a feature map of a scale;

a fusion unit 254, configured to perform feature fusion processing on all feature maps;

and a second convolution unit 255, configured to perform convolution processing on the feature map after the stitching processing through the convolution layer to generate a branch convolution feature map.

Further, the fusion unit 254 includes:

the size adjusting subunit is used for carrying out size unified processing on all the feature maps;

the recognition positioning subunit is used for respectively performing category recognition and position positioning processing on each feature map with the unified size;

and the characteristic fusion subunit is used for performing characteristic fusion processing on all the characteristic graphs after the identification and positioning processing according to a weighting mode.

The foregoing is a preferred embodiment of the present invention, and it should be noted that it would be apparent to those skilled in the art that various modifications and enhancements can be made without departing from the principles of the invention, and such modifications and enhancements are also considered to be within the scope of the invention.

Claims

1. The unmanned aerial vehicle aerial photography video content identification method based on deep learning is characterized by comprising the following steps:

acquiring a video image shot by an unmanned aerial vehicle in real time;

performing frame extraction processing on the video image to extract a sample image, and performing preprocessing on the sample image;

marking the object to be identified in the sample image to generate an object type database;

augmenting the object class database with data enhancement techniques;

training a deep neural network model by using the extended object class database, wherein the deep neural network model is an SSD network model, and the SSD network model comprises a multi-branch convolution structure and a multi-scale feature map fusion structure;

and identifying the video image by using the trained deep neural network model so as to output the position information and the size information of each object class.

2. The method for identifying content of aerial video shot by unmanned aerial vehicle based on deep learning of claim 1, wherein the step of preprocessing the sample image comprises:

correcting the sample image by using a distortion correction algorithm to form a sample image of a regular plane;

and compressing the sample image after the correction processing so as to enable the sample image to reach a target size capable of target identification.

3. The method for identifying content of aerial video shot by unmanned aerial vehicle based on deep learning of claim 1, wherein the method for labeling the object to be identified in the sample image comprises a manual labeling method and/or an image target detection algorithm.

4. The method for identifying content of aerial video shot by unmanned aerial vehicle based on deep learning of claim 1, wherein the step of augmenting the object class database with data enhancement technology comprises:

and carrying out data enhancement processing on the sample image in the object class database in a random probability superposition mode, wherein the data enhancement processing comprises rotation processing, filling type cutting processing and gray data processing.

5. The method for unmanned aerial vehicle aerial video content recognition based on deep learning of claim 1, wherein the step of training the deep neural network model using the augmented object class database comprises:

inputting a plurality of sample images in the object class database into the deep neural network model;

carrying out convolution processing on a plurality of sample images through a multi-branch convolution layer respectively;

respectively carrying out normalization processing on the sample images after convolution processing to generate a characteristic map of a scale;

performing feature fusion processing on all feature graphs;

and performing convolution processing on the spliced feature map through a convolution layer to generate a branch convolution feature map.

6. The method for identifying the content of the aerial video shot by the unmanned aerial vehicle based on the deep learning as claimed in claim 5, wherein the step of performing the feature fusion processing on all the feature maps comprises:

all the feature maps are subjected to size unification processing;

respectively carrying out category identification and position positioning processing on each feature map with uniform size;

and performing feature fusion processing on all the feature maps subjected to the identification and positioning processing according to a weighting mode.

7. The utility model provides an unmanned aerial vehicle video content recognition device that takes photo by plane based on degree of deep learning which characterized in that includes:

the acquisition module is used for acquiring a video image shot by the unmanned aerial vehicle in real time;

the preprocessing module is used for performing frame extraction processing on the video image to extract a sample image and preprocessing the sample image;

the marking module is used for marking the object to be identified in the sample image to generate an object type database;

an expansion module for expanding the object class database using data enhancement techniques;

the training module is used for training a deep neural network model by utilizing the extended object class database, wherein the deep neural network model is an SSD network model, and the SSD network model comprises a multi-branch convolution structure and a multi-scale feature map fusion structure;

and the identification module is used for identifying the video image by using the trained deep neural network model so as to output the position information and the size information of each object type.

8. The apparatus for unmanned aerial vehicle aerial video content recognition based on deep learning of claim 7, wherein the training module comprises:

an input unit, configured to input a plurality of sample images in the object class database into the deep neural network model;

the first convolution unit is used for respectively carrying out convolution processing on the plurality of sample images through the multi-branch convolution layer;

the normalization unit is used for respectively carrying out normalization processing on the sample images after the convolution processing so as to generate a characteristic map of a scale;

the fusion unit is used for carrying out feature fusion processing on all the feature maps;

and the second convolution unit is used for performing convolution processing on the spliced feature map through the convolution layer to generate a branch convolution feature map.

9. The apparatus for identifying content of aerial video shot by unmanned aerial vehicle based on deep learning of claim 8, wherein the fusion unit comprises:

10. An unmanned aerial vehicle aerial video content identification system based on deep learning is characterized by comprising an unmanned aerial vehicle platform and the unmanned aerial vehicle aerial video content identification device of any one of claims 7 to 9, wherein the unmanned aerial vehicle platform is a platform carrying a visible light camera and a thermal infrared camera and used for multi-source image acquisition.