CN116935310A

CN116935310A - Real-time video monitoring bird density estimation method and system based on deep learning

Info

Publication number: CN116935310A
Application number: CN202310857022.9A
Authority: CN
Inventors: 雷佳琳
Original assignee: Bainiao Data Technology Beijing Co ltd
Current assignee: Bainiao Data Technology Beijing Co ltd
Priority date: 2023-07-13
Filing date: 2023-07-13
Publication date: 2023-10-24

Abstract

The application is suitable for the technical field of image recognition, and particularly relates to a method and a system for estimating bird density based on real-time video monitoring of deep learning, wherein the method comprises the following steps: acquiring a scene video, extracting and splicing the scene video to obtain a site panoramic photo; data filtering is carried out on the acquired site panoramic photos, and a data set for training a neural network is constructed; preprocessing a data set, and constructing a mapping relation between the label and the data set; constructing a panoramic bird number statistical network model, training, inputting a panoramic bird density map to be counted into the panoramic bird number statistical network model, and calculating the total number of birds. The application can effectively improve the extraction of the neural network to the image characteristics, solves the problems of degradation and difficult training and convergence along with the deep neural network, and generates the self-adaptive Gaussian blur algorithm of the density map label: the spatial distribution of birds in a given image may be displayed based on the total number of birds.

Description

Real-time video monitoring bird density estimation method and system based on deep learning

Technical Field

The application belongs to the technical field of image recognition, and particularly relates to a real-time video monitoring bird density estimation method and system based on deep learning.

Background

Perfect wild animal and habitat protection policies and practices rely on timely and reliable species monitoring data, which is one of the important responsibilities of the protection personnel of the protection area; traditional ground surveys are an important tool for species monitoring, however, relying on personnel surveys is time consuming, labor intensive and difficult, and can lead to biased data results.

In recent years, with the development of emerging technologies, such as airborne and empty images, biological telemetry, infrared cameras, real-time video monitoring and passive sound recording, the monitoring cost is greatly reduced, and the efficiency and accuracy of coverage expansion are improved, so that new opportunities are provided for local, regional and global scale protection research; ecology and protection are now entering the era of big data; real-time video monitoring devices have many advantages in assisting species investigation, such as remote and non-invasive observation, real-time monitoring, cloud storage and playback convenience, etc.; high definition monitoring devices installed in protected areas offer many convenience and advantages for discovering new species, monitoring the behavior of specific species, investigating animal activity hotspots and their habitat usage patterns, preventing illegal transactions, improving public awareness; in the last decade, to reduce the cost of observation, more and more protection zone administrators around the world are installing high definition video surveillance equipment to assist in the daily management of the protection zone.

While the increase in the amount of data allows unprecedented insight into protecting biology and ecology, it also presents a dual challenge of storage and analysis, with the current growing number of raw data (video, images) being monitored, and the ability to process, analyze and interpret data to support protection has not been greatly developed, which can lead to the loss of large amounts of data.

There have been some studies demonstrating the feasibility of using deep learning to estimate wild animal numbers from static images (e.g., aerial photographs and pictures of citizen science smart phone applications), but few studies explore the accuracy and efficiency of deep learning algorithms to detect animal numbers from real-time surveillance videos, most of the density estimation studies have focused on crowd density estimation and have not been optimized for real-time surveillance of captured bird number analysis.

Disclosure of Invention

The embodiment of the application aims to provide a real-time video monitoring bird density estimation method based on deep learning, which aims to solve the problem that bird density statistics is difficult to apply to implementation monitoring scenes in the prior art.

The embodiment of the application is realized in such a way that the method for estimating the bird density based on the real-time video monitoring of the deep learning comprises the following steps:

acquiring a scene video, extracting and splicing the scene video to obtain a site panoramic photo;

data filtering is carried out on the acquired site panoramic photos, and a data set for training a neural network is constructed;

preprocessing a data set, and constructing a mapping relation between the label and the data set;

constructing a panoramic bird number statistical network model, training, inputting a panoramic bird density map to be counted into the panoramic bird number statistical network model, and calculating the total number of birds.

Preferably, the step of obtaining a scene video, extracting and stitching to obtain a site panoramic photo based on the scene video specifically includes: and acquiring a scene video, acquiring key frames in the scene video, carrying out digital naming on the key frames, extracting key frames of a fixed sequence from the key frames, and splicing the extracted key frames with the captured pictures in the scene video according to a matrix sequence to obtain panoramic pictures of the website.

Preferably, the step of data filtering the collected site panoramic photo to construct a data set for training the neural network specifically includes: deleting photos with resolution lower than a preset value in the site panoramic photos, extracting photos with motion blur in the site panoramic photos, and eliminating photos with target number lower than the preset value in the site panoramic photos.

Preferably, the image cropping module performs scaling and cropping on the panoramic photograph of the website in the dataset to obtain N minimum processing image units, wherein the number N of the minimum processing image units is determined by the following formula:

wherein x and y are the true pixel values of the input site panoramic photo, x ₀ For the first number of pixels, y ₀ For the second pixel number, λ is a fixed pixel value remainder for adjusting the magnitude of the y-direction ratio.

Preferably, the depth density estimation network generates a density map label through a Gaussian blur algorithm, and specifically comprises generating a two-dimensional matrix with accurate resolution with an original image, transforming the coordinate through input-output resolution comparison, and transforming the transformed label coordinate x through a delta function _i Set to 1, the formula is as follows:

convolution is carried out by a two-dimensional Gaussian kernel and a delta function, and perspective deformation is processed by a K-nearest neighbor algorithm.

Preferably, the working process of the depth density estimation network comprises two parts of front end feature extraction and rear end density map generation, the depth density estimation network adopts a depth residual network to extract front end features, a depth neural network is expanded through bottomless residual network and short circuit connection, a ResNet34 network is adopted as a front end feature extraction part of the front end, the ResNet34 network is connected with a rear end density map part of the rear end, a ResNet connecting layer is removed, and hole convolution is adopted in the rear end density map part.

Preferably, when training the panoramic bird quantity statistical network model, constructing a data set label, adjusting the size of the data set label, and after adjustment, causing errors among the labels, wherein an error calculation formula is as follows:

wherein M is the number of samples in the data set, M, n is the coordinates of the image pixel points, num _GT Is the number of birds read from the tag file.

Preferably, after the panoramic bird quantity statistical network model is trained, the result evaluation is carried out through the average absolute error and the root mean square error, and the calculation formula of the average absolute error is as follows:

wherein N is the number of images to be detected, zi is the number of marked birds in the ith image;

the accuracy of the panoramic bird number statistical network model is defined as:

it is another object of an embodiment of the present application to provide a real-time video surveillance bird density estimation system based on deep learning, the system comprising:

the picture splicing processing module is used for acquiring scene videos, extracting and splicing the scene videos to obtain site panoramic pictures;

the data set construction module is used for carrying out data filtering on the acquired site panoramic photos to construct a data set for training the neural network;

the data mapping module is used for preprocessing the data set and constructing a mapping relation between the tag and the data set;

the bird density estimation module is used for constructing a panoramic bird number statistical network model, training, inputting a panoramic bird density map to be counted into the panoramic bird number statistical network model, and calculating the total number of birds.

The real-time video monitoring bird density estimation method based on deep learning provided by the embodiment of the application can realize automatic snapshot, cutting and splicing and quantity statistics, and uses a depth residual error network to extract front-end characteristics: the bottomless residual error network expands the deep neural network to 152 layers through short circuit connection, so that the extraction of the neural network to the image characteristics can be effectively improved, the problems of degradation, difficulty in training and convergence along with the deep neural network are solved, and a self-adaptive Gaussian blur algorithm of a density map label is generated: the spatial distribution of birds in a given image may be displayed based on the total number of birds.

Drawings

FIG. 1 is a flow chart of a method for estimating bird density based on real-time video surveillance of deep learning according to an embodiment of the present application;

fig. 2 is a schematic diagram of a real-time video monitoring bird density estimation system based on deep learning according to an embodiment of the present application;

FIG. 3 is an interface diagram of annotated birds provided by an embodiment of the present application;

fig. 4 is a schematic diagram of a panoramic bird number statistical network model according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It will be understood that the terms "first," "second," and the like, as used herein, may be used to describe various elements, but these elements are not limited by these terms unless otherwise specified. These terms are only used to distinguish one element from another element. For example, a first xX script may be referred to as a second xX script, and similarly, a second xX script may be referred to as a first xX script, without departing from the scope of this disclosure.

The application relates to the aspects of ecological protection, bird investigation statistics, artificial intelligence algorithm and the like, and provides a real-time video monitoring bird density estimation method based on deep learning, which can solve the following technical problems:

1) Real-time problem: the existing bird density estimation method often needs to process a large amount of data offline to obtain an estimation result, and is difficult to use in a real-time monitoring scene; the application adopts a frame method of automatic acquisition, splicing and identification statistics, and can realize the estimation of the bird density of real-time video monitoring, thereby meeting the bird monitoring requirement under the real-time monitoring scene.

2) Precision problem: the existing density estimation method is mainly designed aiming at crowd density, and the current density estimation method directly shifts to the condition of large accuracy reduction in bird crowd density estimation; the application adjusts the algorithm network structure, provides a neural network model of a Depth Density Estimation (DDE) network for the shoal data, and achieves better accuracy than the crowd density algorithm.

3) Data volume problem: the existing bird density estimation method usually needs a large amount of marking data to train, but the application adopts a deep learning technology, and can fully utilize a small amount of marking data by methods such as transfer learning, thereby reducing the requirement of data volume and lightening the burden of data marking.

In the technical scheme disclosed by the application, a monitoring camera is required to be installed in an acquisition area, so that real-time picture return of the camera is completed; the camera requires ONVIF (open network video interface protocol), and can access and view through IP address; each camera is fixedly installed at a distance of 1.70 meters from the ground, the maximum zoom range is 60 times, and clear bird images can be shot; the camera can horizontally rotate 360 degrees, and the vertical rotation range is-45 degrees to 90 degrees; the camera may operate between-40 deg. and 40 deg. or in a humidity environment up to 93%.

As shown in fig. 1, a flowchart of a method for estimating bird density based on real-time video monitoring of deep learning according to an embodiment of the present application is provided, where the method includes:

s100, acquiring scene videos, and extracting and splicing the scene videos to obtain site panoramic photos.

In this step, data collection is performed; pyCharm 3.2.2021 is an integrated development environment of Python for invoking video tracks from randomly set tours to obtain video containing information about the entire scene; obtaining a key frame through OpenCV, wherein the key frame is named through numbers; the length of a panoramic shooting video is four seconds, each second consists of thirty frames of images, and the total number of the images is 120; after testing, key image frames of a fixed sequence (19, 26, 33, 41, 49, 56, 63, 71, 79, 86, 92, 99, 106, 110, 116) are extracted from the video; and finally, splicing the key frames and the pictures captured from the video stream according to the matrix sequence to obtain panoramic pictures of the whole site, namely the panoramic pictures of the site.

And S200, data filtering is carried out on the acquired site panoramic photos, and a data set for training the neural network is constructed.

In this step, data filtering is performed; the data set to be trained by the neural network needs to filter a large amount of data of the camera; in order to improve the quality of the data set, samples with high image quality are selected from all the data sets to form a final data set, and the method is mainly based on the following three steps:

A. image resolution: because the image features are not obvious, the image stitching is incomplete, and partial data is lost, the image with the resolution lower than 4k in the data set is removed;

B. image sharpness: because the images acquired by the camera in the rotation process are affected by motion blur, the unclear photo sample in the data set needs to be removed;

C. data statistics characteristics: in order to ensure the rationality of data set distribution and the effectiveness of neural network training, the sample numbers with different number scales in the data set are required to be reasonably distributed; pictures of less than 10 objects in the dataset were removed and images with larger scale (ranging from 50 to 20,000) were screened to ensure reasonable distribution of the dataset.

S300, preprocessing the data set, and constructing a mapping relation between the label and the data set.

In the step, data labeling is carried out; the embodiment of the application provides an annotation tool for bird population counting, as shown in fig. 3, a point is placed at the center of a target, the coordinates of the point are stored, the point represents a bird, a panoramic image is usually long, the display size of a desktop is limited, and the direct annotation of the panoramic image by the desktop is difficult; therefore, all samples in the data set are preprocessed, so that the labeling efficiency is improved, and the labeling cost is reduced; the pretreatment process is divided into three steps;

A. determining a minimum unit of image cropping: the input size of the density estimation module designed by the method is fixed to 1024 multiplied by 768, so that the calculation efficiency of an algorithm can be ensured, and valuable characteristic information of an image can be reserved as far as possible.

B. Unified size: the original image is irregular in size, the length is in the range of 4k to 30k, the width is in the range of 1k to 1.2k, and a plurality of 1024×768 images cannot be obtained through cutting, so that the original image is required to be subjected to size adjustment and is rounded to be a multiple of 1024×768;

C. cutting: through three-step clipping pretreatment, the original data set is clipped into a plurality of 1024×768 small images, the labeling efficiency is improved, and labeled label information can be mapped to names and image sizes in the original data through the corresponding relation between files.

S400, constructing a panoramic bird number statistical network model, training, inputting a panoramic bird density map to be counted into the panoramic bird number statistical network model, and calculating the total number of birds.

In this step, as shown in fig. 4, the panoramic bird number statistical network model is composed of three parts, namely an image clipping module, a depth density estimation network and a concat operation module, an input image is automatically clipped into n undetermined images with 1024×768 pieces by the image clipping module, then the bird density of the clipped images is estimated by using the depth density estimation network (DDE), the Concat Operation Module (COM) splices the generated density images into a panoramic bird density map of the whole site, and calculates the total number of birds.

The steps specifically comprise:

A. image cropping;

because of the uncertainty of the rotation parameters of the image acquisition equipment, the acquired image data is not fixed in size, the image is scaled, the pixel length of the image is an integer multiple of 1024, the pixel width of the image is an integer multiple of 768, the image is cut into fragments with the same size, 1024×768, and the embodiment of the application provides an automatic cutting method to acquire N minimum processing image units; the formula is as follows:

wherein N is the number of the minimum processing units after clipping, x and y are the true pixel values of the input site panoramic photo, x ₀ For a first number of pixels, 1024, y ₀ A second number of pixels of 768; the actual pixel value of the input image divided by 1024 (the number of fixed pixel values); the whole integer input image of the actual pixel value is divided by 1024 (the number of static pixel values), recorded and replaced by the whole integer b, and the actual pixel value of the input image (the image acquired by the image acquisition device) is divided by 1024 to obtain lambda, which is the remainder of the fixed pixel value, and is used for adjusting the size of the y-direction proportion.

B. Constructing a model;

in order to determine the number of bird groups in a given image and generate a density map, the density map can display the spatial distribution of birds in the given image relative to the total number of birds, the application generates a density map label through an adaptive Gaussian blur algorithm, and the process mainly comprises two parts of image annotation display and image conversion representation; specifically comprises generating a two-dimensional matrix with accurate resolution to the original image, transforming the coordinate by comparing input/output resolution, and converting the label coordinate x by delta function _i Set to 1, the formula is as follows:

convolving the two-dimensional Gaussian kernel with a unit pulse function; taking into account x of different samples _i Not completely independent, and there is a perspective distortion relation between them, so the problem of perspective distortion must be considered in processing; calculating the current sample point x by using a K nearest neighbor algorithm _i And its surrounding sample point x _i +1:

wherein d is _i Is each x _i The average distance corresponding to the sample, β is the scaling parameter, and by experiment, β=5; obtaining a depth density estimation network (DDE model), wherein the DDE model comprises two parts: front-end feature extraction and back-end density map generation; because the background of the image data is complex and contains information of various objects, the embodiment of the application adopts a depth residual error network to extract the front-end characteristics; the bottomless residual error network expands the deep neural network to 152 layers through short circuit connection, so that the extraction of the neural network to the image characteristics is effectively improved, and the problems of degradation, difficulty in training and convergence along with the deep neural network are solved; the ResNet network is modified on the basis of the VGG network, and a residual error unit is added through a short circuit mechanism, so that the problem that the deep network is difficult to train is solved; the ResNet network has five structures with different depths, wherein the ResNet152 deepens the network structure on the basis of the ResNet34, and has stronger feature extraction capability; comprehensively considering the precision of the whole network and the execution efficiency of an algorithm, the embodiment of the application selects a ResNet34 network with high calculation speed as a feature extraction part of the front end, links with a density map generation part of the rear end, removes the whole connection layer of the ResNet, pretrains the ResNet34 on a tens of millions of data sets imageNet, and the weight contains rich target feature information; by using the network structure as the feature extraction network, the convergence rate of network training can be effectively increased.

Expanding the convolution increases the Receptive Field of the convolution operation by adding a hole calculation operation to the convolution kernel such that the output of each convolution contains a broad range of information without requiring a pooling operation (pooling operation); in a common convolution operation, the size of the convolution kernel is 3×3, and the receptive field is 3×3; in the dilation convolution, when the dilation rate is 1, the size of the convolution kernel is 3×3, but the receptive field becomes 7×7; because birds in the dataset are smaller and there may be individual occlusion, richer, more complete local feature information is needed to generate the feature map; thus, hole convolution is selected as the main component of the back-end density map generation section; in order to preserve the image feature information as much as possible, the pooling layer is not used.

The label is a key step of neural network training, the reliability of the data label directly determines the accuracy of a supervised learning result, and when the data set label is generated, the data set uses the label of the human mouth density estimation data set as a reference, because ResNet34 uses a plurality of dimension reduction operations when developing a density map, and the size of the density map is 1024 times that of an original image; therefore, the size of the label data needs to be adjusted, and experiments show that after the label data is directly used for adjustment operation, errors exist among labels, and an error calculation formula is as follows:

wherein M is the number of samples in the data set, M, n is the coordinates of the image pixel points, num _GT In order to read the number of birds from the tag file, according to experimental data, when the number of downsampling times exceeds four times, the error generated by the tag greatly influences the accuracy of an algorithm, the downsampling is smaller than 4 times, although the error is smaller, the number of data required to be processed by the algorithm is increased along with the increase of an output characteristic diagram, and the algorithm is not suitable for convergence; therefore, two-dimensional reduction operation is deleted from the DDE model, and the size of the output characteristic diagram is 128 multiplied by 96, so that the convergence speed of the algorithm is ensured, and the accuracy of the algorithm is ensured.

C. Training a model;

randomly set 80% of the images for model training and the remaining 20% for model testing and validation, DDE models were developed using the PyTorch 1.8.0, trained in the Geforce RTX3090 with 24GB memory, with initial model specifications set as follows: learning rate=10×10-5, batch size=1, momentum=0.95, weight decay=5×10×10-4, iteration cycle is 200 epochs.

D. Evaluating results;

to verify the validity of the recognition result, a test is performed by Mean Absolute Error (MAE) and Root Mean Square Error (RMSE):

the lower the MAE and Errorratio, the higher the accuracy of the algorithm on the test set, the lower the MSE, the more stable the algorithm, and the better the adaptability; in addition, the application adopts test data sets with different sample sizes to calculate model MEA, RMSE and accuracy so as to illustrate the effectiveness of the technical scheme of the application; no neural network algorithm specifically for bird density estimation exists in the prior art.

The embodiment of the application provides a real-time video monitoring bird density estimation method based on deep learning, which comprises the following steps of automatic image clipping: because of the uncertainty of the rotation parameters of the image acquisition equipment, the acquired image data is not fixed in size, and the input image is cut into fragments with the same size, namely 1024×768, so that the stability of a subsequent algorithm is ensured; the difference between the method and the prior art is that the traditional image processing method generally needs to manually cut out the image, and the scheme omits the step through an automatic cutting method, thereby improving the efficiency and the accuracy;

depth residual network (res net): in order to solve the problem that the background of image data is complex, the embodiment of the application selects to use the depth residual error network for front-end feature extraction, the bottomless residual error network expands the depth neural network to 152 layers through short circuit connection, and compared with the traditional convolutional neural network, the depth residual error network has a deeper network structure and stronger nonlinear modeling capability, so that more accurate features can be extracted; compared with the prior art, the method has the advantages that the feature extractor is required to be manually selected, and the deep learning algorithm is utilized to automatically learn the features, so that the efficiency and the accuracy are improved;

k nearest neighbor algorithm: in a density estimation network, the application uses a K nearest neighbor algorithm to calculate the average distance and scaling parameters of the current sample point and surrounding sample points; the problems of sample correlation, perspective deformation and the like are solved, and the accuracy of density estimation is improved; compared with the prior art, the prior art only considers the density of the current sample points, and ignores the influence of surrounding sample points;

depth density estimation network (DDE): for bird density estimation. The DDE model comprises front-end feature extraction and back-end density map generation; compared with the prior art, the DDE model is more excellent in processing perspective deformation and sample correlation problems, and can adaptively generate density map labels without manual labeling, so that the efficiency and accuracy are improved.

As shown in fig. 2, a real-time video monitoring bird density estimation system based on deep learning according to an embodiment of the present application includes:

the picture splicing processing module 100 is used for acquiring scene videos, extracting and splicing to obtain site panoramic photos based on the scene videos;

the data set construction module 200 is used for carrying out data filtering on the acquired site panoramic photos to construct a data set for training the neural network;

the data mapping module 300 is used for preprocessing the data set and constructing a mapping relation between the tag and the data set;

the bird density estimation module 400 is configured to construct and train a panoramic bird number statistical network model, input a panoramic bird density map to be counted into the panoramic bird number statistical network model, and calculate the total number of birds.

It should be understood that, although the steps in the flowcharts of the embodiments of the present application are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in various embodiments may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the application.

Claims

1. A real-time video surveillance bird density estimation method based on deep learning, the method comprising:

2. The method for estimating bird density in real time based on deep learning according to claim 1, wherein the step of obtaining scene video, extracting and splicing site panoramic photos based on the scene video comprises the following steps: and acquiring a scene video, acquiring key frames in the scene video, carrying out digital naming on the key frames, extracting key frames of a fixed sequence from the key frames, and splicing the extracted key frames with the captured pictures in the scene video according to a matrix sequence to obtain panoramic pictures of the website.

3. The method for estimating bird density based on real-time video surveillance of deep learning of claim 1, wherein the step of data filtering the collected site panoramic photos to construct a data set for training of neural network specifically comprises: deleting photos with resolution lower than a preset value in the site panoramic photos, extracting photos with motion blur in the site panoramic photos, and eliminating photos with target number lower than the preset value in the site panoramic photos.

4. The deep learning-based real-time video surveillance bird density estimation method of claim 1, wherein the panoramic bird number statistical network model comprises a three-part structure, which is an image cropping module, a depth density estimation network and a concat operation module.

5. The deep learning based real-time video surveillance bird density estimation method of claim 4, wherein the image cropping module scales and crops the panoramic photograph of the website in the dataset to obtain N minimum processed image units, wherein the number N of minimum processed image units is determined by the following formula:

wherein x and y are the true pixel values of the input site panoramic photo, x ₀ For the first number of pixels, y ₀ For the second number of pixels, λ is the remainder of the fixed pixel value for adjusting the y-direction ratioSize of the product.

6. The method for estimating bird density based on real-time video surveillance of claim 4, wherein the depth density estimation network generates density map labels by means of gaussian blur algorithm, specifically comprising generating a two-dimensional matrix with accurate resolution to the original image, transforming the coordinates by means of input-output resolution ratio, transforming the transformed label coordinates x by means of delta function _i Set to 1, the formula is as follows:

7. The method for estimating bird density based on real-time video monitoring of deep learning according to claim 6, wherein the working process of the depth density estimating network comprises front end feature extraction and rear end density map generation, the depth density estimating network adopts a depth residual network to extract front end features, the depth neural network is expanded by bottomless residual network and short circuit connection, a ResNet34 network is adopted as a front end feature extraction part of the front end, the ResNet34 network is connected with a rear end density map part of the rear end, a connection layer of the ResNet is removed, and hole convolution is adopted in the rear end density map part.

8. The method for estimating bird density based on real-time video monitoring of deep learning according to claim 1, wherein when training the panoramic bird quantity statistical network model, constructing a data set label, adjusting the size of the data set label, and after adjustment, the labels have errors, wherein an error calculation formula is:

9. The method for estimating bird density based on real-time video monitoring of deep learning according to claim 1, wherein after the training of the panoramic bird quantity statistical network model is finished, the result evaluation is performed by means of average absolute error and root mean square error, and the calculation formula of the average absolute error is as follows:

10. a deep learning based real-time video surveillance bird density estimation system, the system comprising: