CN112507832A

CN112507832A - Canine detection method and device in monitoring scene, electronic equipment and storage medium

Info

Publication number: CN112507832A
Application number: CN202011379290.7A
Authority: CN
Inventors: 龚震霆
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-03-16

Abstract

The application discloses a dog detection method and device in a monitoring scene, electronic equipment and a storage medium, relates to the field of artificial intelligence, particularly relates to a computer vision and deep learning technology, and can be used in a smart city scene. The specific implementation scheme is as follows: acquiring a monitoring video sample; acquiring a detection data set for the dog according to the monitoring video sample and the target detection data set; training a target detection basic model according to the detection data set to obtain model parameters, and generating a dog detection model under a monitoring scene according to the model parameters; acquiring a monitoring video, and extracting an image frame to be detected from the monitoring video; and carrying out the canine detection on the image frame to be detected according to the canine detection model. This application can realize that the dog under the wisdom city monitoring scene only detects, can improve detection efficiency, improves and detects the coverage, can be so that the dog only manages more simply, and a large amount of human costs have been saved to high efficiency and intelligence.

Description

Canine detection method and device in monitoring scene, electronic equipment and storage medium

Technical Field

The application relates to the field of artificial intelligence, in particular to a computer vision and deep learning technology, and particularly relates to a dog detection method and device under a monitoring scene, electronic equipment and a storage medium, which can be used under a smart city scene.

Background

Canine management is a common problem in urban treatment and is one of important means for urban safety development. At present, new dog raising management regulations are issued continuously in various places, and the behavior of the uncivil is specially regulated in a centralized way, so that relevant departments in various places begin to pay attention to relevant work of dog management. The prior work of the type is mostly carried out dog management in a mode of manual monitoring and inspection and reporting by the masses. However, this method has problems of low efficiency, low coverage, high labor cost, and the like.

Disclosure of Invention

The application provides a dog detection method and device in a monitoring scene, electronic equipment and a storage medium.

According to a first aspect, a method for detecting a dog under a monitoring scene is provided, which comprises the following steps:

acquiring a monitoring video sample;

acquiring a detection data set for the dog according to the monitoring video sample and the target detection data set;

training a target detection basic model according to the detection data set to obtain model parameters, and generating a dog detection model under a monitoring scene according to the model parameters;

acquiring a monitoring video, and extracting an image frame to be detected from the monitoring video; and

and carrying out the canine detection on the image frame to be detected according to the canine detection model.

According to a second aspect, there is provided a dog detection device under a monitoring scenario, comprising:

the first acquisition module is used for acquiring a monitoring video sample;

the second acquisition module is used for acquiring a detection data set aiming at the dog according to the monitoring video sample and the target detection data set;

the model training module is used for training a target detection basic model according to the detection data set to obtain model parameters and generating a dog detection model under a monitoring scene according to the model parameters;

the third acquisition module is used for acquiring a monitoring video and extracting an image frame to be detected from the monitoring video; and

and the detection module is used for carrying out the canine detection on the image frame to be detected according to the canine detection model.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the dog-only detection method in the monitoring scenario of the first aspect.

According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to execute the dog-only detection method in the monitoring scenario of the first aspect.

The embodiment provided by the application at least has the following beneficial technical effects:

can realize that the dog under the wisdom city monitoring scene only detects. The detection data set for the dog is obtained by utilizing the monitoring video and the target detection data set in the monitoring scene, so that the data set used for training the model is closer to the real application scene, the model trained based on the detection data set is more suitable for the smart city monitoring scene, and the detection result of the model is more accurate. In addition, this application realizes through the detection model to the dog that the dog only detects under the wisdom city monitoring scene, can improve detection efficiency, improves and detects the coverage, can be so that the dog only manages more simply, and a large amount of human costs have been saved to high efficiency and intelligence.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a flowchart of a dog detection method in a monitoring scenario according to an embodiment of the present application;

FIG. 2 is a flow chart of obtaining a test data set for a canine according to an embodiment of the present application;

FIG. 3 is an exemplary diagram of one of the residual blocks of Resnet in the prior art;

FIG. 4 is an exemplary diagram of one of the residual blocks according to an embodiment of the present application;

FIG. 5 is a diagram of an example of a convolution operation of the A channel in one of the residual blocks of Resnet in the prior art;

FIG. 6 is a diagram illustrating an example of a convolution operation for channel A in a residual block according to an embodiment of the present application;

fig. 7 is a block diagram of a dog detection device in a monitoring scenario according to an embodiment of the present application;

fig. 8 is a block diagram of a dog detection device in another monitoring scenario according to an embodiment of the present application;

fig. 9 is a block diagram of an electronic device for implementing a dog-only detection method in a monitoring scenario according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to solve the problems, the application provides a method and a device for detecting a dog in a monitoring scene, electronic equipment and a storage medium, relates to the field of artificial intelligence, particularly relates to a computer vision and deep learning technology, and can be used in a smart city scene. In some embodiments, visual monitoring equipment can be arranged at different monitoring places such as streets and roads in a city, a smart city scene is formed by an artificial intelligence technology, particularly a computer vision technology and a deep learning technology, and useful help can be provided for the fields of portrait detection, human body detection, unmanned driving, dog detection and the like.

Fig. 1 is a flowchart of a dog detection method in a monitoring scenario according to an embodiment of the present application. It should be noted that the method for detecting the dog in the monitoring scene in the embodiment of the application relates to the field of artificial intelligence, in particular to a computer vision and deep learning technology, and can be used in the scene of a smart city. As shown in fig. 1, the method for detecting a dog in the monitoring scenario may include the following steps.

In step 101, a surveillance video sample is obtained.

Optionally, monitoring videos of different monitoring places such as city streets, roads and the like are collected, and videos in which dogs appear are screened out from the monitoring videos to serve as monitoring video samples.

In step 102, a detection data set for the dog is obtained based on the surveillance video sample and the target detection data set.

It should be noted that, in some embodiments, the target detection data set may be understood as a public object in Context (COCO). In this step, the surveillance video sample and the target detection dataset may be utilized to obtain a detection dataset for a dog.

Optionally, an image frame data set containing a dog and a monitoring background image frame data set not containing the dog are obtained from the monitoring video sample, an image containing the dog is screened out from the target detection data set, the screened image containing the dog is fused into each image frame in the monitoring background image frame data set, and the fused data set and the image frame data set containing the dog are merged to obtain a detection data set for the dog.

In step 103, training the target detection basic model according to the detection data set to obtain model parameters, and generating a dog detection model under a monitoring scene according to the model parameters.

In some embodiments, the target detection basis model may be trained on a target detection data set. As an example of one possible implementation, an initial classification pre-training model may be trained according to a target detection dataset to obtain model parameters, and a target detection base model may be formed according to the model parameters. For example, a universal classification pre-trained model is used as the backbone of the canine detection model; training the classified pre-training model on a disclosed target detection data set, continuously correcting model parameters through back propagation in the training process until the training end condition is reached, obtaining the model parameters at the moment, and forming the target detection basic model according to the model parameters.

In some embodiments, when a detection data set for a dog is obtained, a target detection basic model can be trained on the detection data set, model parameters are continuously corrected through back propagation in the training process until a training end condition is reached, the model parameters at the moment are obtained, and the dog detection model under the monitoring scene is formed according to the model parameters.

In step 104, a surveillance video is acquired, and the image frames to be detected are extracted from the surveillance video.

In some embodiments, when only a dog in a smart city scene is detected, a surveillance video in the smart city scene may be obtained, a certain image frame parsing interval is set according to the video length, the surveillance video is parsed according to the parsing interval to obtain a plurality of image frames, wherein the plurality of image frames parsed from the surveillance video based on the parsing interval are used as the image frames to be detected.

In other embodiments, when obtaining the monitoring video, the key frame may be extracted from the monitoring video by using an inter-frame difference method, and the extracted key frame is used as the image frame to be detected.

It should be noted that the two obtaining manners of the image frame to be detected are only examples provided for facilitating understanding of those skilled in the art, and cannot be taken as specific limitations of the present application, and it can be understood that obtaining the image frame to be detected from the monitoring video can also be implemented by using other technical means, which is not described in detail herein.

In step 105, the image frames to be detected are dog-detected according to the dog-detection model.

Optionally, the image frames to be detected are input to the dog-only detection model. The dog detection model performs feature extraction on the image frame, recombines the extracted features, recombines the features of different layers, performs target detection based on the recombined features, and outputs a prediction frame. And comparing a prediction frame output by the canine detection model with a preset confidence threshold, and if the prediction frame is greater than the confidence threshold, determining that a canine exists in the monitoring video, and executing early warning operation. Optionally, if the prediction frame is not greater than the confidence threshold, it is determined that no dog appears in the monitoring video, and no early warning operation is performed.

Optionally, it is assumed that the dog detection method in the monitoring scenario of the embodiment of the present application is applied to a smart city scenario, where the smart city scenario has a smart city management platform, and the smart city management platform may include an electronic device for implementing the dog detection method in the embodiment of the present application, and may further include a public security management terminal, and/or a street management terminal, and/or a user terminal of a dog raiser, and the like. The user terminal, the public security management terminal and the street management terminal of the dog raiser are registered on the smart city management platform in advance, so that the smart city management platform can realize communication interaction with the user terminal, the public security management terminal and the street management terminal through registration information. In some embodiments of the present application, the implementation manner of performing the warning operation may be as follows: when the dog appears in the monitoring video, the electronic equipment can send an early warning to the public security management terminal and/or the street management terminal to remind public security city management personnel that the dog appears in the scene contained in the monitoring video (namely, the street or the highway corresponding to the monitoring video), so that the public security city management personnel or the street management personnel pay attention to and manage the dog, or the public security city management personnel or the street management personnel only utilize the user terminal to inform the corresponding dog raiser based on the dog, so that the dog raiser pays attention to and manages the dog, and reminds the dog raiser to culture the dog civilized. It is to be understood that the execution of the pre-warning operation may be implemented in other ways, and the implementation manner is only an example given for the convenience of understanding of those skilled in the art, and cannot be taken as a specific limitation of the present application.

According to the method for detecting the dog only in the monitoring scene, the monitoring video sample is obtained, the detection data set for the dog is obtained according to the monitoring video sample and the target detection data set, the target detection basic model is trained according to the detection data set, the model parameters are obtained, and the dog only detection model in the monitoring scene is formed according to the model parameters. The detection data set for the dog is obtained by utilizing the monitoring video and the target detection data set in the monitoring scene, so that the data set used for training the model is closer to the real application scene, the model trained based on the detection data set is more suitable for the smart city monitoring scene, and the detection result of the model is more accurate. In addition, this application realizes through the detection model to the dog that the dog only detects under the wisdom city monitoring scene, can improve detection efficiency, improves and detects the coverage, can be so that the dog only manages more simply, and a large amount of human costs have been saved to high efficiency and intelligence.

It should be noted that, in order to further make the data set for training the model closer to the real application scenario, the dog detection model is more suitable for the smart city monitoring scenario. In some embodiments of the present application, as shown in fig. 2, the specific implementation process for obtaining the detection data set for the dog according to the surveillance video sample and the target detection data set may include the following steps.

In step 201, a first image frame data set and a second image frame data set are obtained from a surveillance video sample.

In the embodiment of the present application, the first image frame data set may be understood as an image frame data set including only dogs; the second image frame data set may be understood as a monitoring background image frame data set that does not contain dogs.

Alternatively, since each surveillance video is mostly similar pictures, with a large amount of redundancy, a small fixed number of image frame data sets D1_ frames may be randomly drawn for each surveillance video sample. The small fixed number of image frame data sets D1_ frame is subjected to inspection screening, and image frames containing dogs are retained, resulting in a first image frame data set D _ wDog, wherein each image frame in the first image frame data set contains a dog and a monitoring background. At least a part of the image frames are screened out from the small fixed number of image frame data sets D1_ frame, and the data set formed by the at least a part of the image frames is used as a second image frame data set D _ woDog, wherein each image frame in the second image frame data set is a monitoring background image frame without containing a dog.

It should be noted that, in some embodiments, the image frame data set D1_ frame may be subjected to inspection screening manually, and the image frame data set D1_ frame may also be subjected to inspection screening by artificial intelligence technology, which is not limited in this respect.

In step 202, labeling is performed on each dog in each image frame in the first image frame data set, and a third image frame data set with labels is obtained.

Optionally, each dog in each image frame in the first image frame data set D _ wDog is strictly manually labeled, and the labeling type is a rectangular frame, that is, each dog in each image frame is labeled only by using a rectangular frame, and the labeled data set is used as the third image frame data set D3_ Det with labels. Optionally, the active learning ability in the deep learning can be further adopted to intelligently label only each dog in each image frame in the first image frame data set. Or, it may also use self-supervised learning, semi-supervised learning or weak supervised learning in the deep learning, and automatically label each dog in each image frame in the first image frame data set, which is not limited herein.

In step 203, an image dataset with annotations is acquired based on the target detection dataset.

Alternatively, the image P1 containing the dog may be selected from the public target detection data set, and since the images in the target detection data set are labeled, the images P1 containing the dog selected are also labeled images.

In step 204, a composite data set is generated from the second image frame data set and the image data set, and a test data set for the dog is generated from the composite data set and the third image frame data set.

In some embodiments, a canine target for each image in the image data set selected from the target detection data set may be fused to a random position of each image frame in the second image frame data set based on an image fusion technique to obtain a composite data set, and the composite data set may be merged with the third image frame data set to obtain a canine-specific detection data set.

That is, using an image fusion technique, the canine target in each image P1 in the image data set screened out from the target detection data set is fused to a random position in each image frame in the second image frame data set D _ woDog, and a composite data set is generated using the original annotation information of the canine target in the image P1 and the random position, and the composite data set is merged to the third image frame data set D3_ Det, resulting in the detection data set D4 for the canine.

Therefore, the detection data set for the dog can be obtained through the steps 201 to 204, and the detection data set for training the model can be further closer to a real application scene, so that the dog detection model is more suitable for the smart city monitoring scene.

It should be noted that the current mainstream target detection models are divided into One-stage and Two-stage modes. The One-stage detection model has higher speed, but the detection precision is slightly lower than that of Two-stages. The Two-stages detection model has great advantages in detection accuracy, but is slow in speed and cannot be used in some scenes and equipment with strict time delay limitation. The time delay requirement of the monitoring scene on the target detection model is high, so that the dog detection model designed by the application is in an One-stage mode.

In some embodiments, the canine detection model comprises: the Detection device comprises a Backbone module, a Detection Neck Detection neutral module, a Detection Head block and a Detection Loss block. The backhaul module is used for taking charge of feature extraction; the Detection neutral module is used for being responsible for characteristic recombination and recombining the characteristics of different layers; the Detection Head block is used for being responsible for target Detection; the Detection Loss block is used for calculating the difference degree between the obtained prediction result and the corresponding real labeling information.

It is understood that the Resnet network is commonly used in the art as a backhaul. However, in some embodiments, the backhaul module of the present application is a network constructed based on multiple target detection parameter ticks, that is, the present application uses multiple target detection parameter ticks to construct a backhaul of a dog-only detection model, so as to improve the basic performance of the backhaul.

In some embodiments, the Backbone module includes a C1 stage, a C2 stage, a C3 stage, a C4 stage, a C5 stage, each of said stages consisting of a different number of residual blocks; wherein each of the residual blocks comprises an a channel and a B channel, wherein the a channel in one of the residual blocks comprises three convolution operations, the kernel size of the first convolution in the a channel is 1x1 with a step size of 1, the kernel size of the second convolution in the a channel is 3x3 with a step size of 2, and the kernel size of the third convolution in the a channel is 1x1 with a step size of 1; the B channel comprises an average pooling layer and a convolutional layer, wherein the step size of the average pooling layer is 2, the core size of the convolutional layer is 1x1, and the step size is 1.

For example, the backhaul module in this application is composed of five stages (stages) of C1, C2, C3, C4 and C5, each stage being composed of a different number of residual blocks (bottleeck blocks). The requirement on the running time of the detection model under the video monitoring scene is high, and only the target form of the dog is detected, so that the 34-layer Backbone is designed in the application.

As shown in fig. 3, in one of the residual block bottleeck blocks of Resnet in the prior art, since convolution with a convolution kernel size of 1 × 1 and a step size stride of 2 is used, in PathA, three quarters of the input feature map is ignored by the convolution operation; also in Path B, the convolution operation ignores three-quarters of the input feature map. As shown in fig. 4, in one of the residual block bottleeck blocks in the embodiments of the present application, in Path a, the stride sizes of the first two convolutions are structurally improved, that is, a convolution operation with a convolution kernel size of 1x1 and a stride length of 1 is performed first, a convolution operation with a convolution kernel size of 3x3 and a stride length of 2 is performed again, and then a convolution operation with a convolution kernel size of 1x1 and a stride length of 1 is performed again; in PathB, an average pooling layer AvgPool (2x2, stride 2) was added before the 1x1 convolution, while the stride size of the 1x1 convolution was modified to 1. Therefore, the Path A and the Path B can not ignore any information, the performance effect of the network is improved, and the influence on the calculation cost is small.

In some embodiments, the scale dimension of the a channel in the residual block of the present application is 4, and the convolution operation of the a channel is as follows: performing convolution operation on the feature map of the upper layer by using the first convolution to obtain a first feature map, and cutting the first feature map into x1, x2, x3 and x4 on the basis of the dimension; wherein x1 is not treated and is used as y 1; the x2 is subjected to the convolution operation of the second convolution to obtain K2 as y 2; performing feature fusion on x3 and K2, and performing convolution operation of the second convolution to obtain K3 as y 3; performing feature fusion on x4 and K3, and performing convolution operation of the second convolution to obtain K4 as y 4; splicing y1, y2, y3 and y4 to obtain a second feature map; performing convolution operation on the second feature map by using the third convolution to obtain a third feature map; and fusing the feature map and the third feature map to obtain a fourth feature map.

For example, as shown in fig. 5, it is an exemplary diagram of a convolution operation of an a channel in a residual block bottleneckblock in the prior art, and as shown in fig. 6, it is an exemplary diagram of a convolution operation of an a channel in a residual block in an embodiment of the present application. The residual block in the embodiment of the application is a multi-scale processing method. As shown in fig. 6, when the scale dimension is 4, the present application first uses a 1 × 1 convolution kernel to perform a convolution operation on the feature map feature _ maps _ in of the upper layer to obtain a first feature map feature _ maps 1; feature _ maps1 were then cut into 4 blocks of x1, x2, x3, x 4; x1 as y1 without treatment; x2 is subjected to convolution operation of 3x3 to obtain K2 as y 2; performing feature fusion on x3 and K2, and performing convolution operation of 3x3 to obtain K3 as y 3; performing feature fusion on x4 and K3, and performing convolution operation of 3x3 to obtain K4 as y 4; splicing y1, y2, y3 and y4 to obtain a second feature map 2; carrying out convolution operation on feature _ maps2 by using a 1x1 convolution check to obtain a third feature map feature _ maps 3; and finally fusing the feature map feature _ maps _ in and the third feature map feature _ maps3 to obtain a fourth feature map feature _ maps _ out.

In order to increase the running speed of the model and improve the detection efficiency of the model, optionally, in some embodiments of the present application, a Deformable Convolution Network (DCN) may be used at a last stage (e.g., the C5 stage), so that when the model processes a complex target condition, the detection accuracy may be greatly improved. It will be appreciated that the deformable convolutional network has been validated to be effective on many detection models. However, the DCN is much slower due to the addition of some extra parameters and computational efforts FLOPs, so that DCN is only used in the last stage (e.g. C5 stage) for balancing the present application.

In some embodiments, the operation of the Detection neutral module is as follows: taking the outputs of the C5 stage, the C4 stage, the C3 stage and the C2 stage in the Backbone module as the input of a Detection neutral module; the output of the C5 stage is subjected to multiple groups of convolution to obtain P5, the P5 is fused with the output of the C4 stage after being sampled by N times, and P4 is obtained through multiple groups of convolution; the P4 is fused with the output of the C3 stage after being sampled by N times, and P3 is obtained through a plurality of groups of convolutions; the P3 is fused with the output of the C2 stage after being sampled by N times, and P2 is obtained through a plurality of groups of convolutions; p4, P3, and P2 are used as inputs to the Detection Head block. As an example, the value of N may be 2.

It is understood that the Detection neutral module is usually constructed by using Feature Pyramid Networks (FPN). FPNs are often used to build a pyramid of features between feature maps. In the prior art, C5, C4, C3 and C2 of Backbone are generally used as input of an FPN network; c5 is subjected to several groups of convolution to obtain P5, P5 is fused with the output of C4 after being sampled by 2 times, and then P4 is obtained through several groups of convolution; p4 is fused with the output of C3 after 2 times of upsampling, and then P3 is obtained through several groups of convolution; obtaining P2 by the same method; c5 is downsampled to obtain P6. Considering that the dog size is small in the monitoring scene applicable to the application, the application only uses C5, C4, C3 and C2 as the input of the FPN network to obtain P5, P4, P3 and P2 with different feature maps, and the P5, the P4, the P3 and the P2 correspond to boundary boxes (Bounding boxes, abbreviated as bbox) with different scales; while not obtaining P6; only P4, P3, P2 are then used as inputs to the Detection Head block.

Optionally, in some embodiments of the present application, Spatial Pyramid Pooling (SPP) may be added between two convolution processes of P5 obtained from stage C5, so that the receptive field may be significantly increased and the most important contextual features may be separated without introducing additional parameters.

In some embodiments, the inputs of the Detection Head block may be the outputs P4, P3, P2 of the Detection sock module. The three feature maps P4, P3 and P2 with different sizes are subjected to convolution operation with a convolution kernel size of 3x3 and then convolution operation with a convolution kernel size of 1x1, so that prediction of the bbox corresponding to different scales is obtained.

It should be noted that the Detection Loss block generally consists of a Class Loss, an object Loss and a Bounding box regression Loss. Among them, the Bounding box regression Loss generally adopts L1 Loss and L2 Loss. The way that L1 Loss and L2 Loss go through the regression coordinate frame of 4 coordinate points is to assume that the 4 coordinate points are independent from each other, and do not consider their correlation, however, the actual 4 coordinate points have a certain correlation. Therefore, to solve this problem, in some embodiments, the canine-only detection model of the present application uses an interaction over Union Loss (detection evaluation function IOU Loss).

Therefore, through the structural design of the model, the dog detection model has the following characteristics: the method is suitable for the condition that the picture pixels in the monitored scene are high and the detection target is small, and has the advantages of high detection speed, small-area detection frames and the like.

Fig. 7 is a block diagram of a dog detection device in a monitoring scenario according to an embodiment of the present application. It should be noted that the dog-only detecting device in the monitoring scenario of the embodiment of the present application may be configured in the electronic device of the embodiment of the present application. As shown in fig. 7, the dog detection apparatus 700 may include: a first acquisition module 710, a second acquisition module 720, a model training module 730, a third acquisition module 740, and a detection module 750.

Specifically, the first obtaining module 710 is configured to obtain a surveillance video sample.

The second obtaining module 720 is configured to obtain a detection data set for the dog according to the surveillance video sample and the target detection data set.

The model training module 730 is configured to train the target detection basic model according to the detection data set, obtain model parameters, and generate a dog detection model in a monitoring scene according to the model parameters. In some embodiments, the model training module 730 is further configured to train an initial classification pre-training model according to the target detection data set, obtain model parameters, and generate the target detection base model according to the model parameters.

The third obtaining module 740 is configured to obtain a surveillance video, and extract an image frame to be detected from the surveillance video.

The detection module 750 is configured to perform the dog detection on the image frames to be detected according to the dog detection model. In some embodiments, the detection module 750 inputs the image frames to be detected into the dog detection model to obtain a prediction frame; and comparing the prediction frame with the confidence coefficient threshold, and executing early warning operation in response to the fact that the prediction frame is larger than the confidence coefficient threshold.

In some embodiments, as shown in fig. 8, the second obtaining module 820 may include: a first acquisition unit 821, a second acquisition unit 822, a third acquisition unit 823, and a generation unit 824. The first obtaining unit 821 is configured to obtain a first image frame data set and a second image frame data set from a monitored video sample; the first image frame data set is an image frame data set containing dogs, and the second image frame data set is a monitoring background image frame data set without containing dogs; the second obtaining unit 822 is configured to label each dog in each image frame in the first image frame data set, and obtain a third image frame data set with labels; the third obtaining unit 823 is configured to obtain an image dataset with annotations based on the target detection dataset; the generating unit 824 is configured to generate a composite data set from the second image frame data set and the image data set, and to generate a detection data set for the dog from the composite data set and the third image frame data set.

In some embodiments, the specific implementation of the generating unit 824 for generating the composite data set from the second image data set and the image data set may be as follows: and fusing the canine target of each image in the image data set to the random position of each image frame in the second image frame data set based on an image fusion technology to obtain the composite data set.

Wherein 801-804 in fig. 8 and 701-704 in fig. 7 have the same functions and structures.

In some embodiments, the canine detection model is a one-stage model; the dog Detection model comprises a Backbone module, a Detection Neck Detection New module, a Detection Head block and a Detection Loss block. The backhaul module is used for taking charge of feature extraction; the Detection neutral module is used for being responsible for feature recombination and recombining features of different layers; the Detection Head block is used for being responsible for target Detection; and the Detection Loss block is used for calculating the difference degree between the obtained prediction result and the corresponding real labeling information.

In some embodiments, the backhaul module is a network constructed based on a plurality of target detection parameter ticks; the Backbone module comprises a C1 stage, a C2 stage, a C3 stage, a C4 stage and a C5 stage, wherein each stage is composed of different numbers of residual blocks; wherein each of the residual blocks comprises an a channel and a B channel, wherein the a channel in one residual block comprises three convolution operations, the kernel size of the first convolution in the a channel is 1x1 with a step size of 1, the kernel size of the second convolution in the a channel is 3x3 with a step size of 2, the kernel size of the third convolution in the a channel is 1x1 with a step size of 1; the B channel comprises an average pooling layer and a convolutional layer, wherein the step size of the average pooling layer is 2, the core size of the convolutional layer is 1x1, and the step size is 1.

In some embodiments, the scale dimension of the a channel is 4, and the convolution operation of the a channel is as follows: performing convolution operation on the feature map of the upper layer by using the first convolution to obtain a first feature map, and cutting the first feature map into x1, x2, x3 and x4 on the basis of the dimension; wherein x1 is not treated and is used as y 1; the x2 is subjected to the convolution operation of the second convolution to obtain K2 as y 2; performing feature fusion on x3 and K2, and performing convolution operation of the second convolution to obtain K3 as y 3; performing feature fusion on x4 and K3, and performing convolution operation of the second convolution to obtain K4 as y 4; splicing y1, y2, y3 and y4 to obtain a second feature map; performing convolution operation on the second feature map by using the third convolution to obtain a third feature map; and fusing the feature map and the third feature map to obtain a fourth feature map.

In some embodiments, the Detection neutral module operates as follows: taking the outputs of the C5 stage, the C4 stage, the C3 stage and the C2 stage in the Backbone module as the input of the Detection neutral module; the output of the C5 stage is subjected to multiple groups of convolution to obtain P5, the P5 is fused with the output of the C4 stage after being sampled by N times, and P4 is obtained through multiple groups of convolution; the P4 is fused with the output of the C3 stage after being sampled by N times, and P3 is obtained through multiple groups of convolution; the P3 is fused with the output of the C2 stage after being sampled by N times, and P2 is obtained through multiple groups of convolution; the P4, the P3, and the P2 are used as inputs to the Detection Head block.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The dog detection device under the control scene of this application embodiment obtains the surveillance video sample to obtain the detection data set to the dog according to this surveillance video sample and target detection data set, train target detection basic model according to this detection data set, obtain the model parameter, and form the dog detection model under the control scene according to the model parameter, like this, usable this dog detection model carries out dog detection only to a section new surveillance video, the dog that has realized under the wisdom city control scene only detects. The detection data set for the dog is obtained by utilizing the monitoring video and the target detection data set in the monitoring scene, so that the data set used for training the model is closer to the real application scene, the model trained based on the detection data set is more suitable for the smart city monitoring scene, and the detection result of the model is more accurate. In addition, this application realizes through the detection model to the dog that the dog only detects under the wisdom city monitoring scene, can improve detection efficiency, improves and detects the coverage, can be so that the dog only manages more simply, and a large amount of human costs have been saved to high efficiency and intelligence.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 9 is a block diagram of an electronic device for implementing the dog-only detection method in the monitoring scenario according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 9, the electronic apparatus includes: one or more processors 901, memory 902, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 9 illustrates an example of a processor 901.

Memory 902 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the dog-only detection method in the monitoring scenario provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the dog detection method in a monitoring scenario provided by the present application.

The memory 902, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the dog-only detection method in the monitoring scenario of the embodiment of the present application (e.g., the first obtaining module 710, the second obtaining module 720, the model training module 730, the third obtaining module 740, and the detection module 750 shown in fig. 7). The processor 901 executes various functional applications and data processing of the server by running non-transitory software programs, instructions and modules stored in the memory 902, that is, implements the dog-only detection method in the monitoring scenario in the above method embodiment.

The memory 902 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of an electronic device to implement the dog-only detection method in the monitoring scene, and the like. Further, the memory 902 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 902 may optionally include memory remotely located from the processor 901, which may be connected via a network to an electronic device for implementing the dog-only detection method in a surveillance scenario. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for implementing the dog detection method in the monitoring scene may further include: an input device 903 and an output device 904. The processor 901, the memory 902, the input device 903 and the output device 904 may be connected by a bus or other means, and fig. 9 illustrates the connection by a bus as an example.

The input device 903 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device to implement the dog detection method in the monitoring scenario, such as an input device like a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, etc. The output devices 904 may include a display device, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibrating motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A dog detection method under a monitoring scene comprises the following steps:

acquiring a monitoring video sample;

2. The method for detecting the dog under the monitoring scene according to claim 1, wherein the obtaining of the detection data set for the dog according to the monitoring video sample and the target detection data set comprises:

acquiring a first image frame data set and a second image frame data set from the monitoring video sample; the first image frame data set is an image frame data set containing dogs, and the second image frame data set is a monitoring background image frame data set without containing dogs;

marking each dog in each image frame in the first image frame data set to obtain a third image frame data set with marks;

acquiring an image dataset with annotations based on the target detection dataset;

a composite data set is generated from the second image frame data set and the image data set, and a detection data set for the dog is generated from the composite data set and the third image frame data set.

3. The method of claim 2, wherein the generating a composite data set from the second image frame data set and the image data set comprises:

and fusing the canine target of each image in the image data set to the random position of each image frame in the second image frame data set based on an image fusion technology to obtain the composite data set.

4. The method for detecting canines under surveillance as recited in claim 1, further comprising:

and training an initial classification pre-training model according to the target detection data set to obtain model parameters, and generating the target detection basic model according to the model parameters.

5. The method for detecting the dog under the monitoring scene according to claim 1, wherein the dog detection model is a one-stage model; the dog Detection model comprises a Backbone Detection Block, a Detection Neck Detection Block, a Detection Head block and a Detection Loss block, wherein,

the backsbone module is used for taking charge of feature extraction;

the Detection neutral module is used for being responsible for feature recombination and recombining features of different layers;

the Detection Head block is used for being responsible for target Detection;

and the Detection Loss block is used for calculating the difference degree between the obtained prediction result and the corresponding real labeling information.

6. The canine detection method under the monitoring scene according to claim 5, wherein the Backbone module is a network constructed based on a plurality of target detection parameters, Tricks; the Backbone module comprises a C1 stage, a C2 stage, a C3 stage, a C4 stage and a C5 stage, wherein each stage is composed of different numbers of residual blocks; wherein each of the residual blocks comprises an a channel and a B channel, wherein the a channel in one residual block comprises three convolution operations, the kernel size of the first convolution in the a channel is 1x1 with a step size of 1, the kernel size of the second convolution in the a channel is 3x3 with a step size of 2, the kernel size of the third convolution in the a channel is 1x1 with a step size of 1; the B channel comprises an average pooling layer and a convolutional layer, wherein the step size of the average pooling layer is 2, the core size of the convolutional layer is 1x1, and the step size is 1.

7. The method for detecting the canine under the monitoring scene according to claim 6, wherein the scale dimension of the A channel is 4, and the convolution operation of the A channel is as follows: performing convolution operation on the feature map of the upper layer by using the first convolution to obtain a first feature map, and cutting the first feature map into x1, x2, x3 and x4 on the basis of the dimension; wherein x1 is not treated and is used as y 1; the x2 is subjected to the convolution operation of the second convolution to obtain K2 as y 2; performing feature fusion on x3 and K2, and performing convolution operation of the second convolution to obtain K3 as y 3; performing feature fusion on x4 and K3, and performing convolution operation of the second convolution to obtain K4 as y 4; splicing y1, y2, y3 and y4 to obtain a second feature map; performing convolution operation on the second feature map by using the third convolution to obtain a third feature map; and fusing the feature map and the third feature map to obtain a fourth feature map.

8. The dog-only Detection method under the monitoring scene according to claim 5, wherein the Detection New module operates as follows:

taking the outputs of the C5 stage, the C4 stage, the C3 stage and the C2 stage in the Backbone module as the input of the Detection neutral module; the output of the C5 stage is subjected to multiple groups of convolution to obtain P5, the P5 is fused with the output of the C4 stage after being sampled by N times, and P4 is obtained through multiple groups of convolution; the P4 is fused with the output of the C3 stage after being sampled by N times, and P3 is obtained through multiple groups of convolution; the P3 is fused with the output of the C2 stage after being sampled by N times, and P2 is obtained through multiple groups of convolution; the P4, the P3, and the P2 are used as inputs to the Detection Head block.

9. The method for detecting the dog under the monitoring scene according to any one of claims 1 to 8, wherein the detecting the image frames to be detected for the dog according to the dog detection model comprises:

inputting the image frame to be detected into the dog detection model to obtain a prediction frame;

and comparing the prediction frame with a confidence threshold, and executing early warning operation in response to the prediction frame being larger than the confidence threshold.

10. A dog detection device under a monitoring scene comprises:

the first acquisition module is used for acquiring a monitoring video sample;

11. The dog-only detecting device under the monitoring scene according to claim 10, wherein the second acquiring module comprises:

a first obtaining unit, configured to obtain a first image frame data set and a second image frame data set from the monitored video sample; the first image frame data set is an image frame data set containing dogs, and the second image frame data set is a monitoring background image frame data set without containing dogs;

the second acquisition unit is used for labeling each dog in each image frame in the first image frame data set and acquiring a third image frame data set with labels;

a third acquiring unit configured to acquire an image dataset with a label based on the target detection dataset;

a generating unit configured to generate a composite data set from the second image frame data set and the image data set, and generate a detection data set for the dog from the composite data set and the third image frame data set.

12. The dog-only detecting device under the monitoring scenario according to claim 11, wherein the generating unit is specifically configured to:

13. The apparatus for detecting canines under surveillance as recited in claim 10, wherein the model training module is further configured to train an initial classification pre-training model according to the target detection data set, obtain model parameters, and generate the target detection basis model according to the model parameters.

14. The apparatus for detecting dog under monitoring scene according to claim 10, wherein the dog detection model is a one-stage model; the dog Detection model comprises a Backbone Detection Block, a Detection Neck Detection Block, a Detection Head block and a Detection Loss block, wherein,

the backsbone module is used for taking charge of feature extraction;

the Detection Head block is used for being responsible for target Detection;

15. The canine detection device under the monitoring scenario according to claim 14, wherein the backhaul module is a network constructed based on a plurality of target detection parameters, ticks; the Backbone module comprises a C1 stage, a C2 stage, a C3 stage, a C4 stage and a C5 stage, wherein each stage is composed of different numbers of residual blocks; wherein each of the residual blocks comprises an a channel and a B channel, wherein the a channel in one residual block comprises three convolution operations, the kernel size of the first convolution in the a channel is 1x1 with a step size of 1, the kernel size of the second convolution in the a channel is 3x3 with a step size of 2, the kernel size of the third convolution in the a channel is 1x1 with a step size of 1; the B channel comprises an average pooling layer and a convolutional layer, wherein the step size of the average pooling layer is 2, the core size of the convolutional layer is 1x1, and the step size is 1.

16. The dog-only detection device under the monitoring scene according to claim 15, wherein the scale dimension of the a channel is 4, and the convolution operation of the a channel is as follows: performing convolution operation on the feature map of the upper layer by using the first convolution to obtain a first feature map, and cutting the first feature map into x1, x2, x3 and x4 on the basis of the dimension; wherein x1 is not treated and is used as y 1; the x2 is subjected to the convolution operation of the second convolution to obtain K2 as y 2; performing feature fusion on x3 and K2, and performing convolution operation of the second convolution to obtain K3 as y 3; performing feature fusion on x4 and K3, and performing convolution operation of the second convolution to obtain K4 as y 4; splicing y1, y2, y3 and y4 to obtain a second feature map; performing convolution operation on the second feature map by using the third convolution to obtain a third feature map; and fusing the feature map and the third feature map to obtain a fourth feature map.

17. The dog-only Detection device under the monitoring scene according to claim 14, wherein the Detection neutral module operates as follows:

18. The canine detection device under the monitoring scenario of any one of claims 10 to 17, wherein the detection module is specifically configured to:

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the dog-only detection method in a surveillance scenario of any of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the dog-only detection method in a surveillance scenario of any of claims 1-9.