CN113011404B

CN113011404B - Dog leash identification method and device based on time-space domain features

Info

Publication number: CN113011404B
Application number: CN202110568106.1A
Authority: CN
Inventors: 杨帆; 冯帅; 刘利卉; 胡建国
Original assignee: Nanjing Zhenshi Intelligent Technology Co Ltd
Current assignee: Xiaoshi Technology (Jiangsu) Co.,Ltd.
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-24
Anticipated expiration: 2041-05-25
Also published as: CN113011404A

Abstract

The invention discloses a dog leash identification method based on time-space domain characteristics, which utilizes a dog leash identification model to identify whether a dog in a video is leash or not. The invention also discloses a dog leash identification device based on the time-space domain characteristics. Compared with the prior art, the method can more efficiently and accurately identify whether the dog is tied.

Description

Dog leash identification method and device based on time-space domain features

Technical Field

The invention belongs to the technical field of machine vision, and particularly relates to a dog leash identification method and device based on time-space domain characteristics.

Background

Along with social development, more and more people start to raise dogs as pets, and the following unlawful behaviors of walking dogs without pulling ropes and the like also seriously disturb others, so that the behaviors are definitely classified as illegal behaviors by laws and regulations in various places and recently issued animal epidemic prevention laws of the people's republic of China. For such illegal activities, it is practically infeasible to consume a lot of manpower if the conventional manpower monitoring and dissuading manner is adopted. If the behavior of taking a dog illegally for a walk can be detected by utilizing the high-altitude security camera in the existing street, the real-time monitoring can be realized, the labor and material cost can be saved, the equipment maintenance and the repair are easy, and therefore the illegal dog walking detection system based on the monitoring video has good application and popularization values.

The core of the illegal dog walking detection system is to quickly and accurately identify whether a dog in a video image is tethered or not. At present, the existing method is only based on static pictures to analyze feature information in an airspace, the features in the airspace are mainly related to dog leashes, but the dog leashes often difficultly occupy a plurality of effective pixels (far distance and blocked) in an image under a monitoring view angle. In the process of walking the dog, the behaviors of the person and the dog are varied, but specific rules exist at the same time, and the rules enable the person to extract representative motion posture characteristics through deep learning and computer vision technology, for example, the dog always makes a motion similar to a satellite by taking a man-made center no matter the person moves; there is also a possibility that the dog may move rapidly in one direction, then be bound to the dog leash and stop (or slow down) suddenly with the front body tilting upwards; there is also a case where the dog moves in one direction with the person, while the person's hand and dog leash are in line with the dog. These features allow a human to tell a long distance whether to walk a dog or not to tie a rope. However, for machine vision, it is difficult to obtain valid motion pose features with only one static image when the target is small, because these features require not only information in the space but also feature information between consecutive image sequences.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a dog leash identification method based on time-space domain characteristics, which can efficiently and accurately acquire the special time-space domain motion posture characteristics of a leash dog for walking and accordingly quickly and accurately identify whether the dog in a video is leash or not.

The invention specifically adopts the following technical scheme to solve the technical problems:

a dog leash identification method based on time-space domain features is characterized in that a dog leash identification model is used for identifying whether a dog in a video is leash or not, and the input of the dog leash identification model is an image sequence extracted from a video clip with the dog by the following method: acquiring an interested region which takes the dog as the center in a picture of the dog appearing in the video clip for the first time, and respectively intercepting corresponding interested regions from a plurality of pictures afterwards according to the position of the interested region, wherein the images of the series of interested regions form the image sequence; the output of the dog leash identification model is two categories of 'a dog is leashed' and 'the dog is not leashed'; the dog leash identification model comprises a local time-space domain feature extraction module, a global time-domain attention feature extraction module and an output layer, wherein the front end of the local time-space domain feature extraction module is used for extracting local time-space domain features and reducing dimensions, the rear end of the global time-space domain attention feature extraction module is used for extracting longer-term global features, and the output layer is finally used for outputting two classification results; the local time-space domain feature extraction module is composed of a plurality of groups of parallel three-dimensional convolutional neural networks capable of being re-parameterized and corresponding three-dimensional pooling and dimension conversion layers thereof, and the three-dimensional convolutional neural networks capable of being re-parameterized are composed of a plurality of Rep 3D CNN modules in series; the training structure of the Rep 3D CNN module comprises at least two layers of three-dimensional convolutional layers, wherein a batch normalization layer and an activation function layer are arranged behind each three-dimensional convolutional layer, each three-dimensional convolutional layer in the module is provided with a three-dimensional convolutional branch with a parallel convolutional core of K1=1x1x1, each three-dimensional convolutional layer except the first three-dimensional convolutional layer in the module is also provided with a parallel identical mapping branch, and the output of each three-dimensional convolutional layer is added with the output of each branch of the three-dimensional convolutional layer and is input into the next three-dimensional convolutional layer after passing through the activation layer; the prediction structure of the Rep 3D CNN module is obtained by carrying out the following reparameterization operation on a training structure: and fusing the three-dimensional convolutional layers and the batch normalization layer, and merging the 1x1x1 three-dimensional convolution branches and the identity mapping branches into the corresponding three-dimensional convolutional layers.

Preferably, the global time-domain attention feature extraction module is a Vision transform model composed of T +1 parallel Vision transform modules and a subsequent MLP Head layer, where T is the number of parallel re-parametrizable three-dimensional convolutional neural networks.

Preferably, the output layer is a Sigmoid active layer.

Preferably, the region of interest centered on the dog is specifically: and expanding the detection frame of the dog by 5-10 times by taking the longest edge of the detection frame of the dog as a reference.

Preferably, when the output of the dog leash identification model is "dog leash", a further decision is made based on the estimated distance between the dog and the person: outputting a final recognition result of 'the dog is tethered' only when the estimated distance is not greater than a preset threshold, otherwise outputting a final recognition result of 'the distance is exceeded'.

Based on the same inventive concept, the following technical scheme can be obtained:

a dog leash recognition device based on time-space domain features utilizes a dog leash recognition model to recognize whether a dog in a video is leash or not, and the input of the dog leash recognition model is an image sequence extracted from a video clip with the dog by the following method: acquiring an interested region which takes the dog as the center in a picture of the dog appearing in the video clip for the first time, and respectively intercepting corresponding interested regions from a plurality of pictures afterwards according to the position of the interested region, wherein the images of the series of interested regions form the image sequence; the output of the dog leash identification model is two categories of 'a dog is leashed' and 'the dog is not leashed'; the dog leash identification model comprises a local time-space domain feature extraction module, a global time-domain attention feature extraction module and an output layer, wherein the front end of the local time-space domain feature extraction module is used for extracting local time-space domain features and reducing dimensions, the rear end of the global time-space domain attention feature extraction module is used for extracting longer-term global features, and the output layer is finally used for outputting two classification results; the local time-space domain feature extraction module is composed of a plurality of groups of parallel three-dimensional convolutional neural networks capable of being re-parameterized and corresponding three-dimensional pooling and dimension conversion layers thereof, and the three-dimensional convolutional neural networks capable of being re-parameterized are composed of a plurality of Rep 3D CNN modules in series; the training structure of the Rep 3D CNN module comprises at least two layers of three-dimensional convolutional layers, wherein a batch normalization layer and an activation function layer are arranged behind each three-dimensional convolutional layer, each three-dimensional convolutional layer in the module is provided with a three-dimensional convolutional branch with a parallel convolutional core of K1=1x1x1, each three-dimensional convolutional layer except the first three-dimensional convolutional layer in the module is also provided with a parallel identical mapping branch, and the output of each three-dimensional convolutional layer is added with the output of each branch of the three-dimensional convolutional layer and is input into the next three-dimensional convolutional layer after passing through the activation layer; the prediction structure of the Rep 3D CNN module is obtained by carrying out the following reparameterization operation on a training structure: and fusing the three-dimensional convolutional layers and the batch normalization layer, and merging the 1x1x1 three-dimensional convolution branches and the identity mapping branches into the corresponding three-dimensional convolutional layers.

Preferably, the output layer is a Sigmoid active layer.

Preferably, the device further comprises a dog and person distance estimation model for making a further determination based on the estimated distance between the dog and the person when the output of the dog leash identification model is "dog leash": outputting a final recognition result of 'the dog is tethered' only when the estimated distance is not greater than a preset threshold, otherwise outputting a final recognition result of 'the distance is exceeded'.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

according to the method, the time-space domain motion posture characteristic specific to the dog walking by tying is extracted from the video segment with the dog, whether the dog in the video is tied or not is rapidly and accurately identified according to the time-space domain motion posture characteristic, the problem that the dog tying identification is carried out based on characteristic information in a static picture analysis space domain in the prior art is solved, and a more accurate identification effect is achieved;

aiming at the extraction of the spatial-temporal motion attitude characteristics of the tethered dog walking, the invention provides a structure of a reparameterizable three-dimensional convolutional neural network (Rep 3D CNN network for short) combined with a global time-domain attention characteristic extraction module, and the structure has extremely high identification precision and extremely high identification efficiency.

Drawings

Fig. 1 is a schematic diagram of a training structure of a Rep 3D CNN module;

fig. 2 is a schematic diagram of a prediction structure of the Rep 3D CNN module;

FIG. 3 is a schematic diagram of an embodiment of a dog leash identification model;

fig. 4 is a basic identification process of a dog leash identification device based on time-space domain features according to an embodiment of the invention.

Detailed Description

Aiming at the problem of dog leash identification based on characteristic information in a static picture analysis airspace in the prior art, the invention basically solves the idea that the time-space domain motion attitude characteristic specific to leash walking of a dog is extracted from a video segment with the dog, and whether the dog is leash in a video is quickly and accurately identified according to the time-space domain motion attitude characteristic.

Processing a continuous image sequence with context correlation and extracting time-space domain features from the image sequence, the common techniques are 3D CNN network and LSTM network, or using optical flow method to extract features first and then input into 2D CNN. These techniques have their own disadvantages: the 3D CNN parameter quantity is large, and more local time-space domain features are obtained; LSTM can acquire long-term local features, but is computationally inefficient (low parallelism); the extraction method of the motion features such as the optical flow and the like has high computational complexity, some spatial features are lost, and the global features with the representation capability cannot be extracted in the time domain.

In order to solve the problems, the invention provides a structure of a re-parameterizable three-dimensional convolutional neural network (Rep 3D CNN network for short) combined with a global time domain attention feature extraction module, firstly, continuous interested (ROI) areas with a dog as a center are input into the Rep 3D CNN network in groups to obtain local time-space domain features and reduce dimensions, then the extracted features are input into the global time domain attention feature extraction module to obtain features with longer time span and better representation capability, and finally, a binary classifier is connected to output recognition results of 'dog is tethered' and 'dog is not tethered'.

The invention provides a dog leash identification method based on time-space domain characteristics, which comprises the following specific steps:

identifying whether a dog in a video is tethered by utilizing a dog leasing identification model, wherein the input of the dog leasing identification model is an image sequence extracted from a video clip in which the dog exists according to the following method: acquiring an interested region which takes the dog as the center in a picture of the dog appearing in the video clip for the first time, and respectively intercepting corresponding interested regions from a plurality of pictures afterwards according to the position of the interested region, wherein the images of the series of interested regions form the image sequence; the output of the dog leash identification model is two categories of 'a dog is leashed' and 'the dog is not leashed'; the dog leash identification model comprises a local time-space domain feature extraction module, a global time-domain attention feature extraction module and an output layer, wherein the front end of the local time-space domain feature extraction module is used for extracting local time-space domain features and reducing dimensions, the rear end of the global time-space domain attention feature extraction module is used for extracting longer-term global features, and the output layer is finally used for outputting two classification results; the local time-space domain feature extraction module is composed of a plurality of groups of parallel three-dimensional convolutional neural networks capable of being re-parameterized and corresponding three-dimensional pooling and dimension conversion layers thereof, and the three-dimensional convolutional neural networks capable of being re-parameterized are composed of a plurality of Rep 3D CNN modules in series; the training structure of the Rep 3D CNN module comprises at least two layers of three-dimensional convolutional layers, wherein a batch normalization layer and an activation function layer are arranged behind each three-dimensional convolutional layer, each three-dimensional convolutional layer in the module is provided with a three-dimensional convolutional branch with a parallel convolutional core of K1=1x1x1, each three-dimensional convolutional layer except the first three-dimensional convolutional layer in the module is also provided with a parallel identical mapping branch, and the output of each three-dimensional convolutional layer is added with the output of each branch of the three-dimensional convolutional layer and is input into the next three-dimensional convolutional layer after passing through the activation layer; the prediction structure of the Rep 3D CNN module is obtained by carrying out the following reparameterization operation on a training structure: and fusing the three-dimensional convolutional layers and the batch normalization layer, and merging the 1x1x1 three-dimensional convolution branches and the identity mapping branches into the corresponding three-dimensional convolutional layers.

The invention provides a dog leash identification device based on time-space domain characteristics, which comprises the following specific steps:

For the public understanding, the technical scheme of the invention is explained in detail in the following with the accompanying drawings:

because the existing 3D CNN (such as C3D, R3D, SlowFast and the like) has the defect of slow calculation speed, the invention refers to the re-parameterization idea in repvgg to improve the structure of the 3D CNN. The repvgg is a reconfigurable convolutional network based on 2D CNN, the repvgg is a reparameterized network, residual connection similar to resnet exists only during training, the connection can be combined through reparameterization operation during reasoning, a very simple framework which is only 3x3 convolution and a vgg single-path structure similar to a relu layer is formed, and the efficiency is very high on a general GPU and an AI chip with an NPU acceleration unit. The invention applies the re-parameterization idea to the 3D CNN to form the Rep 3D CNN network.

The Rep 3D CNN network provided by the invention is formed by connecting a plurality of Rep 3D CNN modules in series.

The training structure of a single Rep 3D CNN module is shown in fig. 1, and includes at least two layers of three-dimensional convolution layers, where the convolution kernel is K-n = (n, n, n), where n is the size of the convolution kernel; each three-dimensional convolutional layer is followed by a batch normalization layer and an activation function layer, each three-dimensional convolutional layer in the module is provided with a parallel three-dimensional convolutional branch with a convolutional kernel of K1=1x1x1, each three-dimensional convolutional layer in the module except for a first three-dimensional convolutional layer is also provided with a parallel identity mapping (ID, identity) branch, and the first three-dimensional convolutional layer of the module is simultaneously responsible for time-domain down-sampling operation; the output of each three-dimensional convolution layer is added with the output of each branch of the three-dimensional convolution layer, passes through the activation layer and then is input into the next three-dimensional convolution layer.

The prediction structure of a single Rep 3D CNN module is obtained by carrying out the following reparameterization operation on a training structure: and fusing the three-dimensional convolutional layers and the batch normalization layers, and merging the 1x1x1 three-dimensional convolutional branches and the identity mapping branches into corresponding three-dimensional convolutional layers, as shown in fig. 2, wherein the prediction structure after merging only consists of a plurality of layers of three-dimensional convolutional layers and a trunk formed by connecting active layers in series. And the convolution kernel after the three-dimensional convolution layer and the batch normalization layer are fused is Kn-bn. The manner of combining the K1 three-dimensional convolution branch and the ID identity mapping branch is specifically as follows: which are converted into equivalent convolution kernels of size (n, n, n), respectively. Wherein K1 converts it into a convolution kernel of K1-n = (n, n, n) by zero padding it; the output of the identity map layer is equal to the input, which can be equivalent to a convolution kernel of 1x1x1, which is then converted into a convolution kernel of ID-k = (k, k, k) by zero padding it; the Kn-bn, K1-n, and ID-n are then added to yield a combined convolutional layer of size (n, n, n).

A specific example of the dog leash recognition model set up by the invention is shown in fig. 3, and includes a local time-space domain feature extraction module at a front end for extracting local time-space domain features and reducing dimensions, a global time-domain attention feature extraction module at a rear end for extracting longer-term global features (a Vision Transformer model is adopted in this embodiment), and an output layer (a Sigmoid active layer is adopted in this embodiment) for outputting two classification results. As shown in fig. 3, the dog leash recognition model takes as model input a continuous region of interest (ROI) centered on the dog, which is a sequence of images extracted from a video segment in which the dog is present by the following method: acquiring an interested region which takes the dog as the center in a picture of the dog appearing in the video clip for the first time, and respectively intercepting corresponding interested regions from a plurality of pictures afterwards according to the position of the interested region, wherein the images of the series of interested regions form the image sequence; the region of interest centered on the dog in this embodiment is a region obtained by enlarging the detection frame of the dog by 5 to 10 times with reference to the longest edge of the detection frame of the dog. Equally dividing the image sequence into T parts, zooming to WxHxCXD, and respectively sending the T parts to a T group parallel Rep 3D CNN network; each group of Rep 3D CNN network is formed by connecting L Rep 3D CNN modules in series, the number of channels of each module is not specially limited, each module only carries out down-sampling in a space domain through a convolution layer and does not carry out down-sampling in a time domain; then inputting the final output of each group of Rep 3D CNN networks into a global average pooling layer, and obtaining a 1 xD-dimensional local time-space domain characteristic after dimension conversion (reshpae); local space-time domain feature embedding (local spatial-temporal feature embedding) of TxD dimension is sent to a Transformer encoder part to obtain longer-term and global features in time domain, then the longer-term and global features are input to an MLP Head layer, and finally a Sigmoid is connected to serve as a classifier.

Specifically, the input of the present embodiment is to acquire N =16 consecutive ROI regions centered on the dog in T =4 seconds, zoom the input to 224x224x3x16, and normalize to-1 to 1. Dividing the data into 4 parts which are not overlapped, respectively inputting the parts into 4 groups of parallel Rep 3D CNN networks, and forming local time-space domain characteristics by the extracted characteristics after three-dimensional global average pooling and dimension conversionThe graph is then input into a Transfomer encode part to extract more global, i.e., longer-term, more representative features in the time domain. The Rep 3D CNN network is composed of 5 Rep 3D CNN modules, the main branches inside the modules adopt convolution of 3x3x3, the Rep 3D CNN modules adopt training structures during training, the convolution, batch normalization and identity mapping layers are combined during prediction, and higher calculation efficiency is achieved after combination. The number of convolution channels per module is 32, 64, 128, 64, 32, respectively. Each module only performs down-sampling in the spatial domain, and does not perform down-sampling in the time domain. After 5 modules, the characteristic scale of the Rep 3D CNN network output is 32x4x7x7 (CxTxWxH). Then, the features of 32x1x7x7 are obtained through global average pooling, and after reshape, the features become embeddins with 1x1568 dimensionsg. The Rep 3D CNN network extracts the local time-space domain characteristics. The 4 Rep 3D CNN networks constitute a 4x1568 local spatial-temporal feature embedding (local spatial-temporal feature embedding). Adding the local time-space domain features and learnable position codes, adding a class vector dimension (class token), and outputting a dimension 5 x 1568; and then inputting the data to a global attention feature extraction module, namely 5 parallel Vision Transformer modules, then inputting the data to an MLP Head layer, and finally connecting a Sigmoid as a classifier.

The training process of the dog leash recognition model of the embodiment is specifically as follows:

the Sigmoid cross soil moisture Loss function is adopted, ADAM is adopted by the optimizer, the initial learning rate (learning rate) is set to be 0.001, and the weight attenuation parameter is set to be 0.000005. The data was amplified as: randomly sampling according to a time sequence in a single sample group, rotating by a small amplitude (plus or minus 15 degrees horizontally), randomly cutting, dithering colors, and converting a color image into a gray image with a certain probability. Training was randomly initialized using 8 NVIDIA GPUs, with the batch size set to 64, for a total of 120 iterations, and the learning rate divided by 10 every 30 rounds.

Whether the dog is stroked or not can be accurately identified in real time based on the established dog leasing identification model, and the following further explains the dog leasing identification model through a specific embodiment. The dog leash recognition device based on the time-space domain features of the embodiment comprises a dog leash recognition model established above, a dog and person detector model for detecting persons and dogs, and a dog and person distance estimation model for estimating the distance between the dogs and persons.

Dog and person detection models existing technologies such as classical target detection models like fast Rcnn, SSD, Retinanet, PFN, etc. can be used; for efficiency, it is preferable to use the yolov4 model trained on the COCO dataset, which itself includes the dog and human categories. The invention only uses yolov4 model as a pre-training model, reduces the category from 80 to 2, and then fine-tunes with labeled data set, and the prediction uses relatively large input scale (960 x 576), further improving the precision of the method in target monitoring scene and small target detection.

Because a monocular camera commonly used for security monitoring cannot acquire depth information and cannot acquire high-precision distance estimation through the depth information, the distance between a dog and a person needs to be estimated. The specific estimation can adopt various existing methods, and the estimation model of the distance between the dog and the person in the specific embodiment estimates the distance between the dog and the person according to the actual width prior information of the pedestrian, which is specifically as follows: assuming that a human detection frame obtained by a dog and human detector model is person (x _ person, y _ person, w _ person, h _ person), wherein x _ person and y _ person are horizontal and vertical coordinates of a central point of the human detection frame, and w _ person and h _ person are width and height of the human detection frame respectively; a dog detection frame dog (x _ dog, y _ dog, w _ dog and h _ dog), wherein x _ dog and y _ dog are coordinates of a central point of the dog detection frame, and w _ dog and h _ dog are the width and height of the dog detection frame respectively; the actual width prior range of the pedestrian under the conventional monitoring visual angle (the height of a camera is more than 2 meters, and the depression angle is more than 15 degrees) is 0.25-0.5 meter (from the side surface of the pedestrian to the front surface), and the average value meanWidth =0.375 meter is taken as the reference human body width; pixel horizontal distance between person and dog, persistence = | | x _ person-x _ dog |, then estimated distance between dog and person: persistent _ real _ distance = persistent _ pixel _ distance × means width/w _ person.

As shown in fig. 4, the dog leash identification process in this embodiment specifically includes the following steps:

step 1, inputting a picture frame after video stream decoding;

step 2, reducing the image size to 960x576, and detecting by using a dog and person detector;

step 3, judging whether to acquire continuous ROI (default is no during initialization), if not, entering step 4, and if yes, entering step 7;

step 4, judging whether a dog exists in the image, if not, returning to a normal state, and if so, entering the step 5;

step 5, judging whether a person exists in the image, if the person does not return to the state that the dog is not tied, if so, entering the step 6;

step 6, calculating the ROI coordinates of the region with the dog as the center, and setting the state of whether to acquire the continuous ROI region as yes;

step 7, obtaining continuous ROI areas, and jumping to the step 1 until the required number is met;

step 8, inputting the dog leash identification model, judging whether the dog leash is leash, if so, entering step 9, otherwise, outputting 'the dog is not leash';

and 9, estimating the distance between the dog and the human distance model by using the dog and the human distance model, judging whether the distance exceeds a threshold value, returning to the 'exceeding distance' if the distance exceeds the threshold value, and returning to the 'dog tethered' if the distance does not exceed the threshold value.

The dog leash identification method and device based on the time-space domain features can be used independently, can also be used as one link or module, and is combined with other identification algorithms such as forbidden large dog type identification, dog dung cleaning behavior identification and the like to form a complete dog leash behavior monitoring system so as to monitor various civilized or illegal dog leash behaviors.

Claims

1. A dog leash identification method based on time-space domain features utilizes a dog leash identification model to identify whether a dog in a video is leash or not, and is characterized in that the input of the dog leash identification model is an image sequence extracted from a video clip with the dog by the following method: acquiring an interested region which takes the dog as the center in a picture of the dog appearing in the video clip for the first time, and respectively intercepting corresponding interested regions from a plurality of pictures afterwards according to the position of the interested region, wherein the images of the series of interested regions form the image sequence; the output of the dog leash identification model is two categories of 'a dog is leashed' and 'the dog is not leashed'; the dog leash identification model comprises a local time-space domain feature extraction module, a global time-domain attention feature extraction module and an output layer, wherein the front end of the local time-space domain feature extraction module is used for extracting local time-space domain features and reducing dimensions, the rear end of the global time-space domain attention feature extraction module is used for extracting longer-term global features, and the output layer is finally used for outputting two classification results; the local time-space domain feature extraction module is composed of a plurality of groups of parallel three-dimensional convolutional neural networks capable of being re-parameterized and corresponding three-dimensional pooling and dimension conversion layers thereof, and the three-dimensional convolutional neural networks capable of being re-parameterized are composed of a plurality of Rep 3D CNN modules in series; the training structure of the Rep 3D CNN module comprises at least two layers of three-dimensional convolutional layers, wherein a batch normalization layer and an activation function layer are arranged behind each three-dimensional convolutional layer, each three-dimensional convolutional layer in the module is provided with a three-dimensional convolutional branch with a parallel convolutional core of K1=1x1x1, each three-dimensional convolutional layer except the first three-dimensional convolutional layer in the module is also provided with a parallel identical mapping branch, and the output of each three-dimensional convolutional layer is added with the output of each branch of the three-dimensional convolutional layer and is input into the next three-dimensional convolutional layer after passing through the activation layer; the prediction structure of the Rep 3D CNN module is obtained by carrying out the following reparameterization operation on a training structure: fusing the three-dimensional convolution layer and the batch normalization layer, and merging the 1x1x1 three-dimensional convolution branch and the identity mapping branch into the corresponding three-dimensional convolution layer; the global time domain attention feature extraction module is a Vision Transformer model consisting of T +1 parallel Vision Transformer modules and a MLP Head layer, and T is the number of the parallel three-dimensional convolutional neural networks which can be subjected to re-parameterization.

2. The dog leash identification method based on the time-space domain features according to claim 1, wherein the output layer is a Sigmoid activation layer.

3. The dog leash identification method based on the time-space domain features as claimed in claim 1, wherein the region of interest with the dog as the center is specifically: and expanding the detection frame of the dog by 5-10 times by taking the longest edge of the detection frame of the dog as a reference.

4. The method for dog leash recognition based on time-space domain features according to claim 1, wherein when the output of the dog leash recognition model is 'dog leash', further judgment is made according to the estimated distance between the dog and the human: outputting a final recognition result of 'the dog is tethered' only when the estimated distance is not greater than a preset threshold, otherwise outputting a final recognition result of 'the distance is exceeded'.

5. A dog leash recognition device based on time-space domain features utilizes a dog leash recognition model to recognize whether a dog in a video is leash or not, and is characterized in that the input of the dog leash recognition model is an image sequence extracted from a video clip with the dog by the following method: acquiring an interested region which takes the dog as the center in a picture of the dog appearing in the video clip for the first time, and respectively intercepting corresponding interested regions from a plurality of pictures afterwards according to the position of the interested region, wherein the images of the series of interested regions form the image sequence; the output of the dog leash identification model is two categories of 'a dog is leashed' and 'the dog is not leashed'; the dog leash identification model comprises a local time-space domain feature extraction module, a global time-domain attention feature extraction module and an output layer, wherein the front end of the local time-space domain feature extraction module is used for extracting local time-space domain features and reducing dimensions, the rear end of the global time-space domain attention feature extraction module is used for extracting longer-term global features, and the output layer is finally used for outputting two classification results; the local time-space domain feature extraction module is composed of a plurality of groups of parallel three-dimensional convolutional neural networks capable of being re-parameterized and corresponding three-dimensional pooling and dimension conversion layers thereof, and the three-dimensional convolutional neural networks capable of being re-parameterized are composed of a plurality of Rep 3D CNN modules in series; the training structure of the Rep 3D CNN module comprises at least two layers of three-dimensional convolutional layers, wherein a batch normalization layer and an activation function layer are arranged behind each three-dimensional convolutional layer, each three-dimensional convolutional layer in the module is provided with a three-dimensional convolutional branch with a parallel convolutional core of K1=1x1x1, each three-dimensional convolutional layer except the first three-dimensional convolutional layer in the module is also provided with a parallel identical mapping branch, and the output of each three-dimensional convolutional layer is added with the output of each branch of the three-dimensional convolutional layer and is input into the next three-dimensional convolutional layer after passing through the activation layer; the prediction structure of the Rep 3D CNN module is obtained by carrying out the following reparameterization operation on a training structure: fusing the three-dimensional convolution layer and the batch normalization layer, and merging the 1x1x1 three-dimensional convolution branch and the identity mapping branch into the corresponding three-dimensional convolution layer; the global time domain attention feature extraction module is a Vision Transformer model consisting of T +1 parallel Vision Transformer modules and a MLP Head layer, and T is the number of the parallel three-dimensional convolutional neural networks which can be subjected to re-parameterization.

6. The dog leash identification device based on time-space domain features of claim 5, wherein the output layer is a Sigmoid activation layer.

7. The dog leash identification device based on the time-space domain features as claimed in claim 5, wherein the region of interest with the dog as the center is specifically: and expanding the detection frame of the dog by 5-10 times by taking the longest edge of the detection frame of the dog as a reference.

8. The dog leash recognition device based on time-space domain features according to claim 5, further comprising a dog and person distance estimation model for further judging according to the estimated distance between the dog and the person when the output of the dog leash recognition model is 'dog leash': outputting a final recognition result of 'the dog is tethered' only when the estimated distance is not greater than a preset threshold, otherwise outputting a final recognition result of 'the distance is exceeded'.