CN111639563B - Basketball video event and target online detection method based on multitasking - Google Patents

Basketball video event and target online detection method based on multitasking Download PDF

Info

Publication number
CN111639563B
CN111639563B CN202010419217.1A CN202010419217A CN111639563B CN 111639563 B CN111639563 B CN 111639563B CN 202010419217 A CN202010419217 A CN 202010419217A CN 111639563 B CN111639563 B CN 111639563B
Authority
CN
China
Prior art keywords
event
time
target
frame
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010419217.1A
Other languages
Chinese (zh)
Other versions
CN111639563A (en
Inventor
华璟
王腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN202010419217.1A priority Critical patent/CN111639563B/en
Publication of CN111639563A publication Critical patent/CN111639563A/en
Application granted granted Critical
Publication of CN111639563B publication Critical patent/CN111639563B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/44Event detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a basketball video event and target online detection method based on multitasking, which utilizes a deep convolutional neural network to share multitasking weights and can detect events and targets of basketball game videos in an online or offline mode. The loss of each task is respectively and reversely propagated to the corresponding branches based on the multi-task mixed loss function, so that the learning speed of each task branch is increased. The losses obtained by the two tasks are added according to specific weights to obtain overall losses and are back-propagated, so that the backbone network learns the potential feature induction mode of mixing the two tasks. The semi-supervised pseudo-label mining expands training data and effectively suppresses low-quality event prediction frames generated at high tide moments of deviated events and low-quality bounding frames generated at the geometrical centers of deviated targets. The time-space multi-scale network structure fully utilizes multi-stride time domain information, summarizes multi-scale historical characteristics, and effectively improves recall rate and accuracy rate of event detection.

Description

Basketball video event and target online detection method based on multitasking
Technical Field
The invention belongs to the technical field of video event detection and video target detection, and particularly relates to a basketball game video event and target online detection method based on multitasking.
Background
Video event and object detection are key technologies in video understanding. With the rapid development of communication technology, computer technology and steady growth of sports industry, the volume of amateur and professional sports game video data has been increasing explosively. Sports game video is a video resource containing a large number of events and targets, has very wide audience groups and huge research value, and also has higher requirements on fine processing, archiving and sharing of video mathematics. In recent years, deep learning based on convolutional neural networks and progress of high-performance parallel computing devices provide a guarantee for this demand.
The video target detection is used for accurately detecting targets appearing in video frames, and has important significance in crowd monitoring, automatic driving and other aspects. The simplest detection method is to detect each frame of video by using a common picture object detection network, but the utilization of inter-frame time domain information is lacking, and jitter of bounding boxes and abrupt changes of classification easily occur between frames in actual situations. The use of 3D convolution, optical flow diagrams, LSTM layers, multi-streams, etc. can make good use of inter-frame time domain information, but they bring about very much computation, and place high demands on computing devices.
The existing video event detection research works mainly detect events of game videos in an off-line processing mode, and the practicability in the on-line processing fields such as live broadcasting, rebroadcasting and the like is lacking. The existing video online event detection mode only classifies single frames and lacks prediction of the starting and ending moments of the current event. In addition, the existing video event detection research has higher separation degree from the video target detection research, redundant calculation exists on the same video, and the event detection lacks the utilization of target position and motion information. Thus, there is still a lack of a method for simultaneously and efficiently detecting events and targets in basketball videos online.
Disclosure of Invention
Aiming at the defects and improvement demands of the prior art, the invention aims to provide a method for efficiently and online detecting events and targets in basketball videos based on the weight of a bottom layer of a multitasking shared convolutional neural network.
In order to achieve the above purpose, the present invention is realized by the following technical scheme: a basketball video event and target online detection method based on multitasking comprises the following steps:
s1: neural network construction based on multi-scale feature induction and expression:
the backbone network layer is a Resnet network, and a time domain replacement module is added in the backbone network layer; the time domain replacement module adds the non-bypass convolution of each residual structure in the Resnet network, and the front in the characteristic diagram at the current momentThe channel is replaced by the value stored in the buffer memory at the last moment, and the current frame is added before the network characteristic diagram of the layer>Updating the channel into a cache, wherein m is more than 1;
taking the output of the conv3_x layer, the conv4_x layer and the conv5_x layer of the network as the input of the feature pyramid network to obtain five space-time features F3-F7 with different scales;
inputting the two-layer characteristic diagrams F6 and F7 with the lowest resolution into an event detection head, dividing the event detection head into two paths, respectively passing through 4 convolution layers and a global average pooling layer, wherein one path of output scale is 1 multiplied by C e ,C e The number of event types; the other path is divided into two sub paths, wherein one sub path outputs event start and end time offsets with the scale of 1 multiplied by 2, and the other sub path outputs event climax scoring values with the scale of 1 multiplied by 1;
inputting feature maps F3-F7 with different dimensions into a target detection head, dividing the target detection head into two paths, passing through 4 layers of convolution layers, and outputting a path with the output scale of H multiplied by W multiplied by C o ,C o Is the number of target species; the other path is divided into two sub paths, one sub path outputs regression frame coordinates with the scale of H multiplied by W multiplied by 4, the other sub path outputs target center scoring value with the scale of H multiplied by W multiplied by 1, and H multiplied by W isThe resolution of the feature map output by the front layer;
s2: training a neural network:
the target detection loss comprises classification loss, regression loss and semi-supervision center offset loss, and the losses are added according to specific weights to obtain total target detection loss;
the event detection loss comprises classification loss, regression loss and semi-supervised climax deviation loss, and the two losses are added according to a specific weight to obtain total event detection loss;
the target detection loss and the event detection loss are calculated independently, the loss of each task is transmitted to the corresponding branch in a reverse direction, and the branch learning speed of each task is accelerated; adding the losses obtained by the two tasks according to a specific weight to obtain a mixed loss function based on multiple tasks and carrying out back propagation at the same time, so that a potential feature induction mode of mixing the two tasks is learned by a bottom backbone network; gradient descent is used to minimize the multitasking hybrid loss function to find the optimal network model parameters.
S3: reasoning and result processing:
obtaining target detection and event detection results through multi-branch forward propagation by using a trained neural network;
multiplying the target classification score and the target center scoring value by a low-quality regression frame for inhibiting a large amount of off-center deviation to obtain a final target classification score value, and taking the classification with the highest final score value as the classification of the target;
and multiplying the event classification score by the event climax scoring value to obtain a final event classification score value, and taking the classification with the highest final score value as the event classification corresponding to the current frame.
Further, after adding the time domain replacement module, the calculation formula of the feature map of the j-th layer at the time t is as follows:
wherein: f (F) i,t A feature map of the instant t output by the ith layer F i,t-n A characteristic diagram of t-n moment output by the ith layer, f conv For residual structure operation, f concat F for splicing operation in channel dimension j,t The j is the input of the residual block of the layer after the i, and is the output of the moment t obtained by the j-th layer;
the calculation formula of the feature map of the j-th layer at the time t-n is as follows:
wherein: f (F) i,t-n F, outputting a characteristic diagram of time t-n for the ith layer i,t-2n F is a characteristic diagram of the moment t-2n output by the ith layer j,t-n Outputting the time t-n obtained for the j-th layer;
the method is characterized in that the characteristic diagram of the moment t of the output of the kth layer is obtained by expanding the network structure, the characteristic diagram contains a plurality of pieces of time step information, k is the input of a residual block of the next layer after j, and the characteristic diagram of the moment t of the output of the kth layer is calculated by the following modes:
further, the process of obtaining the five spatio-temporal features F3-F7 with different scales specifically includes:
and marking the conv3_x layer, the conv4_x layer and the conv5_x layer of the Resnet network as C3, C4 and C5, respectively carrying out convolution operation on the C3, C4 and C5 layers to obtain C3', C4', C5', carrying out downsampling on the C5' twice to obtain F6 and F7 respectively, directly outputting the C5 'as F5, upsampling the C5' and adding the C4 'to obtain F4, upsampling the F4 and adding the F4 and the C3' to obtain F3 and F3-F7 to form the pyramid characteristic diagram structure.
Further, for five space-time features F3-F7 with different scales, targets with different sizes are distributed to feature maps F3-F7 with different scales for detection, small targets are mainly extracted from a high-resolution bottom-layer feature map, usually F3 and F4 layers, large targets are mainly extracted from a lower-resolution middle-high-layer feature map, usually F4-F7 layers, and two-layer feature maps F6 and F7 with the lowest resolution are connected with an event detection head for expression and extraction of event features.
Further, for the acquired video stream, an image is extracted every n frames and converted into an RGB color space, the image is kept vertically and horizontally sampled to a size of 800 pixels at a short side, and the average value of ImageNet on three channels of RGB is subtracted from the resampled image as an input of the neural network.
Further, a training set is expanded by using a semi-supervised pseudo label mining method, and the labels marked by n frames at intervals are supplemented with target detection pseudo labels of the rest frames, specifically: extracting the characteristic expression of the object in the frame by using a pedestrian re-identification model of SOTA, and performing multi-object similarity matching on the characteristic vectors in the two interval frames, wherein a similarity calculation formula is as follows:wherein->O t Feature vector set for target in t moment frame, O t+n Feature vector set for target in t+n time frame,>feature vectors in the set respectively; if cos theta i,j If the number is equal to or greater than threshold, the number is considered as an annotation frame of the same person, and the pair of the annotation frames is successfully matched; target proportion P when matching is successful in two interval frames success When the threshold value T is exceeded, two frames are considered to be successfully matched, and the frame size and the position corresponding to the label-free frame are calculated by using a linear interpolation method for the successfully matched frames, so that a pseudo label of the middle missing frame is obtained; if the label is smaller than or equal to the threshold T, the label is not marked; the target proportion calculation mode of successful matching is as follows: />Wherein O is success For the target logarithm successfully matched with the t moment frame and the t+n moment frame, O t 、O t+n The meaning is the same as the previous formula.
Further, a training set is expanded by using a semi-supervised pseudo-label mining method, and a pseudo-label of a climax scoring value is obtained by calculating the approaching degree of the current moment t to the climax moment in the starting time and the ending time of the event in order to effectively restrain a low-quality event prediction frame generated at the moment of the climax of a deviated event, wherein the calculation formula is as follows:wherein T is b * For the duration of the high tide time from the starting time of the event, T l * For the duration of the moment from the moment of the event beginning comprising the moment, T f * For the duration of the high tide time from the ending time of the event, T r * The time length of the event end time containing the time is the time distance; the pseudo tag value corresponding to the real high tide time is 1, the size of the constructed pseudo tag value is in nonlinear decrease from the high tide point to two sides on a time axis in the event until the pseudo tag value coincides with the starting time and the ending time of the event, and the coincident point tags are all 0.
Further, a training set is expanded by using a semi-supervised pseudo tag mining method, so that the approach degree of the current position relative to the geometric center point of the calibration frame contained in the current position is calculated for effectively restraining a low-quality prediction frame generated by deviating from the geometric center of the target, and a pseudo tag scoring value of the target center is obtained, wherein the calculation formula is as follows:wherein left is * Distance from current position to left side of label frame, right * The distance from the current position to the right side of the labeling frame is top * Annotate the distance of frame top for current position distance, bottom * The distance from the current position to the lower part of the labeling frame is set as epsilon, and epsilon is a parameter of the distribution of the regulating value; the pseudo tag value of the real geometric center is 1+epsilon, and the size of the pseudo tag value obtained by construction radially decreases from the geometric center to the periphery in the target calibration frame until reaching the calibration frameThe label on this is 0+ε.
Further, target detection loss L object The calculation formula is as follows:
wherein: n (N) pos_obj Representing the number of target positive samples c x,y Is the target class of feature point predictions at coordinates (x, y),is the actual target category marked by the feature points at the coordinates (x, y), alpha and beta are weight parameters, b x,y Is the target prediction frame parameter corresponding to the feature point at the coordinates (x, y), is +.>Is a target marking frame parameter corresponding to a feature point at coordinates (x, y), r x,y Is the target center scoring value of the feature point prediction at coordinates (x, y), +.>Is the target center scoring value L of the pseudo label mark of the characteristic point at the coordinates (x, y) cls Classifying loss functions for cross entropy, L reg L is the GIOU loss function ctr Is a binary loss function;
event detection loss L event The calculation formula is as follows:
wherein: n (N) pos_ev Representing the number of positive samples of an event, e t Is the event category of the frame prediction corresponding to time t,is the actual category of the frame label corresponding to the time t, gamma and delta are weight parameters, l t For the predicted current time tDuration from the start of event, r t For the duration of the predicted current time t from the end time of the event, l * For the length of the marked high tide time from the starting time of the event, r * For the length of time from the marked high tide time to the ending time of the event, t * For marked climax moment, h is measured by an offset relative to the beginning of the whole video t Is the climax score value of the frame prediction corresponding to time t,/for the frame prediction corresponding to time t>Is the climax scoring value L of the frame pseudo tag label corresponding to the time t cls Classifying loss functions for cross entropy, L hot As a binary loss function, L reg For the cross ratio of the detected event span and the real label on the time axis, the calculation formula is as follows:
the total loss calculation formula is: l (L) total =L object +λL event Where λ is the weight parameter.
Further, in the step S3, for the output of the network target detection portion, four parameters of the regression frame coordinates are converted into a common calibration frame diagonal two-point coordinate form, and original image coordinates corresponding to coordinate points (x, y) in the feature images with different scales are calculated according to the following formula:wherein s is the reduced multiple of the current feature map relative to the original map; for the regressed prediction frame, suppressing the prediction frame by using NMS;
outputting the network event detection part, and if the final event score value is smaller than the threshold value, suppressing to consider that no event exists at present; when online processing is performed, judging that the current event occurs when three continuous frames are larger than a threshold value; and in the offline processing, performing deduplication and merging on the events overlapped on the time axis.
Compared with the prior art, the invention has the following advantages: the method of the invention utilizes the deep convolutional neural network to share the multi-task weight, and can carry out event detection and target detection on basketball game videos in an online or offline mode. The loss of each task is respectively and reversely propagated to the corresponding branches based on the multi-task mixed loss function, so that the learning speed of each task branch is increased, and the detection capability of each branch is improved. In addition, the losses obtained by the two tasks are added according to specific weights to obtain overall losses and are propagated in the opposite direction, so that the underlying backbone network learns a potential feature induction mode mixed with the two tasks. The semi-supervised pseudo-label mining expands training data, effectively inhibits a low-quality event prediction frame generated at a high tide moment of a deviated event and a low-quality bounding frame generated at a geometrical center of a deviated target, improves the accuracy of target detection, and reduces the false alarm rate of event detection. The time-space multi-scale network structure fully utilizes multi-stride time domain information to induce multi-scale historical features, small targets are mainly extracted from high-resolution bottom feature images, large targets are mainly extracted from lower-resolution middle-high-level feature images, and the two-layer feature images with the lowest resolution effectively improve recall rate and accuracy of event detection due to the fact that global motion information is induced.
Drawings
FIG. 1 is a flow chart of a method for online detection of basketball video events and targets based on multitasking in accordance with an embodiment of the present invention;
FIG. 2 is a neural network architecture diagram of a method for online detection of basketball video events and targets based on multitasking in accordance with an embodiment of the invention.
Detailed Description
In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
As shown in fig. 1, the method for online detecting basketball video events and targets based on multitasking provided by the application comprises the following steps:
s1: neural network construction based on multi-scale feature induction and expression, as shown in fig. 2, the backbone network layer is a network of Resnet-50.
S1.1 is to fully utilize multi-stride time domain information, induce multi-scale history features to extract the time domain information, and add a time domain replacement module in a backbone network layer, so that the learning capacity of the network in the time dimension is enhanced, and the stability of target detection and the accuracy of event detection are improved. Since the addition of a time domain permutation module before or on the bypass (shortcut) of the residual structure of Resnet would destroy the spatial information, the selection of adding a time domain permutation module before the non-bypass convolution of each residual structure in the Resnet network, i.e., before each time in the feature mapThe channel is replaced by the value stored in the last (last moment) buffer and the current frame is replaced before the layer of network feature map>Updating the channel into a cache, wherein m is greater than 1, and the calculation formula of the feature map of the jth layer at the time t is as follows:
wherein: f (F) i,t A feature map of the instant t output by the ith layer F i,t-n A characteristic diagram of t-n moment output by the ith layer, f conv For residual structure operation, f concat F for splicing operation in channel dimension j,t And j is the input of the residual block of the layer after i, which is the output of the time t obtained by the j-th layer. Similarly, the feature map of the j-th layer at the time t-n can be obtained by spreading in time, and the calculation formula is as follows:
wherein: f (F) i,t-n F, outputting a characteristic diagram of time t-n for the ith layer i,t-2n F is a characteristic diagram of the moment t-2n output by the ith layer j,t-n The output at time t-n obtained for the j-th layer.
Therefore, the characteristic map at the time t of the k-th layer output can be obtained by expanding the network structure, the characteristic map contains a plurality of pieces of time step information, k is the input of the residual block of the j later layer, and the characteristic map at the time t of the k-th layer output is calculated by the following modes:
s1.2, respectively marking a conv3_x stage layer, a conv4_x stage layer and a conv5_x stage layer of a Resnet network as C3, C4 and C5, and carrying out feature fusion on the C3, C4 and C5 layers to fully utilize multi-scale airspace information, wherein the fusion method is to respectively carry out convolution operation on the C3, C4 and C5 layers (1X 1 convolution is applied to the output of the C3, C4 and C5 layers) to obtain C3', C4 and C5', carrying out downsampling on the C5' twice to respectively obtain F6 and F7, directly outputting the C5' as F5, carrying out upsampling on the C5' and adding the C4' to obtain F4, and adding the upsampled F4 and the C3' to obtain F3, wherein the F3 is calculated in the following way: f3 =f up (f up (C5 ') +c4 ') +c3'; wherein f up For the upsampling method, F4, F5, F6, F7 will not be described again. f3-F7 constitute a pyramid feature map structure that facilitates the expression of multiple scale features.
Targets with different sizes are distributed to feature graphs with different scales for detection, small targets are mainly extracted from a high-resolution bottom-layer feature graph (usually F3 and F4 layers), large targets are mainly extracted from a lower-resolution middle-high-layer feature graph (usually F4-F7 layers), and two-layer feature graphs F6 and F7 with the lowest resolution have potential high-level semantics due to global motion information, and are connected with an event detection head for expressing and extracting event features.
S1.3, using the event detection head and the target detection head to obtain a multiplexed output.
The input of the event detection head is a characteristic diagram F6 and F7 which are convolved by a plurality of layers, the event detection head is divided into two paths, the two paths pass through 4 convolved layers and a global average pooling layer, and the output scale of one path is 1 multiplied by C e ,C e The number of event types; the other path is divided into two sub paths, wherein one sub path outputs event start and end time offsets with the scale of 1 multiplied by 2, and the other sub path outputs event climax scoring values with the scale of 1 multiplied by 1. The global average pooling layer enables the network to adapt to video input with different resolutions, has good robustness to videos with different resolutions, and has a calculation formula of: feature map output for event classification, +.>Classifying unprocessed feature images for events, H×W is the resolution of the feature images output by the previous layer, C e The number of channels of the feature map output for the front layer, namely the number of event categories, is similarly +.>Feature map scoring event climax output, +.>The moment offset of the regression prediction output is started and ended for the event.
The input of the target detection head is a characteristic diagram F3, F4, F5, F6 and F7 which are subjected to a plurality of layers of convolutions, the target detection head is also divided into two paths, and the target detection head is subjected to 4 layers of convolutions, wherein one path of target detection head outputs H multiplied by W multiplied by C in scale o ,C o Is the number of target species; the other path is divided into two sub paths, and one sub path outputs a regression frame with the scale of H multiplied by W multiplied by 4The target center scoring value of the other sub-path output scale is H multiplied by W multiplied by 1, and H multiplied by W is the resolution of the feature map output by the front layer.
S2: neural network training
And extracting an image every n frames for the acquired video stream, converting the image into an RGB color space, keeping longitudinal and transverse resampling to a size of 800 pixels at a short side, and subtracting an average value of image Net on three RGB channels from the resampled image to serve as input of the neural network. Taking a frame with 800×1024 size as an example, the C3, C4 and C5 output layer sizes of the network are 100×128, 50×64 and 25×32, and the channel numbers are 512, 1024 and 2048, respectively. The outputs of the C3, C4 and C5 layers are used as the inputs of the FPN network, namely the feature pyramid network, and the downsampling and upsampling are fused, so that the backbone network obtains 5 space-time features F3-F7 with different scales, and the sizes are 100 multiplied by 128, 50 multiplied by 64, 25 multiplied by 32 and 13 multiplied by 16,7 multiplied by 8 respectively. Training input data is augmented using a semi-supervised pseudo label mining method and a hybrid loss function based on multiplexing is minimized using Adam optimizer gradient descent.
S2.1 semi-supervised pseudo label mining expansion training set
In order to effectively augment the training data, the target detection pseudo tags of the remaining frames are supplemented for the tags originally marked with n frames at intervals. Extracting the characteristic expression of targets in a frame by using a pedestrian re-identification (ReID) model of SOTA, wherein each target obtains 4096-dimensional vectors, and performing multi-target similarity matching on the characteristic vectors in two interval frames, wherein a similarity calculation formula is as follows:wherein->O t Feature vector set for target in t moment frame, O t+n Feature vector set for target in t+n time frame,>respectively, feature vectors in the set. If cos theta i,j More than or equal to threshold, then be considered as the label of the same personThe pair of matches is successful, block. Target proportion P when matching is successful in two interval frames success When the threshold value T is exceeded, two frames are considered to be successfully matched, and the frame size and the position corresponding to the label-free frame are calculated by using a linear interpolation method for the successfully matched frames, so that a pseudo label of the middle missing frame is obtained; if the label is smaller than or equal to the threshold T, the pseudo label is not marked. The target proportion calculation mode of successful matching is as follows: />Wherein O is success Target logarithm of successful matching of t time frame and t+n time frame, 0 t 、O t+n The meaning is the same as the previous formula.
Meanwhile, in order to effectively inhibit a low-quality event prediction frame generated at the high tide moment of a deviated event, the approach degree of the current moment t to the high tide moment in the starting time and the ending time of the event is calculated to obtain a high tide scoring value pseudo tag, wherein the pseudo tag calculation formula is as follows:wherein T is b * For the duration of the high tide time from the starting time of the event, T l * For the duration of the moment from the moment of the event beginning comprising the moment, T f * For the duration of the high tide time from the ending time of the event, T r * For the duration of the moment from the event end moment containing the moment, root number operation enables the attenuation of the hotness label to be firstly delayed and then steeped, and the penalty of semi-supervision is reduced as a whole. Therefore, the pseudo tag value corresponding to the actual high tide time is 1, the size of the constructed pseudo tag value is in nonlinear decrease from the high tide point to two sides on a time axis in the event until the pseudo tag value is overlapped with the starting time and the ending time of the event, and the overlapping point tags are all 0, so that low-quality event prediction generated by deviating from the high tide time of the event is effectively restrained.
In order to effectively restrain a low-quality prediction frame generated by deviating from a target geometric center, the proximity degree of the current position relative to the geometric center point of a calibration frame contained in the current position is calculated to obtain a position scoring value (namely, the target center scoring value) pseudo tag, and the pseudo tag is countedThe calculation formula is as follows:wherein left is * Distance from current position to left side of label frame, right * The distance from the current position to the right side of the labeling frame is top * Annotate the distance of frame top for current position distance, bottom * And the distance from the current position to the lower part of the labeling frame is epsilon, and epsilon is a parameter of the distribution of the regulating values. Therefore, the pseudo tag value of the real geometric center is 1+epsilon, the size of the pseudo tag value obtained by construction is radially decreased from the geometric center to the periphery in the target calibration frame until the tag on the calibration frame is 0+epsilon, and the low-quality prediction frame generated by deviating from the target geometric center is effectively restrained.
S2.2 minimizing the mixing loss function based on multiplexing
Target detection loss L object Including classification loss, regression loss, and semi-supervised center offset loss, which are added according to a specific weight to obtain the total target detection loss. Event detection loss L event The method comprises the steps of classifying loss, regression loss and semi-supervised climax offset loss, and adding the two losses according to specific weights to obtain total event detection loss.
The target detection loss and the event detection loss are calculated independently, and the loss of each task is transmitted to the corresponding branch in a reverse direction, so that the learning speed of each task branch is increased, and the detection capability of each branch is improved. In addition, the losses obtained by the two tasks are added according to specific weights to obtain overall losses and are propagated in opposite directions at the same time, so that the underlying backbone network learns a potential feature induction mode mixed with the two tasks.
Target detection loss L object The calculation formula is as follows:
wherein: n (N) pos_obj Representing the number of target positive samples c x,y Is the target class of feature point prediction at coordinates (x, y)In other words, the device comprises a first control unit,is the actual target category marked by the feature points at the coordinates (x, y), alpha and beta are weight parameters, b x,y Is the target prediction frame parameter corresponding to the feature point at the coordinates (x, y), is +.>Is a target marking frame parameter corresponding to a feature point at coordinates (x, y), r x,y Is the target center scoring value of the feature point prediction at coordinates (x, y), +.>Is the target center scoring value L of the pseudo label mark of the characteristic point at the coordinates (x, y) cls Classifying loss functions for cross entropy, L reg L is the GIOU loss function ctr Is a binary loss function.
Event detection loss L event The calculation formula is as follows:
wherein: n (N) pos_ev Representing the number of positive samples of an event, e t Is the event category of the frame prediction corresponding to time t,is the actual category of the frame label corresponding to the time t, gamma and delta are weight parameters, l t For the duration of the predicted current time t from the event start time, r t For the duration of the predicted current time t from the end time of the event, l * For the length of the marked high tide time from the starting time of the event, r * For the length of time from the marked high tide time to the ending time of the event, t * For marked climax moment, h is measured by an offset relative to the beginning of the whole video t Is the climax score value of the frame prediction corresponding to time t,/for the frame prediction corresponding to time t>Is the climax scoring value L of the frame pseudo tag label corresponding to the time t cls Classifying loss functions for cross entropy, L hot As a binary loss function, L reg For the cross ratio of the detected event span and the real label on the time axis, a calculation formula is defined as follows:
in summary, the total loss calculation formula is: l (L) total =L objectLevent Wherein lambda is a weight parameter, L object To detect the loss of the target, L event Loss is detected for the event.
Minimizing a multitasking based hybrid loss function L using gradient descent during training total To find the best network model parameters.
S3: inference and result processing
And obtaining target detection and event detection results through multi-branch forward propagation by using the trained neural network.
And multiplying the object classification score by a central scoring value center to inhibit a large number of off-center low-quality regression frames to obtain a final object classification score value, and taking the classification with the highest final score value as the classification of the object. And converting four parameters of the regression frame coordinates into a common calibration frame diagonal two-point coordinate form. The original image coordinate calculation formula corresponding to coordinate points (x, y) in the feature images with different scales is as follows:where s is the reduction multiple of the current feature map relative to the original map. For the regressed prediction block, the NMS is used to suppress the prediction block.
And multiplying the event classification score by the event climax score hotness to obtain a final event classification score, and taking the classification with the highest final score as the event classification corresponding to the current frame. If the final score value is less than the threshold value, suppressing is performed, and no event is considered to be present. And when the online processing is carried out, judging that the current event occurs if three continuous frames are larger than the threshold value. And in the offline processing, performing deduplication and merging on the events overlapped on the time axis.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the concept of the present invention, and are intended to be within the scope of the present invention.

Claims (10)

1. The online basketball video event and target detection method based on the multitasking is characterized by comprising the following steps of:
s1: neural network construction based on multi-scale feature induction and expression:
the backbone network layer is a Resnet network, and a time domain replacement module is added in the backbone network layer; the time domain replacement module adds the non-bypass convolution of each residual structure in the Resnet network, and the front in the characteristic diagram at the current momentThe channel is replaced by the value stored in the buffer memory at the last moment and the current frame is before the network characteristic diagram>Updating the channel into a cache, wherein m is more than 1;
taking the output of the conv3_x layer, the conv4_x layer and the conv5_x layer of the network as the input of the feature pyramid network to obtain five space-time features F3-F7 with different scales;
inputting the two-layer characteristic diagrams F6 and F7 with the lowest resolution into an event detection head, dividing the event detection head into two paths, and enabling the event detection head to pass through 4 layers of convolution layers and a global average pooling layer, wherein the output scale of one path is 1 multiplied by 1 e ,C r The number of event types; the other path is divided into two sub-paths, one sub-path is used for transmissionOutputting event start and end time offsets with the scale of 1 multiplied by 2, and outputting event climax scoring values with the scale of 1 multiplied by 1 by the other sub-path;
inputting feature maps F3-F7 with different dimensions into a target detection head, dividing the target detection head into two paths, passing through 4 layers of convolution layers, and outputting a path with the output scale of H multiplied by W o ,C o Is the number of target species; the other path is divided into two sub paths, one sub path outputs regression frame coordinates with the scale of H multiplied by W multiplied by 4, the other sub path outputs target center scoring values with the scale of H multiplied by W multiplied by 1, and HW is the resolution of the feature map output by the front layer;
s2: training a neural network:
the target detection loss comprises classification loss, regression loss and semi-supervision center offset loss, and the losses are added according to specific weights to obtain total target detection loss;
the event detection loss comprises classification loss, regression loss and semi-supervised climax deviation loss, and the two losses are added according to a specific weight to obtain total event detection loss;
the target detection loss and the event detection loss are calculated independently, the loss of each task is transmitted to the corresponding branch in a reverse direction, and the branch learning speed of each task is accelerated; adding the losses obtained by the two tasks according to a specific weight to obtain a mixed loss function based on multiple tasks and carrying out back propagation at the same time, so that a potential feature induction mode of mixing the two tasks is learned by a bottom backbone network; minimizing a multi-task based hybrid loss function using gradient descent to find optimal network model parameters;
s3: reasoning and result processing:
obtaining target detection and event detection results through multi-branch forward propagation by using a trained neural network;
multiplying the target classification score and the target center scoring value by a low-quality regression frame for inhibiting a large amount of off-center deviation to obtain a final target classification score value, and taking the classification with the highest final score value as the classification of the target;
and multiplying the event classification score by the event climax scoring value to obtain a final event classification score value, and taking the classification with the highest final score value as the event classification corresponding to the current frame.
2. The online detection method of basketball video events and targets based on multitasking according to claim 1, wherein after adding a time domain replacement module, a calculation formula of a feature map of a j-th layer at a time t is as follows:
wherein: f (F) i,t A feature map of the instant t output by the ith layer F i,t-n A characteristic diagram of t-n moment output by the ith layer, f conv For residual structure operation, f concat F for splicing operation in channel dimension j,t The j is the input of the residual block of the layer after the i, and is the output of the moment t obtained by the j-th layer;
the calculation formula of the feature map of the j-th layer at the time t-n is as follows:
wherein: f (F) i,t-n F, outputting a characteristic diagram of time t-n for the ith layer i,t-2n F is a characteristic diagram of the moment t-2n output by the ith layer j,t-n Outputting the time t-n obtained for the j-th layer;
the method is characterized in that the characteristic diagram of the moment t of the output of the kth layer is obtained by expanding the network structure, the characteristic diagram contains a plurality of pieces of time step information, k is the input of a residual block of the next layer after j, and the characteristic diagram of the moment t of the output of the kth layer is calculated by the following modes:
3. the online detection method for basketball video events and targets based on multitasking according to claim 1, wherein the process of obtaining the spatio-temporal features F3-F7 of five different scales is specifically:
and marking the conv3_x layer, the conv4_x layer and the conv5_x layer of the Resnet network as C3, C4 and C5, respectively carrying out convolution operation on the C3, C4 and C5 layers to obtain C3', C4', C5', carrying out downsampling on the C5' twice to obtain F6 and F7 respectively, directly outputting the C5 'as F5, upsampling the C5' and adding the C4 'to obtain F4, upsampling the F4 and adding the F4 and the C3' to obtain F3 and F3-F7 to form the pyramid characteristic diagram structure.
4. The online detection method for basketball video events and targets based on multiple tasks according to claim 1, wherein for five space-time features F3-F7 with different scales, targets with different sizes are distributed to feature maps F3-F7 with different scales for detection, small targets are mainly extracted from high-resolution bottom feature maps, usually F3 and F4 layers, large targets are mainly extracted from lower-resolution middle-high layer feature maps, usually F4-F7 layers, and two-layer feature maps F6 and F7 with the lowest resolution are connected with event detection heads for event feature expression and extraction.
5. The method for online detection of basketball video events and targets based on multiple tasks of claim 1, wherein for the acquired video stream, an image is extracted every n frames and converted into RGB color space, the image is kept vertically and horizontally sampled to a size of 800 pixels on the short side and the average value of ImageNet on RGB three channels is subtracted from the resampled image as input to the neural network.
6. The online basketball video event and target detection method based on the multitasking of claim 1, wherein a training set is expanded by using a semi-supervised pseudo tag mining method, and the tags marked with n frames at intervals are supplemented with target detection pseudo tags of the remaining frames, specifically: extracting feature expression of intra-frame target by using SOTA pedestrian re-recognition modelPerforming multi-target similarity matching on the feature vectors in the two interval frames, wherein a similarity calculation formula is as follows:wherein->O t+n ,O t Feature vector set for target in t moment frame, O t+n Feature vector set for target in t+n time frame,> feature vectors in the set respectively; if cos theta i,j If the number is equal to or greater than threshold, the number is considered as an annotation frame of the same person, and the pair of the annotation frames is successfully matched; target proportion P when matching is successful in two interval frames success When the threshold value T is exceeded, two frames are considered to be successfully matched, and the frame size and the position corresponding to the label-free frame are calculated by using a linear interpolation method for the successfully matched frames, so that a pseudo label of the middle missing frame is obtained; if the label is smaller than or equal to the threshold T, the label is not marked; the target proportion calculation mode of successful matching is as follows: />Wherein O is success For the target logarithm successfully matched with the t moment frame and the t+n moment frame, O t 、O t+n The meaning is the same as the previous formula.
7. The method for online detection of basketball video events and targets based on multiple tasks according to claim 1, wherein a semi-supervised pseudo-label mining method is used to expand a training set, and the climax is obtained by calculating the proximity of the current time t to the climax in the start and end times of the event in order to effectively suppress low-quality event prediction frames generated by the climax of offset eventsThe score pseudo tag has a calculation formula:wherein T is b * For the duration of the high tide time from the starting time of the event, T l * For the duration of the moment from the moment of the event beginning comprising the moment, T f * For the duration of the high tide time from the ending time of the event, T r * The time length of the event end time containing the time is the time distance; the pseudo tag value corresponding to the real high tide time is 1, the size of the constructed pseudo tag value is in nonlinear decrease from the high tide point to two sides on a time axis in the event until the pseudo tag value coincides with the starting time and the ending time of the event, and the coincident point tags are all 0.
8. The online detection method of basketball video events and targets based on multiple tasks according to claim 1, wherein a training set is extended by using a semi-supervised pseudo-label mining method, and for effectively suppressing a low-quality prediction frame generated by deviating from a target geometric center, the proximity degree of a current position relative to a calibration frame geometric center point contained in the current position is calculated to obtain a target center scoring pseudo-label, and the calculation formula is as follows:wherein left is * Distance from current position to left side of label frame, right * The distance from the current position to the right side of the labeling frame is top * Annotate the distance of frame top for current position distance, bottom * The distance from the current position to the lower part of the labeling frame is set as epsilon, and epsilon is a parameter of the distribution of the regulating value; the pseudo tag value of the real geometric center is 1, and the size of the pseudo tag value obtained by construction radially decreases from the geometric center to the periphery in the target calibration frame until the tag on the target calibration frame is 0.
9. The method for online detection of basketball video events and targets based on multiple tasks of claim 1, wherein the targets are selected from the group consisting ofLoss of label detection L object The calculation formula is as follows:
wherein: n (N) pos_obj Representing the number of target positive samples c x,y Is the target class of feature point predictions at coordinates (x, y),is the actual target category marked by the feature points at the coordinates (x, y), alpha and beta are weight parameters, b x,y Is the target prediction frame parameter corresponding to the feature point at the coordinates (x, y), is +.>Is a target marking frame parameter corresponding to a feature point at coordinates (x, y), r x,y Is the target center scoring value of the feature point prediction at coordinates (x, y), +.>Is the target center scoring value L of the pseudo label mark of the characteristic point at the coordinates (x, y) cls Classifying loss functions for cross entropy, L reg L is the GIOU loss function ctr Is a binary loss function;
event detection loss L event The calculation formula is as follows:
wherein: n (N) pos_ev Representing the number of positive samples of an event, e t Is the event category of the frame prediction corresponding to time t,is the actual category of the frame label corresponding to the time t, gamma and delta are weight parameters, l t Open for predicted current time t distance eventThe duration of the starting time is set to be, t for the duration of the predicted current time t from the end time of the event, l * For the length of the marked high tide time from the starting time of the event, r * For the duration of the marked high tide time from the end of the event, * for marked climax moment, h is measured by an offset relative to the beginning of the whole video t Is the climax score value of the frame prediction corresponding to time t,/for the frame prediction corresponding to time t>Is the climax scoring value L of the frame pseudo tag label corresponding to the time t cls Classifying loss functions for cross entropy, L hot As a binary loss function, L reg For the cross ratio of the detected event span and the real label on the time axis, the calculation formula is as follows:
the total loss calculation formula is: l (L) total =L object +λL event Where λ is the weight parameter.
10. The online basketball video event and target detection method according to claim 1, wherein in the step S3, four parameters of regression frame coordinates are converted into a common calibration frame diagonal two-point coordinate form for the output of the network target detection part, and original image coordinate calculation formulas corresponding to coordinate points (x, y) in different scale feature graphs are as follows:wherein s is the reduced multiple of the current feature map relative to the original map; for the regressed prediction frame, suppressing the prediction frame by using NMS;
outputting the network event detection part, and if the final event score value is smaller than the threshold value, suppressing to consider that no event exists at present; when online processing is performed, judging that the current event occurs when three continuous frames are larger than a threshold value; and in the offline processing, performing deduplication and merging on the events overlapped on the time axis.
CN202010419217.1A 2020-05-18 2020-05-18 Basketball video event and target online detection method based on multitasking Active CN111639563B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010419217.1A CN111639563B (en) 2020-05-18 2020-05-18 Basketball video event and target online detection method based on multitasking

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010419217.1A CN111639563B (en) 2020-05-18 2020-05-18 Basketball video event and target online detection method based on multitasking

Publications (2)

Publication Number Publication Date
CN111639563A CN111639563A (en) 2020-09-08
CN111639563B true CN111639563B (en) 2023-07-18

Family

ID=72331022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010419217.1A Active CN111639563B (en) 2020-05-18 2020-05-18 Basketball video event and target online detection method based on multitasking

Country Status (1)

Country Link
CN (1) CN111639563B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201726B (en) * 2020-09-18 2023-02-10 深圳先进技术研究院 Convolution operation optimization method, system, terminal and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304808A (en) * 2018-02-06 2018-07-20 广东顺德西安交通大学研究院 A kind of monitor video method for checking object based on space time information Yu depth network
CN108681712A (en) * 2018-05-17 2018-10-19 北京工业大学 A kind of Basketball Match Context event recognition methods of fusion domain knowledge and multistage depth characteristic
CN110378208A (en) * 2019-06-11 2019-10-25 杭州电子科技大学 A kind of Activity recognition method based on depth residual error network
CN110765886A (en) * 2019-09-29 2020-02-07 深圳大学 Road target detection method and device based on convolutional neural network
WO2020088763A1 (en) * 2018-10-31 2020-05-07 Huawei Technologies Co., Ltd. Device and method for recognizing activity in videos

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11106947B2 (en) * 2017-12-13 2021-08-31 Canon Kabushiki Kaisha System and method of classifying an action or event
US11638854B2 (en) * 2018-06-01 2023-05-02 NEX Team, Inc. Methods and systems for generating sports analytics with a mobile device
US11538143B2 (en) * 2018-10-26 2022-12-27 Nec Corporation Fully convolutional transformer based generative adversarial networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304808A (en) * 2018-02-06 2018-07-20 广东顺德西安交通大学研究院 A kind of monitor video method for checking object based on space time information Yu depth network
CN108681712A (en) * 2018-05-17 2018-10-19 北京工业大学 A kind of Basketball Match Context event recognition methods of fusion domain knowledge and multistage depth characteristic
WO2020088763A1 (en) * 2018-10-31 2020-05-07 Huawei Technologies Co., Ltd. Device and method for recognizing activity in videos
CN110378208A (en) * 2019-06-11 2019-10-25 杭州电子科技大学 A kind of Activity recognition method based on depth residual error network
CN110765886A (en) * 2019-09-29 2020-02-07 深圳大学 Road target detection method and device based on convolutional neural network

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Kai Kang et al.Object Detection from Video Tubelets with Convolutional Neural Networks.2016 IEEE Conference on Computer Vision and Pattern Recognition.2016,第817-825页. *
Xizhou Zhu et al..Towards High Performance Video Object Detection.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.2018,第7210-7218页. *
单义等.基于跳跃连接金字塔模型的小目标检测.智能系统学报.2019,第14卷(第06期),第1144-1151页. *
孙明华等.基于深度可分离卷积的地铁隧道巡检视频分析.计算机工程与科学.2020,第42卷(第04期),第691-698页. *
王慧燕等.深度学习辅助的多行人跟踪算法.中国图象图形学报.2017,第22卷(第03期),第349-357页. *
耿月.旅游景区视频异常事件检测与识别.中国优秀硕士学位论文全文数据库 (信息科技辑).2019,(第08期),第I138-860页. *

Also Published As

Publication number Publication date
CN111639563A (en) 2020-09-08

Similar Documents

Publication Publication Date Title
CN111639692B (en) Shadow detection method based on attention mechanism
CN112507777A (en) Optical remote sensing image ship detection and segmentation method based on deep learning
CN111259779B (en) Video motion detection method based on center point track prediction
CN111210446B (en) Video target segmentation method, device and equipment
CN111368634B (en) Human head detection method, system and storage medium based on neural network
CN115171165A (en) Pedestrian re-identification method and device with global features and step-type local features fused
CN111723660A (en) Detection method for long ground target detection network
CN113537462A (en) Data processing method, neural network quantization method and related device
Jayasinghe et al. SwiftLane: towards fast and efficient lane detection
CN111639563B (en) Basketball video event and target online detection method based on multitasking
Aldhaheri et al. MACC Net: Multi-task attention crowd counting network
Ge et al. Improving road extraction for autonomous driving using swin transformer unet
CN115239765B (en) Infrared image target tracking system and method based on multi-scale deformable attention
CN116758340A (en) Small target detection method based on super-resolution feature pyramid and attention mechanism
Dahirou et al. Motion Detection and Object Detection: Yolo (You Only Look Once)
Nag et al. ARCN: a real-time attention-based network for crowd counting from drone images
İsa Performance Evaluation of Jaccard-Dice Coefficient on Building Segmentation from High Resolution Satellite Images
Zhang et al. Boosting the speed of real-time multi-object trackers
Guo et al. ANMS: attention-based non-maximum suppression
WO2022047736A1 (en) Convolutional neural network-based impairment detection method
Fu et al. Foreground gated network for surveillance object detection
CN114140524A (en) Closed loop detection system and method for multi-scale feature fusion
Sivaprakash et al. A Convolutional Neural Network Approach for Crowd Counting
Tian et al. Lightweight dual-task networks for crowd counting in aerial images
Shi et al. Attention-YOLOX: Improvement in On-Road Object Detection by Introducing Attention Mechanisms to YOLOX

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant