CN111639563B

CN111639563B - Basketball video event and target online detection method based on multitasking

Info

Publication number: CN111639563B
Application number: CN202010419217.1A
Authority: CN
Inventors: 华璟; 王腾
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2023-07-18
Anticipated expiration: 2040-05-18
Also published as: CN111639563A

Abstract

The invention discloses a basketball video event and target online detection method based on multitasking, which utilizes a deep convolutional neural network to share multitasking weights and can detect events and targets of basketball game videos in an online or offline mode. The loss of each task is respectively and reversely propagated to the corresponding branches based on the multi-task mixed loss function, so that the learning speed of each task branch is increased. The losses obtained by the two tasks are added according to specific weights to obtain overall losses and are back-propagated, so that the backbone network learns the potential feature induction mode of mixing the two tasks. The semi-supervised pseudo-label mining expands training data and effectively suppresses low-quality event prediction frames generated at high tide moments of deviated events and low-quality bounding frames generated at the geometrical centers of deviated targets. The time-space multi-scale network structure fully utilizes multi-stride time domain information, summarizes multi-scale historical characteristics, and effectively improves recall rate and accuracy rate of event detection.

Description

Basketball video event and target online detection method based on multitasking

Technical Field

The invention belongs to the technical field of video event detection and video target detection, and particularly relates to a basketball game video event and target online detection method based on multitasking.

Background

Video event and object detection are key technologies in video understanding. With the rapid development of communication technology, computer technology and steady growth of sports industry, the volume of amateur and professional sports game video data has been increasing explosively. Sports game video is a video resource containing a large number of events and targets, has very wide audience groups and huge research value, and also has higher requirements on fine processing, archiving and sharing of video mathematics. In recent years, deep learning based on convolutional neural networks and progress of high-performance parallel computing devices provide a guarantee for this demand.

The video target detection is used for accurately detecting targets appearing in video frames, and has important significance in crowd monitoring, automatic driving and other aspects. The simplest detection method is to detect each frame of video by using a common picture object detection network, but the utilization of inter-frame time domain information is lacking, and jitter of bounding boxes and abrupt changes of classification easily occur between frames in actual situations. The use of 3D convolution, optical flow diagrams, LSTM layers, multi-streams, etc. can make good use of inter-frame time domain information, but they bring about very much computation, and place high demands on computing devices.

The existing video event detection research works mainly detect events of game videos in an off-line processing mode, and the practicability in the on-line processing fields such as live broadcasting, rebroadcasting and the like is lacking. The existing video online event detection mode only classifies single frames and lacks prediction of the starting and ending moments of the current event. In addition, the existing video event detection research has higher separation degree from the video target detection research, redundant calculation exists on the same video, and the event detection lacks the utilization of target position and motion information. Thus, there is still a lack of a method for simultaneously and efficiently detecting events and targets in basketball videos online.

Disclosure of Invention

Aiming at the defects and improvement demands of the prior art, the invention aims to provide a method for efficiently and online detecting events and targets in basketball videos based on the weight of a bottom layer of a multitasking shared convolutional neural network.

In order to achieve the above purpose, the present invention is realized by the following technical scheme: a basketball video event and target online detection method based on multitasking comprises the following steps:

s1: neural network construction based on multi-scale feature induction and expression:

the backbone network layer is a Resnet network, and a time domain replacement module is added in the backbone network layer; the time domain replacement module adds the non-bypass convolution of each residual structure in the Resnet network, and the front in the characteristic diagram at the current momentThe channel is replaced by the value stored in the buffer memory at the last moment, and the current frame is added before the network characteristic diagram of the layer>Updating the channel into a cache, wherein m is more than 1;

taking the output of the conv3_x layer, the conv4_x layer and the conv5_x layer of the network as the input of the feature pyramid network to obtain five space-time features F3-F7 with different scales;

inputting the two-layer characteristic diagrams F6 and F7 with the lowest resolution into an event detection head, dividing the event detection head into two paths, respectively passing through 4 convolution layers and a global average pooling layer, wherein one path of output scale is 1 multiplied by C _e ，C _e The number of event types; the other path is divided into two sub paths, wherein one sub path outputs event start and end time offsets with the scale of 1 multiplied by 2, and the other sub path outputs event climax scoring values with the scale of 1 multiplied by 1;

inputting feature maps F3-F7 with different dimensions into a target detection head, dividing the target detection head into two paths, passing through 4 layers of convolution layers, and outputting a path with the output scale of H multiplied by W multiplied by C _o ，C _o Is the number of target species; the other path is divided into two sub paths, one sub path outputs regression frame coordinates with the scale of H multiplied by W multiplied by 4, the other sub path outputs target center scoring value with the scale of H multiplied by W multiplied by 1, and H multiplied by W isThe resolution of the feature map output by the front layer;

s2: training a neural network:

the target detection loss comprises classification loss, regression loss and semi-supervision center offset loss, and the losses are added according to specific weights to obtain total target detection loss;

the event detection loss comprises classification loss, regression loss and semi-supervised climax deviation loss, and the two losses are added according to a specific weight to obtain total event detection loss;

the target detection loss and the event detection loss are calculated independently, the loss of each task is transmitted to the corresponding branch in a reverse direction, and the branch learning speed of each task is accelerated; adding the losses obtained by the two tasks according to a specific weight to obtain a mixed loss function based on multiple tasks and carrying out back propagation at the same time, so that a potential feature induction mode of mixing the two tasks is learned by a bottom backbone network; gradient descent is used to minimize the multitasking hybrid loss function to find the optimal network model parameters.

S3: reasoning and result processing:

obtaining target detection and event detection results through multi-branch forward propagation by using a trained neural network;

multiplying the target classification score and the target center scoring value by a low-quality regression frame for inhibiting a large amount of off-center deviation to obtain a final target classification score value, and taking the classification with the highest final score value as the classification of the target;

and multiplying the event classification score by the event climax scoring value to obtain a final event classification score value, and taking the classification with the highest final score value as the event classification corresponding to the current frame.

Further, after adding the time domain replacement module, the calculation formula of the feature map of the j-th layer at the time t is as follows:

wherein: f (F) _i，t A feature map of the instant t output by the ith layer F _i，t-n A characteristic diagram of t-n moment output by the ith layer, f _conv For residual structure operation, f _concat F for splicing operation in channel dimension _j，t The j is the input of the residual block of the layer after the i, and is the output of the moment t obtained by the j-th layer;

the calculation formula of the feature map of the j-th layer at the time t-n is as follows:

wherein: f (F) _i，t-n F, outputting a characteristic diagram of time t-n for the ith layer _i，t-2n F is a characteristic diagram of the moment t-2n output by the ith layer _j，t-n Outputting the time t-n obtained for the j-th layer;

the method is characterized in that the characteristic diagram of the moment t of the output of the kth layer is obtained by expanding the network structure, the characteristic diagram contains a plurality of pieces of time step information, k is the input of a residual block of the next layer after j, and the characteristic diagram of the moment t of the output of the kth layer is calculated by the following modes:

further, the process of obtaining the five spatio-temporal features F3-F7 with different scales specifically includes:

and marking the conv3_x layer, the conv4_x layer and the conv5_x layer of the Resnet network as C3, C4 and C5, respectively carrying out convolution operation on the C3, C4 and C5 layers to obtain C3', C4', C5', carrying out downsampling on the C5' twice to obtain F6 and F7 respectively, directly outputting the C5 'as F5, upsampling the C5' and adding the C4 'to obtain F4, upsampling the F4 and adding the F4 and the C3' to obtain F3 and F3-F7 to form the pyramid characteristic diagram structure.

Further, for five space-time features F3-F7 with different scales, targets with different sizes are distributed to feature maps F3-F7 with different scales for detection, small targets are mainly extracted from a high-resolution bottom-layer feature map, usually F3 and F4 layers, large targets are mainly extracted from a lower-resolution middle-high-layer feature map, usually F4-F7 layers, and two-layer feature maps F6 and F7 with the lowest resolution are connected with an event detection head for expression and extraction of event features.

Further, for the acquired video stream, an image is extracted every n frames and converted into an RGB color space, the image is kept vertically and horizontally sampled to a size of 800 pixels at a short side, and the average value of ImageNet on three channels of RGB is subtracted from the resampled image as an input of the neural network.

Further, a training set is expanded by using a semi-supervised pseudo label mining method, and the labels marked by n frames at intervals are supplemented with target detection pseudo labels of the rest frames, specifically: extracting the characteristic expression of the object in the frame by using a pedestrian re-identification model of SOTA, and performing multi-object similarity matching on the characteristic vectors in the two interval frames, wherein a similarity calculation formula is as follows:wherein->O _t Feature vector set for target in t moment frame, O _t+n Feature vector set for target in t+n time frame,>feature vectors in the set respectively; if cos theta _i，j If the number is equal to or greater than threshold, the number is considered as an annotation frame of the same person, and the pair of the annotation frames is successfully matched; target proportion P when matching is successful in two interval frames _success When the threshold value T is exceeded, two frames are considered to be successfully matched, and the frame size and the position corresponding to the label-free frame are calculated by using a linear interpolation method for the successfully matched frames, so that a pseudo label of the middle missing frame is obtained; if the label is smaller than or equal to the threshold T, the label is not marked; the target proportion calculation mode of successful matching is as follows: />Wherein O is _success For the target logarithm successfully matched with the t moment frame and the t+n moment frame, O _t 、O _t+n The meaning is the same as the previous formula.

Further, a training set is expanded by using a semi-supervised pseudo-label mining method, and a pseudo-label of a climax scoring value is obtained by calculating the approaching degree of the current moment t to the climax moment in the starting time and the ending time of the event in order to effectively restrain a low-quality event prediction frame generated at the moment of the climax of a deviated event, wherein the calculation formula is as follows:wherein T is _b ^* For the duration of the high tide time from the starting time of the event, T _l ^* For the duration of the moment from the moment of the event beginning comprising the moment, T _f ^* For the duration of the high tide time from the ending time of the event, T _r ^* The time length of the event end time containing the time is the time distance; the pseudo tag value corresponding to the real high tide time is 1, the size of the constructed pseudo tag value is in nonlinear decrease from the high tide point to two sides on a time axis in the event until the pseudo tag value coincides with the starting time and the ending time of the event, and the coincident point tags are all 0.

Further, a training set is expanded by using a semi-supervised pseudo tag mining method, so that the approach degree of the current position relative to the geometric center point of the calibration frame contained in the current position is calculated for effectively restraining a low-quality prediction frame generated by deviating from the geometric center of the target, and a pseudo tag scoring value of the target center is obtained, wherein the calculation formula is as follows:wherein left is ^* Distance from current position to left side of label frame, right ^* The distance from the current position to the right side of the labeling frame is top ^* Annotate the distance of frame top for current position distance, bottom ^* The distance from the current position to the lower part of the labeling frame is set as epsilon, and epsilon is a parameter of the distribution of the regulating value; the pseudo tag value of the real geometric center is 1+epsilon, and the size of the pseudo tag value obtained by construction radially decreases from the geometric center to the periphery in the target calibration frame until reaching the calibration frameThe label on this is 0+ε.

Further, target detection loss L _object The calculation formula is as follows:

wherein: n (N) _{pos_obj} Representing the number of target positive samples c _x，y Is the target class of feature point predictions at coordinates (x, y),is the actual target category marked by the feature points at the coordinates (x, y), alpha and beta are weight parameters, b _x，y Is the target prediction frame parameter corresponding to the feature point at the coordinates (x, y), is +.>Is a target marking frame parameter corresponding to a feature point at coordinates (x, y), r _x，y Is the target center scoring value of the feature point prediction at coordinates (x, y), +.>Is the target center scoring value L of the pseudo label mark of the characteristic point at the coordinates (x, y) _cls Classifying loss functions for cross entropy, L _reg L is the GIOU loss function _ctr Is a binary loss function;

event detection loss L _event The calculation formula is as follows:

wherein: n (N) _{pos_ev} Representing the number of positive samples of an event, e _t Is the event category of the frame prediction corresponding to time t,is the actual category of the frame label corresponding to the time t, gamma and delta are weight parameters, l _t For the predicted current time tDuration from the start of event, r _t For the duration of the predicted current time t from the end time of the event, l ^* For the length of the marked high tide time from the starting time of the event, r ^* For the length of time from the marked high tide time to the ending time of the event, t ^* For marked climax moment, h is measured by an offset relative to the beginning of the whole video _t Is the climax score value of the frame prediction corresponding to time t,/for the frame prediction corresponding to time t>Is the climax scoring value L of the frame pseudo tag label corresponding to the time t _cls Classifying loss functions for cross entropy, L _hot As a binary loss function, L _reg For the cross ratio of the detected event span and the real label on the time axis, the calculation formula is as follows:

the total loss calculation formula is: l (L) _total ＝L _object +λL _event Where λ is the weight parameter.

Further, in the step S3, for the output of the network target detection portion, four parameters of the regression frame coordinates are converted into a common calibration frame diagonal two-point coordinate form, and original image coordinates corresponding to coordinate points (x, y) in the feature images with different scales are calculated according to the following formula:wherein s is the reduced multiple of the current feature map relative to the original map; for the regressed prediction frame, suppressing the prediction frame by using NMS;

outputting the network event detection part, and if the final event score value is smaller than the threshold value, suppressing to consider that no event exists at present; when online processing is performed, judging that the current event occurs when three continuous frames are larger than a threshold value; and in the offline processing, performing deduplication and merging on the events overlapped on the time axis.

Compared with the prior art, the invention has the following advantages: the method of the invention utilizes the deep convolutional neural network to share the multi-task weight, and can carry out event detection and target detection on basketball game videos in an online or offline mode. The loss of each task is respectively and reversely propagated to the corresponding branches based on the multi-task mixed loss function, so that the learning speed of each task branch is increased, and the detection capability of each branch is improved. In addition, the losses obtained by the two tasks are added according to specific weights to obtain overall losses and are propagated in the opposite direction, so that the underlying backbone network learns a potential feature induction mode mixed with the two tasks. The semi-supervised pseudo-label mining expands training data, effectively inhibits a low-quality event prediction frame generated at a high tide moment of a deviated event and a low-quality bounding frame generated at a geometrical center of a deviated target, improves the accuracy of target detection, and reduces the false alarm rate of event detection. The time-space multi-scale network structure fully utilizes multi-stride time domain information to induce multi-scale historical features, small targets are mainly extracted from high-resolution bottom feature images, large targets are mainly extracted from lower-resolution middle-high-level feature images, and the two-layer feature images with the lowest resolution effectively improve recall rate and accuracy of event detection due to the fact that global motion information is induced.

Drawings

FIG. 1 is a flow chart of a method for online detection of basketball video events and targets based on multitasking in accordance with an embodiment of the present invention;

FIG. 2 is a neural network architecture diagram of a method for online detection of basketball video events and targets based on multitasking in accordance with an embodiment of the invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

As shown in fig. 1, the method for online detecting basketball video events and targets based on multitasking provided by the application comprises the following steps:

s1: neural network construction based on multi-scale feature induction and expression, as shown in fig. 2, the backbone network layer is a network of Resnet-50.

S1.1 is to fully utilize multi-stride time domain information, induce multi-scale history features to extract the time domain information, and add a time domain replacement module in a backbone network layer, so that the learning capacity of the network in the time dimension is enhanced, and the stability of target detection and the accuracy of event detection are improved. Since the addition of a time domain permutation module before or on the bypass (shortcut) of the residual structure of Resnet would destroy the spatial information, the selection of adding a time domain permutation module before the non-bypass convolution of each residual structure in the Resnet network, i.e., before each time in the feature mapThe channel is replaced by the value stored in the last (last moment) buffer and the current frame is replaced before the layer of network feature map>Updating the channel into a cache, wherein m is greater than 1, and the calculation formula of the feature map of the jth layer at the time t is as follows:

wherein: f (F) _i，t A feature map of the instant t output by the ith layer F _i，t-n A characteristic diagram of t-n moment output by the ith layer, f _conv For residual structure operation, f _concat F for splicing operation in channel dimension _j，t And j is the input of the residual block of the layer after i, which is the output of the time t obtained by the j-th layer. Similarly, the feature map of the j-th layer at the time t-n can be obtained by spreading in time, and the calculation formula is as follows:

wherein: f (F) _i，t-n F, outputting a characteristic diagram of time t-n for the ith layer _i，t-2n F is a characteristic diagram of the moment t-2n output by the ith layer _j，t-n The output at time t-n obtained for the j-th layer.

Therefore, the characteristic map at the time t of the k-th layer output can be obtained by expanding the network structure, the characteristic map contains a plurality of pieces of time step information, k is the input of the residual block of the j later layer, and the characteristic map at the time t of the k-th layer output is calculated by the following modes:

s1.2, respectively marking a conv3_x stage layer, a conv4_x stage layer and a conv5_x stage layer of a Resnet network as C3, C4 and C5, and carrying out feature fusion on the C3, C4 and C5 layers to fully utilize multi-scale airspace information, wherein the fusion method is to respectively carry out convolution operation on the C3, C4 and C5 layers (1X 1 convolution is applied to the output of the C3, C4 and C5 layers) to obtain C3', C4 and C5', carrying out downsampling on the C5' twice to respectively obtain F6 and F7, directly outputting the C5' as F5, carrying out upsampling on the C5' and adding the C4' to obtain F4, and adding the upsampled F4 and the C3' to obtain F3, wherein the F3 is calculated in the following way: f3 =f _up (f _up (C5 ') +c4 ') +c3'; wherein f _up For the upsampling method, F4, F5, F6, F7 will not be described again. f3-F7 constitute a pyramid feature map structure that facilitates the expression of multiple scale features.

Targets with different sizes are distributed to feature graphs with different scales for detection, small targets are mainly extracted from a high-resolution bottom-layer feature graph (usually F3 and F4 layers), large targets are mainly extracted from a lower-resolution middle-high-layer feature graph (usually F4-F7 layers), and two-layer feature graphs F6 and F7 with the lowest resolution have potential high-level semantics due to global motion information, and are connected with an event detection head for expressing and extracting event features.

S1.3, using the event detection head and the target detection head to obtain a multiplexed output.

The input of the event detection head is a characteristic diagram F6 and F7 which are convolved by a plurality of layers, the event detection head is divided into two paths, the two paths pass through 4 convolved layers and a global average pooling layer, and the output scale of one path is 1 multiplied by C _e ，C _e The number of event types; the other path is divided into two sub paths, wherein one sub path outputs event start and end time offsets with the scale of 1 multiplied by 2, and the other sub path outputs event climax scoring values with the scale of 1 multiplied by 1. The global average pooling layer enables the network to adapt to video input with different resolutions, has good robustness to videos with different resolutions, and has a calculation formula of: feature map output for event classification, +.>Classifying unprocessed feature images for events, H×W is the resolution of the feature images output by the previous layer, C _e The number of channels of the feature map output for the front layer, namely the number of event categories, is similarly +.>Feature map scoring event climax output, +.>The moment offset of the regression prediction output is started and ended for the event.

The input of the target detection head is a characteristic diagram F3, F4, F5, F6 and F7 which are subjected to a plurality of layers of convolutions, the target detection head is also divided into two paths, and the target detection head is subjected to 4 layers of convolutions, wherein one path of target detection head outputs H multiplied by W multiplied by C in scale _o ，C _o Is the number of target species; the other path is divided into two sub paths, and one sub path outputs a regression frame with the scale of H multiplied by W multiplied by 4The target center scoring value of the other sub-path output scale is H multiplied by W multiplied by 1, and H multiplied by W is the resolution of the feature map output by the front layer.

S2: neural network training

And extracting an image every n frames for the acquired video stream, converting the image into an RGB color space, keeping longitudinal and transverse resampling to a size of 800 pixels at a short side, and subtracting an average value of image Net on three RGB channels from the resampled image to serve as input of the neural network. Taking a frame with 800×1024 size as an example, the C3, C4 and C5 output layer sizes of the network are 100×128, 50×64 and 25×32, and the channel numbers are 512, 1024 and 2048, respectively. The outputs of the C3, C4 and C5 layers are used as the inputs of the FPN network, namely the feature pyramid network, and the downsampling and upsampling are fused, so that the backbone network obtains 5 space-time features F3-F7 with different scales, and the sizes are 100 multiplied by 128, 50 multiplied by 64, 25 multiplied by 32 and 13 multiplied by 16,7 multiplied by 8 respectively. Training input data is augmented using a semi-supervised pseudo label mining method and a hybrid loss function based on multiplexing is minimized using Adam optimizer gradient descent.

S2.1 semi-supervised pseudo label mining expansion training set

In order to effectively augment the training data, the target detection pseudo tags of the remaining frames are supplemented for the tags originally marked with n frames at intervals. Extracting the characteristic expression of targets in a frame by using a pedestrian re-identification (ReID) model of SOTA, wherein each target obtains 4096-dimensional vectors, and performing multi-target similarity matching on the characteristic vectors in two interval frames, wherein a similarity calculation formula is as follows:wherein->O _t Feature vector set for target in t moment frame, O _t+n Feature vector set for target in t+n time frame,>respectively, feature vectors in the set. If cos theta _i，j More than or equal to threshold, then be considered as the label of the same personThe pair of matches is successful, block. Target proportion P when matching is successful in two interval frames _success When the threshold value T is exceeded, two frames are considered to be successfully matched, and the frame size and the position corresponding to the label-free frame are calculated by using a linear interpolation method for the successfully matched frames, so that a pseudo label of the middle missing frame is obtained; if the label is smaller than or equal to the threshold T, the pseudo label is not marked. The target proportion calculation mode of successful matching is as follows: />Wherein O is _success Target logarithm of successful matching of t time frame and t+n time frame, 0 _t 、O _t+n The meaning is the same as the previous formula.

Meanwhile, in order to effectively inhibit a low-quality event prediction frame generated at the high tide moment of a deviated event, the approach degree of the current moment t to the high tide moment in the starting time and the ending time of the event is calculated to obtain a high tide scoring value pseudo tag, wherein the pseudo tag calculation formula is as follows:wherein T is _b ^* For the duration of the high tide time from the starting time of the event, T _l ^* For the duration of the moment from the moment of the event beginning comprising the moment, T _f ^* For the duration of the high tide time from the ending time of the event, T _r ^* For the duration of the moment from the event end moment containing the moment, root number operation enables the attenuation of the hotness label to be firstly delayed and then steeped, and the penalty of semi-supervision is reduced as a whole. Therefore, the pseudo tag value corresponding to the actual high tide time is 1, the size of the constructed pseudo tag value is in nonlinear decrease from the high tide point to two sides on a time axis in the event until the pseudo tag value is overlapped with the starting time and the ending time of the event, and the overlapping point tags are all 0, so that low-quality event prediction generated by deviating from the high tide time of the event is effectively restrained.

In order to effectively restrain a low-quality prediction frame generated by deviating from a target geometric center, the proximity degree of the current position relative to the geometric center point of a calibration frame contained in the current position is calculated to obtain a position scoring value (namely, the target center scoring value) pseudo tag, and the pseudo tag is countedThe calculation formula is as follows:wherein left is ^* Distance from current position to left side of label frame, right ^* The distance from the current position to the right side of the labeling frame is top ^* Annotate the distance of frame top for current position distance, bottom ^* And the distance from the current position to the lower part of the labeling frame is epsilon, and epsilon is a parameter of the distribution of the regulating values. Therefore, the pseudo tag value of the real geometric center is 1+epsilon, the size of the pseudo tag value obtained by construction is radially decreased from the geometric center to the periphery in the target calibration frame until the tag on the calibration frame is 0+epsilon, and the low-quality prediction frame generated by deviating from the target geometric center is effectively restrained.

S2.2 minimizing the mixing loss function based on multiplexing

Target detection loss L _object Including classification loss, regression loss, and semi-supervised center offset loss, which are added according to a specific weight to obtain the total target detection loss. Event detection loss L _event The method comprises the steps of classifying loss, regression loss and semi-supervised climax offset loss, and adding the two losses according to specific weights to obtain total event detection loss.

The target detection loss and the event detection loss are calculated independently, and the loss of each task is transmitted to the corresponding branch in a reverse direction, so that the learning speed of each task branch is increased, and the detection capability of each branch is improved. In addition, the losses obtained by the two tasks are added according to specific weights to obtain overall losses and are propagated in opposite directions at the same time, so that the underlying backbone network learns a potential feature induction mode mixed with the two tasks.

Target detection loss L _object The calculation formula is as follows:

wherein: n (N) _{pos_obj} Representing the number of target positive samples c _x，y Is the target class of feature point prediction at coordinates (x, y)In other words, the device comprises a first control unit,is the actual target category marked by the feature points at the coordinates (x, y), alpha and beta are weight parameters, b _x，y Is the target prediction frame parameter corresponding to the feature point at the coordinates (x, y), is +.>Is a target marking frame parameter corresponding to a feature point at coordinates (x, y), r _x，y Is the target center scoring value of the feature point prediction at coordinates (x, y), +.>Is the target center scoring value L of the pseudo label mark of the characteristic point at the coordinates (x, y) _cls Classifying loss functions for cross entropy, L _reg L is the GIOU loss function _ctr Is a binary loss function.

Event detection loss L _event The calculation formula is as follows:

wherein: n (N) _{pos_ev} Representing the number of positive samples of an event, e _t Is the event category of the frame prediction corresponding to time t,is the actual category of the frame label corresponding to the time t, gamma and delta are weight parameters, l _t For the duration of the predicted current time t from the event start time, r _t For the duration of the predicted current time t from the end time of the event, l ^* For the length of the marked high tide time from the starting time of the event, r ^* For the length of time from the marked high tide time to the ending time of the event, t ^* For marked climax moment, h is measured by an offset relative to the beginning of the whole video _t Is the climax score value of the frame prediction corresponding to time t,/for the frame prediction corresponding to time t>Is the climax scoring value L of the frame pseudo tag label corresponding to the time t _cls Classifying loss functions for cross entropy, L _hot As a binary loss function, L _reg For the cross ratio of the detected event span and the real label on the time axis, a calculation formula is defined as follows:

in summary, the total loss calculation formula is: l (L) _total ＝L _object +λ _Levent Wherein lambda is a weight parameter, L _object To detect the loss of the target, L _event Loss is detected for the event.

Minimizing a multitasking based hybrid loss function L using gradient descent during training _total To find the best network model parameters.

S3: inference and result processing

And obtaining target detection and event detection results through multi-branch forward propagation by using the trained neural network.

And multiplying the object classification score by a central scoring value center to inhibit a large number of off-center low-quality regression frames to obtain a final object classification score value, and taking the classification with the highest final score value as the classification of the object. And converting four parameters of the regression frame coordinates into a common calibration frame diagonal two-point coordinate form. The original image coordinate calculation formula corresponding to coordinate points (x, y) in the feature images with different scales is as follows:where s is the reduction multiple of the current feature map relative to the original map. For the regressed prediction block, the NMS is used to suppress the prediction block.

And multiplying the event classification score by the event climax score hotness to obtain a final event classification score, and taking the classification with the highest final score as the event classification corresponding to the current frame. If the final score value is less than the threshold value, suppressing is performed, and no event is considered to be present. And when the online processing is carried out, judging that the current event occurs if three continuous frames are larger than the threshold value. And in the offline processing, performing deduplication and merging on the events overlapped on the time axis.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the concept of the present invention, and are intended to be within the scope of the present invention.

Claims

1. The online basketball video event and target detection method based on the multitasking is characterized by comprising the following steps of:

the backbone network layer is a Resnet network, and a time domain replacement module is added in the backbone network layer; the time domain replacement module adds the non-bypass convolution of each residual structure in the Resnet network, and the front in the characteristic diagram at the current momentThe channel is replaced by the value stored in the buffer memory at the last moment and the current frame is before the network characteristic diagram>Updating the channel into a cache, wherein m is more than 1;

inputting the two-layer characteristic diagrams F6 and F7 with the lowest resolution into an event detection head, dividing the event detection head into two paths, and enabling the event detection head to pass through 4 layers of convolution layers and a global average pooling layer, wherein the output scale of one path is 1 multiplied by 1 _e ，C _r The number of event types; the other path is divided into two sub-paths, one sub-path is used for transmissionOutputting event start and end time offsets with the scale of 1 multiplied by 2, and outputting event climax scoring values with the scale of 1 multiplied by 1 by the other sub-path;

inputting feature maps F3-F7 with different dimensions into a target detection head, dividing the target detection head into two paths, passing through 4 layers of convolution layers, and outputting a path with the output scale of H multiplied by W _o ，C _o Is the number of target species; the other path is divided into two sub paths, one sub path outputs regression frame coordinates with the scale of H multiplied by W multiplied by 4, the other sub path outputs target center scoring values with the scale of H multiplied by W multiplied by 1, and HW is the resolution of the feature map output by the front layer;

s2: training a neural network:

the target detection loss and the event detection loss are calculated independently, the loss of each task is transmitted to the corresponding branch in a reverse direction, and the branch learning speed of each task is accelerated; adding the losses obtained by the two tasks according to a specific weight to obtain a mixed loss function based on multiple tasks and carrying out back propagation at the same time, so that a potential feature induction mode of mixing the two tasks is learned by a bottom backbone network; minimizing a multi-task based hybrid loss function using gradient descent to find optimal network model parameters;

s3: reasoning and result processing:

2. The online detection method of basketball video events and targets based on multitasking according to claim 1, wherein after adding a time domain replacement module, a calculation formula of a feature map of a j-th layer at a time t is as follows:

wherein: f (F) _i,t A feature map of the instant t output by the ith layer F _i,t-n A characteristic diagram of t-n moment output by the ith layer, f _conv For residual structure operation, f _concat F for splicing operation in channel dimension _j,t The j is the input of the residual block of the layer after the i, and is the output of the moment t obtained by the j-th layer;

wherein: f (F) _i,t-n F, outputting a characteristic diagram of time t-n for the ith layer _i,t-2n F is a characteristic diagram of the moment t-2n output by the ith layer _j,t-n Outputting the time t-n obtained for the j-th layer;

3. the online detection method for basketball video events and targets based on multitasking according to claim 1, wherein the process of obtaining the spatio-temporal features F3-F7 of five different scales is specifically:

4. The online detection method for basketball video events and targets based on multiple tasks according to claim 1, wherein for five space-time features F3-F7 with different scales, targets with different sizes are distributed to feature maps F3-F7 with different scales for detection, small targets are mainly extracted from high-resolution bottom feature maps, usually F3 and F4 layers, large targets are mainly extracted from lower-resolution middle-high layer feature maps, usually F4-F7 layers, and two-layer feature maps F6 and F7 with the lowest resolution are connected with event detection heads for event feature expression and extraction.

5. The method for online detection of basketball video events and targets based on multiple tasks of claim 1, wherein for the acquired video stream, an image is extracted every n frames and converted into RGB color space, the image is kept vertically and horizontally sampled to a size of 800 pixels on the short side and the average value of ImageNet on RGB three channels is subtracted from the resampled image as input to the neural network.

6. The online basketball video event and target detection method based on the multitasking of claim 1, wherein a training set is expanded by using a semi-supervised pseudo tag mining method, and the tags marked with n frames at intervals are supplemented with target detection pseudo tags of the remaining frames, specifically: extracting feature expression of intra-frame target by using SOTA pedestrian re-recognition modelPerforming multi-target similarity matching on the feature vectors in the two interval frames, wherein a similarity calculation formula is as follows:wherein->O _t+n ，O _t Feature vector set for target in t moment frame, O _t+n Feature vector set for target in t+n time frame,> feature vectors in the set respectively; if cos theta _i,j If the number is equal to or greater than threshold, the number is considered as an annotation frame of the same person, and the pair of the annotation frames is successfully matched; target proportion P when matching is successful in two interval frames _success When the threshold value T is exceeded, two frames are considered to be successfully matched, and the frame size and the position corresponding to the label-free frame are calculated by using a linear interpolation method for the successfully matched frames, so that a pseudo label of the middle missing frame is obtained; if the label is smaller than or equal to the threshold T, the label is not marked; the target proportion calculation mode of successful matching is as follows: />Wherein O is _success For the target logarithm successfully matched with the t moment frame and the t+n moment frame, O _t 、O _t+n The meaning is the same as the previous formula.

7. The method for online detection of basketball video events and targets based on multiple tasks according to claim 1, wherein a semi-supervised pseudo-label mining method is used to expand a training set, and the climax is obtained by calculating the proximity of the current time t to the climax in the start and end times of the event in order to effectively suppress low-quality event prediction frames generated by the climax of offset eventsThe score pseudo tag has a calculation formula:wherein T is _b ^* For the duration of the high tide time from the starting time of the event, T _l ^* For the duration of the moment from the moment of the event beginning comprising the moment, T _f ^* For the duration of the high tide time from the ending time of the event, T _r ^* The time length of the event end time containing the time is the time distance; the pseudo tag value corresponding to the real high tide time is 1, the size of the constructed pseudo tag value is in nonlinear decrease from the high tide point to two sides on a time axis in the event until the pseudo tag value coincides with the starting time and the ending time of the event, and the coincident point tags are all 0.

8. The online detection method of basketball video events and targets based on multiple tasks according to claim 1, wherein a training set is extended by using a semi-supervised pseudo-label mining method, and for effectively suppressing a low-quality prediction frame generated by deviating from a target geometric center, the proximity degree of a current position relative to a calibration frame geometric center point contained in the current position is calculated to obtain a target center scoring pseudo-label, and the calculation formula is as follows:wherein left is ^* Distance from current position to left side of label frame, right ^* The distance from the current position to the right side of the labeling frame is top ^* Annotate the distance of frame top for current position distance, bottom ^* The distance from the current position to the lower part of the labeling frame is set as epsilon, and epsilon is a parameter of the distribution of the regulating value; the pseudo tag value of the real geometric center is 1, and the size of the pseudo tag value obtained by construction radially decreases from the geometric center to the periphery in the target calibration frame until the tag on the target calibration frame is 0.

9. The method for online detection of basketball video events and targets based on multiple tasks of claim 1, wherein the targets are selected from the group consisting ofLoss of label detection L _object The calculation formula is as follows:

wherein: n (N) _{pos_obj} Representing the number of target positive samples c _x,y Is the target class of feature point predictions at coordinates (x, y),is the actual target category marked by the feature points at the coordinates (x, y), alpha and beta are weight parameters, b _x,y Is the target prediction frame parameter corresponding to the feature point at the coordinates (x, y), is +.>Is a target marking frame parameter corresponding to a feature point at coordinates (x, y), r _x,y Is the target center scoring value of the feature point prediction at coordinates (x, y), +.>Is the target center scoring value L of the pseudo label mark of the characteristic point at the coordinates (x, y) _cls Classifying loss functions for cross entropy, L _reg L is the GIOU loss function _ctr Is a binary loss function;

event detection loss L _event The calculation formula is as follows:

wherein: n (N) _{pos_ev} Representing the number of positive samples of an event, e _t Is the event category of the frame prediction corresponding to time t,is the actual category of the frame label corresponding to the time t, gamma and delta are weight parameters, l _t Open for predicted current time t distance eventThe duration of the starting time is set to be, _t for the duration of the predicted current time t from the end time of the event, l ^* For the length of the marked high tide time from the starting time of the event, r ^* For the duration of the marked high tide time from the end of the event, ^* for marked climax moment, h is measured by an offset relative to the beginning of the whole video _t Is the climax score value of the frame prediction corresponding to time t,/for the frame prediction corresponding to time t>Is the climax scoring value L of the frame pseudo tag label corresponding to the time t _cls Classifying loss functions for cross entropy, L _hot As a binary loss function, L _reg For the cross ratio of the detected event span and the real label on the time axis, the calculation formula is as follows:

10. The online basketball video event and target detection method according to claim 1, wherein in the step S3, four parameters of regression frame coordinates are converted into a common calibration frame diagonal two-point coordinate form for the output of the network target detection part, and original image coordinate calculation formulas corresponding to coordinate points (x, y) in different scale feature graphs are as follows:wherein s is the reduced multiple of the current feature map relative to the original map; for the regressed prediction frame, suppressing the prediction frame by using NMS;