WO2020088763A1 - Dispositif et procédé de reconnaissance d'activité dans des vidéos - Google Patents

Dispositif et procédé de reconnaissance d'activité dans des vidéos Download PDF

Info

Publication number
WO2020088763A1
WO2020088763A1 PCT/EP2018/079890 EP2018079890W WO2020088763A1 WO 2020088763 A1 WO2020088763 A1 WO 2020088763A1 EP 2018079890 W EP2018079890 W EP 2018079890W WO 2020088763 A1 WO2020088763 A1 WO 2020088763A1
Authority
WO
WIPO (PCT)
Prior art keywords
label
video
predictions
deep
rgb
Prior art date
Application number
PCT/EP2018/079890
Other languages
English (en)
Inventor
Milan REDZIC
Tarik CHOWDHURY
Shaoqing Liu
Bing Yu
Peng Yuan
Hamdi OZBAYBURTLU
Hongbin Wang
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2018/079890 priority Critical patent/WO2020088763A1/fr
Priority to CN201880098842.1A priority patent/CN112912888A/zh
Publication of WO2020088763A1 publication Critical patent/WO2020088763A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • Embodiments of the present invention relate to action recognition in videos.
  • embodiments of the invention provide a device and method for recognizing one or more activities in a video, wherein the device and method employ a deep-leaming network.
  • embodiments of the invention are also concerned with designing an effective deep-leaming network architecture, which is particularly suited for recognizing activities in videos.
  • Embodiments of the invention are applicable, for instance, to video surveillance systems and cameras.
  • BA behavior analysis
  • UUA unusual behavior analysis
  • a model can be saved and transfer learning can be used to fine-tune the weights, in order to achieve robustness on different datasets,
  • embodiments of the invention aim to improve conventional approaches for action recognition in videos.
  • An objective is to provide a device and method able to recognize activities in a video more efficiently, more accurately, and more reliably, or in other words with improved robustness, than conventional approaches.
  • the device and method should be able to recognize different types of user activities (e.g. classes) given in video or image form (i.e. so called action events in a video) by associating a label with each activity identified in the video. Based on such labels, the device and method should also be able to analyze behavior of people present in these videos.
  • video surveillance systems and applications are targeting event tasks typically performed by security personnel, i.e. : detection of loitering, perimeter breach, and detection of unattended objects, etc.
  • the device and method should be capable of the detecting specifically those aforementioned activities in a video.
  • the device and method should be based on deep-leaming techniques.
  • Embodiments of the invention are defined in the enclosed independent claims. Advantageous implementations of the embodiments of the invention are further defined in the dependent claims.
  • embodiments of the invention allow realizing a BA module based on an activity recognition heuristic, which takes into account several conventional methods, but puts them in a new unified framework. Thereby, particularly a late fusion function is employed as one main idea, in order to derive more information about the input video.
  • Embodiments of the invention further take into account principles of effective deep- leaming network architectures for action recognition in a video, and are able to leam network models given only limited training samples.
  • RGB frames are individual images of the video, extracted at a particular frame-rate.
  • the OF can, for example, be calculated by determining a pattern of apparent motion of image objects between two consecutive frames caused by the movement of the objects and/or a camera.
  • OF can be described as a two dimensional (2D) vector field where each vector is a displacement vector showing the movement of points from a first frame to a second frame.
  • 2D two dimensional
  • a first aspect of the invention provides a device for recognizing one or more activities in a video, each activity being associated with a predetermined label, wherein the device is configured to employ a deep-leaming network and, in an inference phase, to: receive the video, separate the video into an RGB part and an OF part, employ a spatial part of the deep-leaming network to calculate a plurality of spatial label predictions based on the RGB part, employ a temporal part of the deep-leaming network to calculate a plurality of temporal label predictions based on the OF part, and fuse the spatial and temporal label predictions to obtain a label associated with an activity in the video.
  • A“deep-leaming network” includes, for instance, a neural network like a Convolutional Neural Network (CNN) or Convolutional Networks (ConvNets), and/or include one or more skip connections as proposed in the Residual Networks (ResNet), and/or a batch normalization (bn)-inception type of network.
  • a deep-leaming network can be trained in a training phase of the device, and can be used for recognizing activities in the video during an inference phase of the device.
  • A“label” or“class label” identifies an activity or a class of an activity (e.g.“loitering” or “perimeter breach”). That is, a label is directly associated with an activity. Labels can be determined before operating the device for activity recognition.
  • A“label prediction” is a predicted label, i.e. at least one preliminary label, and may typically include a prediction of multiple labels, e.g. label candidates, each associated with a different probability of being the correct label.
  • Temporal bases on the OF i.e. refers to the motion in the video
  • spatial base on the RGB i.e. refers to the spatial distribution of features, e.g. colors brightness etc. of e.g. pixels or areas, in the video.
  • the device is further configured to: extract a plurality of RGB snippets and a plurality OF snippets from the video, in order to separate the video into the RGB part and OF part, employ the spatial part of the deep-leaming network to calculate a plurality of label predictions for each of the RGB snippets, employ the temporal part of the deep-leaming network to calculate a plurality of label predictions for each of the OF snippets, calculate the plurality of spatial label predictions based on the label predictions of the RGB snippets, and calculate the plurality of temporal label predictions based on the label predictions of the OF snippets.
  • A“snippet” is a short segment or piece of the video, which may for instance be randomly sampled from the video.
  • An“RGB snippet” consists of RGB frames extracted from the video snippet while an“OF snippet” consists of OF frames extracted from the video snippet.
  • the device is further configured to: employ the spatial part of the deep-learning network to calculate a plurality of label predictions for each RGB frame in a given RGB snippet and calculate the plurality of label predictions for the given RGB snippet based on the label predictions of the RGB frames, and/or employ the temporal part of the deep-learning network to calculate a plurality of label predictions for each OF frame in a given OF snippet and calculate the plurality of label predictions for the given OF snippet based on the label predictions of the OF frames.
  • the RGB part, and each RGB snippet includes a plurality of frames, i.e.“RGB frames”.
  • a frame is an image or picture of the video, i.e. a label prediction for a frame takes into account that picture of the video to predict one or more labels associated with activities.
  • the device is further configured to, in order to fuse the spatial and temporal label predictions: calculate a sum of normalized label predictions for the same label from a determined number of the plurality of spatial label predictions and a determined number of the plurality of temporal label predictions, and select a normalized label prediction having highest score as the label.
  • the label can be predicted even more accurately and efficiently.
  • the device is further configured to: calculate, as the sum of normalized label predictions for the same label, a sum of a normalized scaled frequency of appearance of all spatial and temporal label predictions for the same label.
  • “Frequency of appearance” means how often a spatial or temporal label (candidate), i.e. label prediction, is predicted.
  • the device is further configured to: obtain a label for each of a plurality of videos in a dataset, and calculate an accuracy for the dataset based on the obtained labels.
  • the deep-learning network is a Temporal Segment Network (TSN)-bn-inception, enhanced with skip connections from a residual network (ResNet), type of a network.
  • TSN Temporal Segment Network
  • ResNet residual network
  • the device is able to efficiently but accurately obtain the label(s) based on deep- learning.
  • a skip connection shortcut s layers in the deep-learning network.
  • the spatial part and/or the temporal part of the deep-learning network comprises a plurality of connected input layers, a plurality of connected output layers, and a plurality of skip connections, each skip connection connecting an input layer to an output layer.
  • the device is further configured to, in a training and testing phase: receive a training/testing video, and output a result including a ranked list of predicted labels based on the training/testing video, each predicted label being associated with a confidence value score.
  • the result further includes a calculated loss.
  • the device is further configured to: interrupt the training phase, if calculating a loss of a predetermined value. In a further implementation form of the first aspect, the device is further configured to: obtain a pre-trained network model of the deep-learning network at the end of the training phase.
  • a second aspect of the invention provides a method for recognizing one or more activities in a video, each activity being associated with a predetermined label, wherein the method employs a deep-learning network and comprises in an inference phase: receiving the video, separating the video into an RGB part and an OF part, employing a spatial part of the deep- learning network to calculate a plurality of spatial label predictions based on the RGB part, employing a temporal part of the deep-learning network to calculate a plurality of temporal label predictions based on the OF part, and fusing the spatial and temporal label predictions to obtain a label associated with an activity in the video.
  • the method further comprises: extracting a plurality of RGB snippets and a plurality OF snippets from the video, in order to separate the video into the RGB part and OF part, employing the spatial part of the deep-learning network to calculate a plurality of label predictions for each of the RGB snippets, employing the temporal part of the deep-learning network to calculate a plurality of label predictions for each of the OF snippets, calculating the plurality of spatial label predictions based on the label predictions of the RGB snippets, and calculating the plurality of temporal label predictions based on the label predictions of the OF snippets.
  • the method further comprises: employing the spatial part of the deep-learning network to calculate a plurality of label predictions for each RGB frame in a given RGB snippet and calculate the plurality of label predictions for the given RGB snippet based on the label predictions of the RGB frames, and/or employing the temporal part of the deep-leaming network to calculate a plurality of label predictions for each OF frame in a given OF snippet and calculate the plurality of label predictions for the given OF snippet based on the label predictions of the OF frames.
  • the method further comprises, in order to fuse the spatial and temporal label predictions: outputting and calculating a sum of normalized label predictions for the same label from a determined number of the plurality of spatial label predictions and a determined number of the plurality of temporal label predictions, and selecting a normalized label prediction having highest score as the label.
  • the method further comprises: calculating, as the sum of normalized label predictions for the same label, a sum of a normalized scaled frequency of appearance of all spatial and temporal label predictions for the same label.
  • the method further comprises: obtaining a label for each of a plurality of videos in a dataset, and calculating an accuracy for the dataset based on the obtained labels.
  • the deep-learning network is a TSN- bn-inception, enhanced with skip connections from a residual network, type of a network.
  • the spatial part and/or the temporal part of the deep-learning network comprises a plurality of connected input layers, a plurality of connected output layers, and a plurality of skip connections, each skip connection connecting an input layer to an output layer.
  • the method further comprises, in a training/testing phase: receiving a training/testing video, and outputting a result including a ranked list of predicted labels based on the training/testing video, each predicted label being associated with a confidence value score.
  • the result further includes a calculated loss.
  • the method further comprises: interrupting the training phase, if calculating a loss of a predetermined value. In a further implementation form of the second aspect, the method further comprises: obtaining a pre -trained network model of the deep-learning network at the end of the training phase.
  • a third aspect of the invention provides a computer program product comprising program code for controlling a device to perform the method of the second aspect or any of its implementation forms, when the program code is executed by one or more processors of the device.
  • the fusing (late fusion function) of the spatial and temporal label predictions may be performed taking into account predictions of both the RGB and OF parts, particularly fusing the output predictions of the two data streams, respectively.
  • This fusion function may be based on the top k predictions (k > 1) for each stream separately: e.g. for each RGB frame of an input video, the top k predictions of the network output may be found. Then all the predictions may be grouped based on one source of information (all the RGB frames) and also the top k predictions may be chosen (based on majority of votes or a frequency of appearance).
  • the first ranked (most-likely) prediction may be taken as the correct one, and may be compared it to its label (ground truth prediction).
  • the same process may be repeated, and the prediction based on this part only may be obtained.
  • the processes for RGB and OF may be repeated, but this time by taking the top m (m > 1 and preferably > k) predictions into account. Then a union (sum) of normalized predictions for the same label may be found from both parts, and the one with the most votes may be chosen.
  • this fusion heuristic does not actually depend on a type of the input data.
  • the main improvement provided by the embodiments of the invention are an increase of accuracy and an improvement of efficiency, as compared to conventional approaches of video action recognition.
  • the accuracy improvement is reflected on three various different datasets tested, while there is also a slight improvement of the training speed.
  • FIG. 1 shows a device according to an embodiment of the invention.
  • FIG. 2 shows an interference phase of a device according to an embodiment of the invention.
  • FIG. 3 shows a training phase of a device according to an embodiment of the invention.
  • FIG. 4 shows an exemplary inference phase procedure, implemented by a device according to an embodiment of the invention.
  • FIG. 5 shows a basic block example of a deep-learning network using skip connections for a device according to an embodiment of the invention.
  • FIG. 6 shows an example of a part of a deep-leaming network for by a device according to an embodiment of the invention.
  • FIG. 7 shows a method according to an embodiment of the invention.
  • FIG. 1 shows a device 100 according to an embodiment of the invention.
  • the device 100 is configured to recognize one or more activities in a (input) video 101.
  • the device 100 may be implemented in a video surveillance system, and/or may receive the video 101 from a camera, particularly a camera of a video surveillance system. However, the device 100 is able to perform action recognition on any kind of input video, regardless of its origin.
  • Each activity is associated with a predetermined label 104.
  • a number of predetermined labels 104 may be known to the device 100 and/or determined labels 104 may be learned or trained by the device 104.
  • the device 100 is specifically configured to employ a deep- leaming network 102, and can accordingly be operated in an inference phase and a training phase.
  • the deep-leaming network may be implemented by at least one processor or processing circuitry of the device 100.
  • the device 100 is particularly configured to receive the video 101 (e.g. from a video camera or from a video post-processing device, which outputs a post-processed video), in which video activities are to be recognized by the device 100.
  • the device 100 is configured to first separate the video 101 into an RGB part lOla and an OF part 10 lb, respectively.
  • the RGB part represents spatial features (e.g. colors, contrast, shapes etc.) in the video 101
  • the OF part 10 lb represents temporal features (i.e. motion features) in the video 101.
  • the device 100 is configured to employ the deep-leaming network 102, which includes a spatial part l02a and a temporal part l02b.
  • the deep-leaming network 102 may be software-implemented in the device 100.
  • the spatial part l02a is employed to calculate a plurality of spatial label predictions l03a based on the RGB part lOla of the video 101
  • the temporal part l02b is employed to calculate a plurality of temporal label predictions l03b based on the OF part 10 lb of the video 101.
  • the device 100 is configured to fuse the spatial and temporal label predictions l03a, l03b, in order to obtain a (fused) label 104 associated with an activity in the video 101.
  • This fusing is also referred to as late fusion, since it operates on label predictions, i.e. on preliminary results.
  • the label 104 classifies an activity in the video 101, i.e. an activity in the video 101 has been recognized.
  • FIG. 2 shows in particular a device 100 according to an embodiment of the invention, which builds on the device 100 shown in FIG. 1, and is operated in the inference phase.
  • FIG. 3 also shows a device 100 according to an embodiment of the invention, which builds on the device 100 shown in FIG. 1, but is operated in a training phase.
  • the devices 100 of FIG. 2 and FIG. 3 may be the same.
  • Training and inference (also referred to as testing) phases (also called stages) can be distinguished, because the device 100 is deep-learning network based.
  • FIG. 2 A block diagram with respect to the testing/inference phase of the device 100 is shown in FIG. 2. It can be seen that the device 100 is configured to extract a plurality of RGB snippets 200a and a plurality of OF snippets 200b, respectively, from the video 101, in order to separate the video into the RGB part lOla and the OF part 10 lb. The RGB and OF snippets 200a, 200b then propagate through corresponding deep-learning network 102 parts (i.e. the spatial part l02a and the temporal part l02b, respectively).
  • corresponding deep-learning network 102 parts i.e. the spatial part l02a and the temporal part l02b, respectively.
  • the spatial part l02a calculates a plurality of label predictions 20 la for each of the RGB snippets 200a
  • the temporal part l02b calculates a plurality of label predictions 20 lb for each of the OF snippets 200b.
  • spatial and temporal consensus predictions are obtained, wherein the device 100 is configured to calculate the plurality of spatial label predictions l03a based on the label predictions 20 la of the RGB snippets 200a, and to calculate the plurality of temporal label predictions l03b based on the label predictions 20 lb of the OF snippets 200b.
  • the spatial and temporal label predictions l03a, l03b are fused (late fusion) to obtain at least one label 104 associated with at least one activity in the video 101, i.e. a final prediction of the activity is made.
  • the label 104, or multiple labels 104 may be provided to a watch-list ranking block 202. Multiple predictions for multiple videos available (from a dataset) may be processed to obtain a final accuracy (on the whole dataset).
  • the device 100 may obtain at least one label 104 for each of a plurality of videos 101 in the dataset, and may calculate an accuracy for the dataset based on the obtained labels 104.
  • a validation output 301 which includes a ranked list of (the current) predicted labels 104 based on input video frames (images) of a training/testing video 300, may be used to calculate a validation accuracy result, in addition to corresponding confidence value scores.
  • the device 100 may output the result 301 including a ranked list of predicted labels 104 based on the training/testing video 300, wherein each predicted label 104 is associated with a confidence value score.
  • a loss may be calculated, and the overall training process may repeated, until the process either finishes with the last training iteration or reaches a particular (predefined) loss value.
  • the result 301 output by the device 100 in the training phase may further include a calculated loss, and the device 100 may be configured to interrupt the training phase, if calculating a loss of a predetermined value.
  • the device 100 obtains a pre-trained network model of the deep-learning network 102 at the end of the training phase (i.e. at least a network graph and trained network weights), which can be used by the device 100 during the testing (inference) phase, e.g. as shown in FIG. 2, for recognizing activities in the video 101.
  • a pre-trained network model of the deep-learning network 102 at the end of the training phase i.e. at least a network graph and trained network weights
  • the device 100 obtains a pre-trained network model of the deep-learning network 102 at the end of the training phase (i.e. at least a network graph and trained network weights), which can be used by the device 100 during the testing (inference) phase, e.g. as shown in FIG. 2, for recognizing activities in the video 101.
  • the deep-learning network 102 employed by the device 100 may be a TSN-bn-inception, enhanced with skip connections from a ResNet, type of a network.
  • the deep-learning network 101 may be a modification and/or combination of different building blocks, in particular it may be based on a combination of a TSN and a bn-inception type network with skip connections as proposed in the ResNets. Below, first the individual building blocks and then the combined deep-leaming network 102 are described.
  • the TSN may be chosen as one building block of the deep-leaming network 102.
  • the TSN is generally enabled to model dynamics throughout a video.
  • the TSN may to this end be composed of spatial stream ConvNets and temporal stream ConvNets.
  • FIG. 4 A general approach for performing action recognition in a video with a device 100 employing such a TSN is shown in FIG. 4.
  • an inference phase of such a device 100 is shown.
  • the input video is divided 400 into a plurality of segments (also referred as slices or chunks), and then short snippets are extracted 401 from each segment, wherein a snippet comprises more than one frame, i.e. a plurality of frames.
  • the TSN operates on a sequence of short snippets sparsely sampled (in time and/or spatial domain, for example depending on the video size) from the entire video.
  • Each snippet in the sequence may produce its own preliminary prediction of action classes (class scores 402).
  • Class scores 402 of different snippets may be fused 403 by a segmental consensus function to yield segmental consensus, which is a video-level prediction. Predictions from all modalities are then fused 404 to produce the final prediction.
  • ConvNets on all snippets may share parameters.
  • the loss values of video-level predictions may be optimized by iteratively updating the model parameters.
  • a given video V may be divided into K segments (Sl, S2, ... , SK ⁇ of equal durations.
  • the TSN may model a sequence of snippets as follows:
  • TSN(Tl, T2, ..., TK) M(G(F(Tl;W), F(T2;W), ..., F(TK;W))).
  • T 1 , T2, ... ,TK is a sequence of snippets.
  • Each snippet Tk may be randomly sampled from a corresponding segment Sk, wherein k is an integer index in a range from 1 to K.
  • F(Tk;W) may define a function representing a ConvNet with parameters W, which operates on the short snippet Tk and produces class scores for all the classes.
  • the segmental consensus function G combines the outputs from multiple short snippets to obtain a consensus of class hypothesis among them. Based on this consensus, the prediction function M (Softmax function) predicts the probability of each action class for the whole video. Combining with standard categorical cross-entropy loss, the final loss function regarding the segmental consensus may read:
  • C is the number of action classes and yi the ground-truth label concerning class i.
  • a class score Gi is inferred from the scores of the same class on all the snippets, using an aggregation function.
  • Inception with Batch Normalization may be chosen as another building block of the deep-leaming network 102. That is, the deep-learning network 102 may particularly be or include a bn-inception type of network. The bn-inception type of network may be specifically chosen, due to its good balance between accuracy and efficiency.
  • the bn-inception architecture may be specifically designed for the two-stream ConvNets as the first building block.
  • the spatial stream ConvNet may operate on a single RGB image, and the temporal stream ConvNet may take a stack of consecutive OF fields as input.
  • the two-stream ConvNets may use RGB images for the spatial stream and stacked OF fields for the temporal stream.
  • a single RGB image usually encodes a static appearance at a specific time point, and lacks the contextual information about previous and next frames.
  • the temporal stream ConvNets takes the OF flow fields as an input, and aims to capture the motion information. In realistic videos, however, there usually exists camera motion, and the OF fields may not concentrate on the human movements.
  • a ResNet framework may be chosen as another building block of the deep-leaming network 102. Although deep networks have better performance in classification most of the times, they are harder to train than ResNets, which is mainly due to two reasons:
  • Vanishing / exploding gradients sometimes a neuron dies during the training process, and depending on its activation function it might never be in operation again. This problem can be resolved by employing some initialization techniques. 2. Harder optimization: when the model introduces more parameters, it becomes more difficult to train the network.
  • ResNets have shortcut connections parallel to their normal convolutional layers. This results in a faster training and also provides a clear path for gradients to back propagate to early layers of the network. This makes the learning process faster by avoiding vanishing gradients or dead neurons.
  • a ResNet model specifically designed for the deep-learning network 102 of the device 100 may accept images and classify them.
  • a naive method could be just up-sampling an image and then give it to the trained model, or just skipping the first layer and insert the original image as the input of the second convolutional layer, and then fine tuning a few of the last layers to get higher accuracy.
  • the deep-leaming network 102 of the device 100 may be based on TSN-bn-inception with skip connections (as proposed in the ResNets).
  • a two-step approach may be applied as described in the following:
  • skip connections and deep residual layers may be used to allow the network to learn deviations from the identity layer.
  • the network may be simplified by reducing the layers and approximating them with layers which can better distinguish the features and improve the accuracy.
  • An embodiment of the deep learning network consists, for example, in total of 202 layers.
  • the residual connections are also part of the network.
  • the Rectified Linear Unit (ReLU) layer is connected to the convolutional layer of every sub-inception unit. Also the convolutional layer of this unit is connected to the output of the 8th unit placed in the middle of the huge network.
  • An addition layer connects the input with a batch normalization layer and can lead to a ReLU layer after the addition process. Modifications were done on the following parts for the both RGB and OF stream and throughout the network. 1.
  • An addition layer is placed between convolutional inception_3a_lxl and bn inception_3a_lxl_bn layers and connects an input directly to the addition layer.
  • FIG. 5 A basic block example of the network is shown in FIG. 5, and a zoomed-up part of the network looks, for example, like shown in FIG. 6.
  • Data augmentation is an effective method to expand the training data by applying transformations and deformations to the labeled data, resulting in new samples as additional training data.
  • data augmentation techniques were used: random brightness, random flip (left-to-right flipping) and a bit of random contrast.
  • random brightness random flip (left-to-right flipping)
  • a bit of random contrast The addition of these varying versions of images enables the networks to model discriminative characteristics pertaining to these variety of representations.
  • the training of the deep networks with the augmented data will improve its generalization on unseen samples.
  • details of the network training are described.
  • a cross modality pre- training technique is applied, in which RGB models are utilized to initialize the temporal networks.
  • the OF fields (OF snippets) are discretized into the interval from 0 to 255 by a linear transformation. This step makes the range of OF fields to be the same with RGB images (RGB snippets).
  • the weights of first convolution layer of RGB models are modified to handle the input of OF fields. Specifically, the weights are averaged across the RGB channels and this average is replicated by the channel number of temporal network input.
  • batch normalization will estimate the activation mean and variance within each batch and use them to transform these activation values into a standard Gaussian distribution. This operation speeds up the convergence of training but also leads to over-fitting in the transferring process, due to the biased estimation of activations distributions from limited number of training samples. Therefore, after initialization with pre-trained models, the mean and variance parameters of all Batch Normalization layers except the first one are frozen. As the distribution of OF is different from the RGB images, the activation value of first convolution layer will have a different distribution and the mean and variance need to be re-estimated accordingly. An extra dropout layer is added after the global pooling layer in bn— inception architecture to further reduce the effect of over- fitting.
  • Data augmentation can generate diverse training samples and prevent severe over-fitting.
  • random left to right flipping in addition to random contrast and brightness are employed to augment training samples.
  • FIG. 7 shows a method according to an embodiment of the invention.
  • the method 700 is for recognizing one or more activities in a video 101, each activity being associated with a predetermined label 104.
  • the method 700 employs a deep-learning network 102 and may be carried out by the device 100 shown in FIG. 1 or FIG. 2.
  • the method 700 comprises: a step 701 of receiving the video 101; a step 702 of separating the video 101 into an RGB part lOla and an OF part 10 lb; a step 703 of employing a spatial part l02a of the deep-leaming network 102 to calculate a plurality of spatial label predictions l03a based on the RGB part lOla; a step 704 of employing a temporal part l02b of the deep-leaming network 102 to calculate a plurality of temporal label predictions l03b based on the OF part 10 lb; and a step 705 of fusing the spatial and temporal label predictions l03a, l03b to obtain a label 104 associated with an activity in the video 101.
  • Embodiments of the invention may be implemented in hardware, software or any combination thereof.
  • Embodiments of the invention e.g. the device and/or the hardware implementation, may be implemented as any of a variety of suitable circuitry, such as one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), discrete logic, hardware, etc., or any combinations thereof.
  • Embodiments may comprise computer program products comprising program code for performing, when implemented on a processor, any of the methods described herein.
  • Further embodiments may comprise at least one memory and at least one processor, which are configured to store and execute program code to perform any of the methods described herein.
  • embodiments may comprise a device configured store instructions for software in a suitable, non-transitory computer-readable storage medium and may execute the instructions in hardware using one or more processors to perform any of the methods described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

Selon des modes de réalisation, la présente invention se rapporte à la reconnaissance d'action dans des vidéos. Pour cela, un mode de réalisation de l'invention concerne un dispositif et un procédé de reconnaissance d'une ou de plusieurs activités dans une vidéo, le dispositif et le procédé utilisant un réseau à apprentissage profond. Le dispositif est configuré pour : recevoir la vidéo ; séparer la vidéo en une partie RVB et une partie flux optique (OF) ; utiliser une partie spatiale du réseau à apprentissage profond pour calculer une pluralité de prédictions d'étiquette spatiales en fonction de la partie RVB ; utiliser une partie temporelle du réseau à apprentissage profond pour calculer une pluralité de prédictions d'étiquette temporelles en fonction de la partie OF ; et fusionner les prédictions d'étiquette spatiales et temporelles pour obtenir une étiquette associée à une activité dans la vidéo.
PCT/EP2018/079890 2018-10-31 2018-10-31 Dispositif et procédé de reconnaissance d'activité dans des vidéos WO2020088763A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2018/079890 WO2020088763A1 (fr) 2018-10-31 2018-10-31 Dispositif et procédé de reconnaissance d'activité dans des vidéos
CN201880098842.1A CN112912888A (zh) 2018-10-31 2018-10-31 识别视频活动的设备和方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2018/079890 WO2020088763A1 (fr) 2018-10-31 2018-10-31 Dispositif et procédé de reconnaissance d'activité dans des vidéos

Publications (1)

Publication Number Publication Date
WO2020088763A1 true WO2020088763A1 (fr) 2020-05-07

Family

ID=64109862

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2018/079890 WO2020088763A1 (fr) 2018-10-31 2018-10-31 Dispositif et procédé de reconnaissance d'activité dans des vidéos

Country Status (2)

Country Link
CN (1) CN112912888A (fr)
WO (1) WO2020088763A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639563A (zh) * 2020-05-18 2020-09-08 浙江工商大学 一种基于多任务的篮球视频事件与目标在线检测方法
CN111709410A (zh) * 2020-08-20 2020-09-25 深兰人工智能芯片研究院(江苏)有限公司 一种强动态视频的行为识别方法
CN111738171A (zh) * 2020-06-24 2020-10-02 北京奇艺世纪科技有限公司 视频片段检测方法、装置、电子设备及存储介质
CN113095128A (zh) * 2021-03-01 2021-07-09 西安电子科技大学 基于k最远交叉一致性正则化的半监督时序行为定位方法
CN113139479A (zh) * 2021-04-28 2021-07-20 山东大学 一种基于光流和rgb模态对比学习的微表情识别方法及系统
CN113255489A (zh) * 2021-05-13 2021-08-13 东南大学 一种基于标记分布学习的多模态跳水赛事智能评估方法
WO2022134576A1 (fr) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Procédé, appareil et dispositif de positionnement de comportement de moment de vidéo infrarouge, et support de stockage

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Serious Games", vol. 9912, 2016, SPRINGER INTERNATIONAL PUBLISHING, Cham, ISBN: 978-3-540-37274-5, ISSN: 0302-9743, article LIMIN WANG ET AL: "Temporal Segment Networks: Towards Good Practices for Deep Action Recognition", pages: 20 - 36, XP055551834, 032682, DOI: 10.1007/978-3-319-46484-8_2 *
ANDREJ KARPATHY ET AL: "Large-Scale Video Classification with Convolutional Neural Networks", 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, June 2014 (2014-06-01), pages 1725 - 1732, XP055560536, ISBN: 978-1-4799-5118-5, DOI: 10.1109/CVPR.2014.223 *
KAREN SIMONYAN ET AL: "Two-Stream Convolutional Networks for Action Recognition in Videos", 9 June 2014 (2014-06-09), XP055324674, Retrieved from the Internet <URL:http://papers.nips.cc/paper/5353-two-stream-convolutional-networks-for-action-recognition-in-videos.pdf> [retrieved on 20180604] *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639563A (zh) * 2020-05-18 2020-09-08 浙江工商大学 一种基于多任务的篮球视频事件与目标在线检测方法
CN111639563B (zh) * 2020-05-18 2023-07-18 浙江工商大学 一种基于多任务的篮球视频事件与目标在线检测方法
CN111738171A (zh) * 2020-06-24 2020-10-02 北京奇艺世纪科技有限公司 视频片段检测方法、装置、电子设备及存储介质
CN111709410A (zh) * 2020-08-20 2020-09-25 深兰人工智能芯片研究院(江苏)有限公司 一种强动态视频的行为识别方法
CN111709410B (zh) * 2020-08-20 2020-12-01 深兰人工智能芯片研究院(江苏)有限公司 一种强动态视频的行为识别方法
WO2022134576A1 (fr) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Procédé, appareil et dispositif de positionnement de comportement de moment de vidéo infrarouge, et support de stockage
CN113095128A (zh) * 2021-03-01 2021-07-09 西安电子科技大学 基于k最远交叉一致性正则化的半监督时序行为定位方法
CN113095128B (zh) * 2021-03-01 2023-09-19 西安电子科技大学 基于k最远交叉一致性正则化的半监督时序行为定位方法
CN113139479A (zh) * 2021-04-28 2021-07-20 山东大学 一种基于光流和rgb模态对比学习的微表情识别方法及系统
CN113255489A (zh) * 2021-05-13 2021-08-13 东南大学 一种基于标记分布学习的多模态跳水赛事智能评估方法
CN113255489B (zh) * 2021-05-13 2024-04-16 东南大学 一种基于标记分布学习的多模态跳水赛事智能评估方法

Also Published As

Publication number Publication date
CN112912888A (zh) 2021-06-04

Similar Documents

Publication Publication Date Title
Zitouni et al. Advances and trends in visual crowd analysis: A systematic survey and evaluation of crowd modelling techniques
Xiong et al. Spatiotemporal modeling for crowd counting in videos
US10628683B2 (en) System and method for CNN layer sharing
Elharrouss et al. Gait recognition for person re-identification
US10402655B2 (en) System and method for visual event description and event analysis
WO2020088763A1 (fr) Dispositif et procédé de reconnaissance d&#39;activité dans des vidéos
WO2018192570A1 (fr) Procédé et système de détection de mouvement dans le domaine temporel, dispositif électronique et support de stockage informatique
Zhao et al. Robust unsupervised motion pattern inference from video and applications
Asad et al. Anomaly3D: Video anomaly detection based on 3D-normality clusters
Singh et al. A deep learning based technique for anomaly detection in surveillance videos
Ratre Taylor series based compressive approach and Firefly support vector neural network for tracking and anomaly detection in crowded videos
Ahmed et al. Crowd Detection and Analysis for Surveillance Videos using Deep Learning
Hu et al. Two-stage unsupervised video anomaly detection using low-rank based unsupervised one-class learning with ridge regression
Thai et al. Real-time masked face classification and head pose estimation for RGB facial image via knowledge distillation
Singh et al. Stemgan: spatio-temporal generative adversarial network for video anomaly detection
WO2020192868A1 (fr) Détection d&#39;événement
Latha et al. Human action recognition using deep learning methods (CNN-LSTM) without sensors
Veluchamy et al. Detection and localization of abnormalities in surveillance video using timerider-based neural network
Zhang et al. Video entity resolution: Applying er techniques for smart video surveillance
Aarthy et al. Crowd violence detection in videos using deep learning architecture
Vahora et al. Comprehensive analysis of crowd behavior techniques: A thorough exploration
Farrajota et al. A deep neural network video framework for monitoring elderly persons
Seemanthini et al. Recognition of trivial humanoid group event using clustering and higher order local auto-correlation techniques
Radulescu et al. Model of human actions recognition based on 2D Kernel
Cuong Noisy-label propagation for video anomaly detection with graph transformer network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18796913

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18796913

Country of ref document: EP

Kind code of ref document: A1