WO2016183770A1 - A system and a method for predicting crowd attributes - Google Patents
A system and a method for predicting crowd attributes Download PDFInfo
- Publication number
- WO2016183770A1 WO2016183770A1 PCT/CN2015/079190 CN2015079190W WO2016183770A1 WO 2016183770 A1 WO2016183770 A1 WO 2016183770A1 CN 2015079190 W CN2015079190 W CN 2015079190W WO 2016183770 A1 WO2016183770 A1 WO 2016183770A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- crowd
- motion
- video
- attributes
- features
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/53—Recognition of crowd images, e.g. recognition of crowd congestion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
Definitions
- the disclosures relate to a system for predicting crowd attributes and a method thereof.
- an attribute-based representation might describe a crowd video as the “conductor” and “choir” performing on the “stage” with “audience” “applauding” , in contrast to a categorical label like “chorus” .
- Crowd attribute profiling But the number of attributes in their work is limited (only four or less) , as well as the dataset is also small in terms of scene diversity.
- an system for predicting crowd attributes comprising: a feature extracting device obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and a prediction device being electronically communicated with the feature extracting device and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
- a method for understanding crowd scene comprising: obtaining a video with crowd scenes; extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
- a system for predicting crowd attributes comprising:
- a processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:
- a feature extracting component obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video;
- a prediction component extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
- the prediction device/component is configured with a convolutional neural network having:
- a first branch configured to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features;
- second branch configured to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features
- predicted features from the first branch and the predicted features from the second branch are fused together to form a prediction of the attributes of the crowd in the video.
- Fig. 1 is a schematic diagram illustrating a system for predicting crowd attributes according to an embodiment of the present application.
- Fig. 2 is a schematic diagram illustrating a flow chart for the system according to one embodiment of the present application.
- Fig. 3 illustrates a schematic block diagram of the feature extracting device according to an embodiment of the present application.
- Fig. 4 is a schematic diagram illustrating motion channels in scenarios consistent with some disclosed embodiments.
- Fig. 5 is a schematic diagram illustrating a convolutional neural network structure included in the prediction device according to some disclosed embodiments.
- Fig. 6 is a schematic diagram illustrating a flow chart for constructing a network with the appearance and motion braches according to one embodiment of the present application.
- Fig. 7 is a schematic diagram illustrating a flow chart for the training device to fine-tune the second network using the appearance and motion channels of videos in the fine-tuning set.
- Fig. 8 illustrates a system for predicting crowd attributes according to one embodiment of the present application, in which the functions of the present invention are carried out by the software.
- Fig. 1 illustrates a system 1000 for predicting crowd attributes.
- the proposed system 1000 is capable of understanding crowded scene in computer vision from attribute-level, and characterizing the crowded scene by predicting a plurality of attributes rather than discriminative assignment into a single specific category. It will be significant in many applications, e.g. in the video surveillance and video search engine.
- the system 1000 comprises a feature extracting device 100 and a prediction device 200.
- Fig. 2 illustrates a schematic diagram illustrating a flow chart for the system 1000 according to one embodiment of the present application.
- the feature extracting device 100 obtains a video with crowd scenes and extracts appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and then at step S202, the prediction device 200 predicts attributes of the crowd in the video based on the extracted motion features and the extracted appearances, which will be further discussed later.
- the feature extracting device 100 may deeply learn the appearance and motion representation across different crowded scenes.
- Fig. 3 illustrates a schematic block diagram of the feature extracting device 100 according to an embodiment of the present application.
- the feature extracting device 100 comprises an appearance feature exacting unit 101 configured to extract the RGB components of each frame from the input video.
- the feature extracting device 100 further comprises a motion feature exacting unit 102 to extract motion feature from the obtained video.
- the motion feature extracting unit 102 further comprises a tracklet detection module 1021 to detect crowd tracklets (i.e., short trajectories) for each frame in the obtained video with crowd scene.
- the tracklet detection module 1021 may utilize the well-known KLT feature point tracker to detect several key points for each frame in the obtained video.
- the detected key points are tracked with the matching algorithm predefined by the KLT, and the corresponding key points across consecutive frames are matched to extract the tracklets.
- a plurality of key points are detected in one person in the crowd in each frame.
- each of the motion features is computed on a certain number of (for example, 75) frames of the obtained video.
- the motion feature exacting unit 102 further comprises a motion distribution determination module 1022 to compute physical relationships between each tracklet and its neighbors to determine motions distribution in each frame.
- the scene-independent properties for groups in crowd exist in the whole scene space and can be quantified from scene-level.
- three properties namely collectiveness, stability and conflict are computed for the frames.
- the collectiveness indicates the degree of individuals in the whole scene acting as a union in collective motion
- the stability characterizes whether the whole scene can keep its topological structures
- conflict measures the interaction/friction between each pair of nearest neighbors of interest points.
- FIG. 4 Examples shown in the Fig. 4 illustrate each property intuitively. Referring to Fig. 4, for each channel, two examples are shown in the first and second rows.
- Fig. 4-a people in the crowded scene walk randomly, target on different destinations, so as to exhibit low collectiveness.
- Fig. 4-b a marathon video has people run coherently towards the same destination to exhibit high collectiveness.
- the present application is not only restricted to the proposed three properties, but can generate any properties if required.
- the motion maps module 1022 operates to define K-NN graph G (V, E) for the whole point set of the tracklets detected by the tracklet detection module 1021, whose vertices V represent the tracklet points, and tracklet point pairs are connected by edges E.
- G K-NN graph
- the motion distribution module 1022 then extracts three motion maps namely collectiveness distribution, stability distribution, and conflict distribution for each frame.
- the collectiveness distribution (or map) can be computed by integrating path similarities among crowds on collective manifold.
- B. Zhou, X. Tang, H. Zhang, and X. Wang. have proposed the algorithm of the Collective Merging to detect collective motions from random motions by modeling collective motions on the manifold in the “Measuring crowd collectiveness” (TPAMI, 36 (8) : 1586-1599, 2014) .
- the stability distribution is extracted by counting and averaging the number of invariant neighbors of each point in the K-NN graph.
- each member i its K-NN set is in the first frame and in the ⁇ -th frame. It has high stability if its neighbor sets vary little across frames. Thus the larger the is, the lower stability the member has.
- the conflict distribution is extracted by computing the velocity correlation between each pair of nearby tracklet points ⁇ z, z * ⁇ within the K-NN graph.
- each member i if the velocity of each member in its K-NN set is similar to that of himself, he will have low conflict. It means his neighbors move coherently with him without generating conflict with him.
- the motion feature exacting unit 102 further comprises a continuous motion channel generation module 1023 to average the per-frame motion maps, for example, the collectiveness maps, the stability maps and the conflict maps across temporal domain, and interpolate the sparse tracklet points to output three complete and continuous motion channels.
- a single frame owns tens or hundreds of tracklets, the total tracklet points are still sparse.
- the Gaussian kernel can be utilized to interpolate the averaged motion maps to get continuous motion channels.
- the system 1000 further comprises a prediction device 200.
- the prediction device 200 is electronically communicated with the feature extracting device 100 and is configured to obtain appearances of the video, receive the extracted motion features from the feature extracting device 100, and predict attributes of the crowd in the video based on the received motion features and/or the obtained appearances of the video.
- This function it can effectively detect the attributes, including the roles of people, their activities and the locations, from the crowd videos, so as to describe the content of the crowd videos. Therefore, crowd videos with the same set of attributes can be obtained and the similarity of different crowd videos can be measured by their attribute set. Furthermore, there are a large number of possible interactions among these attributes. Some attributes are likely to be detected simultaneously whilst some exclusive.
- the scenario “street” attribute is likely to co-occur with subject “pedestrian” when the subject is “walking” , and also likely to co-occur with subject “mob” when the subject is “fighting” , but not related to subject “swimmer” because the subject cannot “swim” on “street” .
- the feature extracting device 100 may configured as a model with convolutional neural network structure as shown in Fig. 5.
- Fig. 5 shows two branches are included in the convolutional neural network structure.
- the number of branches is not limited to the proposed two, and it can be generalized to more branches. The number of each type of layers and the number of parameters can also be tuned according to different tasks and objectives.
- the network comprises: one or more data layers 501, one or more convolution layers 502, one or more max/sum pooling layers 503, one or more normalization layers 504 and a fully-connected layer 505.
- this layer of the top appearance branch contains the RGB components (or channels) of the images and their labels (for example, the dimension is 94)
- of the bottom motion branch contains at least one motion features (for example, the proposed three motion channels as discussed in the above: the collectiveness, the stability and the conflict) and their labels same to the labels of the top branch.
- this layer 501 provides images and its labels where x ij is the j-th bit value of the d-dimension feature vector of the i-th input image region, y ij is the j-th bit value of the n dimension label vector of the i-th input image region.
- the layer 502 performs convolution, padding, and non-linear transformation operations.
- the convolution layer receives the output (s and ) from the data layer 501 and performs convolution, padding, and non-linear transformation operations.
- the convolution operation in each convolutional layer may be expressed as
- x i and y j are the i-th input feature map and the j-th output feature map, respectively;
- k ij is the convolution kernel between the i-th input feature map and the j-th output feature map
- b j is the bias of the j-th output feature map
- the convolution operation can extract features from the input image, such as edge, curve, dot, etc. These features are not predefined manually but are learned from the training data.
- the convolution kernel k ij When the convolution kernel k ij operates on the marginal pixels of x i , it will exceed the border of x i . In this case, it sets the values that exceed the border of x i to be 0 so as to make the operation valid. This operation is also called “padding” .
- the order of the above operations is: padding -> convolutions ->non-linear transformation (ReLU) .
- the input to “padding” is x i in equation (1) .
- Each step uses the output of the previous step.
- the non-linear transformation produces y j in equation 3) .
- This layer keeps the maximum value in a local window, and the dimension of the output is thus smaller than the input.
- the max pooling layer keeps the maximum value in a local window and discard the other values, the output is thus smaller than the input, which may be formulated as
- each neuron in the i-th output feature map y i pools over an M ⁇ N local region in the i-th input feature map x i , with s as the step size.
- the spatial invariance means that if the input shifts by several pixels, the output of the layer won’t change much.
- This layer normalizes the responses in local regions of input feature maps.
- the output dimensionality of this layer is equal to the input dimensionality.
- This layer takes the feature vector from the previous layer as the input and operates the inner-production between the feature and weights. And one non-linear transformation is operated on the production.
- the fully connected layer takes the feature vector from the previous layer as input and operates the inner-production between the feature x and weights w, and then one non-linear transformation will be operated on the production, which may be formulated as
- x denotes neural inputs (features) .
- y denotes neural outputs (features) in the current fully-connected layer.
- w denotes neural weights in current fully-connected layer. Neurons in the fully-connected layer linearly combine features in previous feature extraction module, followed by ReLU non-linearity.
- the fully connected layer is configured to extract global features (features extracted from the entire input feature maps) from previous layer.
- the fully-connected layer also has the function of feature dimension reduction by restricting the number of neurons in them.
- there are provided with at least two fully-connected layers so as to increase the nonlinearity of the neural network, which in turns makes the operation of fitting data easier.
- the convolutional layer and the max pooling layer only provide local transformations, which means that they only operate on a local window of the input (local region of the input image) .
- the fully-connected layer provides global transformation, which takes features from the whole space of the inputted image and conduct a transformation as discussed in the above Equation 5)
- the two branches then fuse together to one fully-connected layer.
- Conv (N, K, S) for convolutional. layers with N outputs, kernel size K and stride size S
- Pool (T, K, S) for pooling layers with type T, kernel size K and stride size S
- Norm (K) for local response normalization layers with local size K
- FC (N) for fully-connected layers with N outputs
- FC (8192) The output fully connected layers of two branches are concatenated to be FC (8192) .
- FC (8192) -FC (94) -Sig producing a plurality of (for example, 94) attribute probability predictions.
- the output of the FC 405 may be 94 attributes , for example , ⁇ street, temple, ... ⁇ belong to “where” , ⁇ star, protester, ... ⁇ belong to “who” , and ⁇ walk, board, ... ⁇ belong to “why” .
- the 94 attributes outputted from the FC 405 may be of three types: “where” (e.g. street, temple, and classroom) ; “who” (e.g. star, protester, and skater) ; and “why” (e.g. walk, board, and ceremony) .
- the system 1000 may further comprises a training device 300.
- the training device 300 is used to train the convolutional neural network by using the following two inputs to obtain a fine-tuned convolutional neural network which produces predictions of crowd attributes:
- a pre-training set contains images with different objects and the corresponding ground truth object labels.
- the label set encompasses m object classes.
- a fine-tuning set contains crowd videos with appearance as well as motion channels, and the corresponding ground truth attribute labels.
- the label set encompasses n attribute classes.
- Fig. 6 is a schematic diagram illustrating a flow chart for constructing a network with the appearance and motion braches according to one embodiment of the present application.
- two convolutional neural networks are provided with the same structure but different numbers of branches, the first one is used to do pre-training with only one branch, and the second one is used to do fine-tuning with two branches.
- the first convolutional neural network with one branch of convolutional neural layers may be constructed according to the conventional means.
- the second convolutional neural network with one branch of convolutional neural layers is constructed based on the first convolutional neural network.
- the device 300 operates to pre-train the first convolutional neural network with image net detection task, which can be done by the conventional means or algorithm.
- the network parameters of the appearance branch are initialized using the pre-trained model stated in step S601.
- the parameters may be randomly initialized.
- the input of the motion branch in the first convolutional neural network is replaced by the proposed motion distributions, i.e. , collectiveness distributions, stability distributions and conflict distributions.
- the network parameters of the motion branch of the first convolutional neural network with the proposed motion channels are randomly initialized without pre-training.
- the second convolution neural network with two branches (i.e. the appearance channel and the motion channels) is constructed.
- the second network is constructed by combining the first convolutional neural network initialized with the appearance parameters at step S602 and the first convolutional neural network initialized with the motion parameters at step S604, as shown in Fig. 6.
- Fig. 7 is a schematic diagram illustrating a flow chart for the training device 300 to fine-tune the second network using the appearance and motion channels of videos in the fine-tuning set.
- step S701 parameters, including the convolution filters, deformational layer weights, fully connected weights, and bias are initialized randomly by the training device 300.
- the training tries to minimize the loss function and can be divided into many updating steps. Therefore, at step S702, the loss is calculated, and then at step S703, the algorithm calculates the gradient with respect to all the neural network parameters based on the calculated loss, including the convolution filters, deformational layer weights, fully connected weights, and bias.
- the gradient of any network parameters can be calculated with the chain rule.
- the output of a layer L k in the network can be expressed by a general function
- y k is the output of the layer L k
- y k-1 is the output of the previous layer L k-1
- w k is the weights of L k
- f k is the function for L k .
- the derivative of y k with respect to y k-1 and w k is all known.
- the loss function C of the network is define on the output of the last layer L n and the ground truth label t,
- the derivative of c with respect to y n is also known.
- the chain rule can be applied
- the gradient of the cost c with respect to any weights in the network can be calculated.
- the algorithm updates the convolution filters, deformational layer weights, fully connected weights, and bias by rule of
- ⁇ is the learning rate
- ⁇ is a predefined value
- Updates of the parameters are performed using the production of one prefixed learning rate and the corresponding gradients.
- step S705 it determines if the stopping criterion is satisfied. For example, if the variation of the loss is less than a predetermined value, the process terminates, otherwise, the process return back to step S702.
- the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ”
- ICs integrated circuits
- the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software.
- the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
- Fig. 8 illustrates a system 3000 for predicting crowd attributes according to one embodiment of the present application, in which the functions of the present invention are carried out by the software.
- the system 3000 comprises a memory 3001 that stores executable components and a processor 3002, electrically coupled to the memory 3001 to execute the executable components to perform operations of the system 3000.
- the executable components may comprise: a feature extracting component 3003 obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and a prediction component 3004 extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
- the functions of the components 3003 and 3004 are similar to those of the unit 100 and 200, respectively, and thus the detailed descriptions thereof are omitted herein.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
Disclosed is a system for predicting crowd attributes, comprising: a feature extracting device obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and a prediction device being electronically communicated with the feature extracting device and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
Description
The disclosures relate to a system for predicting crowd attributes and a method thereof.
During the last decade, the field of crowd analysis had a remarkable evolution from crowded scene understanding, including crowd behavior analysis, crowd tracking, and crowd segmentation. Much of this progress was sparked by the creation of crowd datasets as well as the new and robust features and models for profiling crowd intrinsic properties. Most of the above studies on crowd understanding are scene-specific, that is, the crowd model is learned from a specific scene and thus poor in generalization to describe other scenes. Attributes are particularly effective on characterizing generic properties across scenes.
In the recent years, studies in attribute-based representations of objects, faces, actions, and scenes have drawn a large attention as an alternative or complement to categorical representations as they characterize the target subject by several attributes rather than discriminative assignment into a single specific category, which is too restrictive to describe the nature of the target subject. Furthermore, scientific studies have shown that different crowd systems share similar principles that can be characterized by some common properties or attributes. Indeed, attributes can express more information in a crowd video as they can describe a video by answering “Who is in the crowd? ” , “Where is the crowd? ” , and “What is crowd here? ” , but not merely define a categorical scene label or event label to it. For instance, an attribute-based representation might describe a crowd video as the “conductor” and “choir” performing on the “stage” with “audience” “applauding” , in contrast to a categorical label like “chorus” . Recently, some works have made efforts on crowd attribute profiling. But the number of attributes in their work is limited (only four or less) , as well as the dataset is also small in terms of scene diversity.
Summary
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. This summary is not an extensive overview of the disclosure. It is intended to neither identify key or critical elements of the disclosure nor delineate any scope of particular embodiments of the disclosure, or any scope of the claims. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
In an aspect, disclosed is an system for predicting crowd attributes comprising: a feature extracting device obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and a prediction device being electronically communicated with the feature extracting device and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
In yet another aspect, disclosed is a method for understanding crowd scene, comprising: obtaining a video with crowd scenes; extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
In yet another aspect, disclosed is a system for predicting crowd attributes, comprising:
a memory that stores executable components; and
a processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:
a feature extracting component obtaining a video with crowd scenes and extracting
appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and
a prediction component extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
In one embodiment, the prediction device/component is configured with a convolutional neural network having:
a first branch configured to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features; and
a second branch configured to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features,
wherein the predicted features from the first branch and the predicted features from the second branch are fused together to form a prediction of the attributes of the crowd in the video.
Brief Description of the Drawing
Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
Fig. 1 is a schematic diagram illustrating a system for predicting crowd attributes according to an embodiment of the present application.
Fig. 2 is a schematic diagram illustrating a flow chart for the system according to one embodiment of the present application.
Fig. 3 illustrates a schematic block diagram of the feature extracting device according to an embodiment of the present application.
Fig. 4 is a schematic diagram illustrating motion channels in scenarios consistent with some disclosed embodiments.
Fig. 5 is a schematic diagram illustrating a convolutional neural network structure included in the prediction device according to some disclosed embodiments.
Fig. 6 is a schematic diagram illustrating a flow chart for constructing a network with the appearance and motion braches according to one embodiment of the present application.
Fig. 7 is a schematic diagram illustrating a flow chart for the training device to fine-tune the second network using the appearance and motion channels of videos in the fine-tuning set.
Fig. 8 illustrates a system for predicting crowd attributes according to one embodiment of the present application, in which the functions of the present invention are carried out by the software.
Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a" , "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises"
and/or "comprising, " when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Fig. 1 illustrates a system 1000 for predicting crowd attributes. The proposed system 1000 is capable of understanding crowded scene in computer vision from attribute-level, and characterizing the crowded scene by predicting a plurality of attributes rather than discriminative assignment into a single specific category. It will be significant in many applications, e.g. in the video surveillance and video search engine.
As shown in Fig. 1, the system 1000 comprises a feature extracting device 100 and a prediction device 200. Fig. 2 illustrates a schematic diagram illustrating a flow chart for the system 1000 according to one embodiment of the present application. At step S201, the feature extracting device 100 obtains a video with crowd scenes and extracts appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and then at step S202, the prediction device 200 predicts attributes of the crowd in the video based on the extracted motion features and the extracted appearances, which will be further discussed later.
In one example of the present application, the feature extracting device 100 may deeply learn the appearance and motion representation across different crowded scenes. Fig. 3 illustrates a schematic block diagram of the feature extracting device 100 according to an embodiment of the present application. The feature extracting device 100 comprises an appearance feature exacting unit 101 configured to extract the RGB components of each frame from the input video.
The feature extracting device 100 further comprises a motion feature exacting unit 102 to extract motion feature from the obtained video. To be specific, the
motion feature extracting unit 102 further comprises a tracklet detection module 1021 to detect crowd tracklets (i.e., short trajectories) for each frame in the obtained video with crowd scene. For example, the tracklet detection module 1021 may utilize the well-known KLT feature point tracker to detect several key points for each frame in the obtained video. To be specific, the detected key points are tracked with the matching algorithm predefined by the KLT, and the corresponding key points across consecutive frames are matched to extract the tracklets. In the non-limiting embodiments in the present application, a plurality of key points are detected in one person in the crowd in each frame. In a preferred embodiment, each of the motion features is computed on a certain number of (for example, 75) frames of the obtained video.
The motion feature exacting unit 102 further comprises a motion distribution determination module 1022 to compute physical relationships between each tracklet and its neighbors to determine motions distribution in each frame. The scene-independent properties for groups in crowd exist in the whole scene space and can be quantified from scene-level.
According to one embodiment, three properties namely collectiveness, stability and conflict are computed for the frames. After the reformulation, the collectiveness indicates the degree of individuals in the whole scene acting as a union in collective motion, and the stability characterizes whether the whole scene can keep its topological structures, and conflict measures the interaction/friction between each pair of nearest neighbors of interest points.
Examples shown in the Fig. 4 illustrate each property intuitively. Referring to Fig. 4, for each channel, two examples are shown in the first and second rows.
Individuals in crowd moving randomly indicate low collectiveness, while the coherent motion of crowd reveals high collectiveness. In Fig. 4-a people in the crowded
scene walk randomly, target on different destinations, so as to exhibit low collectiveness. In Fig. 4-b, a marathon video has people run coherently towards the same destination to exhibit high collectiveness.
Individuals have low stability if their topological structure changes a lot, whereas high stability if topological structure changes a little. In Fig. 4-c, skate dancers have their formation changed a lot from the first frame to the fiftieth frame which means low stability; while in Fig. 4-d, dancers in the bottom example keep their topological formation unchanged to exhibit high stability.
Conflict occurs when individuals move towards different directions. In Fig. 4-e, there is one group of horse-riders parading without any other frictions; while in Fig. 4-f, several groups of people crossing walk which generate conflict with each other.
The present application is not only restricted to the proposed three properties, but can generate any properties if required.
In one example of the present application, the motion maps module 1022 operates to define K-NN graph G (V, E) for the whole point set of the tracklets detected by the tracklet detection module 1021, whose vertices V represent the tracklet points, and tracklet point pairs are connected by edges E. We denote the set of nearest neighbors of a tracklet point z∈V asat every frame of a given video clip.
The motion distribution module 1022 then extracts three motion maps namely collectiveness distribution, stability distribution, and conflict distribution for each frame.
The collectiveness distribution (or map) can be computed by integrating path similarities among crowds on collective manifold. B. Zhou, X. Tang, H. Zhang, and X. Wang.
have proposed the algorithm of the Collective Merging to detect collective motions from random motions by modeling collective motions on the manifold in the “Measuring crowd collectiveness” (TPAMI, 36 (8) : 1586-1599, 2014) .
The stability distribution is extracted by counting and averaging the number of invariant neighbors of each point in the K-NN graph.
For each member i, its K-NN set isin the first frame andin the τ-th frame. It has high stability if its neighbor sets vary little across frames. Thus the larger the is, the lower stability the member has.
The conflict distribution is extracted by computing the velocity correlation between each pair of nearby tracklet points {z, z*} within the K-NN graph.
For each member i, if the velocity of each member in its K-NN set is similar to that of himself, he will have low conflict. It means his neighbors move coherently with him without generating conflict with him.
Returning to Fig. 3, the motion feature exacting unit 102 further comprises a continuous motion channel generation module 1023 to average the per-frame motion maps, for example, the collectiveness maps, the stability maps and the conflict maps across temporal domain, and interpolate the sparse tracklet points to output three complete and continuous motion channels. Although a single frame owns tens or hundreds of tracklets, the total tracklet points are still sparse. The Gaussian kernel can be utilized to interpolate the averaged motion maps to get continuous motion channels.
Returning to Fig. 1, the system 1000 further comprises a prediction device 200. The prediction device 200 is electronically communicated with the feature extracting device 100 and is configured to obtain appearances of the video, receive the extracted motion features from the feature extracting device 100, and predict attributes of the crowd in the video based on the received motion features and/or the obtained appearances of the video. With this function, it can effectively detect the attributes, including the roles of people, their activities and the locations, from the crowd videos, so as to describe the content of the crowd videos. Therefore, crowd videos with the same set of attributes can be obtained and the similarity of different crowd videos can be measured by their attribute set. Furthermore, there are a large number of possible interactions among these attributes. Some attributes are likely to be detected simultaneously whilst some exclusive. For example, the scenario “street” attribute is likely to co-occur with subject “pedestrian” when the subject is “walking” , and also likely to co-occur with subject “mob” when the subject is “fighting” , but not related to subject “swimmer” because the subject cannot “swim” on “street” .
In a model perspective, the feature extracting device 100 may configured as a model with convolutional neural network structure as shown in Fig. 5. For purpose of illustration, Fig. 5 shows two branches are included in the convolutional neural network structure. However, the number of branches is not limited to the proposed two, and it can be generalized to more branches. The number of each type of layers and the number of parameters can also be tuned according to different tasks and objectives.
As shown in Fig. 5, the network comprises: one or more data layers 501, one or more convolution layers 502, one or more max/sum pooling layers 503, one or more normalization layers 504 and a fully-connected layer 505.
In this exemplified embodiment as shown in Fig. 5, this layer of the top appearance branch contains the RGB components (or channels) of the images and their labels
(for example, the dimension is 94) , and of the bottom motion branch contains at least one motion features (for example, the proposed three motion channels as discussed in the above: the collectiveness, the stability and the conflict) and their labels same to the labels of the top branch.
Specifically, this layer 501 provides imagesand its labels where xij is the j-th bit value of the d-dimension feature vector of the i-th input image region, yij is the j-th bit value of the n dimension label vector of the i-th input image region.
The layer 502 performs convolution, padding, and non-linear transformation operations. The convolution layer receives the output (sand) from the data layer 501 and performs convolution, padding, and non-linear transformation operations.
The convolution operation in each convolutional layer may be expressed as
Where,
xi and yj are the i-th input feature map and the j-th output feature map, respectively;
kij is the convolution kernel between the i-th input feature map and the j-th output feature map;
* denotes convolution;
bj is the bias of the j-th output feature map; and
ReLU nonlinearity y=max (0, x) is used for neurons.
The convolution operation can extract features from the input image, such as edge, curve, dot, etc. These features are not predefined manually but are learned from the
training data.
When the convolution kernel kij operates on the marginal pixels of xi, it will exceed the border of xi. In this case, it sets the values that exceed the border of xi to be 0 so as to make the operation valid. This operation is also called “padding” .
The order of the above operations is: padding -> convolutions ->non-linear transformation (ReLU) . The input to “padding” is xi in equation (1) . Each step uses the output of the previous step. The non-linear transformation produces yj in equation 3) .
This layer keeps the maximum value in a local window, and the dimension of the output is thus smaller than the input. The max pooling layer keeps the maximum value in a local window and discard the other values, the output is thus smaller than the input, which may be formulated as
where each neuron in the i-th output feature map yi pools over an M×N local region in the i-th input feature map xi , with s as the step size.
In other words, it reduces the feature dimensions and provides spatial invariance. The spatial invariance means that if the input shifts by several pixels, the output of the layer won’t change much.
Normalization layer 504:
This layer normalizes the responses in local regions of input feature maps. The output dimensionality of this layer is equal to the input dimensionality.
Fully-connected layer 505
This layer takes the feature vector from the previous layer as the input and operates the inner-production between the feature and weights. And one non-linear transformation is operated on the production. The fully connected layer takes the feature vector from the previous layer as input and operates the inner-production between the feature x and weights w, and then one non-linear transformation will be operated on the production, which may be formulated as
Where,
x denotes neural inputs (features) .
y denotes neural outputs (features) in the current fully-connected layer.
w denotes neural weights in current fully-connected layer. Neurons in the fully-connected layer linearly combine features in previous feature extraction module, followed by ReLU non-linearity.
The fully connected layer is configured to extract global features (features extracted from the entire input feature maps) from previous layer. The fully-connected layer also has the function of feature dimension reduction by restricting the number of neurons in them. In one embodiments of the present application, there are provided with at least two fully-connected layers so as to increase the nonlinearity of the neural network, which in turns makes the operation of fitting data easier.
The convolutional layer and the max pooling layer only provide local transformations, which means that they only operate on a local window of the input (local region of the input image) . However, the fully-connected layer provides global transformation, which takes features from the whole space of the inputted image and conduct a transformation as discussed in the above Equation 5)
In the end, the two branches then fuse together to one fully-connected layer. If
simple notations are used to represent parameters in the networks: (1) Conv (N, K, S) for convolutional. layers with N outputs, kernel size K and stride size S, (2) Pool (T, K, S) for pooling layers with type T, kernel size K and stride size S, (3) Norm (K) for local response normalization layers with local size K, and (4) FC (N) for fully-connected layers with N outputs, (5) The activation functions in each layer are represented by ReLU for rectified linear unit and Sig for sigmoid function, then, given N=96, K=7 and S=2 as an example, the two branches have parameters: Conv (96, 7, 2) -ReLU-Pool (3, 2) -Norm (5) -Conv (256, 5, 2) -ReLU-Pool (3, 2) -Norm (5) -Conv (384, 3, 1) -ReLU-Conv (384, 3, 1) -ReLU-Conv (256, 3, 1) -ReLU-Pool (3, 2) -FC (4096) .
The output fully connected layers of two branches are concatenated to be FC (8192) . Finally, we have FC (8192) -FC (94) -Sig producing a plurality of (for example, 94) attribute probability predictions. In one embodiment of the present application, the output of the FC 405 may be 94 attributes , for example , {street, temple, ... } belong to “where” , {star, protester, ... } belong to “who” , and {walk, board, ... } belong to “why” . Accordingly, the 94 attributes outputted from the FC 405may be of three types: “where” (e.g. street, temple, and classroom) ; “who” (e.g. star, protester, and skater) ; and “why” (e.g. walk, board, and ceremony) .
Returning to Fig. 1, the system 1000may further comprises a training device 300. The training device 300is used to train the convolutional neural network by using the following two inputs to obtain a fine-tuned convolutional neural network which produces predictions of crowd attributes:
i. A pre-training set contains images with different objects and the corresponding ground truth object labels. The label set encompasses m object classes.
ii. A fine-tuning set contains crowd videos with appearance as well as motion channels, and the corresponding ground truth attribute labels. The label set encompasses n attribute classes.
Fig. 6 is a schematic diagram illustrating a flow chart for constructing a network with the appearance and motion braches according to one embodiment of the present application.
In this embodiment, two convolutional neural networks are provided with the same structure but different numbers of branches, the first one is used to do pre-training with only one branch, and the second one is used to do fine-tuning with two branches. The first convolutional neural network with one branch of convolutional neural layers may be constructed according to the conventional means. The second convolutional neural network with one branch of convolutional neural layers is constructed based on the first convolutional neural network.
As shown, at step S601, the device 300 operates to pre-train the first convolutional neural network with image net detection task, which can be done by the conventional means or algorithm.
At step S602, the network parameters of the appearance branch are initialized using the pre-trained model stated in step S601. For example, the parameters may be randomly initialized.
At step S603, the input of the motion branch in the first convolutional neural network is replaced by the proposed motion distributions, i.e. , collectiveness distributions, stability distributions and conflict distributions.
At step S604, the network parameters of the motion branch of the first convolutional neural network with the proposed motion channels are randomly initialized without pre-training.
At step S605, the second convolution neural network with two branches (i.e. the appearance channel and the motion channels) is constructed. In particular, the second network is constructed by combining the first convolutional neural network initialized with the appearance parameters at step S602 and the first convolutional neural network initialized with the motion parameters at step S604, as shown in Fig. 6.
Fig. 7 is a schematic diagram illustrating a flow chart for the training device 300 to fine-tune the second network using the appearance and motion channels of videos in the fine-tuning set.
At step S701, parameters, including the convolution filters, deformational layer weights, fully connected weights, and bias are initialized randomly by the training device 300. The training tries to minimize the loss function and can be divided into many updating steps. Therefore, at step S702, the loss is calculated, and then at step S703, the algorithm calculates the gradient with respect to all the neural network parameters based on the calculated loss, including the convolution filters, deformational layer weights, fully connected weights, and bias.
The gradient of any network parameters can be calculated with the chain rule. Suppose the network has n layers and they are denoted by Li, i=1, 2, ... , n. The output of a layer Lk in the network can be expressed by a general function
yk=fk (yk-1, wk) 6)
where yk is the output of the layer Lk, yk-1 is the output of the previous layer Lk-1, wk is the weights of Lk, and fk is the function for Lk. The derivative of yk with respect to yk-1 and wk is all known. The loss function C of the network is define on the output of the last layer Ln and the ground truth label t,
c=C (yn, t) 7)
The derivative of c with respect to yn is also known. To calculate the gradient
of c with respect to weights wn, the chain rule can be applied
To calculate the gradient of c with respect to yk, the chain rule can also be applied
which is in recursive way. To calculate the gradient of c with respect to arbitrary weight wk, we can use
In this procedure, the gradient of the cost c with respect to any weights in the network can be calculated.
At step S704, the algorithm updates the convolution filters, deformational layer weights, fully connected weights, and bias by rule of
where η is the learning rate, and η is a predefined value.
Updates of the parameters are performed using the production of one prefixed learning rate and the corresponding gradients.
At step S705, it determines if the stopping criterion is satisfied. For example, if the variation of the loss is less than a predetermined value, the process terminates, otherwise, the process return back to step S702.
As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present
invention may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ” Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present invention, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments.
In addition, the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium. Fig. 8 illustrates a system 3000 for predicting crowd attributes according to one embodiment of the present application, in which the functions of the present invention are carried out by the software. Referring to Fig. 8, the system 3000 comprises a memory 3001 that stores executable components and a processor 3002, electrically coupled to the memory 3001 to execute the executable components to perform operations of the system 3000. The executable components may comprise: a feature extracting component 3003 obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and a prediction component 3004 extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances. The functions of the components 3003 and 3004 are similar to those of the unit 100 and 200, respectively, and thus the detailed descriptions
thereof are omitted herein.
Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.
Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.
Claims (20)
- A system for predicting crowd attributes, comprising:a feature extracting device obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd in the video; anda prediction device being electronically communicated with the feature extracting device and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
- The system according to claim 1, wherein the feature extracting device further comprises a motion feature extracting unit comprising:a tracklet detection module (1021) detecting short trajectories of the crowd in each frame of the video;a motion maps determination module (1022) computing physical relationships between each of the short trajectories and its neighbors to determine one or more motion distributions for the crowd in each frame of the video; anda continuous motion channel generation module (1023) averaging the determined motion distributions across temporal domain, and interpolating one or more sparse short trajectory points into the averaged distributions to form one or more continuous motion channels forming the motion features.
- The system according to claim 2, wherein the motion distribution at least comprises at least one of:a collectiveness distribution that indicates a degree of individual in a whole scene acting as a union in a collective motion,a stability distribution that indicates whether the whole scene can keep a topological structure for the crowd in the whole scene; anda conflict distribution that indicates an interaction/friction between each pair of nearest neighbors of the short trajectories for the crowd in the scene.
- The system according to claim 1, wherein the predicted attributes at least indicate a role of the people in the crowd, a place of the crowd and a reason why people are in the crowd.
- The system according to any one of claims 1-4, wherein the prediction device is configured with a convolutional neural network having:a first branch configured to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features; anda second branch configured to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features,wherein, the predicted crowd attributes from the first branch and the predicted crowd attributes from the second branch are fused together to output a prediction of the attributes of the crowd in the video.
- The system according to claim 5, further comprising a training device for training the second neural network by:initializing randomly parameters for the second neural network;calculating a loss of the parameters in the second neural network;calculating a gradient with respect to all said parameters based on the calculated loss;updating the parameters by using a production of one prefixed learning rate and the corresponding gradients;determining if a stopping criterion is satisfied;if not, returning to the step of calculating.
- The system according to claim 6, wherein the training device trains the first neural network by:initializing parameters for the first neural network with pre-trained data sets;calculating a loss of the parameters in the first neural network;calculating a gradient with respect to all said parameters based on the calculated loss;updating the parameters by using a production of one prefixed learning rate and the corresponding gradients;determining if a stopping criterion is satisfied;if not, returning to the step of calculating.
- The system according to claim 7, wherein the trained first neural network and the trained second neural network are connected together, the training device further inputs a fine-tuning set into the connected networks to fine-tune the connected networks.
- A method for understanding crowd scene, comprising:obtaining a video with crowd scenes;extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd in the video; andpredicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
- The method according to claim 9, wherein the extracting further comprises:detecting short trajectories of the crowd in the video frames;computing physical relationships between each of the short trajectories and its neighbors to determine one or more motion distributions for the crowd in each frame; and averaging the determined motion distributions across temporal domain, and interpolating one or more sparse short trajectory points into the averaged distributions to form one or more continuous motion channels forming the motion features.
- The method according to claim 10, wherein the motion distribution at least comprises at least one of:a collectiveness distribution that indicates a degree of individual in a whole scene acting as a union in a collective motion,a stability distribution that indicates whether the whole scene can keep a topological structure for the crowd in the whole scene; anda conflict distribution that indicates an interaction/friction between each pair of nearest neighbors of short trajectories for the crowd in the whole scene.
- The method according to claim 9, wherein the predicted attributes at least indicate a role of the people in the crowd, a place of the crowd and a reason why people are in the crowd.
- The method according to any one of claims 9-12, wherein the predicting further comprises:configuring a first branch to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features;configuring a second branch to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features; andconnecting the predicted crowd attributes from the first branch and the predicted crowd attributes from the second branch to output a . prediction of the attributes of the crowd in the video.
- The method according to claim 13, further comprising:initializing randomly parameters for the second neural network;calculating a loss of the parameters in the second neural network;calculating a gradient with respect to all said parameters based on the calculated loss;updating the parameters by using a production of one prefixed learning rate and the corresponding gradients;determining if a stopping criterion is satisfied;if not, returning to the step of calculating.
- The method according to claim 14, further comprising:initializing parameters for the first neural network with pre-trained data sets;calculating a loss of the parameters in the first neural network;calculating a gradient with respect to all said parameters based on the calculated loss;updating the parameters by using a production of one prefixed learning rate and the corresponding gradients;determining if a stopping criterion is satisfied;if not, returning to the step of calculating.
- The method according to claim 15, further comprising:connecting the trained first neural network and the trained second neural network together; andfine-tuning the connected networks by inputting a fine-tuning set into the connected networks.
- A system for predicting crowd attributes, comprising:a memory that stores executable components; anda processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:a feature extracting component obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd in the video; anda prediction component extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
- The system according to claim 17, wherein the feature extracting component is configured to:detect short trajectories of the crowd in each frame in the video;compute physical relationships between each of the trajectories and its neighbors to determine one or more motion distributions for each frame in the video; andaverage the determined motion distributions across temporal domain, and interpolate one or more sparse tracklet points into the averaged distributions to form one or more continuous motion channels forming the motion feature.
- The system according to claim 18, wherein the motion distribution at least comprises at least one of:a collectiveness distribution that indicates a degree of individual in a whole scene acting as a union in a collective motion,a stability distribution that indicates whether the whole scene can keep a topological structure for the crowd in the whole scene; anda conflict distribution that indicates an interaction/friction between each pair of nearest neighbors of trajectories for the crowd in the whole scene.
- The system according to claim 19, wherein the prediction component is further configured for:configuring a first branch to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features;configuring a second branch to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features; andconnecting the predicted crowd attributes from the first branch and the predicted crowd attributes from the second branch to output a . prediction of the attributes of the crowd in the video.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201580080179.9A CN107615272B (en) | 2015-05-18 | 2015-05-18 | System and method for predicting crowd attributes |
PCT/CN2015/079190 WO2016183770A1 (en) | 2015-05-18 | 2015-05-18 | A system and a method for predicting crowd attributes |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2015/079190 WO2016183770A1 (en) | 2015-05-18 | 2015-05-18 | A system and a method for predicting crowd attributes |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016183770A1 true WO2016183770A1 (en) | 2016-11-24 |
Family
ID=57319155
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2015/079190 WO2016183770A1 (en) | 2015-05-18 | 2015-05-18 | A system and a method for predicting crowd attributes |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107615272B (en) |
WO (1) | WO2016183770A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109615140A (en) * | 2018-12-14 | 2019-04-12 | 中国科学技术大学 | A kind of method and device for predicting pedestrian movement |
CN109977800A (en) * | 2019-03-08 | 2019-07-05 | 上海电力学院 | A kind of intensive scene crowd of combination multiple features divides group's detection method |
CN110210603A (en) * | 2019-06-10 | 2019-09-06 | 长沙理工大学 | Counter model construction method, method of counting and the device of crowd |
CN111933298A (en) * | 2020-08-14 | 2020-11-13 | 医渡云(北京)技术有限公司 | Crowd relation determination method, device, electronic equipment and medium |
CN113792930A (en) * | 2021-04-26 | 2021-12-14 | 青岛大学 | Blind person walking track prediction method, electronic device and storage medium |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110415323B (en) * | 2019-07-30 | 2023-05-26 | 成都数字天空科技有限公司 | Fusion deformation coefficient obtaining method, fusion deformation coefficient obtaining device and storage medium |
CN111339364B (en) * | 2020-02-28 | 2023-09-29 | 网易(杭州)网络有限公司 | Video classification method, medium, device and computing equipment |
CN111429185B (en) * | 2020-03-27 | 2023-06-02 | 京东城市(北京)数字科技有限公司 | Crowd figure prediction method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040230452A1 (en) * | 2003-05-15 | 2004-11-18 | Yuichi Abe | Regional attribute determination method, regional attribute determination device, and regional attribute determination program |
CN101561928A (en) * | 2009-05-27 | 2009-10-21 | 湖南大学 | Multi-human body tracking method based on attribute relational graph appearance model |
CN103150375A (en) * | 2013-03-11 | 2013-06-12 | 浙江捷尚视觉科技有限公司 | Quick video retrieval system and quick video retrieval method for video detection |
CN104537685A (en) * | 2014-12-12 | 2015-04-22 | 浙江工商大学 | Method for conducting automatic passenger flow statistical analysis on basis of video images |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9202225B2 (en) * | 2010-05-28 | 2015-12-01 | Red Hat, Inc. | Aggregate monitoring of utilization data for vendor products in cloud networks |
CN102201065B (en) * | 2011-05-16 | 2012-11-21 | 天津大学 | Method for detecting monitored video abnormal event based on trace analysis |
CN102508923B (en) * | 2011-11-22 | 2014-06-11 | 北京大学 | Automatic video annotation method based on automatic classification and keyword marking |
CN105095908B (en) * | 2014-05-16 | 2018-12-14 | 华为技术有限公司 | Group behavior characteristic processing method and apparatus in video image |
CN104598890B (en) * | 2015-01-30 | 2017-07-28 | 南京邮电大学 | A kind of Human bodys' response method based on RGB D videos |
-
2015
- 2015-05-18 WO PCT/CN2015/079190 patent/WO2016183770A1/en active Application Filing
- 2015-05-18 CN CN201580080179.9A patent/CN107615272B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040230452A1 (en) * | 2003-05-15 | 2004-11-18 | Yuichi Abe | Regional attribute determination method, regional attribute determination device, and regional attribute determination program |
CN101561928A (en) * | 2009-05-27 | 2009-10-21 | 湖南大学 | Multi-human body tracking method based on attribute relational graph appearance model |
CN103150375A (en) * | 2013-03-11 | 2013-06-12 | 浙江捷尚视觉科技有限公司 | Quick video retrieval system and quick video retrieval method for video detection |
CN104537685A (en) * | 2014-12-12 | 2015-04-22 | 浙江工商大学 | Method for conducting automatic passenger flow statistical analysis on basis of video images |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109615140A (en) * | 2018-12-14 | 2019-04-12 | 中国科学技术大学 | A kind of method and device for predicting pedestrian movement |
CN109615140B (en) * | 2018-12-14 | 2024-01-09 | 中国科学技术大学 | Method and device for predicting pedestrian movement |
CN109977800A (en) * | 2019-03-08 | 2019-07-05 | 上海电力学院 | A kind of intensive scene crowd of combination multiple features divides group's detection method |
CN110210603A (en) * | 2019-06-10 | 2019-09-06 | 长沙理工大学 | Counter model construction method, method of counting and the device of crowd |
CN111933298A (en) * | 2020-08-14 | 2020-11-13 | 医渡云(北京)技术有限公司 | Crowd relation determination method, device, electronic equipment and medium |
CN111933298B (en) * | 2020-08-14 | 2024-02-13 | 医渡云(北京)技术有限公司 | Crowd relation determining method and device, electronic equipment and medium |
CN113792930A (en) * | 2021-04-26 | 2021-12-14 | 青岛大学 | Blind person walking track prediction method, electronic device and storage medium |
CN113792930B (en) * | 2021-04-26 | 2023-08-22 | 青岛大学 | Blind person walking track prediction method, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107615272B (en) | 2021-09-03 |
CN107615272A (en) | 2018-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016183770A1 (en) | A system and a method for predicting crowd attributes | |
Zhang et al. | Attentional neural fields for crowd counting | |
Shen et al. | Multiobject tracking by submodular optimization | |
Xiong et al. | Spatiotemporal modeling for crowd counting in videos | |
CN108960184B (en) | Pedestrian re-identification method based on heterogeneous component deep neural network | |
Somasundaram et al. | Action recognition using global spatio-temporal features derived from sparse representations | |
CN106469299A (en) | A kind of vehicle search method and device | |
Karavasilis et al. | Visual tracking using the Earth Mover's Distance between Gaussian mixtures and Kalman filtering | |
Ma et al. | Counting people crossing a line using integer programming and local features | |
CN111178284A (en) | Pedestrian re-identification method and system based on spatio-temporal union model of map data | |
Banerjee et al. | Efficient pooling of image based CNN features for action recognition in videos | |
WO2020088763A1 (en) | Device and method for recognizing activity in videos | |
Barkoky et al. | Complex Network-based features extraction in RGB-D human action recognition | |
Xie et al. | Event-based stereo matching using semiglobal matching | |
Zhang et al. | Joint discriminative representation learning for end-to-end person search | |
Islam et al. | Representation for action recognition with motion vector termed as: SDQIO | |
Ji et al. | Semisupervised hyperspectral image classification using spatial-spectral information and landscape features | |
Bakour et al. | Soft-CSRNet: real-time dilated convolutional neural networks for crowd counting with drones | |
Behera et al. | Person re-identification: A taxonomic survey and the path ahead | |
Yadav et al. | DroneAttention: Sparse weighted temporal attention for drone-camera based activity recognition | |
Babu et al. | Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network | |
Pehlivan et al. | Recognizing activities in multiple views with fusion of frame judgments | |
Zhu et al. | Correspondence-free dictionary learning for cross-view action recognition | |
WO2020192868A1 (en) | Event detection | |
Narayan et al. | Learning deep features for online person tracking using non-overlapping cameras: A survey |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15892149 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15892149 Country of ref document: EP Kind code of ref document: A1 |