WO2016183770A1 - A system and a method for predicting crowd attributes - Google Patents

A system and a method for predicting crowd attributes Download PDF

Info

Publication number
WO2016183770A1
WO2016183770A1 PCT/CN2015/079190 CN2015079190W WO2016183770A1 WO 2016183770 A1 WO2016183770 A1 WO 2016183770A1 CN 2015079190 W CN2015079190 W CN 2015079190W WO 2016183770 A1 WO2016183770 A1 WO 2016183770A1
Authority
WO
WIPO (PCT)
Prior art keywords
crowd
motion
video
attributes
features
Prior art date
Application number
PCT/CN2015/079190
Other languages
French (fr)
Inventor
Xiaogang Wang
Chen Change Loy
Jing SHAO
Kai Kang
Original Assignee
Xiaogang Wang
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaogang Wang filed Critical Xiaogang Wang
Priority to CN201580080179.9A priority Critical patent/CN107615272B/en
Priority to PCT/CN2015/079190 priority patent/WO2016183770A1/en
Publication of WO2016183770A1 publication Critical patent/WO2016183770A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]

Definitions

  • the disclosures relate to a system for predicting crowd attributes and a method thereof.
  • an attribute-based representation might describe a crowd video as the “conductor” and “choir” performing on the “stage” with “audience” “applauding” , in contrast to a categorical label like “chorus” .
  • Crowd attribute profiling But the number of attributes in their work is limited (only four or less) , as well as the dataset is also small in terms of scene diversity.
  • an system for predicting crowd attributes comprising: a feature extracting device obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and a prediction device being electronically communicated with the feature extracting device and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
  • a method for understanding crowd scene comprising: obtaining a video with crowd scenes; extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
  • a system for predicting crowd attributes comprising:
  • a processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:
  • a feature extracting component obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video;
  • a prediction component extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
  • the prediction device/component is configured with a convolutional neural network having:
  • a first branch configured to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features;
  • second branch configured to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features
  • predicted features from the first branch and the predicted features from the second branch are fused together to form a prediction of the attributes of the crowd in the video.
  • Fig. 1 is a schematic diagram illustrating a system for predicting crowd attributes according to an embodiment of the present application.
  • Fig. 2 is a schematic diagram illustrating a flow chart for the system according to one embodiment of the present application.
  • Fig. 3 illustrates a schematic block diagram of the feature extracting device according to an embodiment of the present application.
  • Fig. 4 is a schematic diagram illustrating motion channels in scenarios consistent with some disclosed embodiments.
  • Fig. 5 is a schematic diagram illustrating a convolutional neural network structure included in the prediction device according to some disclosed embodiments.
  • Fig. 6 is a schematic diagram illustrating a flow chart for constructing a network with the appearance and motion braches according to one embodiment of the present application.
  • Fig. 7 is a schematic diagram illustrating a flow chart for the training device to fine-tune the second network using the appearance and motion channels of videos in the fine-tuning set.
  • Fig. 8 illustrates a system for predicting crowd attributes according to one embodiment of the present application, in which the functions of the present invention are carried out by the software.
  • Fig. 1 illustrates a system 1000 for predicting crowd attributes.
  • the proposed system 1000 is capable of understanding crowded scene in computer vision from attribute-level, and characterizing the crowded scene by predicting a plurality of attributes rather than discriminative assignment into a single specific category. It will be significant in many applications, e.g. in the video surveillance and video search engine.
  • the system 1000 comprises a feature extracting device 100 and a prediction device 200.
  • Fig. 2 illustrates a schematic diagram illustrating a flow chart for the system 1000 according to one embodiment of the present application.
  • the feature extracting device 100 obtains a video with crowd scenes and extracts appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and then at step S202, the prediction device 200 predicts attributes of the crowd in the video based on the extracted motion features and the extracted appearances, which will be further discussed later.
  • the feature extracting device 100 may deeply learn the appearance and motion representation across different crowded scenes.
  • Fig. 3 illustrates a schematic block diagram of the feature extracting device 100 according to an embodiment of the present application.
  • the feature extracting device 100 comprises an appearance feature exacting unit 101 configured to extract the RGB components of each frame from the input video.
  • the feature extracting device 100 further comprises a motion feature exacting unit 102 to extract motion feature from the obtained video.
  • the motion feature extracting unit 102 further comprises a tracklet detection module 1021 to detect crowd tracklets (i.e., short trajectories) for each frame in the obtained video with crowd scene.
  • the tracklet detection module 1021 may utilize the well-known KLT feature point tracker to detect several key points for each frame in the obtained video.
  • the detected key points are tracked with the matching algorithm predefined by the KLT, and the corresponding key points across consecutive frames are matched to extract the tracklets.
  • a plurality of key points are detected in one person in the crowd in each frame.
  • each of the motion features is computed on a certain number of (for example, 75) frames of the obtained video.
  • the motion feature exacting unit 102 further comprises a motion distribution determination module 1022 to compute physical relationships between each tracklet and its neighbors to determine motions distribution in each frame.
  • the scene-independent properties for groups in crowd exist in the whole scene space and can be quantified from scene-level.
  • three properties namely collectiveness, stability and conflict are computed for the frames.
  • the collectiveness indicates the degree of individuals in the whole scene acting as a union in collective motion
  • the stability characterizes whether the whole scene can keep its topological structures
  • conflict measures the interaction/friction between each pair of nearest neighbors of interest points.
  • FIG. 4 Examples shown in the Fig. 4 illustrate each property intuitively. Referring to Fig. 4, for each channel, two examples are shown in the first and second rows.
  • Fig. 4-a people in the crowded scene walk randomly, target on different destinations, so as to exhibit low collectiveness.
  • Fig. 4-b a marathon video has people run coherently towards the same destination to exhibit high collectiveness.
  • the present application is not only restricted to the proposed three properties, but can generate any properties if required.
  • the motion maps module 1022 operates to define K-NN graph G (V, E) for the whole point set of the tracklets detected by the tracklet detection module 1021, whose vertices V represent the tracklet points, and tracklet point pairs are connected by edges E.
  • G K-NN graph
  • the motion distribution module 1022 then extracts three motion maps namely collectiveness distribution, stability distribution, and conflict distribution for each frame.
  • the collectiveness distribution (or map) can be computed by integrating path similarities among crowds on collective manifold.
  • B. Zhou, X. Tang, H. Zhang, and X. Wang. have proposed the algorithm of the Collective Merging to detect collective motions from random motions by modeling collective motions on the manifold in the “Measuring crowd collectiveness” (TPAMI, 36 (8) : 1586-1599, 2014) .
  • the stability distribution is extracted by counting and averaging the number of invariant neighbors of each point in the K-NN graph.
  • each member i its K-NN set is in the first frame and in the ⁇ -th frame. It has high stability if its neighbor sets vary little across frames. Thus the larger the is, the lower stability the member has.
  • the conflict distribution is extracted by computing the velocity correlation between each pair of nearby tracklet points ⁇ z, z * ⁇ within the K-NN graph.
  • each member i if the velocity of each member in its K-NN set is similar to that of himself, he will have low conflict. It means his neighbors move coherently with him without generating conflict with him.
  • the motion feature exacting unit 102 further comprises a continuous motion channel generation module 1023 to average the per-frame motion maps, for example, the collectiveness maps, the stability maps and the conflict maps across temporal domain, and interpolate the sparse tracklet points to output three complete and continuous motion channels.
  • a single frame owns tens or hundreds of tracklets, the total tracklet points are still sparse.
  • the Gaussian kernel can be utilized to interpolate the averaged motion maps to get continuous motion channels.
  • the system 1000 further comprises a prediction device 200.
  • the prediction device 200 is electronically communicated with the feature extracting device 100 and is configured to obtain appearances of the video, receive the extracted motion features from the feature extracting device 100, and predict attributes of the crowd in the video based on the received motion features and/or the obtained appearances of the video.
  • This function it can effectively detect the attributes, including the roles of people, their activities and the locations, from the crowd videos, so as to describe the content of the crowd videos. Therefore, crowd videos with the same set of attributes can be obtained and the similarity of different crowd videos can be measured by their attribute set. Furthermore, there are a large number of possible interactions among these attributes. Some attributes are likely to be detected simultaneously whilst some exclusive.
  • the scenario “street” attribute is likely to co-occur with subject “pedestrian” when the subject is “walking” , and also likely to co-occur with subject “mob” when the subject is “fighting” , but not related to subject “swimmer” because the subject cannot “swim” on “street” .
  • the feature extracting device 100 may configured as a model with convolutional neural network structure as shown in Fig. 5.
  • Fig. 5 shows two branches are included in the convolutional neural network structure.
  • the number of branches is not limited to the proposed two, and it can be generalized to more branches. The number of each type of layers and the number of parameters can also be tuned according to different tasks and objectives.
  • the network comprises: one or more data layers 501, one or more convolution layers 502, one or more max/sum pooling layers 503, one or more normalization layers 504 and a fully-connected layer 505.
  • this layer of the top appearance branch contains the RGB components (or channels) of the images and their labels (for example, the dimension is 94)
  • of the bottom motion branch contains at least one motion features (for example, the proposed three motion channels as discussed in the above: the collectiveness, the stability and the conflict) and their labels same to the labels of the top branch.
  • this layer 501 provides images and its labels where x ij is the j-th bit value of the d-dimension feature vector of the i-th input image region, y ij is the j-th bit value of the n dimension label vector of the i-th input image region.
  • the layer 502 performs convolution, padding, and non-linear transformation operations.
  • the convolution layer receives the output (s and ) from the data layer 501 and performs convolution, padding, and non-linear transformation operations.
  • the convolution operation in each convolutional layer may be expressed as
  • x i and y j are the i-th input feature map and the j-th output feature map, respectively;
  • k ij is the convolution kernel between the i-th input feature map and the j-th output feature map
  • b j is the bias of the j-th output feature map
  • the convolution operation can extract features from the input image, such as edge, curve, dot, etc. These features are not predefined manually but are learned from the training data.
  • the convolution kernel k ij When the convolution kernel k ij operates on the marginal pixels of x i , it will exceed the border of x i . In this case, it sets the values that exceed the border of x i to be 0 so as to make the operation valid. This operation is also called “padding” .
  • the order of the above operations is: padding -> convolutions ->non-linear transformation (ReLU) .
  • the input to “padding” is x i in equation (1) .
  • Each step uses the output of the previous step.
  • the non-linear transformation produces y j in equation 3) .
  • This layer keeps the maximum value in a local window, and the dimension of the output is thus smaller than the input.
  • the max pooling layer keeps the maximum value in a local window and discard the other values, the output is thus smaller than the input, which may be formulated as
  • each neuron in the i-th output feature map y i pools over an M ⁇ N local region in the i-th input feature map x i , with s as the step size.
  • the spatial invariance means that if the input shifts by several pixels, the output of the layer won’t change much.
  • This layer normalizes the responses in local regions of input feature maps.
  • the output dimensionality of this layer is equal to the input dimensionality.
  • This layer takes the feature vector from the previous layer as the input and operates the inner-production between the feature and weights. And one non-linear transformation is operated on the production.
  • the fully connected layer takes the feature vector from the previous layer as input and operates the inner-production between the feature x and weights w, and then one non-linear transformation will be operated on the production, which may be formulated as
  • x denotes neural inputs (features) .
  • y denotes neural outputs (features) in the current fully-connected layer.
  • w denotes neural weights in current fully-connected layer. Neurons in the fully-connected layer linearly combine features in previous feature extraction module, followed by ReLU non-linearity.
  • the fully connected layer is configured to extract global features (features extracted from the entire input feature maps) from previous layer.
  • the fully-connected layer also has the function of feature dimension reduction by restricting the number of neurons in them.
  • there are provided with at least two fully-connected layers so as to increase the nonlinearity of the neural network, which in turns makes the operation of fitting data easier.
  • the convolutional layer and the max pooling layer only provide local transformations, which means that they only operate on a local window of the input (local region of the input image) .
  • the fully-connected layer provides global transformation, which takes features from the whole space of the inputted image and conduct a transformation as discussed in the above Equation 5)
  • the two branches then fuse together to one fully-connected layer.
  • Conv (N, K, S) for convolutional. layers with N outputs, kernel size K and stride size S
  • Pool (T, K, S) for pooling layers with type T, kernel size K and stride size S
  • Norm (K) for local response normalization layers with local size K
  • FC (N) for fully-connected layers with N outputs
  • FC (8192) The output fully connected layers of two branches are concatenated to be FC (8192) .
  • FC (8192) -FC (94) -Sig producing a plurality of (for example, 94) attribute probability predictions.
  • the output of the FC 405 may be 94 attributes , for example , ⁇ street, temple, ... ⁇ belong to “where” , ⁇ star, protester, ... ⁇ belong to “who” , and ⁇ walk, board, ... ⁇ belong to “why” .
  • the 94 attributes outputted from the FC 405 may be of three types: “where” (e.g. street, temple, and classroom) ; “who” (e.g. star, protester, and skater) ; and “why” (e.g. walk, board, and ceremony) .
  • the system 1000 may further comprises a training device 300.
  • the training device 300 is used to train the convolutional neural network by using the following two inputs to obtain a fine-tuned convolutional neural network which produces predictions of crowd attributes:
  • a pre-training set contains images with different objects and the corresponding ground truth object labels.
  • the label set encompasses m object classes.
  • a fine-tuning set contains crowd videos with appearance as well as motion channels, and the corresponding ground truth attribute labels.
  • the label set encompasses n attribute classes.
  • Fig. 6 is a schematic diagram illustrating a flow chart for constructing a network with the appearance and motion braches according to one embodiment of the present application.
  • two convolutional neural networks are provided with the same structure but different numbers of branches, the first one is used to do pre-training with only one branch, and the second one is used to do fine-tuning with two branches.
  • the first convolutional neural network with one branch of convolutional neural layers may be constructed according to the conventional means.
  • the second convolutional neural network with one branch of convolutional neural layers is constructed based on the first convolutional neural network.
  • the device 300 operates to pre-train the first convolutional neural network with image net detection task, which can be done by the conventional means or algorithm.
  • the network parameters of the appearance branch are initialized using the pre-trained model stated in step S601.
  • the parameters may be randomly initialized.
  • the input of the motion branch in the first convolutional neural network is replaced by the proposed motion distributions, i.e. , collectiveness distributions, stability distributions and conflict distributions.
  • the network parameters of the motion branch of the first convolutional neural network with the proposed motion channels are randomly initialized without pre-training.
  • the second convolution neural network with two branches (i.e. the appearance channel and the motion channels) is constructed.
  • the second network is constructed by combining the first convolutional neural network initialized with the appearance parameters at step S602 and the first convolutional neural network initialized with the motion parameters at step S604, as shown in Fig. 6.
  • Fig. 7 is a schematic diagram illustrating a flow chart for the training device 300 to fine-tune the second network using the appearance and motion channels of videos in the fine-tuning set.
  • step S701 parameters, including the convolution filters, deformational layer weights, fully connected weights, and bias are initialized randomly by the training device 300.
  • the training tries to minimize the loss function and can be divided into many updating steps. Therefore, at step S702, the loss is calculated, and then at step S703, the algorithm calculates the gradient with respect to all the neural network parameters based on the calculated loss, including the convolution filters, deformational layer weights, fully connected weights, and bias.
  • the gradient of any network parameters can be calculated with the chain rule.
  • the output of a layer L k in the network can be expressed by a general function
  • y k is the output of the layer L k
  • y k-1 is the output of the previous layer L k-1
  • w k is the weights of L k
  • f k is the function for L k .
  • the derivative of y k with respect to y k-1 and w k is all known.
  • the loss function C of the network is define on the output of the last layer L n and the ground truth label t,
  • the derivative of c with respect to y n is also known.
  • the chain rule can be applied
  • the gradient of the cost c with respect to any weights in the network can be calculated.
  • the algorithm updates the convolution filters, deformational layer weights, fully connected weights, and bias by rule of
  • is the learning rate
  • is a predefined value
  • Updates of the parameters are performed using the production of one prefixed learning rate and the corresponding gradients.
  • step S705 it determines if the stopping criterion is satisfied. For example, if the variation of the loss is less than a predetermined value, the process terminates, otherwise, the process return back to step S702.
  • the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ”
  • ICs integrated circuits
  • the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software.
  • the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
  • Fig. 8 illustrates a system 3000 for predicting crowd attributes according to one embodiment of the present application, in which the functions of the present invention are carried out by the software.
  • the system 3000 comprises a memory 3001 that stores executable components and a processor 3002, electrically coupled to the memory 3001 to execute the executable components to perform operations of the system 3000.
  • the executable components may comprise: a feature extracting component 3003 obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and a prediction component 3004 extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
  • the functions of the components 3003 and 3004 are similar to those of the unit 100 and 200, respectively, and thus the detailed descriptions thereof are omitted herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed is a system for predicting crowd attributes, comprising: a feature extracting device obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and a prediction device being electronically communicated with the feature extracting device and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.

Description

A SYSTEM AND A METHOD FOR PREDICTING CROWD ATTRIBUTES Technical Field
The disclosures relate to a system for predicting crowd attributes and a method thereof.
Background
During the last decade, the field of crowd analysis had a remarkable evolution from crowded scene understanding, including crowd behavior analysis, crowd tracking, and crowd segmentation. Much of this progress was sparked by the creation of crowd datasets as well as the new and robust features and models for profiling crowd intrinsic properties. Most of the above studies on crowd understanding are scene-specific, that is, the crowd model is learned from a specific scene and thus poor in generalization to describe other scenes. Attributes are particularly effective on characterizing generic properties across scenes.
In the recent years, studies in attribute-based representations of objects, faces, actions, and scenes have drawn a large attention as an alternative or complement to categorical representations as they characterize the target subject by several attributes rather than discriminative assignment into a single specific category, which is too restrictive to describe the nature of the target subject. Furthermore, scientific studies have shown that different crowd systems share similar principles that can be characterized by some common properties or attributes. Indeed, attributes can express more information in a crowd video as they can describe a video by answering “Who is in the crowd? ” , “Where is the crowd? ” , and “What is crowd here? ” , but not merely define a categorical scene label or event label to it. For instance, an attribute-based representation might describe a crowd video as the “conductor” and “choir” performing on the “stage” with “audience” “applauding” , in contrast to a categorical label like “chorus” . Recently, some works have made efforts on crowd attribute profiling. But the number of attributes in their work is limited (only four or less) , as well as the dataset is also small in terms of scene diversity.
Summary
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. This summary is not an extensive overview of the disclosure. It is intended to neither identify key or critical elements of the disclosure nor delineate any scope of particular embodiments of the disclosure, or any scope of the claims. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
In an aspect, disclosed is an system for predicting crowd attributes comprising: a feature extracting device obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and a prediction device being electronically communicated with the feature extracting device and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
In yet another aspect, disclosed is a method for understanding crowd scene, comprising: obtaining a video with crowd scenes; extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
In yet another aspect, disclosed is a system for predicting crowd attributes, comprising:
a memory that stores executable components; and
a processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:
a feature extracting component obtaining a video with crowd scenes and extracting  appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and
a prediction component extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
In one embodiment, the prediction device/component is configured with a convolutional neural network having:
a first branch configured to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features; and
a second branch configured to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features,
wherein the predicted features from the first branch and the predicted features from the second branch are fused together to form a prediction of the attributes of the crowd in the video.
Brief Description of the Drawing
Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
Fig. 1 is a schematic diagram illustrating a system for predicting crowd attributes according to an embodiment of the present application.
Fig. 2 is a schematic diagram illustrating a flow chart for the system according to one embodiment of the present application.
Fig. 3 illustrates a schematic block diagram of the feature extracting device according to an embodiment of the present application.
Fig. 4 is a schematic diagram illustrating motion channels in scenarios consistent with some disclosed embodiments.
Fig. 5 is a schematic diagram illustrating a convolutional neural network structure included in the prediction device according to some disclosed embodiments.
Fig. 6 is a schematic diagram illustrating a flow chart for constructing a network with the appearance and motion braches according to one embodiment of the present application.
Fig. 7 is a schematic diagram illustrating a flow chart for the training device to fine-tune the second network using the appearance and motion channels of videos in the fine-tuning set.
Fig. 8 illustrates a system for predicting crowd attributes according to one embodiment of the present application, in which the functions of the present invention are carried out by the software.
Detailed Description
Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a" , "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises"  and/or "comprising, " when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Fig. 1 illustrates a system 1000 for predicting crowd attributes. The proposed system 1000 is capable of understanding crowded scene in computer vision from attribute-level, and characterizing the crowded scene by predicting a plurality of attributes rather than discriminative assignment into a single specific category. It will be significant in many applications, e.g. in the video surveillance and video search engine.
As shown in Fig. 1, the system 1000 comprises a feature extracting device 100 and a prediction device 200. Fig. 2 illustrates a schematic diagram illustrating a flow chart for the system 1000 according to one embodiment of the present application. At step S201, the feature extracting device 100 obtains a video with crowd scenes and extracts appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and then at step S202, the prediction device 200 predicts attributes of the crowd in the video based on the extracted motion features and the extracted appearances, which will be further discussed later.
In one example of the present application, the feature extracting device 100 may deeply learn the appearance and motion representation across different crowded scenes. Fig. 3 illustrates a schematic block diagram of the feature extracting device 100 according to an embodiment of the present application. The feature extracting device 100 comprises an appearance feature exacting unit 101 configured to extract the RGB components of each frame from the input video.
The feature extracting device 100 further comprises a motion feature exacting unit 102 to extract motion feature from the obtained video. To be specific, the  motion feature extracting unit 102 further comprises a tracklet detection module 1021 to detect crowd tracklets (i.e., short trajectories) for each frame in the obtained video with crowd scene. For example, the tracklet detection module 1021 may utilize the well-known KLT feature point tracker to detect several key points for each frame in the obtained video. To be specific, the detected key points are tracked with the matching algorithm predefined by the KLT, and the corresponding key points across consecutive frames are matched to extract the tracklets. In the non-limiting embodiments in the present application, a plurality of key points are detected in one person in the crowd in each frame. In a preferred embodiment, each of the motion features is computed on a certain number of (for example, 75) frames of the obtained video.
The motion feature exacting unit 102 further comprises a motion distribution determination module 1022 to compute physical relationships between each tracklet and its neighbors to determine motions distribution in each frame. The scene-independent properties for groups in crowd exist in the whole scene space and can be quantified from scene-level.
According to one embodiment, three properties namely collectiveness, stability and conflict are computed for the frames. After the reformulation, the collectiveness indicates the degree of individuals in the whole scene acting as a union in collective motion, and the stability characterizes whether the whole scene can keep its topological structures, and conflict measures the interaction/friction between each pair of nearest neighbors of interest points.
Examples shown in the Fig. 4 illustrate each property intuitively. Referring to Fig. 4, for each channel, two examples are shown in the first and second rows.
Individuals in crowd moving randomly indicate low collectiveness, while the coherent motion of crowd reveals high collectiveness. In Fig. 4-a people in the crowded  scene walk randomly, target on different destinations, so as to exhibit low collectiveness. In Fig. 4-b, a marathon video has people run coherently towards the same destination to exhibit high collectiveness.
Individuals have low stability if their topological structure changes a lot, whereas high stability if topological structure changes a little. In Fig. 4-c, skate dancers have their formation changed a lot from the first frame to the fiftieth frame which means low stability; while in Fig. 4-d, dancers in the bottom example keep their topological formation unchanged to exhibit high stability.
Conflict occurs when individuals move towards different directions. In Fig. 4-e, there is one group of horse-riders parading without any other frictions; while in Fig. 4-f, several groups of people crossing walk which generate conflict with each other.
The present application is not only restricted to the proposed three properties, but can generate any properties if required.
In one example of the present application, the motion maps module 1022 operates to define K-NN graph G (V, E) for the whole point set of the tracklets detected by the tracklet detection module 1021, whose vertices V represent the tracklet points, and tracklet point pairs are connected by edges E. We denote the set of nearest neighbors of a tracklet point z∈V as
Figure PCTCN2015079190-appb-000001
at every frame of a given video clip.
The motion distribution module 1022 then extracts three motion maps namely collectiveness distribution, stability distribution, and conflict distribution for each frame.
The collectiveness distribution (or map) can be computed by integrating path similarities among crowds on collective manifold. B. Zhou, X. Tang, H. Zhang, and X. Wang.  have proposed the algorithm of the Collective Merging to detect collective motions from random motions by modeling collective motions on the manifold in the “Measuring crowd collectiveness” (TPAMI, 36 (8) : 1586-1599, 2014) .
The stability distribution is extracted by counting and averaging the number of invariant neighbors of each point in the K-NN graph.
Figure PCTCN2015079190-appb-000002
where
Figure PCTCN2015079190-appb-000003
For each member i, its K-NN set is
Figure PCTCN2015079190-appb-000004
in the first frame and
Figure PCTCN2015079190-appb-000005
in the τ-th frame. It has high stability if its neighbor sets vary little across frames. Thus the larger the 
Figure PCTCN2015079190-appb-000006
is, the lower stability the member has.
The conflict distribution is extracted by computing the velocity correlation between each pair of nearby tracklet points {z, z*} within the K-NN graph.
Figure PCTCN2015079190-appb-000007
For each member i, if the velocity of each member in its K-NN set is similar to that of himself, he will have low conflict. It means his neighbors move coherently with him without generating conflict with him.
Returning to Fig. 3, the motion feature exacting unit 102 further comprises a continuous motion channel generation module 1023 to average the per-frame motion maps, for example, the collectiveness maps, the stability maps and the conflict maps across temporal domain, and interpolate the sparse tracklet points to output three complete and continuous motion channels. Although a single frame owns tens or hundreds of tracklets, the total tracklet points are still sparse. The Gaussian kernel can be utilized to interpolate the averaged motion maps to get continuous motion channels.
Returning to Fig. 1, the system 1000 further comprises a prediction device 200. The prediction device 200 is electronically communicated with the feature extracting device 100 and is configured to obtain appearances of the video, receive the extracted motion features from the feature extracting device 100, and predict attributes of the crowd in the video based on the received motion features and/or the obtained appearances of the video. With this function, it can effectively detect the attributes, including the roles of people, their activities and the locations, from the crowd videos, so as to describe the content of the crowd videos. Therefore, crowd videos with the same set of attributes can be obtained and the similarity of different crowd videos can be measured by their attribute set. Furthermore, there are a large number of possible interactions among these attributes. Some attributes are likely to be detected simultaneously whilst some exclusive. For example, the scenario “street” attribute is likely to co-occur with subject “pedestrian” when the subject is “walking” , and also likely to co-occur with subject “mob” when the subject is “fighting” , but not related to subject “swimmer” because the subject cannot “swim” on “street” .
In a model perspective, the feature extracting device 100 may configured as a model with convolutional neural network structure as shown in Fig. 5. For purpose of illustration, Fig. 5 shows two branches are included in the convolutional neural network structure. However, the number of branches is not limited to the proposed two, and it can be generalized to more branches. The number of each type of layers and the number of parameters can also be tuned according to different tasks and objectives.
As shown in Fig. 5, the network comprises: one or more data layers 501, one or more convolution layers 502, one or more max/sum pooling layers 503, one or more normalization layers 504 and a fully-connected layer 505.
Data layer 501
In this exemplified embodiment as shown in Fig. 5, this layer of the top appearance branch contains the RGB components (or channels) of the images and their labels  (for example, the dimension is 94) , and of the bottom motion branch contains at least one motion features (for example, the proposed three motion channels as discussed in the above: the collectiveness, the stability and the conflict) and their labels same to the labels of the top branch.
Specifically, this layer 501 provides images
Figure PCTCN2015079190-appb-000008
and its labels 
Figure PCTCN2015079190-appb-000009
where xij is the j-th bit value of the d-dimension feature vector of the i-th input image region, yij is the j-th bit value of the n dimension label vector of the i-th input image region.
Convolution layer 502
The layer 502 performs convolution, padding, and non-linear transformation operations. The convolution layer receives the output (s
Figure PCTCN2015079190-appb-000010
and
Figure PCTCN2015079190-appb-000011
) from the data layer 501 and performs convolution, padding, and non-linear transformation operations.
The convolution operation in each convolutional layer may be expressed as
Figure PCTCN2015079190-appb-000012
Where,
xi and yj are the i-th input feature map and the j-th output feature map, respectively;
kij is the convolution kernel between the i-th input feature map and the j-th output feature map;
* denotes convolution;
bj is the bias of the j-th output feature map; and
ReLU nonlinearity y=max (0, x) is used for neurons.
The convolution operation can extract features from the input image, such as edge, curve, dot, etc. These features are not predefined manually but are learned from the  training data.
When the convolution kernel kij operates on the marginal pixels of xi, it will exceed the border of xi. In this case, it sets the values that exceed the border of xi to be 0 so as to make the operation valid. This operation is also called “padding” .
The order of the above operations is: padding -> convolutions ->non-linear transformation (ReLU) . The input to “padding” is xi in equation (1) . Each step uses the output of the previous step. The non-linear transformation produces yj in equation 3) .
Max pooling layer 503
This layer keeps the maximum value in a local window, and the dimension of the output is thus smaller than the input. The max pooling layer keeps the maximum value in a local window and discard the other values, the output is thus smaller than the input, which may be formulated as
Figure PCTCN2015079190-appb-000013
where each neuron in the i-th output feature map yi pools over an M×N local region in the i-th input feature map xi , with s as the step size.
In other words, it reduces the feature dimensions and provides spatial invariance. The spatial invariance means that if the input shifts by several pixels, the output of the layer won’t change much.
Normalization layer 504:
This layer normalizes the responses in local regions of input feature maps. The output dimensionality of this layer is equal to the input dimensionality.
Fully-connected layer 505
This layer takes the feature vector from the previous layer as the input and operates the inner-production between the feature and weights. And one non-linear transformation is operated on the production. The fully connected layer takes the feature vector from the previous layer as input and operates the inner-production between the feature x and weights w, and then one non-linear transformation will be operated on the production, which may be formulated as
Figure PCTCN2015079190-appb-000014
Where,
x denotes neural inputs (features) .
y denotes neural outputs (features) in the current fully-connected layer.
w denotes neural weights in current fully-connected layer. Neurons in the fully-connected layer linearly combine features in previous feature extraction module, followed by ReLU non-linearity.
The fully connected layer is configured to extract global features (features extracted from the entire input feature maps) from previous layer. The fully-connected layer also has the function of feature dimension reduction by restricting the number of neurons in them. In one embodiments of the present application, there are provided with at least two fully-connected layers so as to increase the nonlinearity of the neural network, which in turns makes the operation of fitting data easier.
The convolutional layer and the max pooling layer only provide local transformations, which means that they only operate on a local window of the input (local region of the input image) . However, the fully-connected layer provides global transformation, which takes features from the whole space of the inputted image and conduct a transformation as discussed in the above Equation 5)
In the end, the two branches then fuse together to one fully-connected layer. If  simple notations are used to represent parameters in the networks: (1) Conv (N, K, S) for convolutional. layers with N outputs, kernel size K and stride size S, (2) Pool (T, K, S) for pooling layers with type T, kernel size K and stride size S, (3) Norm (K) for local response normalization layers with local size K, and (4) FC (N) for fully-connected layers with N outputs, (5) The activation functions in each layer are represented by ReLU for rectified linear unit and Sig for sigmoid function, then, given N=96, K=7 and S=2 as an example, the two branches have parameters: Conv (96, 7, 2) -ReLU-Pool (3, 2) -Norm (5) -Conv (256, 5, 2) -ReLU-Pool (3, 2) -Norm (5) -Conv (384, 3, 1) -ReLU-Conv (384, 3, 1) -ReLU-Conv (256, 3, 1) -ReLU-Pool (3, 2) -FC (4096) .
The output fully connected layers of two branches are concatenated to be FC (8192) . Finally, we have FC (8192) -FC (94) -Sig producing a plurality of (for example, 94) attribute probability predictions. In one embodiment of the present application, the output of the FC 405 may be 94 attributes , for example , {street, temple, ... } belong to “where” , {star, protester, ... } belong to “who” , and {walk, board, ... } belong to “why” . Accordingly, the 94 attributes outputted from the FC 405may be of three types: “where” (e.g. street, temple, and classroom) ; “who” (e.g. star, protester, and skater) ; and “why” (e.g. walk, board, and ceremony) .
Returning to Fig. 1, the system 1000may further comprises a training device 300. The training device 300is used to train the convolutional neural network by using the following two inputs to obtain a fine-tuned convolutional neural network which produces predictions of crowd attributes:
i. A pre-training set contains images with different objects and the corresponding ground truth object labels. The label set encompasses m object classes.
ii. A fine-tuning set contains crowd videos with appearance as well as motion channels, and the corresponding ground truth attribute labels. The label set encompasses n attribute classes.
Fig. 6 is a schematic diagram illustrating a flow chart for constructing a network with the appearance and motion braches according to one embodiment of the present application.
In this embodiment, two convolutional neural networks are provided with the same structure but different numbers of branches, the first one is used to do pre-training with only one branch, and the second one is used to do fine-tuning with two branches. The first convolutional neural network with one branch of convolutional neural layers may be constructed according to the conventional means. The second convolutional neural network with one branch of convolutional neural layers is constructed based on the first convolutional neural network.
As shown, at step S601, the device 300 operates to pre-train the first convolutional neural network with image net detection task, which can be done by the conventional means or algorithm.
At step S602, the network parameters of the appearance branch are initialized using the pre-trained model stated in step S601. For example, the parameters may be randomly initialized.
At step S603, the input of the motion branch in the first convolutional neural network is replaced by the proposed motion distributions, i.e. , collectiveness distributions, stability distributions and conflict distributions.
At step S604, the network parameters of the motion branch of the first convolutional neural network with the proposed motion channels are randomly initialized without pre-training.
At step S605, the second convolution neural network with two branches (i.e. the appearance channel and the motion channels) is constructed. In particular, the second network is constructed by combining the first convolutional neural network initialized with the appearance parameters at step S602 and the first convolutional neural network initialized with the motion parameters at step S604, as shown in Fig. 6.
Fig. 7 is a schematic diagram illustrating a flow chart for the training device 300 to fine-tune the second network using the appearance and motion channels of videos in the fine-tuning set.
At step S701, parameters, including the convolution filters, deformational layer weights, fully connected weights, and bias are initialized randomly by the training device 300. The training tries to minimize the loss function and can be divided into many updating steps. Therefore, at step S702, the loss is calculated, and then at step S703, the algorithm calculates the gradient with respect to all the neural network parameters based on the calculated loss, including the convolution filters, deformational layer weights, fully connected weights, and bias.
The gradient of any network parameters can be calculated with the chain rule. Suppose the network has n layers and they are denoted by Li, i=1, 2, ... , n. The output of a layer Lk in the network can be expressed by a general function
yk=fk (yk-1, wk)   6)
where yk is the output of the layer Lk, yk-1 is the output of the previous layer Lk-1, wk is the weights of Lk, and fk is the function for Lk. The derivative of yk with respect to yk-1 and wk is all known. The loss function C of the network is define on the output of the last layer Ln and the ground truth label t,
c=C (yn, t)   7)
The derivative of c with respect to yn is also known. To calculate the gradient  of c with respect to weights wn, the chain rule can be applied
Figure PCTCN2015079190-appb-000015
To calculate the gradient of c with respect to yk, the chain rule can also be applied
Figure PCTCN2015079190-appb-000016
which is in recursive way. To calculate the gradient of c with respect to arbitrary weight wk, we can use
Figure PCTCN2015079190-appb-000017
In this procedure, the gradient of the cost c with respect to any weights in the network can be calculated.
At step S704, the algorithm updates the convolution filters, deformational layer weights, fully connected weights, and bias by rule of
Figure PCTCN2015079190-appb-000018
where η is the learning rate, and η is a predefined value.
Updates of the parameters are performed using the production of one prefixed learning rate and the corresponding gradients.
At step S705, it determines if the stopping criterion is satisfied. For example, if the variation of the loss is less than a predetermined value, the process terminates, otherwise, the process return back to step S702.
As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present  invention may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ” Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present invention, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments.
In addition, the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium. Fig. 8 illustrates a system 3000 for predicting crowd attributes according to one embodiment of the present application, in which the functions of the present invention are carried out by the software. Referring to Fig. 8, the system 3000 comprises a memory 3001 that stores executable components and a processor 3002, electrically coupled to the memory 3001 to execute the executable components to perform operations of the system 3000. The executable components may comprise: a feature extracting component 3003 obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and a prediction component 3004 extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances. The functions of the  components  3003 and 3004 are similar to those of the  unit  100 and 200, respectively, and thus the detailed descriptions  thereof are omitted herein.
Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.
Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

Claims (20)

  1. A system for predicting crowd attributes, comprising:
    a feature extracting device obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd in the video; and
    a prediction device being electronically communicated with the feature extracting device and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
  2. The system according to claim 1, wherein the feature extracting device further comprises a motion feature extracting unit comprising:
    a tracklet detection module (1021) detecting short trajectories of the crowd in each frame of the video;
    a motion maps determination module (1022) computing physical relationships between each of the short trajectories and its neighbors to determine one or more motion distributions for the crowd in each frame of the video; and
    a continuous motion channel generation module (1023) averaging the determined motion distributions across temporal domain, and interpolating one or more sparse short trajectory points into the averaged distributions to form one or more continuous motion channels forming the motion features.
  3. The system according to claim 2, wherein the motion distribution at least comprises at least one of:
    a collectiveness distribution that indicates a degree of individual in a whole scene acting as a union in a collective motion,
    a stability distribution that indicates whether the whole scene can keep a topological structure for the crowd in the whole scene; and
    a conflict distribution that indicates an interaction/friction between each pair of nearest neighbors of the short trajectories for the crowd in the scene.
  4. The system according to claim 1, wherein the predicted attributes at least indicate a role of the people in the crowd, a place of the crowd and a reason why people are in the crowd.
  5. The system according to any one of claims 1-4, wherein the prediction device is configured with a convolutional neural network having:
    a first branch configured to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features; and
    a second branch configured to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features,
    wherein, the predicted crowd attributes from the first branch and the predicted crowd attributes from the second branch are fused together to output a prediction of the attributes of the crowd in the video.
  6. The system according to claim 5, further comprising a training device for training the second neural network by:
    initializing randomly parameters for the second neural network;
    calculating a loss of the parameters in the second neural network;
    calculating a gradient with respect to all said parameters based on the calculated loss;
    updating the parameters by using a production of one prefixed learning rate and the corresponding gradients;
    determining if a stopping criterion is satisfied;
    if not, returning to the step of calculating.
  7. The system according to claim 6, wherein the training device trains the first neural network by:
    initializing parameters for the first neural network with pre-trained data sets;
    calculating a loss of the parameters in the first neural network;
    calculating a gradient with respect to all said parameters based on the calculated loss;
    updating the parameters by using a production of one prefixed learning rate and the corresponding gradients;
    determining if a stopping criterion is satisfied;
    if not, returning to the step of calculating.
  8. The system according to claim 7, wherein the trained first neural network and the trained second neural network are connected together, the training device further inputs a fine-tuning set into the connected networks to fine-tune the connected networks.
  9. A method for understanding crowd scene, comprising:
    obtaining a video with crowd scenes;
    extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd in the video; and
    predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
  10. The method according to claim 9, wherein the extracting further comprises:
    detecting short trajectories of the crowd in the video frames;
    computing physical relationships between each of the short trajectories and its neighbors to determine one or more motion distributions for the crowd in each frame; and averaging the determined motion distributions across temporal domain, and interpolating one or more sparse short trajectory points into the averaged distributions to form one or more continuous motion channels forming the motion features.
  11. The method according to claim 10, wherein the motion distribution at least comprises at least one of:
    a collectiveness distribution that indicates a degree of individual in a whole scene acting as a union in a collective motion,
    a stability distribution that indicates whether the whole scene can keep a topological structure for the crowd in the whole scene; and
    a conflict distribution that indicates an interaction/friction between each pair of nearest neighbors of short trajectories for the crowd in the whole scene.
  12. The method according to claim 9, wherein the predicted attributes at least indicate a role of the people in the crowd, a place of the crowd and a reason why people are in the crowd.
  13. The method according to any one of claims 9-12, wherein the predicting further comprises:
    configuring a first branch to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features;
    configuring a second branch to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features; and
    connecting the predicted crowd attributes from the first branch and the predicted crowd attributes from the second branch to output a . prediction of the attributes of the crowd in the video.
  14. The method according to claim 13, further comprising:
    initializing randomly parameters for the second neural network;
    calculating a loss of the parameters in the second neural network;
    calculating a gradient with respect to all said parameters based on the calculated loss;
    updating the parameters by using a production of one prefixed learning rate and the corresponding gradients;
    determining if a stopping criterion is satisfied;
    if not, returning to the step of calculating.
  15. The method according to claim 14, further comprising:
    initializing parameters for the first neural network with pre-trained data sets;
    calculating a loss of the parameters in the first neural network;
    calculating a gradient with respect to all said parameters based on the calculated loss;
    updating the parameters by using a production of one prefixed learning rate and the corresponding gradients;
    determining if a stopping criterion is satisfied;
    if not, returning to the step of calculating.
  16. The method according to claim 15, further comprising:
    connecting the trained first neural network and the trained second neural network together; and
    fine-tuning the connected networks by inputting a fine-tuning set into the connected networks.
  17. A system for predicting crowd attributes, comprising:
    a memory that stores executable components; and
    a processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:
    a feature extracting component obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd in the video; and
    a prediction component extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
  18. The system according to claim 17, wherein the feature extracting component is configured to:
    detect short trajectories of the crowd in each frame in the video;
    compute physical relationships between each of the trajectories and its neighbors to determine one or more motion distributions for each frame in the video; and
    average the determined motion distributions across temporal domain, and interpolate one or more sparse tracklet points into the averaged distributions to form one or more continuous motion channels forming the motion feature.
  19. The system according to claim 18, wherein the motion distribution at least comprises at least one of:
    a collectiveness distribution that indicates a degree of individual in a whole scene acting as a union in a collective motion,
    a stability distribution that indicates whether the whole scene can keep a topological structure for the crowd in the whole scene; and
    a conflict distribution that indicates an interaction/friction between each pair of nearest neighbors of trajectories for the crowd in the whole scene.
  20. The system according to claim 19, wherein the prediction component is further configured for:
    configuring a first branch to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features;
    configuring a second branch to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features; and
    connecting the predicted crowd attributes from the first branch and the predicted crowd attributes from the second branch to output a . prediction of the attributes of the crowd in the video.
PCT/CN2015/079190 2015-05-18 2015-05-18 A system and a method for predicting crowd attributes WO2016183770A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201580080179.9A CN107615272B (en) 2015-05-18 2015-05-18 System and method for predicting crowd attributes
PCT/CN2015/079190 WO2016183770A1 (en) 2015-05-18 2015-05-18 A system and a method for predicting crowd attributes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/079190 WO2016183770A1 (en) 2015-05-18 2015-05-18 A system and a method for predicting crowd attributes

Publications (1)

Publication Number Publication Date
WO2016183770A1 true WO2016183770A1 (en) 2016-11-24

Family

ID=57319155

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/079190 WO2016183770A1 (en) 2015-05-18 2015-05-18 A system and a method for predicting crowd attributes

Country Status (2)

Country Link
CN (1) CN107615272B (en)
WO (1) WO2016183770A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109615140A (en) * 2018-12-14 2019-04-12 中国科学技术大学 A kind of method and device for predicting pedestrian movement
CN109977800A (en) * 2019-03-08 2019-07-05 上海电力学院 A kind of intensive scene crowd of combination multiple features divides group's detection method
CN110210603A (en) * 2019-06-10 2019-09-06 长沙理工大学 Counter model construction method, method of counting and the device of crowd
CN111933298A (en) * 2020-08-14 2020-11-13 医渡云(北京)技术有限公司 Crowd relation determination method, device, electronic equipment and medium
CN113792930A (en) * 2021-04-26 2021-12-14 青岛大学 Blind person walking track prediction method, electronic device and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415323B (en) * 2019-07-30 2023-05-26 成都数字天空科技有限公司 Fusion deformation coefficient obtaining method, fusion deformation coefficient obtaining device and storage medium
CN111339364B (en) * 2020-02-28 2023-09-29 网易(杭州)网络有限公司 Video classification method, medium, device and computing equipment
CN111429185B (en) * 2020-03-27 2023-06-02 京东城市(北京)数字科技有限公司 Crowd figure prediction method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040230452A1 (en) * 2003-05-15 2004-11-18 Yuichi Abe Regional attribute determination method, regional attribute determination device, and regional attribute determination program
CN101561928A (en) * 2009-05-27 2009-10-21 湖南大学 Multi-human body tracking method based on attribute relational graph appearance model
CN103150375A (en) * 2013-03-11 2013-06-12 浙江捷尚视觉科技有限公司 Quick video retrieval system and quick video retrieval method for video detection
CN104537685A (en) * 2014-12-12 2015-04-22 浙江工商大学 Method for conducting automatic passenger flow statistical analysis on basis of video images

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9202225B2 (en) * 2010-05-28 2015-12-01 Red Hat, Inc. Aggregate monitoring of utilization data for vendor products in cloud networks
CN102201065B (en) * 2011-05-16 2012-11-21 天津大学 Method for detecting monitored video abnormal event based on trace analysis
CN102508923B (en) * 2011-11-22 2014-06-11 北京大学 Automatic video annotation method based on automatic classification and keyword marking
CN105095908B (en) * 2014-05-16 2018-12-14 华为技术有限公司 Group behavior characteristic processing method and apparatus in video image
CN104598890B (en) * 2015-01-30 2017-07-28 南京邮电大学 A kind of Human bodys' response method based on RGB D videos

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040230452A1 (en) * 2003-05-15 2004-11-18 Yuichi Abe Regional attribute determination method, regional attribute determination device, and regional attribute determination program
CN101561928A (en) * 2009-05-27 2009-10-21 湖南大学 Multi-human body tracking method based on attribute relational graph appearance model
CN103150375A (en) * 2013-03-11 2013-06-12 浙江捷尚视觉科技有限公司 Quick video retrieval system and quick video retrieval method for video detection
CN104537685A (en) * 2014-12-12 2015-04-22 浙江工商大学 Method for conducting automatic passenger flow statistical analysis on basis of video images

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109615140A (en) * 2018-12-14 2019-04-12 中国科学技术大学 A kind of method and device for predicting pedestrian movement
CN109615140B (en) * 2018-12-14 2024-01-09 中国科学技术大学 Method and device for predicting pedestrian movement
CN109977800A (en) * 2019-03-08 2019-07-05 上海电力学院 A kind of intensive scene crowd of combination multiple features divides group's detection method
CN110210603A (en) * 2019-06-10 2019-09-06 长沙理工大学 Counter model construction method, method of counting and the device of crowd
CN111933298A (en) * 2020-08-14 2020-11-13 医渡云(北京)技术有限公司 Crowd relation determination method, device, electronic equipment and medium
CN111933298B (en) * 2020-08-14 2024-02-13 医渡云(北京)技术有限公司 Crowd relation determining method and device, electronic equipment and medium
CN113792930A (en) * 2021-04-26 2021-12-14 青岛大学 Blind person walking track prediction method, electronic device and storage medium
CN113792930B (en) * 2021-04-26 2023-08-22 青岛大学 Blind person walking track prediction method, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN107615272A (en) 2018-01-19
CN107615272B (en) 2021-09-03

Similar Documents

Publication Publication Date Title
WO2016183770A1 (en) A system and a method for predicting crowd attributes
Zhang et al. Attentional neural fields for crowd counting
Xiong et al. Spatiotemporal modeling for crowd counting in videos
CN107624189B (en) Method and apparatus for generating a predictive model
US20170103264A1 (en) System and Method for Visual Event Description and Event Analysis
CN108960184B (en) Pedestrian re-identification method based on heterogeneous component deep neural network
Karavasilis et al. Visual tracking using the Earth Mover's Distance between Gaussian mixtures and Kalman filtering
Hou et al. Human tracking over camera networks: a review
Lian et al. Spatial–temporal consistent labeling of tracked pedestrians across non-overlapping camera views
Ma et al. Counting people crossing a line using integer programming and local features
CN111178284A (en) Pedestrian re-identification method and system based on spatio-temporal union model of map data
Banerjee et al. Efficient pooling of image based CNN features for action recognition in videos
CN114240997B (en) Intelligent building online trans-camera multi-target tracking method
WO2020088763A1 (en) Device and method for recognizing activity in videos
Xie et al. Event-based stereo matching using semiglobal matching
Ji et al. Semisupervised hyperspectral image classification using spatial-spectral information and landscape features
Zhang et al. Joint discriminative representation learning for end-to-end person search
Islam et al. Representation for action recognition with motion vector termed as: SDQIO
Behera et al. Person re-identification: A taxonomic survey and the path ahead
Pehlivan et al. Recognizing activities in multiple views with fusion of frame judgments
Babu et al. Subject independent human action recognition using spatio-depth information and meta-cognitive RBF network
Yadav et al. DroneAttention: Sparse weighted temporal attention for drone-camera based activity recognition
Zhu et al. Correspondence-free dictionary learning for cross-view action recognition
Srilakshmi et al. A-DQRBRL: attention based deep Q reinforcement battle royale learning model for sports video classification
WO2020192868A1 (en) Event detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15892149

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15892149

Country of ref document: EP

Kind code of ref document: A1