WO2016183770A1

WO2016183770A1 - A system and a method for predicting crowd attributes

Info

Publication number: WO2016183770A1
Application number: PCT/CN2015/079190
Authority: WO
Inventors: Xiaogang Wang; Chen Change Loy; Jing SHAO; Kai Kang
Original assignee: Xiaogang Wang
Priority date: 2015-05-18
Filing date: 2015-05-18
Publication date: 2016-11-24
Also published as: CN107615272B; CN107615272A

Abstract

Disclosed is a system for predicting crowd attributes, comprising: a feature extracting device obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video; and a prediction device being electronically communicated with the feature extracting device and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.

Description

A SYSTEM AND A METHOD FOR PREDICTING CROWD ATTRIBUTES

Technical Field

The disclosures relate to a system for predicting crowd attributes and a method thereof.

Background

During the last decade, the field of crowd analysis had a remarkable evolution from crowded scene understanding, including crowd behavior analysis, crowd tracking, and crowd segmentation. Much of this progress was sparked by the creation of crowd datasets as well as the new and robust features and models for profiling crowd intrinsic properties. Most of the above studies on crowd understanding are scene-specific, that is, the crowd model is learned from a specific scene and thus poor in generalization to describe other scenes. Attributes are particularly effective on characterizing generic properties across scenes.

In the recent years, studies in attribute-based representations of objects, faces, actions, and scenes have drawn a large attention as an alternative or complement to categorical representations as they characterize the target subject by several attributes rather than discriminative assignment into a single specific category, which is too restrictive to describe the nature of the target subject. Furthermore, scientific studies have shown that different crowd systems share similar principles that can be characterized by some common properties or attributes. Indeed, attributes can express more information in a crowd video as they can describe a video by answering “Who is in the crowd？ ” ， “Where is the crowd？ ” ， and “What is crowd here？ ” ， but not merely define a categorical scene label or event label to it. For instance, an attribute-based representation might describe a crowd video as the “conductor” and “choir” performing on the “stage” with “audience” “applauding” ， in contrast to a categorical label like “chorus” . Recently， some works have made efforts on crowd attribute profiling. But the number of attributes in their work is limited (only four or less) , as well as the dataset is also small in terms of scene diversity.

Summary

The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. This summary is not an extensive overview of the disclosure. It is intended to neither identify key or critical elements of the disclosure nor delineate any scope of particular embodiments of the disclosure, or any scope of the claims. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.

In an aspect, disclosed is an system for predicting crowd attributes comprising: a feature extracting device obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video； and a prediction device being electronically communicated with the feature extracting device and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.

In yet another aspect, disclosed is a method for understanding crowd scene, comprising: obtaining a video with crowd scenes； extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video； and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.

In yet another aspect, disclosed is a system for predicting crowd attributes, comprising:

a memory that stores executable components； and

a processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:

a feature extracting component obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video； and

a prediction component extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.

In one embodiment, the prediction device/component is configured with a convolutional neural network having:

a first branch configured to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features； and

a second branch configured to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features,

wherein the predicted features from the first branch and the predicted features from the second branch are fused together to form a prediction of the attributes of the crowd in the video.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 is a schematic diagram illustrating a system for predicting crowd attributes according to an embodiment of the present application.

Fig. 2 is a schematic diagram illustrating a flow chart for the system according to one embodiment of the present application.

Fig. 3 illustrates a schematic block diagram of the feature extracting device according to an embodiment of the present application.

Fig. 4 is a schematic diagram illustrating motion channels in scenarios consistent with some disclosed embodiments.

Fig. 5 is a schematic diagram illustrating a convolutional neural network structure included in the prediction device according to some disclosed embodiments.

Fig. 6 is a schematic diagram illustrating a flow chart for constructing a network with the appearance and motion braches according to one embodiment of the present application.

Fig. 7 is a schematic diagram illustrating a flow chart for the training device to fine-tune the second network using the appearance and motion channels of videos in the fine-tuning set.

Fig. 8 illustrates a system for predicting crowd attributes according to one embodiment of the present application, in which the functions of the present invention are carried out by the software.

Detailed Description

Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a" , "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising, " when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Fig. 1 illustrates a system 1000 for predicting crowd attributes. The proposed system 1000 is capable of understanding crowded scene in computer vision from attribute-level, and characterizing the crowded scene by predicting a plurality of attributes rather than discriminative assignment into a single specific category. It will be significant in many applications, e.g. in the video surveillance and video search engine.

As shown in Fig. 1, the system 1000 comprises a feature extracting device 100 and a prediction device 200. Fig. 2 illustrates a schematic diagram illustrating a flow chart for the system 1000 according to one embodiment of the present application. At step S201, the feature extracting device 100 obtains a video with crowd scenes and extracts appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video； and then at step S202, the prediction device 200 predicts attributes of the crowd in the video based on the extracted motion features and the extracted appearances, which will be further discussed later.

In one example of the present application, the feature extracting device 100 may deeply learn the appearance and motion representation across different crowded scenes. Fig. 3 illustrates a schematic block diagram of the feature extracting device 100 according to an embodiment of the present application. The feature extracting device 100 comprises an appearance feature exacting unit 101 configured to extract the RGB components of each frame from the input video.

The feature extracting device 100 further comprises a motion feature exacting unit 102 to extract motion feature from the obtained video. To be specific, the motion feature extracting unit 102 further comprises a tracklet detection module 1021 to detect crowd tracklets (i.e., short trajectories) for each frame in the obtained video with crowd scene. For example, the tracklet detection module 1021 may utilize the well-known KLT feature point tracker to detect several key points for each frame in the obtained video. To be specific, the detected key points are tracked with the matching algorithm predefined by the KLT, and the corresponding key points across consecutive frames are matched to extract the tracklets. In the non-limiting embodiments in the present application, a plurality of key points are detected in one person in the crowd in each frame. In a preferred embodiment, each of the motion features is computed on a certain number of (for example, 75) frames of the obtained video.

The motion feature exacting unit 102 further comprises a motion distribution determination module 1022 to compute physical relationships between each tracklet and its neighbors to determine motions distribution in each frame. The scene-independent properties for groups in crowd exist in the whole scene space and can be quantified from scene-level.

According to one embodiment, three properties namely collectiveness, stability and conflict are computed for the frames. After the reformulation, the collectiveness indicates the degree of individuals in the whole scene acting as a union in collective motion, and the stability characterizes whether the whole scene can keep its topological structures, and conflict measures the interaction/friction between each pair of nearest neighbors of interest points.

Examples shown in the Fig. 4 illustrate each property intuitively. Referring to Fig. 4, for each channel, two examples are shown in the first and second rows.

Individuals in crowd moving randomly indicate low collectiveness, while the coherent motion of crowd reveals high collectiveness. In Fig. 4-a people in the crowded scene walk randomly, target on different destinations, so as to exhibit low collectiveness. In Fig. 4-b, a marathon video has people run coherently towards the same destination to exhibit high collectiveness.

Individuals have low stability if their topological structure changes a lot, whereas high stability if topological structure changes a little. In Fig. 4-c, skate dancers have their formation changed a lot from the first frame to the fiftieth frame which means low stability； while in Fig. 4-d, dancers in the bottom example keep their topological formation unchanged to exhibit high stability.

Conflict occurs when individuals move towards different directions. In Fig. 4-e, there is one group of horse-riders parading without any other frictions； while in Fig. 4-f, several groups of people crossing walk which generate conflict with each other.

The present application is not only restricted to the proposed three properties, but can generate any properties if required.

In one example of the present application, the motion maps module 1022 operates to define K-NN graph G (V， E) for the whole point set of the tracklets detected by the tracklet detection module 1021, whose vertices V represent the tracklet points, and tracklet point pairs are connected by edges E. We denote the set of nearest neighbors of a tracklet point z∈V as

at every frame of a given video clip.

The motion distribution module 1022 then extracts three motion maps namely collectiveness distribution, stability distribution, and conflict distribution for each frame.

The collectiveness distribution (or map) can be computed by integrating path similarities among crowds on collective manifold. B. Zhou, X. Tang, H. Zhang, and X. Wang. have proposed the algorithm of the Collective Merging to detect collective motions from random motions by modeling collective motions on the manifold in the “Measuring crowd collectiveness” (TPAMI， 36 (8) ： 1586-1599, 2014) .

The stability distribution is extracted by counting and averaging the number of invariant neighbors of each point in the K-NN graph.

where

For each member i, its K-NN set is

in the first frame and

in the τ-th frame. It has high stability if its neighbor sets vary little across frames. Thus the larger the

is, the lower stability the member has.

The conflict distribution is extracted by computing the velocity correlation between each pair of nearby tracklet points {z， z^*} within the K-NN graph.

For each member i, if the velocity of each member in its K-NN set is similar to that of himself, he will have low conflict. It means his neighbors move coherently with him without generating conflict with him.

Returning to Fig. 3, the motion feature exacting unit 102 further comprises a continuous motion channel generation module 1023 to average the per-frame motion maps, for example, the collectiveness maps, the stability maps and the conflict maps across temporal domain, and interpolate the sparse tracklet points to output three complete and continuous motion channels. Although a single frame owns tens or hundreds of tracklets, the total tracklet points are still sparse. The Gaussian kernel can be utilized to interpolate the averaged motion maps to get continuous motion channels.

Returning to Fig. 1, the system 1000 further comprises a prediction device 200. The prediction device 200 is electronically communicated with the feature extracting device 100 and is configured to obtain appearances of the video, receive the extracted motion features from the feature extracting device 100, and predict attributes of the crowd in the video based on the received motion features and/or the obtained appearances of the video. With this function, it can effectively detect the attributes, including the roles of people, their activities and the locations, from the crowd videos, so as to describe the content of the crowd videos. Therefore, crowd videos with the same set of attributes can be obtained and the similarity of different crowd videos can be measured by their attribute set. Furthermore, there are a large number of possible interactions among these attributes. Some attributes are likely to be detected simultaneously whilst some exclusive. For example， the scenario “street” attribute is likely to co-occur with subject “pedestrian” when the subject is “walking” ， and also likely to co-occur with subject “mob” when the subject is “fighting” ， but not related to subject “swimmer” because the subject cannot “swim” on “street” .

In a model perspective, the feature extracting device 100 may configured as a model with convolutional neural network structure as shown in Fig. 5. For purpose of illustration, Fig. 5 shows two branches are included in the convolutional neural network structure. However, the number of branches is not limited to the proposed two, and it can be generalized to more branches. The number of each type of layers and the number of parameters can also be tuned according to different tasks and objectives.

As shown in Fig. 5, the network comprises: one or more data layers 501, one or more convolution layers 502, one or more max/sum pooling layers 503, one or more normalization layers 504 and a fully-connected layer 505.

Data layer 501

In this exemplified embodiment as shown in Fig. 5, this layer of the top appearance branch contains the RGB components (or channels) of the images and their labels (for example, the dimension is 94) , and of the bottom motion branch contains at least one motion features (for example, the proposed three motion channels as discussed in the above: the collectiveness, the stability and the conflict) and their labels same to the labels of the top branch.

Specifically, this layer 501 provides images

and its labels

where x_ij is the j-th bit value of the d-dimension feature vector of the i-th input image region, y_ij is the j-th bit value of the n dimension label vector of the i-th input image region.

Convolution layer 502

The layer 502 performs convolution, padding, and non-linear transformation operations. The convolution layer receives the output (s

and

) from the data layer 501 and performs convolution, padding, and non-linear transformation operations.

The convolution operation in each convolutional layer may be expressed as

Where,

xⁱ and y^j are the i-th input feature map and the j-th output feature map, respectively；

k^ij is the convolution kernel between the i-th input feature map and the j-th output feature map；

* denotes convolution；

b^j is the bias of the j-th output feature map； and

ReLU nonlinearity y＝max (0, x) is used for neurons.

The convolution operation can extract features from the input image, such as edge, curve, dot, etc. These features are not predefined manually but are learned from the training data.

When the convolution kernel k^ij operates on the marginal pixels of xⁱ, it will exceed the border of xⁱ. In this case, it sets the values that exceed the border of xⁱ to be 0 so as to make the operation valid. This operation is also called “padding” .

The order of the above operations is: padding -> convolutions ->non-linear transformation (ReLU) . The input to “padding” is xⁱ in equation (1) . Each step uses the output of the previous step. The non-linear transformation produces y^j in equation 3) .

Max pooling layer 503

This layer keeps the maximum value in a local window, and the dimension of the output is thus smaller than the input. The max pooling layer keeps the maximum value in a local window and discard the other values, the output is thus smaller than the input, which may be formulated as

where each neuron in the i-th output feature map yⁱ pools over an M×N local region in the i-th input feature map xⁱ , with s as the step size.

In other words, it reduces the feature dimensions and provides spatial invariance. The spatial invariance means that if the input shifts by several pixels, the output of the layer won’t change much.

Normalization layer 504:

This layer normalizes the responses in local regions of input feature maps. The output dimensionality of this layer is equal to the input dimensionality.

Fully-connected layer 505

This layer takes the feature vector from the previous layer as the input and operates the inner-production between the feature and weights. And one non-linear transformation is operated on the production. The fully connected layer takes the feature vector from the previous layer as input and operates the inner-production between the feature x and weights w, and then one non-linear transformation will be operated on the production, which may be formulated as

Where,

x denotes neural inputs (features) .

y denotes neural outputs (features) in the current fully-connected layer.

w denotes neural weights in current fully-connected layer. Neurons in the fully-connected layer linearly combine features in previous feature extraction module, followed by ReLU non-linearity.

The fully connected layer is configured to extract global features (features extracted from the entire input feature maps) from previous layer. The fully-connected layer also has the function of feature dimension reduction by restricting the number of neurons in them. In one embodiments of the present application, there are provided with at least two fully-connected layers so as to increase the nonlinearity of the neural network, which in turns makes the operation of fitting data easier.

The convolutional layer and the max pooling layer only provide local transformations, which means that they only operate on a local window of the input (local region of the input image) . However, the fully-connected layer provides global transformation, which takes features from the whole space of the inputted image and conduct a transformation as discussed in the above Equation 5)

In the end, the two branches then fuse together to one fully-connected layer. If simple notations are used to represent parameters in the networks: (1) Conv (N, K, S) for convolutional. layers with N outputs, kernel size K and stride size S, (2) Pool (T, K, S) for pooling layers with type T, kernel size K and stride size S, (3) Norm (K) for local response normalization layers with local size K, and (4) FC (N) for fully-connected layers with N outputs, (5) The activation functions in each layer are represented by ReLU for rectified linear unit and Sig for sigmoid function, then, given N＝96, K＝7 and S＝2 as an example, the two branches have parameters: Conv (96, 7, 2) -ReLU-Pool (3, 2) -Norm (5) -Conv (256, 5, 2) -ReLU-Pool (3, 2) -Norm (5) -Conv (384, 3, 1) -ReLU-Conv (384, 3, 1) -ReLU-Conv (256, 3, 1) -ReLU-Pool (3, 2) -FC (4096) .

The output fully connected layers of two branches are concatenated to be FC (8192) . Finally, we have FC (8192) -FC (94) -Sig producing a plurality of (for example, 94) attribute probability predictions. In one embodiment of the present application, the output of the FC 405 may be 94 attributes ， for example ， {street， temple， ... } belong to “where” ， {star， protester， ... } belong to “who” ， and {walk， board， ... } belong to “why” . Accordingly， the 94 attributes outputted from the FC 405may be of three types: “where” (e.g. street, temple, and classroom) ； “who” (e.g. star, protester, and skater) ； and “why” (e.g. walk, board, and ceremony) .

Returning to Fig. 1, the system 1000may further comprises a training device 300. The training device 300is used to train the convolutional neural network by using the following two inputs to obtain a fine-tuned convolutional neural network which produces predictions of crowd attributes:

i. A pre-training set contains images with different objects and the corresponding ground truth object labels. The label set encompasses m object classes.

ii. A fine-tuning set contains crowd videos with appearance as well as motion channels, and the corresponding ground truth attribute labels. The label set encompasses n attribute classes.

In this embodiment, two convolutional neural networks are provided with the same structure but different numbers of branches, the first one is used to do pre-training with only one branch, and the second one is used to do fine-tuning with two branches. The first convolutional neural network with one branch of convolutional neural layers may be constructed according to the conventional means. The second convolutional neural network with one branch of convolutional neural layers is constructed based on the first convolutional neural network.

As shown, at step S601, the device 300 operates to pre-train the first convolutional neural network with image net detection task, which can be done by the conventional means or algorithm.

At step S602, the network parameters of the appearance branch are initialized using the pre-trained model stated in step S601. For example, the parameters may be randomly initialized.

At step S603, the input of the motion branch in the first convolutional neural network is replaced by the proposed motion distributions, i.e. , collectiveness distributions, stability distributions and conflict distributions.

At step S604, the network parameters of the motion branch of the first convolutional neural network with the proposed motion channels are randomly initialized without pre-training.

At step S605, the second convolution neural network with two branches (i.e. the appearance channel and the motion channels) is constructed. In particular, the second network is constructed by combining the first convolutional neural network initialized with the appearance parameters at step S602 and the first convolutional neural network initialized with the motion parameters at step S604, as shown in Fig. 6.

Fig. 7 is a schematic diagram illustrating a flow chart for the training device 300 to fine-tune the second network using the appearance and motion channels of videos in the fine-tuning set.

At step S701, parameters, including the convolution filters, deformational layer weights, fully connected weights, and bias are initialized randomly by the training device 300. The training tries to minimize the loss function and can be divided into many updating steps. Therefore, at step S702, the loss is calculated, and then at step S703, the algorithm calculates the gradient with respect to all the neural network parameters based on the calculated loss, including the convolution filters, deformational layer weights, fully connected weights, and bias.

The gradient of any network parameters can be calculated with the chain rule. Suppose the network has n layers and they are denoted by L_i， i＝1， 2， ... ， n. The output of a layer L_k in the network can be expressed by a general function

y_k＝f_k (y_k-1， w_k) 6)

where y_k is the output of the layer L_k, y_k-1 is the output of the previous layer L_k-1, w_k is the weights of L_k, and f_k is the function for L_k. The derivative of y_k with respect to y_k-1 and w_k is all known. The loss function C of the network is define on the output of the last layer L_n and the ground truth label t,

c＝C (y_n， t) 7)

The derivative of c with respect to y_n is also known. To calculate the gradient of c with respect to weights w_n, the chain rule can be applied

To calculate the gradient of c with respect to y_k, the chain rule can also be applied

which is in recursive way. To calculate the gradient of c with respect to arbitrary weight w_k, we can use

In this procedure, the gradient of the cost c with respect to any weights in the network can be calculated.

At step S704, the algorithm updates the convolution filters, deformational layer weights, fully connected weights, and bias by rule of

where η is the learning rate, and η is a predefined value.

Updates of the parameters are performed using the production of one prefixed learning rate and the corresponding gradients.

At step S705, it determines if the stopping criterion is satisfied. For example, if the variation of the loss is less than a predetermined value, the process terminates, otherwise, the process return back to step S702.

As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” ， “circuit， ” “module” or “system. ” Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present invention, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments.

In addition, the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium. Fig. 8 illustrates a system 3000 for predicting crowd attributes according to one embodiment of the present application, in which the functions of the present invention are carried out by the software. Referring to Fig. 8, the system 3000 comprises a memory 3001 that stores executable components and a processor 3002, electrically coupled to the memory 3001 to execute the executable components to perform operations of the system 3000. The executable components may comprise: a feature extracting component 3003 obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd of the video； and a prediction component 3004 extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances. The functions of the

components

3003 and 3004 are similar to those of the

unit

100 and 200, respectively, and thus the detailed descriptions thereof are omitted herein.

Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.

Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

Claims

A system for predicting crowd attributes, comprising:

a feature extracting device obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd in the video； and

a prediction device being electronically communicated with the feature extracting device and predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
The system according to claim 1, wherein the feature extracting device further comprises a motion feature extracting unit comprising:

a tracklet detection module (1021) detecting short trajectories of the crowd in each frame of the video；

a motion maps determination module (1022) computing physical relationships between each of the short trajectories and its neighbors to determine one or more motion distributions for the crowd in each frame of the video； and

a continuous motion channel generation module (1023) averaging the determined motion distributions across temporal domain, and interpolating one or more sparse short trajectory points into the averaged distributions to form one or more continuous motion channels forming the motion features.
The system according to claim 2, wherein the motion distribution at least comprises at least one of:

a collectiveness distribution that indicates a degree of individual in a whole scene acting as a union in a collective motion,

a stability distribution that indicates whether the whole scene can keep a topological structure for the crowd in the whole scene； and

a conflict distribution that indicates an interaction/friction between each pair of nearest neighbors of the short trajectories for the crowd in the scene.
The system according to claim 1, wherein the predicted attributes at least indicate a role of the people in the crowd, a place of the crowd and a reason why people are in the crowd.
The system according to any one of claims 1-4, wherein the prediction device is configured with a convolutional neural network having:

a first branch configured to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features； and

a second branch configured to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features,

wherein, the predicted crowd attributes from the first branch and the predicted crowd attributes from the second branch are fused together to output a prediction of the attributes of the crowd in the video.
The system according to claim 5, further comprising a training device for training the second neural network by:

initializing randomly parameters for the second neural network；

calculating a loss of the parameters in the second neural network；

calculating a gradient with respect to all said parameters based on the calculated loss；

updating the parameters by using a production of one prefixed learning rate and the corresponding gradients；

determining if a stopping criterion is satisfied；

if not, returning to the step of calculating.
The system according to claim 6, wherein the training device trains the first neural network by:

initializing parameters for the first neural network with pre-trained data sets；

calculating a loss of the parameters in the first neural network；

calculating a gradient with respect to all said parameters based on the calculated loss；

updating the parameters by using a production of one prefixed learning rate and the corresponding gradients；

determining if a stopping criterion is satisfied；

if not, returning to the step of calculating.
The system according to claim 7, wherein the trained first neural network and the trained second neural network are connected together, the training device further inputs a fine-tuning set into the connected networks to fine-tune the connected networks.
A method for understanding crowd scene, comprising:

obtaining a video with crowd scenes；

extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd in the video； and

predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
The method according to claim 9, wherein the extracting further comprises:

detecting short trajectories of the crowd in the video frames；

computing physical relationships between each of the short trajectories and its neighbors to determine one or more motion distributions for the crowd in each frame； and averaging the determined motion distributions across temporal domain, and interpolating one or more sparse short trajectory points into the averaged distributions to form one or more continuous motion channels forming the motion features.
The method according to claim 10, wherein the motion distribution at least comprises at least one of:

a collectiveness distribution that indicates a degree of individual in a whole scene acting as a union in a collective motion,

a stability distribution that indicates whether the whole scene can keep a topological structure for the crowd in the whole scene； and

a conflict distribution that indicates an interaction/friction between each pair of nearest neighbors of short trajectories for the crowd in the whole scene.
The method according to claim 9, wherein the predicted attributes at least indicate a role of the people in the crowd, a place of the crowd and a reason why people are in the crowd.
The method according to any one of claims 9-12, wherein the predicting further comprises:

configuring a first branch to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features；

configuring a second branch to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features； and

connecting the predicted crowd attributes from the first branch and the predicted crowd attributes from the second branch to output a . prediction of the attributes of the crowd in the video.
The method according to claim 13, further comprising:

initializing randomly parameters for the second neural network；

calculating a loss of the parameters in the second neural network；

calculating a gradient with respect to all said parameters based on the calculated loss；

updating the parameters by using a production of one prefixed learning rate and the corresponding gradients；

determining if a stopping criterion is satisfied；

if not, returning to the step of calculating.
The method according to claim 14, further comprising:

initializing parameters for the first neural network with pre-trained data sets；

calculating a loss of the parameters in the first neural network；

calculating a gradient with respect to all said parameters based on the calculated loss；

updating the parameters by using a production of one prefixed learning rate and the corresponding gradients；

determining if a stopping criterion is satisfied；

if not, returning to the step of calculating.
The method according to claim 15, further comprising:

connecting the trained first neural network and the trained second neural network together； and

fine-tuning the connected networks by inputting a fine-tuning set into the connected networks.
A system for predicting crowd attributes, comprising:

a memory that stores executable components； and

a processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein, the executable components comprise:

a feature extracting component obtaining a video with crowd scenes and extracting appearances features and motion features from the obtained video, wherein the motion features are scene-independent and indicate motion properties of crowd in the video； and

a prediction component extracting component predicting attributes of the crowd in the video based on the extracted motion features and the extracted appearances.
The system according to claim 17, wherein the feature extracting component is configured to:

detect short trajectories of the crowd in each frame in the video；

compute physical relationships between each of the trajectories and its neighbors to determine one or more motion distributions for each frame in the video； and

average the determined motion distributions across temporal domain, and interpolate one or more sparse tracklet points into the averaged distributions to form one or more continuous motion channels forming the motion feature.
The system according to claim 18, wherein the motion distribution at least comprises at least one of:

a collectiveness distribution that indicates a degree of individual in a whole scene acting as a union in a collective motion,

a stability distribution that indicates whether the whole scene can keep a topological structure for the crowd in the whole scene； and

a conflict distribution that indicates an interaction/friction between each pair of nearest neighbors of trajectories for the crowd in the whole scene.
The system according to claim 19, wherein the prediction component is further configured for:

configuring a first branch to receive the motion features of the video with crowd scenes, wherein the first branch is configured with a first neural network to predict crowd attributes from the received motion features；

configuring a second branch to receive the appearance features of the video with crowd scenes, wherein second branch is configured with a second neural network to predict crowd attributes from the received appearance features； and

connecting the predicted crowd attributes from the first branch and the predicted crowd attributes from the second branch to output a . prediction of the attributes of the crowd in the video.