CN106407946B - Cross-line counting method, deep neural network training method, device and electronic equipment - Google Patents

Cross-line counting method, deep neural network training method, device and electronic equipment Download PDF

Info

Publication number
CN106407946B
CN106407946B CN201610867834.1A CN201610867834A CN106407946B CN 106407946 B CN106407946 B CN 106407946B CN 201610867834 A CN201610867834 A CN 201610867834A CN 106407946 B CN106407946 B CN 106407946B
Authority
CN
China
Prior art keywords
frame image
crowd
neural network
counting
loi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610867834.1A
Other languages
Chinese (zh)
Other versions
CN106407946A (en
Inventor
王晓刚
赵倬毅
李鸿升
赵瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sensetime Technology Development Co Ltd
Original Assignee
Beijing Sensetime Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sensetime Technology Development Co Ltd filed Critical Beijing Sensetime Technology Development Co Ltd
Priority to CN201610867834.1A priority Critical patent/CN106407946B/en
Publication of CN106407946A publication Critical patent/CN106407946A/en
Priority to PCT/CN2017/103530 priority patent/WO2018059408A1/en
Application granted granted Critical
Publication of CN106407946B publication Critical patent/CN106407946B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion

Abstract

The embodiment of the invention discloses a cross-line counting method, a deep neural network training method, a device and electronic equipment, wherein the cross-line counting method comprises the following steps: inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a depth neural network, outputting a crowd counting graph of the plurality of original frame images by the depth neural network, wherein the crowd counting graph comprises a counting vector of each position, each frame image in the plurality of original frame images is used as a current frame image, and aiming at an interested line LOI to be subjected to cross-line counting in the video, the number of people of the current frame image passing through the LOI from at least one direction is obtained according to the crowd counting graph of the current frame image; and respectively accumulating the number of people of a plurality of original frame images passing through the LOI in at least one direction to obtain the number of unidirectional line crossing people of the LOI in at least one direction in the time period T to be analyzed. The embodiment of the invention can be applied to various different scenes, and the cross-line counting result is more objective and accurate.

Description

Cross-line counting method, deep neural network training method, device and electronic equipment
Technical Field
The invention relates to a computer vision technology, in particular to a cross-line counting method, a deep neural network training method, a device and electronic equipment.
Background
The automatic crowd counting technology in the video plays an increasingly important role in the aspects of crowd flow monitoring, public safety and the like, and particularly, the cross-line counting method can help people count the flow of people on key roads or entrances and exits in real time, so that the total number of the crowd in one area is estimated.
Currently, the mainstream line crossing counting method is a time-series Slice image (Temporal Slice) based method. The time sequence slicing method is characterized in that pixel vectors (color images are three-channel vectors) on lines extracted from each frame of a video are accumulated in a time dimension to form a two-dimensional image of a time sequence slice, then a regression model is learned based on the time sequence slice image by directly using artificially marked line crossing people number as a supervision signal, and the number of people in the time sequence slice image is estimated, so that the line crossing people number in a certain period is obtained.
Disclosure of Invention
The embodiment of the invention provides a technical scheme of cross-line counting.
According to an aspect of the embodiments of the present invention, there is provided a line crossing counting method, including:
inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a deep neural network, and outputting a crowd counting graph of the plurality of original frame images by the deep neural network; the crowd counting graph comprises counting vectors of all positions in the frame images, and the counting vectors are used for representing the number of people passing through between all the frame images and the adjacent previous frame images in the counting direction of the two-dimensional coordinate plane in the original frame images;
respectively taking each frame image in the plurality of original frame images as a current frame image, and acquiring the number of people of the current frame image passing through an LOI (line of interest) from at least one counting direction according to a crowd counting graph of the current frame image aiming at the LOI to be subjected to line crossing counting in a video;
and respectively accumulating the number of people of the plurality of original frame images passing through the LOI in the at least one direction to obtain the number of unidirectional line crossing people of the LOI in the at least one direction in the time period T to be analyzed.
According to another aspect of the embodiments of the present invention, there is provided a deep neural network training method, further including:
inputting a plurality of original frame images of a sample video into an initial deep neural network, taking a population counting graph labeled in advance by the plurality of original frame images as a supervision signal, and performing iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final deep neural network; the initial deep neural network includes an initial convolutional neural network and an initial element multiplication network.
According to another aspect of the embodiment of the invention, a method for counting the cross lines of the crowd in the video based on the neural network obtained by training the deep neural network training method is provided.
According to another aspect of the embodiments of the present invention, there is provided a line crossing counting apparatus, including:
the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used as a deep neural network and is used for receiving a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting and outputting a crowd counting graph of the original frame images; the crowd counting graph comprises counting vectors of all positions in the frame images, and the counting vectors are used for representing the number of people respectively passing through the positions between all the frame images and the adjacent previous frame images in the counting direction;
the second acquisition unit is used for respectively taking each frame image in the plurality of original frame images as a current frame image, aiming at an interest line LOI to be subjected to line crossing counting in a video, and acquiring the number of people of the current frame image passing through the LOI from two directions according to a crowd counting graph of the current frame image;
a third obtaining unit, configured to respectively accumulate the number of people that the original frame images in the two directions pass through the LOI, and obtain the number of people that the LOI crosses the line in one direction in the two directions in the time period T to be analyzed; and the third acquisition unit is used for respectively accumulating the number of people passing through the LOI by the plurality of original frame images in the at least one direction to obtain the number of unidirectional line crossing people of the LOI in the at least one direction in the time period T to be analyzed.
According to another aspect of the embodiments of the present invention, there is provided a deep neural network training apparatus, further including:
the network training unit is used for inputting a plurality of original frame images of a sample video into an initial deep neural network, taking a pre-labeled population counting graph of the plurality of original frame images as a supervision signal, and carrying out iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final deep neural network; the initial deep neural network comprises an initial convolutional neural network CNN and an initial element multiplication network.
According to another aspect of the embodiments of the present invention, there is provided a data processing apparatus including the line crossing counting apparatus or the deep neural network training apparatus according to the above embodiments.
According to still another aspect of the embodiments of the present invention, there is provided an electronic device including the data processing apparatus according to the above embodiments.
According to yet another aspect of the embodiments of the present invention, there is provided a computer storage medium for storing computer-readable instructions, the instructions comprising:
inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a deep neural network, and outputting a crowd counting graph of the plurality of original frame images by the deep neural network; the crowd counting graph comprises a counting vector of each position, and the counting vector is used for representing the number of people passing through in the counting direction between each frame image and an adjacent previous frame image in the plurality of original frame images;
respectively taking each frame image in the plurality of original frame images as a current frame image, and acquiring instructions of the number of people of the current frame image passing through an LOI (line of interest) from at least one direction according to a crowd counting graph of the current frame image aiming at the LOI to be subjected to line crossing counting in a video;
and respectively accumulating the number of people who pass through the LOI by the plurality of original frame images in the at least one direction to obtain the number of people who cross the line in one direction in the time period T to be analyzed by the LOI in the at least one direction.
According to yet another aspect of embodiments of the present invention, there is provided another computer storage medium for storing computer-readable instructions, the instructions comprising:
inputting a plurality of original frame images of a sample video into an initial deep neural network, taking a population counting graph labeled in advance by the plurality of original frame images as a supervision signal, and performing iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final instruction of the deep neural network; the initial deep neural network includes an initial convolutional neural network and an initial element multiplication network.
According to still another aspect of an embodiment of the present invention, there is provided a computer apparatus including:
a memory storing executable instructions;
one or more processors in communication with the memory to execute the executable instructions to perform operations corresponding to the cross-line counting method or the deep neural network training method of any of the above embodiments of the present invention.
Based on the cross-line counting method, the deep neural network training method, the device and the electronic equipment provided by the embodiment of the invention, a deep neural network training method is provided, and a technical scheme for crowd cross-line counting based on the trained deep neural network is adopted, a sample video is input into an initial deep neural network, a crowd counting graph pre-labeled by a plurality of original frame images in the sample video is used as a supervision signal, iterative training is carried out on the initial deep neural network until a training result meets a preset condition, and the deep neural network is obtained; by inputting a plurality of original frame images corresponding to the time period T to be analyzed in the video which needs to be subjected to cross-line counting to the deep neural network, a crowd counting graph of each frame image in the original frame images can be output, that is: the number of persons passing in the count direction (for example, at least one coordinate direction of the x-axis and the y-axis of the two-dimensional coordinate plane) between the current frame image and the adjacent previous frame image at each position, respectively; and respectively aiming at each frame of image, acquiring the number of people passing through the LOI from at least one direction according to the crowd counting graph, and respectively accumulating the number of people passing through the LOI by a plurality of original frame images in at least one direction to obtain the number of unidirectional line crossing people of the LOI in at least one direction in the time period T to be analyzed. Because the embodiment of the invention directly takes the original frame image in the original video as input without using the time sequence slice image, the robustness is better, the method can be applied to various different scenes, and the problems that the identifiability of pedestrians in the time sequence slice image is low and the number of people in the time sequence slice image cannot be estimated when the crowd density in the video is higher, the crowd moving speed is low or the video is still, or the visual angle of a monitoring camera is lower and the like are avoided, the method is also suitable for the situations that the crowd density is higher, the crowd moving speed is low or the video is still, and the method can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting diagram instead of only using the total number of the crowd, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
The invention will be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
FIG. 1 is a flowchart of an embodiment of a cross-line counting method according to the present invention.
FIG. 2 is a flowchart illustrating another embodiment of a cross-line counting method according to the present invention.
FIG. 3 is a flowchart of an embodiment of a deep neural network training method of the present invention.
FIG. 4 is a schematic diagram of an embodiment of the present invention in which an initial deep neural network is trained in two stages.
FIG. 5 is a schematic structural diagram of an embodiment of the cross-line counting apparatus according to the present invention.
FIG. 6 is a schematic structural diagram of another embodiment of the cross-line counting apparatus according to the present invention.
FIG. 7 is a schematic structural diagram of an embodiment of a deep neural network training device according to the present invention.
Fig. 8 is a schematic structural diagram of an embodiment of an electronic device according to the present invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations, and with numerous other electronic devices, such as terminal devices, computer systems, servers, etc. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with computer systems, servers, and terminal devices include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.
The computer system, server, and terminal device may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
In a cross-line counting method based on a time sequence slice image, the characteristics of a Local gradient direction Histogram (HOG) are extracted from the time sequence slice image, then a Gaussian process regression model is trained to predict the number of people on the time sequence slice image, and a dynamic texture method is utilized to distinguish cross-line crowds in two directions. This method is also referred to as an IP-based method.
In another cross-line counting method based on time series slice images, time series slice color images and corresponding time series slice optical flow images are respectively input into a depth neural network (CNN), the total number of people on the time series slices and the proportion of the number of people in two directions are obtained, and therefore the number of cross-line people in two directions is obtained. This method is also known as transport stream convolutional neural network (TS-CNN).
In the process of implementing the invention, the inventor finds that the cross-line counting method based on the time-series slice images at least has the following problems through research:
the time sequence slice images are not natural images, when the crowd density in the video is high, the crowd moving speed is low, particularly when the crowd is still, or the visual angle of the monitoring camera is low, the images of pedestrians in the time sequence slice images can be pulled into strips, so that the identifiability of the pedestrians is very low, the number of the crowds in the time sequence slice images cannot be estimated, and the effectiveness of the method is restricted; in addition, the mode only uses the number of the overline headquarters as a monitoring signal, and monitoring information is not rich enough, so that the method is not beneficial to learning of a complex CNN model.
In the embodiment of the invention, a Counting Map (Counting Map) of each frame of image is obtained from an original video; then, accumulating the crowd counting graph of each frame of image on a line of interest (LOI) to be subjected to cross line counting to obtain instantaneous cross line counting values of two directions on the LOI (namely, the number of people passing through the LOI); and accumulating instantaneous line crossing count values in the time period T to be analyzed in two directions respectively to obtain a population line crossing count value (namely, the number of people passing through the LOI) in the time period T to be analyzed.
FIG. 1 is a flowchart of an embodiment of a cross-line counting method according to the present invention. As shown in fig. 1, the line crossing counting method of this embodiment includes:
102, inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a depth neural network, and outputting a crowd counting map needing the plurality of original frame images by the depth neural network.
In the embodiment of the invention, a crowd counting graph is introduced, wherein the crowd counting graph comprises counting vectors of each position in a frame image, namely: each position of the people counting map records a two-dimensional counting vector representing the number of people passing between the current frame image and the adjacent previous frame image in a counting direction, for example, the number of people passing in two coordinate directions (i.e., x-axis and y-axis directions) of a two-dimensional coordinate plane, which is a mathematical approximation map, and the value of the counting vector at each position is usually less than 1, which represents how much proportion of a person between the current frame image and the adjacent previous frame image passes through the position.
As a specific example, after the plurality of original frame images corresponding to the time period T to be analyzed in the video that needs to be subjected to line crossing counting are input to the depth neural network in operation 102, at least two frame images may be sequentially extracted from the plurality of original frame images corresponding to the time period T to be analyzed in the video, and a crowd counting graph of the current frame image may be generated by using a later frame image of the at least two frame images as the current frame image. The at least two frame images extracted sequentially may be continuous original frame images or discontinuous original frame images, and the at least two frame images may also be partially continuous original frame images and partially discontinuous original frame images. Namely: in the embodiment of the invention, crowd cross-line counting can be carried out on the basis of all original frame images corresponding to the time period T to be analyzed in the video needing cross-line counting, and partial original frame images can be extracted from the video to carry out crowd cross-line counting without the need that all the original frame images corresponding to the time period T to be analyzed participate in crowd cross-line counting.
And 104, taking each frame image in the plurality of original frame images as a current frame image, and acquiring the number of people of the current frame image passing through an LOI from at least one direction according to a crowd counting graph of the current frame image aiming at the line of interest (LOI) to be subjected to line crossing counting in the video.
The LOI can be set according to the application requirements of people counting as required, and can be a position connecting line which is required to be subjected to people counting at will in a video scene, for example, a connecting line formed on two sides of a subway entrance and exit, a connecting line formed on two sides of a market doorway and the like.
106, respectively accumulating the number of people passing through the LOI in each of the plurality of original frame images in at least one direction, and obtaining the number of people crossing the line in one direction in the time period T to be analyzed in which the LOI is respectively in at least one direction.
Based on the above embodiment of the present invention, a new technical scheme for performing crowd cross-line counting based on CNN is provided, in which a crowd counting graph of each frame of image corresponding to a time period T to be analyzed in a video is respectively obtained through a deep neural network, the number of people passing through an LOI in at least one direction is respectively obtained for each frame of image, the number of people passing through the LOI in at least one direction of each frame of image is respectively accumulated, and the number of unidirectional cross-line people of the LOI in at least one direction in the time period T to be analyzed is obtained. Because the embodiment of the invention directly takes each frame image in the original video as input without using a time sequence slice image, the robustness is better, the method can be applied to various different scenes, is also suitable for extreme conditions of large crowd density, low crowd moving speed or immobility, and can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting diagram instead of only using the total number of the crowd, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.
In a specific example of each embodiment of the line crossing counting method of the present invention, in operation 104, the number of people passing through the LOI from one direction of the current frame image may be obtained; accordingly, in operation 106, the number of people passing through the LOI in each of the plurality of original frame images in the direction is accumulated, so that the number of people crossing the line in the direction in the time period T to be analyzed can be obtained. In addition, in operation 104, the number of people passing through the LOI from two directions respectively for the current frame image may be obtained; accordingly, in operation 106, the number of people passing through the LOI in each of the plurality of original frame images in the two directions is respectively accumulated, so that the number of people passing through the one-way line in the two directions in the time period T to be analyzed can be obtained, and the number of people passing through the two-way line in the LOI can be comprehensively known.
In another specific example of each embodiment of the line crossing counting method according to the present invention, a crowd counting graph of a current frame image may be generated specifically as follows:
and inputting the plurality of original frame images into the deep neural network, and generating a crowd density map and a crowd speed map of the current frame image by a convolutional neural network in the deep neural network. The crowd density graph is used for representing the crowd density of each position in the current frame image, and the crowd speed graph is used for representing the speed of each pedestrian in the current frame image moving from the adjacent previous frame image to the current frame image;
inputting the crowd density map and the crowd speed map of the current frame image into an element multiplication network (elementary product network) in the deep neural network, and multiplying the crowd density map and the crowd speed map of the current frame image at corresponding positions by the element multiplication network to obtain a crowd counting map of the current frame image.
In the embodiment of the invention, the crowd density map and the crowd speed map of the frame image are obtained based on at least two frame images in the video, and the crowd density map and the crowd speed map of the current frame image are multiplied by elements of the corresponding positions to obtain the crowd counting map of the frame image on the assumption that the density distribution and the walking speed of pedestrians in the two frames are unchanged, so that the accurate acquisition of the crowd counting map is realized.
In a specific example of each embodiment of the line crossing counting method, after obtaining the crowd counting map of the frame image based on the implementation of the present invention, for any LOI to be subjected to line crossing counting, the number of people of the current frame image passing through the LOI from two directions can be specifically obtained as follows:
respectively projecting the counting vectors of each position on the LOI in the crowd counting graph in the normal direction of the LOI to obtain the scalar value of each position on the LOI, wherein the positive and negative of the scalar value represent two cross-line directions of the LOI, such as two cross-line directions of the LOI entering a subway entrance and the LOI going out of the subway entrance;
and accumulating the positive scalar value and the negative scalar value on the LOI respectively to obtain the number of people passing through the current frame image in two directions on the LOI respectively.
For example, positive and negative scalar values on the LOI can be accumulated separately as follows:
Figure GDA0001126844550000101
wherein, C1,tAnd C2,tRespectively representing instantaneous line crossing count values theta at t moments in two directions of LOI in the current frame imagepRepresents the count vector (C) at the current position pt,x(p),Ct,y(p)) and the normal direction of the LOI, and T is any time in the time period T to be analyzed.
Obtaining instantaneous line crossing count value C of two directions on LOI in frame image1,tAnd C2,tThen, can be represented by the formula c1=Σ{t|t∈T}c1,t,c2=Σ{t|t∈T}c2,tC for each time T within the time period T to be analyzed1,tAnd C2,tAccumulating to obtain the number of unidirectional cross-line persons in the time period T to be analyzed C1And C2The number of cross-line persons of the LOI in two directions in the time period T to be analyzed is respectively.
In a further embodiment of the method for counting the overline, after the number of people passing through the current frame image in the two directions on the LOI is obtained, the number of people passing through the LOI in the two directions can be accumulated, so that the total number of overline passing through the LOI in the time period T to be analyzed is obtained.
FIG. 2 is a flowchart illustrating another embodiment of a cross-line counting method according to the present invention. As shown in fig. 2, the line crossing counting method of this embodiment includes:
202, sequentially extracting at least two frames of images from a plurality of original frame images corresponding to a time period T to be analyzed in a video needing crowd line crossing counting by the deep neural network, and generating a crowd counting graph of the current frame image by taking a later frame image in the at least two frames of images as the current frame image.
The at least two frame images extracted sequentially may be continuous original frame images or discontinuous original frame images, and the at least two frame images may also be partially continuous original frame images and partially discontinuous original frame images. The group count map includes a count vector for each position in the frame image, that is: each position of the crowd counting graph records a two-dimensional counting vector, and the two-dimensional counting vector represents the number of people passing through between the current frame image and the adjacent previous frame image in the directions of the x axis and the y axis respectively.
And 204, taking each frame image in the plurality of original frame images as a current frame image, and projecting the counting vector of each position on the LOI in the crowd counting graph in the normal direction of the LOI aiming at the LOI to be subjected to line crossing counting in the video to obtain a scalar value of each position on the LOI, wherein the positive and negative of the scalar value represent two line crossing directions of the LOI.
And 206, accumulating the positive scalar value and the negative scalar value on the LOI respectively to obtain the number of people passing through the current frame image in the two directions on the LOI respectively, wherein the number is the instant line crossing count value of the current frame image in the two directions on the LOI at the moment t corresponding to the current frame image.
208, respectively accumulating the number of people of the plurality of original frame images passing through the LOI in two directions in the time period T to be analyzed, and obtaining the number of people of one-way line crossing of the LOI in two directions in the time period T to be analyzed.
And 210, accumulating the number of the unidirectional line crossing people of the LOI in two directions to obtain the total number of the line crossing people passing through the LOI in the time period T to be analyzed.
Before the cross-line counting method of each embodiment of the present invention, the initial deep neural network may be trained in advance to obtain a deep neural network, and the obtained deep neural network may be used in the cross-line counting scheme of the above embodiment, and may also be used in other application situations requiring a population counting map. In particular, an initial deep neural network may be preset, and the initial deep neural network includes an initial Convolutional Neural Network (CNN) and an initial element multiplication network. Inputting a plurality of original frame images of more than one sample video into an initial deep neural network, taking a pre-labeled crowd counting graph of the plurality of original frame images in the sample video as a supervision signal, and carrying out iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final deep neural network.
Based on the deep neural network training method provided by the above embodiment of the present invention, an original sample video is input to an initial deep neural network, a crowd counting graph labeled in advance in a plurality of original frame images in the sample video is used as a supervision signal, iterative training is performed on the initial deep neural network until a training result meets a preset condition, and the deep neural network is obtained so as to perform crowd cross-line counting on the deep neural network. Because the depth neural network directly takes the original frame image in the original video as input without using the time sequence slice image, the robustness is better, the depth neural network can be applied to various different scenes, the problems that the identifiability of pedestrians in the time sequence slice image is low and the number of people in the time sequence slice image cannot be estimated when the crowd density in the video is high, the crowd moving speed is low or the video is still, or the visual angle of a monitoring camera is low and the like are solved, the depth neural network is also suitable for the situations that the crowd density is high, the crowd moving speed is low or the video is still, and can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting graph instead of only using the total number of the crowd when the deep neural network is trained, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.
In a specific example of the embodiment of the present invention, the plurality of original frame images are labeled with a crowd density map, a crowd speed map, and a crowd count map, respectively. Correspondingly, in this embodiment, inputting a plurality of original frame images of the sample video into the initial deep neural network, taking a pre-labeled population count map of the plurality of original frame images as a supervision signal, and iteratively training the initial deep neural network until a training result meets a preset condition may include:
and respectively taking two adjacent frame images in a plurality of original frame images in the sample video as a training sample to be input into an initial convolutional neural network, taking a pre-labeled crowd density graph and a crowd velocity graph as supervision signals, and performing iterative training on the initial convolutional neural network until a training result meets a first preset convergence condition to obtain a final convolutional neural network. The two adjacent frame images can be two continuous frame original images in an original video, or discontinuous original frame images extracted from the original video according to a certain time interval or frame image interval;
and respectively taking two adjacent frame images in a plurality of original frame images in the sample video as a training sample to be input into the initial deep neural network, taking a pre-labeled population counting graph as a supervision signal, and carrying out iterative training on the initial deep neural network until a set index meets a second preset convergence condition to obtain a final deep neural network.
In order to learn a better deep neural network, in the above embodiment of the present invention, the initial deep neural network is trained through two stages. Respectively estimating a crowd density map and a crowd velocity map in a first stage, which are two relatively simple tasks with relatively obvious semantic information; in the second stage a direct estimate is given of the population count map.
It is noted that after the training of the first stage is completed, we can obtain the crowd count map by multiplying the crowd density map and the velocity map, but in practice, the obtained crowd density map and the velocity map may have a mismatch in spatial position because no spatial position matching constraint is given to the crowd density map and the velocity map in the training process of the first stage. Because the target of the second-stage training is obtained by multiplying the two crowd density graphs output by the first stage by the elements of the crowd speed at the corresponding positions, after the first-stage training is finished, the embodiment of the invention corrects the mismatch at the spatial position through the second-stage training, thereby effectively ensuring the match of the crowd density graphs and the speed graphs at the spatial position; in addition, the crowd counting graph is used as a supervision signal in the second stage, so that the learning of a complex initial deep neural network is facilitated, and the deep neural network obtained through training has stronger and more accurate counting capability.
In another embodiment of the cross-line counting method of the present invention, before the iterative training of the initial deep neural network, the following operations may be performed:
respectively positioning pedestrians for each frame image in the plurality of original frame images in the sample video to obtain the positions of the pedestrians in each frame image in the sample video and respectively distributing pedestrian IDs to the pedestrians;
and calibrating the pedestrian information of each pedestrian in each frame image in the sample video according to the pedestrian position in each frame image in the sample video, wherein the pedestrian information comprises the pedestrian position and the pedestrian ID.
Since the geometric perspective view includes the correspondence between the number of pixels at different positions in the sample video and the real physical size of the scene, the pedestrian information of each pedestrian is marked in each frame image of the plurality of original frame images in the sample video according to the position of the pedestrian in each frame image of the plurality of original frame images in the sample video and the geometric perspective view, and the position of the pedestrian can be marked with an icon of a corresponding size in the sample video scene according to the position of the pedestrian in the real scene and the physical size of the pedestrian. For example, in the frame image of the sample video corresponding to the time t, the position information of the head of the pedestrian can be labeled:
Figure GDA0001126844550000141
where t denotes the time, {1, …, n } denotes the pedestrian ID of each pedestrian, and here the pedestrian ID is specifically denoted by a serial number.
In the specific training process, when the pedestrians in the sample video are calibrated and the pedestrian IDs are distributed, the calibration can be carried out on each frame of image in the sample video, the marking can be carried out according to the running condition and the moving speed of the pedestrians and preset intervals (for example, 1 second), and the pedestrians and the pedestrian IDs of the middle frame of image can be obtained approximately through interpolation of the pedestrians and the pedestrian IDs of the pedestrians in the two frames of image calibrated in the front and the back, so that the marking workload is simplified; in addition, all original frame images in the sample video can participate in the initial deep neural network training, and a part of the original frame images can also be extracted to participate in the initial deep neural network training, so that more sample videos can be trained under the condition of a certain training task, the more the total number of the frame images participating in the initial deep neural network training is, the better the training effect of the initial deep neural network is, and the better the robustness of the deep neural network obtained by training is.
FIG. 3 is a flowchart of an embodiment of a deep neural network training method of the present invention. The preset initial deep neural network may specifically include an initial CNN and an initial element multiplication network. And the deep neural network obtained after training comprises a corresponding CNN and an element multiplication network. As shown in fig. 3, the deep neural network training method of the embodiment includes:
302, setting a geometric perspective view of a sample video aiming at a scene of the sample video in advance, wherein the geometric perspective view comprises a corresponding relation between the number of pixels at different positions in the sample video and the real physical size of the scene; and respectively carrying out pedestrian positioning on each frame image in a plurality of original frame images participating in network training in the sample video to obtain the pedestrian position in each frame image and respectively distributing pedestrian ID to each pedestrian.
Because the body sizes of different pedestrians are different and the heads of the pedestrians are not easily shielded, the positions of the heads of the pedestrians can be used as the positions of the pedestrians for accurately and objectively representing the positions of the pedestrians.
And 304, calibrating the pedestrian information of each pedestrian in each frame image in the plurality of original frame images of the sample video respectively according to the pedestrian position in each frame image in the plurality of original frame images of the sample video, wherein the pedestrian information comprises the pedestrian position and the pedestrian ID.
And 306, respectively taking two adjacent frame images in the original frame images in the sample video as a training sample to be input to the initial convolutional neural network, taking a pre-labeled crowd density graph and a crowd velocity graph as supervision signals, and performing iterative training on the initial convolutional neural network until a training result meets a first preset convergence condition to obtain a final convolutional neural network.
The crowd density graph is used for representing the crowd density of each position in the current frame image, and the crowd speed graph is used for representing the speed of each person in the current frame image moving from the previous frame image to the current frame image.
Specifically, after two adjacent frame images in a plurality of original frame images in the sample video are respectively used as a training sample and input to an initial convolutional neural network, the initial convolutional neural network uses a later frame image in a current training sample as a current frame image, generates a crowd density map of the current frame image according to pedestrian information calibrated by each frame image, and generates a crowd speed map of the current frame image according to the pedestrian information and a geometric perspective view in the two frame images of the current training sample; comparing whether the deviation between the crowd density graph and the crowd speed graph generated by the initial convolutional neural network and the marked crowd density graph and the marked crowd speed graph is smaller than a preset condition or whether the times of iterative training of the initial convolutional neural network reach a preset time threshold value; if the deviation is not less than the preset condition or the number of iterative training times does not reach the preset number threshold, adjusting the network parameters of the initial convolutional neural network, returning to continue the operation 306 until the deviation is less than the preset condition or the number of iterative training times reaches the preset number threshold, ending the training of the initial convolutional neural network, and obtaining the convolutional neural network.
Specifically, two adjacent frame images in the plurality of original frame images in the sample video may be continuous two-frame original images, continuous three-frame or more original frame images, continuous two-frame original images sequentially extracted therefrom, discontinuous two-frame original images, two-frame original images sequentially extracted from discontinuous three-frame or more frame images, or optical flow images of original images. When the number of the extracted frame images is more than two frames, the current frame image and the previous frame image are respectively two frame images positioned at the back and the front in the original sample video, and the frame numbers of the two frame images are not required to be continuous.
In one specific example, the initial convolutional neural network may specifically generate a crowd density map of the current frame image by:
respectively acquiring the crowd density value of each position in the current frame image according to the pedestrian information in the current frame image;
and generating a crowd density map of the current frame image according to the crowd density value and the geometric perspective of each position in the current frame image.
For example, according to the positions of the pedestrians in each frame image, the crowd density values of the positions in the frame image can be obtained after the positions of the pedestrians are respectively marked in each frame image; the crowd density map in the frame image can be calculated and obtained through the following formula:
Figure GDA0001126844550000161
wherein D ist(p) indicates the crowd density value at a p position in the frame image,
Figure GDA0001126844550000162
representing a normalized two-dimensional Gaussian distribution centered on the head marker P at the P position (i.e., representing the position of a pedestrian's head by a Gaussian kernel), σPRepresenting the variance, σ, of the Gaussian distributionPIs determined from the geometric rendering of each particular sample video scene to ensure that each person has the same physical dimensions.
In another specific example, the initial convolutional neural network may generate a crowd velocity map of the current frame image by:
acquiring the moving speed of each pedestrian in the current frame image according to the position difference of each pedestrian in the current frame image in the previous frame image and the current frame image and the corresponding time difference of the previous frame image and the current frame image;
acquiring the crowd speed of each position in the current frame image according to the moving speed and the position of each pedestrian in the current frame image;
and generating a crowd speed map of the current frame image according to the crowd speed and the geometric perspective of each position in the current frame image.
For example, the crowd velocity map in the frame image can be calculated and obtained through the following formula:
Figure GDA0001126844550000171
wherein, Vt(p) represents the speed value of the population at position p;
Figure GDA0001126844550000172
the moving speed of the mark head P in the current frame image can be represented according to the position difference between two adjacent frame images
Figure GDA0001126844550000173
To obtain; k (P, P, r)P) Is a function of the shape of a disc, the centre of which is the head mark P, with a radius rPRadius rPThe selection can be specifically carried out in the following way: conversion of the empirically set actual physical size of the human head from a geometric perspective into the number of pixels of the drinking position, e.g. radius rPThe value of (A) can be selected to be 0.15m according to experience; k (P, P, r)P)=1(p≤||P-rP||2)。
And 308, respectively taking two adjacent frame images in the plurality of original frame images in the sample video as a training sample to be input into the initial deep neural network, taking a pre-labeled crowd counting graph as a supervision signal, and performing iterative training on the initial deep neural network until a set index meets a second preset convergence condition to obtain a final deep neural network.
Specifically, after two adjacent frame images in the plurality of original frame images in the sample video are respectively used as a training sample to be input to an initial deep neural network, a convolutional neural network in the initial deep neural network uses a later frame image in a current training sample as a current frame image, a crowd density map of the current frame image is generated according to pedestrian information and a geometric perspective which are calibrated by each frame image in the sample video, and a crowd speed map of the current frame image is generated according to the pedestrian information and the geometric perspective in the two frame images of the current training sample and is input to an initial element multiplication network in the initial deep neural network; and multiplying the crowd density graph and the crowd velocity graph of the current frame image input by the convolutional neural network at the elements of the corresponding positions by the initial element multiplication network to obtain the crowd counting graph of the current frame image. Comparing whether the deviation between the crowd counting graph output by the element multiplication network and the pre-labeled crowd counting graph is smaller than a preset condition or whether the times of iterative training of the initial deep neural network reach a preset time threshold value; if the deviation is not less than the preset condition or the number of iterative training times does not reach the preset number threshold, adjusting the network parameters of the initial element multiplication network, returning to continue executing the operation 308 until the deviation is less than the preset condition or the number of iterative training times reaches the preset number threshold, ending the training of the initial deep neural network, and obtaining a final element multiplication network by the initial element multiplication network, thereby obtaining the final deep neural network.
In order to obtain the crowd counting graph, in the above embodiment of the present invention, the crowd density graph and the crowd speed graph of the frame image are obtained based on at least two frame images and the geometric perspective view in the plurality of original frame images in the sample video, and the crowd counting graph of the frame image is obtained by multiplying elements of the crowd density graph and the crowd speed graph of the current frame image at corresponding positions on the assumption that the density distribution and the walking speed of the pedestrian at the two frames are unchanged, so that the crowd counting graph is obtained conveniently.
In the embodiment of the invention shown in fig. 3, an initial deep neural network, which is a deep learning model, is introduced, an original video is directly used as a training sample video, a frame image in the original video is used as an input of an initial convolutional neural network, a pixel-level crowd density map, a labeled crowd density map and a crowd counting map are used as supervisory signals, cross-line counting is performed based on the crowd counting map during training instead of using only the total number of crowds, and the distribution condition of the crowds is also considered, so that the deep neural network for cross-line counting is obtained by training, has high robustness, is also applicable to extreme conditions of large crowd density, low crowd moving speed or immobility, can be applied across scenes, and has no problems that the identifiability of pedestrians in a time sequence slice image is low and the number of crowds in the time sequence slice image cannot be estimated, the over-line counting result is more objective and accurate.
In order to learn a better deep neural network, in the embodiment shown in fig. 3, the initial deep neural network is trained in two stages. The first stage corresponds to operation 306, which is two relatively simple tasks with relatively obvious semantic information, by giving estimates to the crowd density map and the crowd velocity map, respectively, through the initial convolutional neural network; the second stage corresponds to operation 308, which gives a direct estimate of the population count map by the initial element multiplication network.
Fig. 4 is a schematic diagram illustrating an initial deep neural network training in two stages according to an embodiment of the present invention. Inputting two adjacent frames of images in a sample video as a training sample into an initial convolutional neural network in an initial deep neural network, and outputting a crowd density graph and a crowd speed graph by the initial convolutional neural network at a first stage; inputting the crowd density graph and the crowd speed graph into an initial element multiplication network in an initial deep neural network, and outputting a crowd counting graph by the initial element multiplication network in a second stage.
It is noted that after the training of the first stage is completed, we can obtain the crowd count map by multiplying the crowd density map and the velocity map, but in practice, the obtained crowd density map and the velocity map may have a mismatch in spatial position because no spatial position matching constraint is given to the crowd density map and the velocity map in the training process of the first stage. Because the target of the second-stage training is obtained by multiplying the two crowd density graphs output by the first stage by the elements of the crowd speed at the corresponding positions, for this reason, the embodiment of the invention corrects the mismatch at the spatial position through the second-stage training, and effectively ensures the match of the crowd density graphs and the speed graphs at the spatial position; in addition, the crowd counting graph is used as a supervision signal in the second stage, so that the learning of a complex deep neural network is facilitated, and the deep neural network obtained through training has stronger and more accurate counting capability.
In a specific example of the embodiment shown in fig. 3, for example, the training result may be considered to satisfy the first preset convergence condition when any one or more of the following conditions are satisfied:
for a plurality of original frame images in each sample video, the ratio of the number of frames of an image, in which a crowd density map and a crowd speed map output by an initial convolutional neural network are consistent with a pre-labeled crowd density map and a crowd speed map, to the number of frames of the plurality of original frame images reaches a first preset threshold, that is: the ratio of the number of frames of the image, of which the crowd density graph output by the initial convolutional neural network is consistent with the pre-labeled crowd density graph and the crowd speed graph, to the number of frames of the image input into the sample video of the initial convolutional neural network reaches a first preset threshold value, and meanwhile, the ratio of the number of frames of the image, of which the crowd speed graph output by the initial convolutional neural network is consistent with the pre-labeled crowd speed graph, to the number of frames of the image input into the sample video of the initial convolutional neural network reaches the first preset threshold value;
for each frame image in the plurality of original frame images in each sample video, the similarity between the crowd density graph output by the initial convolutional neural network and the pre-labeled crowd density graph and the similarity between the crowd speed graph output by the initial convolutional neural network and the pre-labeled crowd speed graph are greater than a second preset threshold value;
aiming at the plurality of original frame images in each sample video, the average similarity between the crowd density graph output by the initial convolutional neural network and the pre-labeled crowd density graph and the average similarity between the crowd speed graph output by the initial convolutional neural network and the pre-labeled crowd speed graph are greater than a third preset threshold value;
and the iterative training times of the initial convolutional neural network reach a fourth preset threshold value.
In another specific example of the embodiment shown in fig. 3, the training result may be considered to satisfy the second preset convergence condition when any one or more of the following conditions are satisfied, for example:
aiming at the plurality of original frame images in each sample video, the ratio of the number of frames of an image, which is output by an initial element multiplication network and is consistent with a pre-marked crowd counting image, to the number of frames of the plurality of original frame images reaches a fifth preset threshold;
aiming at each frame of image in each sample video, the similarity between the crowd counting graph output by the initial element multiplication network and the pre-labeled crowd counting graph is greater than a sixth preset threshold;
aiming at all frame images in each sample video, the average similarity between the crowd counting graph output by the initial element multiplication network and the crowd counting graph obtained by artificial labeling is greater than a seventh preset threshold;
and the number of times of iterative training of the second part of the deep neural network reaches an eighth preset threshold value.
Wherein, according to the actual requirement, when any one or more of the following conditions are satisfied, the crowd density map is considered to be consistent with the pre-labeled crowd density map (or the crowd speed map is consistent with the pre-labeled crowd speed map):
the image characteristics of the crowd density graph output by the initial convolutional neural network and the image characteristics of the pre-labeled crowd density graph (or the crowd speed graph output by the initial convolutional neural network and the pre-labeled crowd speed graph) are completely the same;
the image features of the pre-labeled crowd density map include but are more than those of the crowd density map output by the initial convolutional neural network (or the image features of the pre-labeled crowd velocity map include but are more than those of the crowd velocity map output by the initial convolutional neural network);
the same characteristics between the image characteristics of the crowd density graph output by the initial convolutional neural network and the image characteristics of the pre-labeled crowd density graph (or the crowd speed graph output by the initial convolutional neural network and the pre-labeled crowd speed graph) reach a certain quantity or a certain preset proportion;
the same characteristics between the image characteristics of the crowd density graph output by the initial convolutional neural network and the image characteristics of the pre-labeled crowd density graph (or the crowd velocity graph output by the initial convolutional neural network and the pre-labeled crowd velocity graph) meet other preset conditions.
In addition, according to actual requirements, when any one or more conditions including but not limited to the following conditions are met, the crowd counting graph output by the initial element multiplication network is considered to be consistent with the pre-labeled crowd counting graph:
the image characteristics of the crowd counting graph output by the initial element multiplication network and the image characteristics of the pre-labeled crowd density graph are completely the same;
the image features of the pre-labeled crowd density map include but are more than those of the crowd count map output by the initial element multiplication network;
the same characteristics between the image characteristics of the population counting graph output by the initial element multiplication network and the image characteristics of the pre-labeled population counting graph reach a certain quantity or a certain preset proportion;
the same characteristics between the image characteristics of the population counting graph output by the initial element multiplication network and the image characteristics of the pre-labeled population counting graph meet other preset conditions.
In addition, the similarity between the two graphs, for example, the similarity between the crowd density graph output by the initial convolutional neural network and the pre-labeled crowd density graph, the similarity between the crowd velocity graph output by the initial convolutional neural network and the pre-labeled crowd velocity graph, and the similarity between the crowd count graph output by the initial element multiplication network and the pre-labeled crowd density graph, can be measured by the euclidean distance (L2) between the two graphs. Therefore, the euclidean distance between the two images can be obtained first, and whether the euclidean distance between the two images is greater than a preset threshold value is compared to determine whether the similarity between the two images is greater than the preset threshold value.
In addition, the embodiment of the invention also provides a method for counting the cross lines of the crowd in the video by the neural network obtained by training the deep neural network training method.
After the neural network obtained by the deep neural network training method is trained, a crowd counting graph of a frame image in a video can be obtained based on the deep neural network so as to carry out crowd cross-line counting in the video. The original frame image of the video to be cross-line counted is input into the deep neural network, and the deep neural network can output the crowd counting graph of the frame image by, but not limited to, the operations as described in any of the above embodiments of the present invention. In addition, the deep neural network used in the cross-line counting method according to the above embodiment of the present invention may be obtained based on the deep neural network training method according to any one of the above embodiments of the present invention, or may be obtained by other training methods, as long as the trained deep neural network can output the population counting map of the frame image for the input original frame image.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
FIG. 5 is a schematic structural diagram of an embodiment of the cross-line counting apparatus according to the present invention. The cross-line counting device of the embodiment can be used for realizing the cross-line counting method embodiments of the invention. As shown in fig. 5, the over line counting apparatus of this embodiment includes: the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit. Wherein:
the first acquisition unit is used as a deep neural network and used for receiving a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting and outputting a crowd counting graph of the original frame images.
The people counting map includes counting vectors of positions in the frame images, and the counting vectors are used for indicating the number of people respectively passing through between each frame image and an adjacent previous frame image in the plurality of original frame images in the counting direction (for example, two coordinate directions of a two-dimensional coordinate plane).
Illustratively, the first obtaining unit is specifically configured to sequentially extract at least two frame images from a plurality of original frame images in the video corresponding to the time period T to be analyzed, and generate a crowd count map of the current frame image by using a later frame image of the at least two frame images as the current frame image.
The second obtaining unit is configured to use each of the plurality of original frame images as a current frame image, and obtain, according to a people counting map of the current frame image, a number of people of the current frame image that pass through the LOI from at least one direction, for example, a number of people of the current frame image that pass through the LOI from one direction, or a number of people of the current frame image that pass through the LOI from two directions, for a line of interest LOI to be cross-line counted in the video.
Exemplarily, the second obtaining unit may be specifically configured to project the counting vectors at the positions on the LOI in the population counting graph in a normal direction of the LOI, respectively, to obtain scalar values at the positions on the LOI, where the positive and negative of the scalar values represent two cross-line directions of the LOI; and accumulating the positive scalar value and the negative scalar value on the LOI respectively to obtain the number of people passing through the current frame image in two directions on the LOI respectively.
And the third acquisition unit is used for respectively accumulating the number of people passing through the LOI by the plurality of original frame images in the at least one direction to acquire the number of unidirectional line crossing people of the LOI in the at least one direction in the time period T to be analyzed.
When the second acquiring unit acquires the number of people of the current frame image passing through the LOI from one direction, the third acquiring unit correspondingly accumulates the number of people of each frame image in the plurality of original frame images passing through the LOI in the direction to acquire the number of people of one-way line crossing of the LOI in the direction in the time period T to be analyzed. When the second acquiring unit acquires the number of people of the current frame image passing through the LOI from two directions, the third acquiring unit accumulates the number of people of each frame image in the original frame images passing through the LOI in the two directions respectively, and the number of people of one-way line crossing of the LOI in the two directions in the time period T to be analyzed is acquired.
The crowd cross-line counting device based on the embodiment of the invention respectively obtains the crowd counting graph of each frame of image corresponding to the time period T to be analyzed in the video through the deep neural network, respectively obtains the number of people passing through the LOI from at least one direction according to the crowd counting graph aiming at each frame of image, respectively accumulates the number of people passing through the LOI from a plurality of original frame images in at least one direction, and obtains the unidirectional cross-line number of people in at least one direction of the LOI in the time period T to be analyzed. Because the embodiment of the invention directly takes each frame image in the original video as input without using a time sequence slice image, the robustness is better, the method can be applied to various different scenes, is also suitable for extreme conditions of large crowd density, low crowd moving speed or immobility, and can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting diagram instead of only using the total number of the crowd, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.
FIG. 6 is a schematic structural diagram of another embodiment of the cross-line counting apparatus according to the present invention. As shown in fig. 6, compared with the embodiment shown in fig. 5, in the cross-line counting apparatus of this embodiment, the first obtaining unit specifically includes a convolutional neural network and an element multiplication network. Wherein:
and the convolutional neural network is used for receiving at least two input frame images, taking a later frame image in the at least two frame images as a current frame image, and generating a crowd density map and a crowd speed map of the current frame image. The crowd density map is used for representing the crowd density of each position in the current frame image, and the crowd speed map is used for representing the speed of each pedestrian in the current frame image moving from the adjacent previous frame image to the current frame image.
Based on the deep neural network training device provided by the above embodiment of the present invention, an original sample video is input to an initial deep neural network, a crowd counting graph labeled in advance in a plurality of original frame images in the sample video is used as a supervision signal, iterative training is performed on the initial deep neural network until a training result meets a preset condition, and the deep neural network is obtained so as to perform crowd cross-line counting on the deep neural network. Because the depth neural network directly takes the original frame image in the original video as input without using the time sequence slice image, the robustness is better, the depth neural network can be applied to various different scenes, the problems that the identifiability of pedestrians in the time sequence slice image is low and the number of people in the time sequence slice image cannot be estimated when the crowd density in the video is high, the crowd moving speed is low or the video is still, or the visual angle of a monitoring camera is low and the like are solved, the depth neural network is also suitable for the situations that the crowd density is high, the crowd moving speed is low or the video is still, and can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting graph instead of only using the total number of the crowd when the deep neural network is trained, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.
Illustratively, when the initial convolutional neural network generates the crowd density map of the current frame image, the initial convolutional neural network can be specifically used for respectively obtaining the crowd density values of all positions in the current frame image according to the pedestrian information in the current frame image; generating a crowd density map of the current frame image according to the crowd density values of all positions in the current frame image; when the crowd speed map of the current frame image is generated, the crowd speed map is specifically used for acquiring the moving speed of each pedestrian in the current frame image according to the position difference of each pedestrian in the current frame image in the current training sample in the previous frame image and the current frame image and the corresponding time difference of the previous frame image and the current frame image; acquiring the crowd speed of each position in the current frame image according to the moving speed and the position of each pedestrian in the current frame image; and generating a crowd speed map of the current frame image according to the crowd speed of each position in the current frame image.
And the element multiplication network is used for multiplying the crowd density graph and the crowd velocity graph of the current frame image at corresponding positions to obtain a crowd counting graph of the current frame image.
Further, referring to fig. 6, in another embodiment of the cross-line counting apparatus of the present invention, a calculating unit may be further included, configured to accumulate the number of unidirectional cross-line persons in two directions of the LOI, so as to obtain the total number of cross-line persons passing through the LOI in the time period T to be analyzed.
FIG. 7 is a schematic structural diagram of an embodiment of a deep neural network training device according to the present invention. As shown in fig. 7, the deep neural network training device of this embodiment includes a network training unit, configured to input a plurality of original frame images of a sample video to an initial deep neural network, and perform iterative training on the initial deep neural network with a population count map pre-labeled with the plurality of original frame images in the sample video as a supervision signal until a training result meets a preset condition to obtain a final deep neural network; the initial deep neural network includes an initial convolutional neural network CNN and an initial element multiplication network.
In a specific example of the above deep neural network training device embodiment, the plurality of original frame images are labeled with a crowd density map, a crowd speed map, and a crowd count map, respectively. Accordingly, in this embodiment, the network training unit may specifically be configured to train the initial deep neural network by:
respectively taking two adjacent frame images in the plurality of original frame images in the sample video as a training sample to be input into an initial convolutional neural network, taking a pre-labeled crowd density graph and a crowd speed graph as supervision signals, and performing iterative training on the initial convolutional neural network until a training result meets a first preset convergence condition to obtain a convolutional neural network; and
and respectively taking two adjacent frame images in the plurality of original frame images in the sample video as a training sample to be input into the initial deep neural network, taking a pre-labeled population counting graph as a supervision signal, and performing iterative training on the initial deep neural network until a set index meets a second preset convergence condition to obtain a final deep neural network.
For the training process of the initial deep neural network, the training result of which satisfies the first preset convergence condition, and the training result of which satisfies the second preset convergence condition, reference may be made to the above description of the embodiment shown in fig. 3; for details, reference may be made to the description of the embodiments of the line crossing counting method of the present invention, and details are not described herein.
The embodiment of the invention also provides a data processing device which comprises the overline counting device provided by any one of the above embodiments of the invention.
Specifically, the data processing apparatus of the embodiment of the present invention may be any apparatus having a data processing function, and may include, for example and without limitation: advanced reduced instruction set machines (ARM), Central Processing Units (CPU) or Graphics Processing Units (GPU), etc.
The data processing device provided based on the above embodiment of the present invention includes the cross-line counting device provided in any of the above embodiments of the present invention, and each frame image in the original video is directly used as an input without using a time series slice image, so that the robustness is better, the data processing device can be applied to various different scenes, is also applicable to extreme situations with high crowd density, low crowd moving speed or stillness, and can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting diagram instead of only using the total number of the crowd, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.
In addition, an embodiment of the present invention further provides an electronic device, which may be, for example, a mobile terminal, a Personal Computer (PC), a tablet computer, a server, and the like, and the electronic device is provided with the data processing apparatus according to any of the above embodiments of the present invention.
The electronic device provided based on the above embodiment of the present invention includes the above data processing device of the present invention, and thus includes the above cross-line counting device provided in any of the above embodiments of the present invention, and each frame image in the original video is directly used as an input without using a time series slice image, so that the robustness is better, the electronic device can be applied to various different scenes, is also applicable to extreme situations with large crowd density, low crowd moving speed or static, and can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting diagram instead of only using the total number of the crowd, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.
Fig. 8 is a schematic structural diagram of an embodiment of an electronic device according to the present invention. As shown in fig. 7, an electronic device for implementing an embodiment of the present invention includes a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) that can perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM) or loaded from a storage section into a Random Access Memory (RAM). The cpu or the gpu may communicate with the rom and/or the ram to execute the executable instructions to perform operations corresponding to the cross-line counting method provided by the embodiments of the present invention, for example: inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a deep neural network, and outputting a crowd counting graph of the plurality of original frame images by the deep neural network; the crowd counting graph comprises a counting vector of each position, and the counting vector is used for representing the number of people passing through between each frame image and an adjacent previous frame image in the plurality of original frame images in the counting direction; respectively taking each frame image in the plurality of original frame images as a current frame image, and acquiring the number of people of the current frame image passing through an LOI (line of interest) from at least one direction according to a crowd counting graph of the current frame image aiming at the LOI to be subjected to line crossing counting in a video; and respectively accumulating the number of people of the plurality of original frame images passing through the LOI in the at least one direction to obtain the number of unidirectional line crossing people of the LOI in the at least one direction in the time period T to be analyzed. In addition, the central processing unit or the graphics processing unit may communicate with the read-only memory and/or the random access memory to execute the executable instructions so as to perform operations corresponding to the deep neural network training method provided by the embodiment of the present invention, for example: inputting a plurality of original frame images of a sample video into an initial deep neural network, taking a population counting graph labeled in advance by the plurality of original frame images as a supervision signal, and performing iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final deep neural network; the initial deep neural network includes an initial convolutional neural network and an initial element multiplication network.
In addition, in the RAM, various programs and data necessary for system operation may also be stored. The CPU, GPU, ROM, and RAM are connected to each other via a bus. An input/output (I/O) interface is also connected to the bus.
The following components are connected to the I/O interface: an input section including a keyboard, a mouse, and the like; an output section including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section including a hard disk and the like; and a communication section including a network interface card such as a LAN card, a modem, or the like. The communication section performs communication processing via a network such as the internet. The drive is also connected to the I/O interface as needed. A removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive as necessary, so that a computer program read out therefrom is previously mounted in the storage section as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for executing the method shown in the flowchart, the program code may include instructions corresponding to performing any one of the steps of the cross-line counting method provided by the embodiments of the present invention, for example, instructions for inputting a plurality of raw frame images corresponding to a time period T to be analyzed in a video that needs to be cross-line counted to a deep neural network, and outputting a crowd count map of the plurality of raw frame images by the deep neural network; the crowd counting graph comprises a counting vector of each position, and the counting vector is used for representing the number of people passing through between each frame image and an adjacent previous frame image in the plurality of original frame images in the counting direction; respectively taking each frame image in the plurality of original frame images as a current frame image, and acquiring instructions of the number of people of the current frame image passing through an LOI (line of interest) from at least one direction according to a crowd counting graph of the current frame image aiming at the LOI to be subjected to line crossing counting in a video; and respectively accumulating the number of people who pass through the LOI by the plurality of original frame images in the at least one direction to obtain the number of people who cross the line in one direction in the time period T to be analyzed by the LOI in the at least one direction. The program code may further include instructions corresponding to the execution of any one of the steps of the deep neural network training method provided by the embodiment of the present invention, for example, instructions for inputting a plurality of original frame images in a sample video to an initial deep neural network, taking a population count map pre-labeled with the plurality of original frame images as a supervision signal, iteratively training the initial deep neural network until a training result satisfies a preset condition, and obtaining a final deep neural network; the initial deep neural network includes an initial convolutional neural network and an initial element multiplication network. The computer program may be downloaded and installed from a network through the communication section, and/or installed from a removable medium. The computer program performs the above-mentioned functions defined in the method of the present invention when executed by a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU).
An embodiment of the present invention further provides a computer storage medium, configured to store a computer-readable instruction, where the instruction includes: inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a deep neural network, and outputting a crowd counting graph of the plurality of original frame images by the deep neural network; the crowd counting graph comprises a counting vector of each position, and the counting vector is used for representing the number of people passing through between each frame image and an adjacent previous frame image in the plurality of original frame images in the counting direction; respectively taking each frame image in the plurality of original frame images as a current frame image, and acquiring instructions of the number of people of the current frame image passing through an LOI (line of interest) from at least one direction according to a crowd counting graph of the current frame image aiming at the LOI to be subjected to line crossing counting in a video; and respectively accumulating the number of people who pass through the LOI by the plurality of original frame images in the at least one direction to obtain the number of people who cross the line in one direction in the time period T to be analyzed by the LOI in the at least one direction. Alternatively, the instructions include: inputting a plurality of original frame images in a sample video into an initial deep neural network, taking a population counting graph labeled in advance by the plurality of original frame images as a supervision signal, and performing iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final instruction of the deep neural network; the initial deep neural network includes an initial convolutional neural network and an initial element multiplication network.
In addition, an embodiment of the present invention further provides a computer device, including:
a memory storing executable instructions;
one or more processors in communication with the memory to execute the executable instructions to perform operations corresponding to the cross-line counting method or the deep neural network training method of any of the above embodiments of the present invention.
The embodiment of the invention can be applied to all scenes needing crowd flow statistics, such as:
the method comprises the following steps that 1, when the number of cross-line people at the time period T to be analyzed of the subway entrance/exit needs to be counted, videos of all the entrances and exits of the subway are collected through a monitoring camera, all the entrances and exits of the subway are respectively used as LOIs, and the videos of all the entrances and exits of the subway in the time period T to be analyzed are input into the deep neural network of the embodiment of the invention;
the method comprises the following steps that scene 2, aiming at the touring of urban masses, videos of touring streets are collected through a street monitoring camera, LOIs are arranged in the width direction of the touring streets, the videos of the touring streets in the LOIs within a time period T to be analyzed are input into the deep neural network of the embodiment of the invention, the number of people participating in the touring and the moving state of the crowd can be obtained through the cross-line counting method of the embodiment of the invention, and police force is conveniently allocated to ensure the touring order and the public safety;
and 3, aiming at the scenic spot or the public stadium, the video of the scenic spot or the public stadium can be acquired through a monitoring camera, the LOI is arranged at the entrance and exit of the scenic spot or the public stadium, the video of the scenic spot or the public stadium is input into the deep neural network of the embodiment of the invention, and the people entering and exiting the scenic spot or the stadium can be counted by the cross-line counting method of the embodiment of the invention, so that the people flow is reasonably controlled, and the danger of trampling accidents and the like caused by overcrowding is avoided.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the device and apparatus embodiments, since they correspond to the method embodiments basically, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.
The method, apparatus and device of the present invention may be implemented in a number of ways. For example, the methods, apparatus and devices of the present invention may be implemented by software, hardware, firmware or any combination of software, hardware and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the device and apparatus embodiments, since they correspond to the method embodiments basically, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.
The method and apparatus, device of the present invention may be implemented in a number of ways. For example, the method, apparatus and device of the present invention may be implemented by software, hardware, firmware or any combination of software, hardware and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (34)

1. A method of line crossing counting, comprising:
inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a deep neural network;
the depth neural network sequentially extracts at least two frames of images from a plurality of original frame images corresponding to a time period T to be analyzed in the video, takes a later frame image in the at least two frames of images as a current frame image, generates a crowd counting graph of the current frame image, and outputs the crowd counting graph of the plurality of original frame images; the crowd counting graph comprises counting vectors of all positions in the frame images, and the counting vectors are used for indicating the number of people passing through in the counting direction between each frame image and the adjacent previous frame image in at least two frames of images extracted sequentially; the at least two frames of images extracted sequentially comprise any one of the following items: continuous original frame images, discontinuous original frame images, or partial continuous original frame images and partial discontinuous original frame images;
respectively taking each frame image in the plurality of original frame images as a current frame image, and acquiring the number of people of the current frame image passing through an LOI (line of interest) from at least one direction according to a crowd counting graph of the current frame image aiming at the LOI to be subjected to line crossing counting in a video;
and respectively accumulating the number of people of the plurality of original frame images passing through the LOI in the at least one direction to obtain the number of unidirectional line crossing people of the LOI in the at least one direction in the time period T to be analyzed.
2. The method of claim 1, wherein the counting direction comprises two coordinate directions of a two-dimensional coordinate plane.
3. The method of claim 2, wherein said obtaining the number of people that the current frame image passes through the LOI from at least one direction respectively comprises: acquiring the number of people of the current frame image passing through the LOI from two directions respectively;
the accumulating the number of people passing through the LOI by the plurality of original frame images in the at least one direction respectively to obtain the number of people passing through the one-way line in the period of time T to be analyzed, wherein the number of people passing through the LOI in the at least one direction respectively comprises:
and respectively accumulating the number of people of which the original frame images pass through the LOI in the two directions to obtain the number of unidirectional line crossing people of the LOI in the two directions in the time period T to be analyzed.
4. The method of any one of claims 1 to 3, wherein the generating the people counting map of the current frame image comprises:
inputting the plurality of original frame images into the deep neural network, and generating a crowd density map and a crowd speed map of the current frame image by a convolutional neural network in the deep neural network; the crowd density graph is used for representing the crowd density of each position in the current frame image, and the crowd speed graph is used for representing the speed of each pedestrian in the current frame image moving from the adjacent previous frame image to the current frame image;
and inputting the crowd density map and the crowd speed map of the current frame image to an element multiplication network in the deep neural network, and multiplying the crowd density map and the crowd speed map of the current frame image at corresponding positions by the element multiplication network to obtain a crowd counting map of the current frame image.
5. The method of claim 3, wherein obtaining the number of people that the current frame image passes through the LOI from two directions respectively comprises:
respectively projecting the counting vectors of all positions on the LOI in the people counting graph in the normal direction of the LOI to obtain the scalar values of all the positions on the LOI, wherein the positive and negative of the scalar values represent two cross-line directions of the LOI;
and accumulating the positive scalar value and the negative scalar value on the LOI respectively to obtain the number of people passing through the current frame image in two directions on the LOI respectively.
6. The method of claim 4, wherein obtaining the number of people that the current frame image passes through the LOI from two directions respectively comprises:
respectively projecting the counting vectors of all positions on the LOI in the people counting graph in the normal direction of the LOI to obtain the scalar values of all the positions on the LOI, wherein the positive and negative of the scalar values represent two cross-line directions of the LOI;
and accumulating the positive scalar value and the negative scalar value on the LOI respectively to obtain the number of people passing through the current frame image in two directions on the LOI respectively.
7. The method of claim 3, further comprising:
and accumulating the number of the unidirectional cross-line persons of the LOI in the two directions to obtain the total number of the cross-line persons passing through the LOI in the time period T to be analyzed.
8. The method of claim 4, further comprising:
and accumulating the number of the unidirectional cross-line persons of the LOI in two directions to obtain the total number of the cross-line persons passing through the LOI in the time period T to be analyzed.
9. A deep neural network training method, wherein the deep neural network is the deep neural network in the cross-line counting method according to any one of claims 1 to 8; the method comprises the following steps:
inputting a plurality of original frame images of a sample video into an initial deep neural network, taking a population counting graph labeled in advance by the plurality of original frame images as a supervision signal, and performing iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final deep neural network; the initial deep neural network includes an initial convolutional neural network and an initial element multiplication network.
10. The method of claim 9, wherein the plurality of original frame images are labeled with a crowd density map and a crowd velocity map, a crowd count map, respectively;
the method comprises the steps of inputting a plurality of original frame images of a sample video into an initial deep neural network, taking a population counting graph pre-labeled by the original frame images as a supervision signal, and performing iterative training on the initial deep neural network until a training result meets a preset condition, wherein the training step comprises the following steps:
respectively taking two adjacent frames of images in the plurality of original frame images as a training sample to be input into the initial convolutional neural network, taking a pre-labeled crowd density graph and a crowd speed graph as supervision signals, and performing iterative training on the initial convolutional neural network until a training result meets a first preset convergence condition to obtain the convolutional neural network; and
and respectively taking two adjacent frames of images in the plurality of original frame images as a training sample to be input into the initial deep neural network, taking a pre-labeled population counting graph as a supervision signal, and performing iterative training on the initial deep neural network until a set index meets a second preset convergence condition to obtain the final deep neural network.
11. The method of claim 10, further comprising:
respectively positioning pedestrians for each frame image in the plurality of original frame images to obtain the positions of the pedestrians in each frame image and respectively allocating pedestrian Identification (ID) to each pedestrian, wherein the pedestrian ID is used for uniquely identifying one pedestrian in the video;
and respectively calibrating the pedestrian information of each pedestrian in each frame image according to the pedestrian position in each frame image, wherein the pedestrian information comprises the pedestrian position and the pedestrian ID.
12. The method of claim 11, further comprising:
pre-setting a geometric perspective of the sample video for a scene of the sample video; the geometric rendering comprises a correspondence between the number of pixels at different positions in the sample video and the true physical dimensions of the scene;
after two adjacent frame images in the plurality of original frame images are respectively used as a training sample to be input to the initial convolutional neural network, the method further includes:
the initial convolutional neural network takes a later frame image in a current training sample as a current frame image, generates a crowd density map of the current frame image according to the pedestrian information calibrated by each frame image and the geometric perspective, and generates a crowd speed map of the current frame image according to the pedestrian information in the two frame images of the current training sample and the geometric perspective.
13. The method of claim 12, wherein generating the crowd density map of the current frame image comprises:
respectively acquiring the crowd density value of each position in the current frame image according to the pedestrian information in the current frame image and the geometric perspective;
and generating a crowd density map of the current frame image according to the crowd density values of all positions in the current frame image.
14. The method of claim 12 or 13, wherein the generating the crowd velocity map of the current frame image comprises:
acquiring the moving speed of each pedestrian in the current frame image according to the position difference of each pedestrian in the current frame image in the current training sample in the previous frame image and the current frame image and the corresponding time difference of the previous frame image and the current frame image;
acquiring the crowd speed of each position in the current frame image according to the moving speed and the position of each pedestrian in the current frame image;
and generating a crowd speed map of the current frame image according to the crowd speed of each position in the current frame image and the geometric perspective.
15. The method according to any one of claims 10 to 13, wherein the training result satisfying a first preset convergence condition comprises:
aiming at the original frame images, the proportion of the frame number of the image, of which the crowd density graph and the crowd speed graph output by the initial convolutional neural network are consistent with the pre-marked crowd density graph and the crowd speed graph, to the frame number of the original frame images reaches a first preset threshold; and/or
For each frame image in the plurality of original frame images, the similarity between the crowd density map output by the initial convolutional neural network and the pre-labeled crowd density map, and the similarity between the crowd speed map output by the initial convolutional neural network and the pre-labeled crowd speed map are greater than a second preset threshold; and/or
For the plurality of original frame images, the average similarity between the crowd density graph output by the initial convolutional neural network and the pre-labeled crowd density graph and the average similarity between the crowd speed graph output by the initial convolutional neural network and the pre-labeled crowd speed graph are greater than a third preset threshold; and/or
And the iterative training times of the initial convolutional neural network reach a fourth preset threshold value.
16. The method of claim 14, wherein the training result satisfying a first preset convergence condition comprises:
aiming at the original frame images, the proportion of the frame number of the image, of which the crowd density graph and the crowd speed graph output by the initial convolutional neural network are consistent with the pre-marked crowd density graph and the crowd speed graph, to the frame number of the original frame images reaches a first preset threshold; and/or
For each frame image in the plurality of original frame images, the similarity between the crowd density map output by the initial convolutional neural network and the pre-labeled crowd density map, and the similarity between the crowd speed map output by the initial convolutional neural network and the pre-labeled crowd speed map are greater than a second preset threshold; and/or
For the plurality of original frame images, the average similarity between the crowd density graph output by the initial convolutional neural network and the pre-labeled crowd density graph and the average similarity between the crowd speed graph output by the initial convolutional neural network and the pre-labeled crowd speed graph are greater than a third preset threshold; and/or
And the iterative training times of the initial convolutional neural network reach a fourth preset threshold value.
17. The method according to any one of claims 10 to 13, wherein the training result satisfying a second preset convergence condition comprises:
aiming at the original frame images, the ratio of the number of frames of the image, which is output by the initial element multiplication network and is consistent with the pre-labeled crowd counting graph, to the number of frames of the original frame images reaches a fifth preset threshold; and/or
For each frame image in the plurality of original frame images, the similarity between the crowd counting graph output by the initial element multiplication network and the pre-labeled crowd counting graph is greater than a sixth preset threshold; and/or
For the plurality of original frame images, the average similarity between the crowd counting graph output by the initial element multiplication network and the crowd counting graph obtained by pre-labeling is greater than a seventh preset threshold; and/or
And the number of times of iterative training of the second part of the deep neural network reaches an eighth preset threshold value.
18. The method of claim 14, wherein the training result satisfying a second predetermined convergence condition comprises:
aiming at the original frame images, the ratio of the number of frames of the image, which is output by the initial element multiplication network and is consistent with the pre-labeled crowd counting graph, to the number of frames of the original frame images reaches a fifth preset threshold; and/or
For each frame image in the plurality of original frame images, the similarity between the crowd counting graph output by the initial element multiplication network and the pre-labeled crowd counting graph is greater than a sixth preset threshold; and/or
For the plurality of original frame images, the average similarity between the crowd counting graph output by the initial element multiplication network and the crowd counting graph obtained by pre-labeling is greater than a seventh preset threshold; and/or
And the number of times of iterative training of the second part of the deep neural network reaches an eighth preset threshold value.
19. An over-the-wire counting device, comprising:
the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used as a depth neural network and is used for receiving a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting, sequentially extracting at least two frame images from the plurality of original frame images corresponding to the time period T to be analyzed in the video, taking a later frame image in the at least two frame images as a current frame image, generating a crowd counting image of the current frame image and outputting the crowd counting image of the original frame image; the crowd counting graph comprises counting vectors of all positions in the frame images, and the counting vectors are used for representing the number of people passing through between all the frame images and the adjacent previous frame images in the counting direction; the at least two frames of images extracted sequentially comprise any one of the following items: continuous original frame images, discontinuous original frame images, or partial continuous original frame images and partial discontinuous original frame images;
the second acquisition unit is used for respectively taking each frame image in the plurality of original frame images as a current frame image, aiming at an interested line LOI to be subjected to line crossing counting in a video, and acquiring the number of people of the current frame image passing through the LOI from at least one direction according to a crowd counting graph of the current frame image; and the third acquisition unit is used for respectively accumulating the number of people passing through the LOI by the plurality of original frame images in the at least one direction to obtain the number of unidirectional line crossing people of the LOI in the at least one direction in the time period T to be analyzed.
20. The apparatus of claim 19, wherein the counting direction comprises two coordinate directions of a two-dimensional coordinate plane.
21. The apparatus according to claim 20, wherein the second obtaining unit is specifically configured to obtain the number of people passing through the LOI from two directions respectively for the current frame image;
the third obtaining unit is specifically configured to respectively accumulate the number of people that the original frame images pass through the LOI in the two directions, and obtain the number of people that the LOI crosses the line in the two directions in the time period T to be analyzed.
22. The apparatus according to any one of claims 19 to 21, wherein the first obtaining unit comprises:
the convolutional neural network is used for receiving at least two input frame images, taking a later frame image in the at least two frame images as a current frame image and generating a crowd density map and a crowd speed map of the current frame image; the crowd density graph is used for representing the crowd density of each position in the current frame image, and the crowd speed graph is used for representing the speed of each pedestrian in the current frame image moving from the adjacent previous frame image to the current frame image;
and the element multiplication network is used for multiplying the elements of the crowd density map and the crowd velocity map of the current frame image at the corresponding positions to obtain the crowd counting map of the current frame image.
23. The apparatus according to any one of claims 19 to 21, wherein the second obtaining unit is specifically configured to:
respectively projecting the counting vectors of all positions on the LOI in the people counting graph in the normal direction of the LOI to obtain the scalar values of all the positions on the LOI, wherein the positive and negative of the scalar values represent two cross-line directions of the LOI; and
and accumulating the positive scalar value and the negative scalar value on the LOI respectively to obtain the number of people passing through the current frame image in two directions on the LOI respectively.
24. The apparatus of claim 22, wherein the second obtaining unit is specifically configured to:
respectively projecting the counting vectors of all positions on the LOI in the people counting graph in the normal direction of the LOI to obtain the scalar values of all the positions on the LOI, wherein the positive and negative of the scalar values represent two cross-line directions of the LOI; and
and accumulating the positive scalar value and the negative scalar value on the LOI respectively to obtain the number of people passing through the current frame image in two directions on the LOI respectively.
25. The apparatus of claim 21, further comprising:
and the calculating unit is used for accumulating the number of unidirectional cross-line people in the two directions of the LOI to obtain the total number of cross-line people passing through the LOI in the time period T to be analyzed.
26. The apparatus of claim 22, further comprising:
and the calculating unit is used for accumulating the number of unidirectional cross-line people of the LOI in two directions to obtain the total number of cross-line people passing through the LOI in the time period T to be analyzed.
27. A deep neural network training device, wherein the deep neural network is the deep neural network in the cross-line counting method according to any one of claims 1 to 8; the device comprises:
the network training unit is used for inputting a plurality of original frame images of a sample video into an initial deep neural network, taking a pre-labeled population counting graph of the plurality of original frame images as a supervision signal, and carrying out iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final deep neural network; the initial deep neural network comprises an initial convolutional neural network CNN and an initial element multiplication network.
28. The apparatus of claim 27, wherein the plurality of original frame images are labeled with a crowd density map and a crowd velocity map, a crowd count map, respectively;
the network training unit is specifically configured to:
respectively inputting two adjacent frames of images in the plurality of original frame images as a training sample to the initial convolutional neural network, performing iterative training on the initial convolutional neural network by using a pre-labeled crowd density graph and a crowd speed graph as supervision signals until a training result meets a first preset convergence condition, and obtaining the convolutional neural network; and
and respectively taking two adjacent frames of images in the plurality of original frame images as a training sample to be input into the initial deep neural network, taking a pre-labeled population counting graph as a supervision signal, and performing iterative training on the initial deep neural network until a set index meets a second preset convergence condition to obtain the final deep neural network.
29. The apparatus of claim 28, wherein the scene of the sample video is pre-labeled with geometric perspective, the geometric perspective comprising a correspondence between the number of pixels at different positions in the sample video and a real physical size of the scene; pedestrian information of each pedestrian is calibrated in advance in the plurality of original frame images, the pedestrian information comprises a pedestrian position and a pedestrian ID, and the pedestrian ID uniquely identifies one pedestrian;
the initial convolutional neural network is used for generating a crowd density map of the current frame image according to the pedestrian information and the geometric perspective calibrated by each frame image by taking the later frame image in the current training sample as the current frame image, and generating a crowd speed map of the current frame image according to the pedestrians in the two frame images of the current training sample and the geometric perspective.
30. The apparatus according to claim 29, wherein the initial convolutional neural network is configured to, when generating the crowd density map of the current frame image, specifically, obtain the crowd density values of the positions in the current frame image according to the pedestrian information in the current frame image and the geometric rendering; and generating a crowd density map of the current frame image according to the crowd density values of all positions in the current frame image.
31. The apparatus according to claim 29 or 30, wherein the initial convolutional neural network, when generating the crowd velocity map of the current frame image, is specifically configured to:
acquiring the moving speed of each pedestrian in the current frame image according to the position difference of each pedestrian in the current frame image in the current training sample in the previous frame image and the current frame image and the corresponding time difference of the previous frame image and the current frame image;
acquiring the crowd speed of each position in the current frame image according to the moving speed and the position of each pedestrian in the current frame image;
and generating a crowd speed map of the current frame image according to the crowd speed of each position in the current frame image and the geometric perspective.
32. A data processing apparatus, comprising: the flying lead counting device of any one of claims 19 to 26; or a cross-line counting device as claimed in any one of claims 27 to 31.
33. The apparatus of claim 32, wherein the data processing apparatus comprises an advanced reduced instruction set machine (ARM), a Central Processing Unit (CPU), or a Graphics Processing Unit (GPU).
34. An electronic device, characterized in that a data processing apparatus according to claim 32 or 33 is provided.
CN201610867834.1A 2016-09-29 2016-09-29 Cross-line counting method, deep neural network training method, device and electronic equipment Active CN106407946B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610867834.1A CN106407946B (en) 2016-09-29 2016-09-29 Cross-line counting method, deep neural network training method, device and electronic equipment
PCT/CN2017/103530 WO2018059408A1 (en) 2016-09-29 2017-09-26 Cross-line counting method, and neural network training method and apparatus, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610867834.1A CN106407946B (en) 2016-09-29 2016-09-29 Cross-line counting method, deep neural network training method, device and electronic equipment

Publications (2)

Publication Number Publication Date
CN106407946A CN106407946A (en) 2017-02-15
CN106407946B true CN106407946B (en) 2020-03-03

Family

ID=59228726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610867834.1A Active CN106407946B (en) 2016-09-29 2016-09-29 Cross-line counting method, deep neural network training method, device and electronic equipment

Country Status (2)

Country Link
CN (1) CN106407946B (en)
WO (1) WO2018059408A1 (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407946B (en) * 2016-09-29 2020-03-03 北京市商汤科技开发有限公司 Cross-line counting method, deep neural network training method, device and electronic equipment
CN108052984B (en) * 2017-10-30 2019-11-08 中国科学院计算技术研究所 Method of counting and device
CN109472291A (en) * 2018-10-11 2019-03-15 浙江工业大学 A kind of demographics classification method based on DNN algorithm
CN109615140B (en) * 2018-12-14 2024-01-09 中国科学技术大学 Method and device for predicting pedestrian movement
CN109726658B (en) * 2018-12-21 2022-10-04 上海科技大学 Crowd counting and positioning method and system, electronic terminal and storage medium
JP6703679B1 (en) * 2019-02-01 2020-06-03 株式会社計数技研 Counting device, learning device manufacturing device, counting method, learning device manufacturing method, and program
CN109948500B (en) * 2019-03-13 2022-12-27 西安科技大学 Method for accurately monitoring personnel entering and exiting of coal mine
CN110135325B (en) * 2019-05-10 2020-12-08 山东大学 Method and system for counting people of crowd based on scale adaptive network
CN110263643B (en) * 2019-05-20 2023-05-16 上海兑观信息科技技术有限公司 Quick video crowd counting method based on time sequence relation
CN110458114B (en) * 2019-08-13 2022-02-01 杜波 Method and device for determining number of people and storage medium
JP7383435B2 (en) 2019-09-17 2023-11-20 キヤノン株式会社 Image processing device, image processing method, and program
CN110674729A (en) * 2019-09-20 2020-01-10 澳门理工学院 Method for identifying number of people based on heat energy estimation, computer device and computer readable storage medium
CN110991225A (en) * 2019-10-22 2020-04-10 同济大学 Crowd counting and density estimation method and device based on multi-column convolutional neural network
CN110866453B (en) * 2019-10-22 2023-05-02 同济大学 Real-time crowd steady state identification method and device based on convolutional neural network
CN110941999B (en) * 2019-11-12 2023-02-17 通号通信信息集团有限公司 Method for adaptively calculating size of Gaussian kernel in crowd counting system
CN110909648B (en) * 2019-11-15 2023-08-25 华东师范大学 People flow monitoring method implemented on edge computing equipment by using neural network
CN111062275A (en) * 2019-12-02 2020-04-24 汇纳科技股份有限公司 Multi-level supervision crowd counting method, device, medium and electronic equipment
CN111178276B (en) * 2019-12-30 2024-04-02 上海商汤智能科技有限公司 Image processing method, image processing apparatus, and computer-readable storage medium
CN111428551B (en) * 2019-12-30 2023-06-16 杭州海康威视数字技术股份有限公司 Density detection method, density detection model training method and device
CN113378608B (en) * 2020-03-10 2024-04-19 顺丰科技有限公司 Crowd counting method, device, equipment and storage medium
CN112232257B (en) * 2020-10-26 2023-08-11 青岛海信网络科技股份有限公司 Traffic abnormality determination method, device, equipment and medium
CN112333431B (en) * 2020-10-30 2022-06-07 深圳市商汤科技有限公司 Scene monitoring method and device, electronic equipment and storage medium
CN112364788B (en) * 2020-11-13 2021-08-03 润联软件系统(深圳)有限公司 Monitoring video crowd quantity monitoring method based on deep learning and related components thereof
CN113297983A (en) * 2021-05-27 2021-08-24 上海商汤智能科技有限公司 Crowd positioning method and device, electronic equipment and storage medium
CN113239882B (en) * 2021-06-03 2022-06-03 成都鼎安华智慧物联网股份有限公司 Deep learning-based personnel counting method and system
CN113807274B (en) * 2021-09-23 2023-07-04 山东建筑大学 Crowd counting method and system based on image anti-perspective transformation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778442A (en) * 2014-02-26 2014-05-07 哈尔滨工业大学深圳研究生院 Central air-conditioner control method based on video people counting statistic analysis
CN102148959B (en) * 2010-02-09 2016-01-20 北京中星微电子有限公司 The moving target detecting method of a kind of video monitoring system and image thereof

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6944319B1 (en) * 1999-09-13 2005-09-13 Microsoft Corporation Pose-invariant face recognition system and process
CN102542289B (en) * 2011-12-16 2014-06-04 重庆邮电大学 Pedestrian volume statistical method based on plurality of Gaussian counting models
CN105160313A (en) * 2014-09-15 2015-12-16 中国科学院重庆绿色智能技术研究院 Method and apparatus for crowd behavior analysis in video monitoring
CN105590094B (en) * 2015-12-11 2019-03-01 小米科技有限责任公司 Determine the method and device of human body quantity
CN105740894B (en) * 2016-01-28 2020-05-29 北京航空航天大学 Semantic annotation method for hyperspectral remote sensing image
CN106407946B (en) * 2016-09-29 2020-03-03 北京市商汤科技开发有限公司 Cross-line counting method, deep neural network training method, device and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102148959B (en) * 2010-02-09 2016-01-20 北京中星微电子有限公司 The moving target detecting method of a kind of video monitoring system and image thereof
CN103778442A (en) * 2014-02-26 2014-05-07 哈尔滨工业大学深圳研究生院 Central air-conditioner control method based on video people counting statistic analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Crossing-line Crowd Counting with Two-phase Deep Neural Networks";Zhuoyi Zhao 等;《European Conference on Computer Vision》;20160917;摘要、第3.1-3.3、4.2-4.3节,图1(c)、图2-4、6 *

Also Published As

Publication number Publication date
CN106407946A (en) 2017-02-15
WO2018059408A1 (en) 2018-04-05

Similar Documents

Publication Publication Date Title
CN106407946B (en) Cross-line counting method, deep neural network training method, device and electronic equipment
CN111539273B (en) Traffic video background modeling method and system
Zhao et al. Crossing-line crowd counting with two-phase deep neural networks
Seer et al. Kinects and human kinetics: A new approach for studying pedestrian behavior
US8582816B2 (en) Method and apparatus for video analytics based object counting
US20190138798A1 (en) Time domain action detecting methods and system, electronic devices, and computer storage medium
US20170161591A1 (en) System and method for deep-learning based object tracking
US10009579B2 (en) Method and system for counting people using depth sensor
CN107025658A (en) The method and system of moving object is detected using single camera
US9514363B2 (en) Eye gaze driven spatio-temporal action localization
JP7292492B2 (en) Object tracking method and device, storage medium and computer program
CN105809178A (en) Population analyzing method based on human face attribute and device
Bour et al. Crowd behavior analysis from fixed and moving cameras
WO2009039350A1 (en) System and method for estimating characteristics of persons or things
Himeur et al. Deep visual social distancing monitoring to combat COVID-19: A comprehensive survey
CN109902550A (en) The recognition methods of pedestrian's attribute and device
US9947107B2 (en) Method and system for tracking objects between cameras
KR101529620B1 (en) Method and apparatus for counting pedestrians by moving directions
US11348338B2 (en) Methods and systems for crowd motion summarization via tracklet based human localization
KR101467307B1 (en) Method and apparatus for counting pedestrians using artificial neural network model
Suganyadevi et al. OFGM-SMED: An efficient and robust foreground object detection in compressed video sequences
Lee et al. An intelligent image-based customer analysis service
CN101685538B (en) Method and device for tracking object
CN110414471B (en) Video identification method and system based on double models
KR101467360B1 (en) Method and apparatus for counting pedestrians by moving directions

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant