CN106407946B

CN106407946B - Cross-line counting method, deep neural network training method, device and electronic equipment

Info

Publication number: CN106407946B
Application number: CN201610867834.1A
Authority: CN
Inventors: 王晓刚; 赵倬毅; 李鸿升; 赵瑞
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2020-03-03
Anticipated expiration: 2036-09-29
Also published as: CN106407946A; WO2018059408A1

Abstract

The embodiment of the invention discloses a cross-line counting method, a deep neural network training method, a device and electronic equipment, wherein the cross-line counting method comprises the following steps: inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a depth neural network, outputting a crowd counting graph of the plurality of original frame images by the depth neural network, wherein the crowd counting graph comprises a counting vector of each position, each frame image in the plurality of original frame images is used as a current frame image, and aiming at an interested line LOI to be subjected to cross-line counting in the video, the number of people of the current frame image passing through the LOI from at least one direction is obtained according to the crowd counting graph of the current frame image; and respectively accumulating the number of people of a plurality of original frame images passing through the LOI in at least one direction to obtain the number of unidirectional line crossing people of the LOI in at least one direction in the time period T to be analyzed. The embodiment of the invention can be applied to various different scenes, and the cross-line counting result is more objective and accurate.

Description

Cross-line counting method, deep neural network training method, device and electronic equipment

Technical Field

The invention relates to a computer vision technology, in particular to a cross-line counting method, a deep neural network training method, a device and electronic equipment.

Background

The automatic crowd counting technology in the video plays an increasingly important role in the aspects of crowd flow monitoring, public safety and the like, and particularly, the cross-line counting method can help people count the flow of people on key roads or entrances and exits in real time, so that the total number of the crowd in one area is estimated.

Currently, the mainstream line crossing counting method is a time-series Slice image (Temporal Slice) based method. The time sequence slicing method is characterized in that pixel vectors (color images are three-channel vectors) on lines extracted from each frame of a video are accumulated in a time dimension to form a two-dimensional image of a time sequence slice, then a regression model is learned based on the time sequence slice image by directly using artificially marked line crossing people number as a supervision signal, and the number of people in the time sequence slice image is estimated, so that the line crossing people number in a certain period is obtained.

Disclosure of Invention

The embodiment of the invention provides a technical scheme of cross-line counting.

According to an aspect of the embodiments of the present invention, there is provided a line crossing counting method, including:

inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a deep neural network, and outputting a crowd counting graph of the plurality of original frame images by the deep neural network; the crowd counting graph comprises counting vectors of all positions in the frame images, and the counting vectors are used for representing the number of people passing through between all the frame images and the adjacent previous frame images in the counting direction of the two-dimensional coordinate plane in the original frame images;

respectively taking each frame image in the plurality of original frame images as a current frame image, and acquiring the number of people of the current frame image passing through an LOI (line of interest) from at least one counting direction according to a crowd counting graph of the current frame image aiming at the LOI to be subjected to line crossing counting in a video;

and respectively accumulating the number of people of the plurality of original frame images passing through the LOI in the at least one direction to obtain the number of unidirectional line crossing people of the LOI in the at least one direction in the time period T to be analyzed.

According to another aspect of the embodiments of the present invention, there is provided a deep neural network training method, further including:

inputting a plurality of original frame images of a sample video into an initial deep neural network, taking a population counting graph labeled in advance by the plurality of original frame images as a supervision signal, and performing iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final deep neural network; the initial deep neural network includes an initial convolutional neural network and an initial element multiplication network.

According to another aspect of the embodiment of the invention, a method for counting the cross lines of the crowd in the video based on the neural network obtained by training the deep neural network training method is provided.

According to another aspect of the embodiments of the present invention, there is provided a line crossing counting apparatus, including:

the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used as a deep neural network and is used for receiving a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting and outputting a crowd counting graph of the original frame images; the crowd counting graph comprises counting vectors of all positions in the frame images, and the counting vectors are used for representing the number of people respectively passing through the positions between all the frame images and the adjacent previous frame images in the counting direction;

the second acquisition unit is used for respectively taking each frame image in the plurality of original frame images as a current frame image, aiming at an interest line LOI to be subjected to line crossing counting in a video, and acquiring the number of people of the current frame image passing through the LOI from two directions according to a crowd counting graph of the current frame image;

a third obtaining unit, configured to respectively accumulate the number of people that the original frame images in the two directions pass through the LOI, and obtain the number of people that the LOI crosses the line in one direction in the two directions in the time period T to be analyzed; and the third acquisition unit is used for respectively accumulating the number of people passing through the LOI by the plurality of original frame images in the at least one direction to obtain the number of unidirectional line crossing people of the LOI in the at least one direction in the time period T to be analyzed.

According to another aspect of the embodiments of the present invention, there is provided a deep neural network training apparatus, further including:

the network training unit is used for inputting a plurality of original frame images of a sample video into an initial deep neural network, taking a pre-labeled population counting graph of the plurality of original frame images as a supervision signal, and carrying out iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final deep neural network; the initial deep neural network comprises an initial convolutional neural network CNN and an initial element multiplication network.

According to another aspect of the embodiments of the present invention, there is provided a data processing apparatus including the line crossing counting apparatus or the deep neural network training apparatus according to the above embodiments.

According to still another aspect of the embodiments of the present invention, there is provided an electronic device including the data processing apparatus according to the above embodiments.

According to yet another aspect of the embodiments of the present invention, there is provided a computer storage medium for storing computer-readable instructions, the instructions comprising:

inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a deep neural network, and outputting a crowd counting graph of the plurality of original frame images by the deep neural network; the crowd counting graph comprises a counting vector of each position, and the counting vector is used for representing the number of people passing through in the counting direction between each frame image and an adjacent previous frame image in the plurality of original frame images;

respectively taking each frame image in the plurality of original frame images as a current frame image, and acquiring instructions of the number of people of the current frame image passing through an LOI (line of interest) from at least one direction according to a crowd counting graph of the current frame image aiming at the LOI to be subjected to line crossing counting in a video;

and respectively accumulating the number of people who pass through the LOI by the plurality of original frame images in the at least one direction to obtain the number of people who cross the line in one direction in the time period T to be analyzed by the LOI in the at least one direction.

According to yet another aspect of embodiments of the present invention, there is provided another computer storage medium for storing computer-readable instructions, the instructions comprising:

inputting a plurality of original frame images of a sample video into an initial deep neural network, taking a population counting graph labeled in advance by the plurality of original frame images as a supervision signal, and performing iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final instruction of the deep neural network; the initial deep neural network includes an initial convolutional neural network and an initial element multiplication network.

According to still another aspect of an embodiment of the present invention, there is provided a computer apparatus including:

a memory storing executable instructions;

one or more processors in communication with the memory to execute the executable instructions to perform operations corresponding to the cross-line counting method or the deep neural network training method of any of the above embodiments of the present invention.

Based on the cross-line counting method, the deep neural network training method, the device and the electronic equipment provided by the embodiment of the invention, a deep neural network training method is provided, and a technical scheme for crowd cross-line counting based on the trained deep neural network is adopted, a sample video is input into an initial deep neural network, a crowd counting graph pre-labeled by a plurality of original frame images in the sample video is used as a supervision signal, iterative training is carried out on the initial deep neural network until a training result meets a preset condition, and the deep neural network is obtained; by inputting a plurality of original frame images corresponding to the time period T to be analyzed in the video which needs to be subjected to cross-line counting to the deep neural network, a crowd counting graph of each frame image in the original frame images can be output, that is: the number of persons passing in the count direction (for example, at least one coordinate direction of the x-axis and the y-axis of the two-dimensional coordinate plane) between the current frame image and the adjacent previous frame image at each position, respectively; and respectively aiming at each frame of image, acquiring the number of people passing through the LOI from at least one direction according to the crowd counting graph, and respectively accumulating the number of people passing through the LOI by a plurality of original frame images in at least one direction to obtain the number of unidirectional line crossing people of the LOI in at least one direction in the time period T to be analyzed. Because the embodiment of the invention directly takes the original frame image in the original video as input without using the time sequence slice image, the robustness is better, the method can be applied to various different scenes, and the problems that the identifiability of pedestrians in the time sequence slice image is low and the number of people in the time sequence slice image cannot be estimated when the crowd density in the video is higher, the crowd moving speed is low or the video is still, or the visual angle of a monitoring camera is lower and the like are avoided, the method is also suitable for the situations that the crowd density is higher, the crowd moving speed is low or the video is still, and the method can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting diagram instead of only using the total number of the crowd, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

The invention will be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

FIG. 1 is a flowchart of an embodiment of a cross-line counting method according to the present invention.

FIG. 2 is a flowchart illustrating another embodiment of a cross-line counting method according to the present invention.

FIG. 3 is a flowchart of an embodiment of a deep neural network training method of the present invention.

FIG. 4 is a schematic diagram of an embodiment of the present invention in which an initial deep neural network is trained in two stages.

FIG. 5 is a schematic structural diagram of an embodiment of the cross-line counting apparatus according to the present invention.

FIG. 6 is a schematic structural diagram of another embodiment of the cross-line counting apparatus according to the present invention.

FIG. 7 is a schematic structural diagram of an embodiment of a deep neural network training device according to the present invention.

Fig. 8 is a schematic structural diagram of an embodiment of an electronic device according to the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations, and with numerous other electronic devices, such as terminal devices, computer systems, servers, etc. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with computer systems, servers, and terminal devices include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

The computer system, server, and terminal device may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

In a cross-line counting method based on a time sequence slice image, the characteristics of a Local gradient direction Histogram (HOG) are extracted from the time sequence slice image, then a Gaussian process regression model is trained to predict the number of people on the time sequence slice image, and a dynamic texture method is utilized to distinguish cross-line crowds in two directions. This method is also referred to as an IP-based method.

In another cross-line counting method based on time series slice images, time series slice color images and corresponding time series slice optical flow images are respectively input into a depth neural network (CNN), the total number of people on the time series slices and the proportion of the number of people in two directions are obtained, and therefore the number of cross-line people in two directions is obtained. This method is also known as transport stream convolutional neural network (TS-CNN).

In the process of implementing the invention, the inventor finds that the cross-line counting method based on the time-series slice images at least has the following problems through research:

the time sequence slice images are not natural images, when the crowd density in the video is high, the crowd moving speed is low, particularly when the crowd is still, or the visual angle of the monitoring camera is low, the images of pedestrians in the time sequence slice images can be pulled into strips, so that the identifiability of the pedestrians is very low, the number of the crowds in the time sequence slice images cannot be estimated, and the effectiveness of the method is restricted; in addition, the mode only uses the number of the overline headquarters as a monitoring signal, and monitoring information is not rich enough, so that the method is not beneficial to learning of a complex CNN model.

In the embodiment of the invention, a Counting Map (Counting Map) of each frame of image is obtained from an original video; then, accumulating the crowd counting graph of each frame of image on a line of interest (LOI) to be subjected to cross line counting to obtain instantaneous cross line counting values of two directions on the LOI (namely, the number of people passing through the LOI); and accumulating instantaneous line crossing count values in the time period T to be analyzed in two directions respectively to obtain a population line crossing count value (namely, the number of people passing through the LOI) in the time period T to be analyzed.

FIG. 1 is a flowchart of an embodiment of a cross-line counting method according to the present invention. As shown in fig. 1, the line crossing counting method of this embodiment includes:

102, inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a depth neural network, and outputting a crowd counting map needing the plurality of original frame images by the depth neural network.

In the embodiment of the invention, a crowd counting graph is introduced, wherein the crowd counting graph comprises counting vectors of each position in a frame image, namely: each position of the people counting map records a two-dimensional counting vector representing the number of people passing between the current frame image and the adjacent previous frame image in a counting direction, for example, the number of people passing in two coordinate directions (i.e., x-axis and y-axis directions) of a two-dimensional coordinate plane, which is a mathematical approximation map, and the value of the counting vector at each position is usually less than 1, which represents how much proportion of a person between the current frame image and the adjacent previous frame image passes through the position.

As a specific example, after the plurality of original frame images corresponding to the time period T to be analyzed in the video that needs to be subjected to line crossing counting are input to the depth neural network in operation 102, at least two frame images may be sequentially extracted from the plurality of original frame images corresponding to the time period T to be analyzed in the video, and a crowd counting graph of the current frame image may be generated by using a later frame image of the at least two frame images as the current frame image. The at least two frame images extracted sequentially may be continuous original frame images or discontinuous original frame images, and the at least two frame images may also be partially continuous original frame images and partially discontinuous original frame images. Namely: in the embodiment of the invention, crowd cross-line counting can be carried out on the basis of all original frame images corresponding to the time period T to be analyzed in the video needing cross-line counting, and partial original frame images can be extracted from the video to carry out crowd cross-line counting without the need that all the original frame images corresponding to the time period T to be analyzed participate in crowd cross-line counting.

And 104, taking each frame image in the plurality of original frame images as a current frame image, and acquiring the number of people of the current frame image passing through an LOI from at least one direction according to a crowd counting graph of the current frame image aiming at the line of interest (LOI) to be subjected to line crossing counting in the video.

The LOI can be set according to the application requirements of people counting as required, and can be a position connecting line which is required to be subjected to people counting at will in a video scene, for example, a connecting line formed on two sides of a subway entrance and exit, a connecting line formed on two sides of a market doorway and the like.

106, respectively accumulating the number of people passing through the LOI in each of the plurality of original frame images in at least one direction, and obtaining the number of people crossing the line in one direction in the time period T to be analyzed in which the LOI is respectively in at least one direction.

Based on the above embodiment of the present invention, a new technical scheme for performing crowd cross-line counting based on CNN is provided, in which a crowd counting graph of each frame of image corresponding to a time period T to be analyzed in a video is respectively obtained through a deep neural network, the number of people passing through an LOI in at least one direction is respectively obtained for each frame of image, the number of people passing through the LOI in at least one direction of each frame of image is respectively accumulated, and the number of unidirectional cross-line people of the LOI in at least one direction in the time period T to be analyzed is obtained. Because the embodiment of the invention directly takes each frame image in the original video as input without using a time sequence slice image, the robustness is better, the method can be applied to various different scenes, is also suitable for extreme conditions of large crowd density, low crowd moving speed or immobility, and can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting diagram instead of only using the total number of the crowd, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.

In a specific example of each embodiment of the line crossing counting method of the present invention, in operation 104, the number of people passing through the LOI from one direction of the current frame image may be obtained; accordingly, in operation 106, the number of people passing through the LOI in each of the plurality of original frame images in the direction is accumulated, so that the number of people crossing the line in the direction in the time period T to be analyzed can be obtained. In addition, in operation 104, the number of people passing through the LOI from two directions respectively for the current frame image may be obtained; accordingly, in operation 106, the number of people passing through the LOI in each of the plurality of original frame images in the two directions is respectively accumulated, so that the number of people passing through the one-way line in the two directions in the time period T to be analyzed can be obtained, and the number of people passing through the two-way line in the LOI can be comprehensively known.

In another specific example of each embodiment of the line crossing counting method according to the present invention, a crowd counting graph of a current frame image may be generated specifically as follows:

and inputting the plurality of original frame images into the deep neural network, and generating a crowd density map and a crowd speed map of the current frame image by a convolutional neural network in the deep neural network. The crowd density graph is used for representing the crowd density of each position in the current frame image, and the crowd speed graph is used for representing the speed of each pedestrian in the current frame image moving from the adjacent previous frame image to the current frame image;

inputting the crowd density map and the crowd speed map of the current frame image into an element multiplication network (elementary product network) in the deep neural network, and multiplying the crowd density map and the crowd speed map of the current frame image at corresponding positions by the element multiplication network to obtain a crowd counting map of the current frame image.

In the embodiment of the invention, the crowd density map and the crowd speed map of the frame image are obtained based on at least two frame images in the video, and the crowd density map and the crowd speed map of the current frame image are multiplied by elements of the corresponding positions to obtain the crowd counting map of the frame image on the assumption that the density distribution and the walking speed of pedestrians in the two frames are unchanged, so that the accurate acquisition of the crowd counting map is realized.

In a specific example of each embodiment of the line crossing counting method, after obtaining the crowd counting map of the frame image based on the implementation of the present invention, for any LOI to be subjected to line crossing counting, the number of people of the current frame image passing through the LOI from two directions can be specifically obtained as follows:

respectively projecting the counting vectors of each position on the LOI in the crowd counting graph in the normal direction of the LOI to obtain the scalar value of each position on the LOI, wherein the positive and negative of the scalar value represent two cross-line directions of the LOI, such as two cross-line directions of the LOI entering a subway entrance and the LOI going out of the subway entrance;

and accumulating the positive scalar value and the negative scalar value on the LOI respectively to obtain the number of people passing through the current frame image in two directions on the LOI respectively.

For example, positive and negative scalar values on the LOI can be accumulated separately as follows:

wherein, C_1，tAnd C_2，tRespectively representing instantaneous line crossing count values theta at t moments in two directions of LOI in the current frame image_pRepresents the count vector (C) at the current position p_t，x(p)，C_t，y(p)) and the normal direction of the LOI, and T is any time in the time period T to be analyzed.

Obtaining instantaneous line crossing count value C of two directions on LOI in frame image_1，tAnd C_2，tThen, can be represented by the formula c₁＝Σ{t|t∈T}c_1，t，c₂＝Σ{t|t∈T}c_2，tC for each time T within the time period T to be analyzed_1，tAnd C_2，tAccumulating to obtain the number of unidirectional cross-line persons in the time period T to be analyzed C₁And C₂The number of cross-line persons of the LOI in two directions in the time period T to be analyzed is respectively.

In a further embodiment of the method for counting the overline, after the number of people passing through the current frame image in the two directions on the LOI is obtained, the number of people passing through the LOI in the two directions can be accumulated, so that the total number of overline passing through the LOI in the time period T to be analyzed is obtained.

FIG. 2 is a flowchart illustrating another embodiment of a cross-line counting method according to the present invention. As shown in fig. 2, the line crossing counting method of this embodiment includes:

202, sequentially extracting at least two frames of images from a plurality of original frame images corresponding to a time period T to be analyzed in a video needing crowd line crossing counting by the deep neural network, and generating a crowd counting graph of the current frame image by taking a later frame image in the at least two frames of images as the current frame image.

The at least two frame images extracted sequentially may be continuous original frame images or discontinuous original frame images, and the at least two frame images may also be partially continuous original frame images and partially discontinuous original frame images. The group count map includes a count vector for each position in the frame image, that is: each position of the crowd counting graph records a two-dimensional counting vector, and the two-dimensional counting vector represents the number of people passing through between the current frame image and the adjacent previous frame image in the directions of the x axis and the y axis respectively.

And 204, taking each frame image in the plurality of original frame images as a current frame image, and projecting the counting vector of each position on the LOI in the crowd counting graph in the normal direction of the LOI aiming at the LOI to be subjected to line crossing counting in the video to obtain a scalar value of each position on the LOI, wherein the positive and negative of the scalar value represent two line crossing directions of the LOI.

And 206, accumulating the positive scalar value and the negative scalar value on the LOI respectively to obtain the number of people passing through the current frame image in the two directions on the LOI respectively, wherein the number is the instant line crossing count value of the current frame image in the two directions on the LOI at the moment t corresponding to the current frame image.

208, respectively accumulating the number of people of the plurality of original frame images passing through the LOI in two directions in the time period T to be analyzed, and obtaining the number of people of one-way line crossing of the LOI in two directions in the time period T to be analyzed.

And 210, accumulating the number of the unidirectional line crossing people of the LOI in two directions to obtain the total number of the line crossing people passing through the LOI in the time period T to be analyzed.

Before the cross-line counting method of each embodiment of the present invention, the initial deep neural network may be trained in advance to obtain a deep neural network, and the obtained deep neural network may be used in the cross-line counting scheme of the above embodiment, and may also be used in other application situations requiring a population counting map. In particular, an initial deep neural network may be preset, and the initial deep neural network includes an initial Convolutional Neural Network (CNN) and an initial element multiplication network. Inputting a plurality of original frame images of more than one sample video into an initial deep neural network, taking a pre-labeled crowd counting graph of the plurality of original frame images in the sample video as a supervision signal, and carrying out iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final deep neural network.

Based on the deep neural network training method provided by the above embodiment of the present invention, an original sample video is input to an initial deep neural network, a crowd counting graph labeled in advance in a plurality of original frame images in the sample video is used as a supervision signal, iterative training is performed on the initial deep neural network until a training result meets a preset condition, and the deep neural network is obtained so as to perform crowd cross-line counting on the deep neural network. Because the depth neural network directly takes the original frame image in the original video as input without using the time sequence slice image, the robustness is better, the depth neural network can be applied to various different scenes, the problems that the identifiability of pedestrians in the time sequence slice image is low and the number of people in the time sequence slice image cannot be estimated when the crowd density in the video is high, the crowd moving speed is low or the video is still, or the visual angle of a monitoring camera is low and the like are solved, the depth neural network is also suitable for the situations that the crowd density is high, the crowd moving speed is low or the video is still, and can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting graph instead of only using the total number of the crowd when the deep neural network is trained, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.

In a specific example of the embodiment of the present invention, the plurality of original frame images are labeled with a crowd density map, a crowd speed map, and a crowd count map, respectively. Correspondingly, in this embodiment, inputting a plurality of original frame images of the sample video into the initial deep neural network, taking a pre-labeled population count map of the plurality of original frame images as a supervision signal, and iteratively training the initial deep neural network until a training result meets a preset condition may include:

and respectively taking two adjacent frame images in a plurality of original frame images in the sample video as a training sample to be input into an initial convolutional neural network, taking a pre-labeled crowd density graph and a crowd velocity graph as supervision signals, and performing iterative training on the initial convolutional neural network until a training result meets a first preset convergence condition to obtain a final convolutional neural network. The two adjacent frame images can be two continuous frame original images in an original video, or discontinuous original frame images extracted from the original video according to a certain time interval or frame image interval;

and respectively taking two adjacent frame images in a plurality of original frame images in the sample video as a training sample to be input into the initial deep neural network, taking a pre-labeled population counting graph as a supervision signal, and carrying out iterative training on the initial deep neural network until a set index meets a second preset convergence condition to obtain a final deep neural network.

In order to learn a better deep neural network, in the above embodiment of the present invention, the initial deep neural network is trained through two stages. Respectively estimating a crowd density map and a crowd velocity map in a first stage, which are two relatively simple tasks with relatively obvious semantic information; in the second stage a direct estimate is given of the population count map.

It is noted that after the training of the first stage is completed, we can obtain the crowd count map by multiplying the crowd density map and the velocity map, but in practice, the obtained crowd density map and the velocity map may have a mismatch in spatial position because no spatial position matching constraint is given to the crowd density map and the velocity map in the training process of the first stage. Because the target of the second-stage training is obtained by multiplying the two crowd density graphs output by the first stage by the elements of the crowd speed at the corresponding positions, after the first-stage training is finished, the embodiment of the invention corrects the mismatch at the spatial position through the second-stage training, thereby effectively ensuring the match of the crowd density graphs and the speed graphs at the spatial position; in addition, the crowd counting graph is used as a supervision signal in the second stage, so that the learning of a complex initial deep neural network is facilitated, and the deep neural network obtained through training has stronger and more accurate counting capability.

In another embodiment of the cross-line counting method of the present invention, before the iterative training of the initial deep neural network, the following operations may be performed:

respectively positioning pedestrians for each frame image in the plurality of original frame images in the sample video to obtain the positions of the pedestrians in each frame image in the sample video and respectively distributing pedestrian IDs to the pedestrians;

and calibrating the pedestrian information of each pedestrian in each frame image in the sample video according to the pedestrian position in each frame image in the sample video, wherein the pedestrian information comprises the pedestrian position and the pedestrian ID.

Since the geometric perspective view includes the correspondence between the number of pixels at different positions in the sample video and the real physical size of the scene, the pedestrian information of each pedestrian is marked in each frame image of the plurality of original frame images in the sample video according to the position of the pedestrian in each frame image of the plurality of original frame images in the sample video and the geometric perspective view, and the position of the pedestrian can be marked with an icon of a corresponding size in the sample video scene according to the position of the pedestrian in the real scene and the physical size of the pedestrian. For example, in the frame image of the sample video corresponding to the time t, the position information of the head of the pedestrian can be labeled:

where t denotes the time, {1, …, n } denotes the pedestrian ID of each pedestrian, and here the pedestrian ID is specifically denoted by a serial number.

In the specific training process, when the pedestrians in the sample video are calibrated and the pedestrian IDs are distributed, the calibration can be carried out on each frame of image in the sample video, the marking can be carried out according to the running condition and the moving speed of the pedestrians and preset intervals (for example, 1 second), and the pedestrians and the pedestrian IDs of the middle frame of image can be obtained approximately through interpolation of the pedestrians and the pedestrian IDs of the pedestrians in the two frames of image calibrated in the front and the back, so that the marking workload is simplified; in addition, all original frame images in the sample video can participate in the initial deep neural network training, and a part of the original frame images can also be extracted to participate in the initial deep neural network training, so that more sample videos can be trained under the condition of a certain training task, the more the total number of the frame images participating in the initial deep neural network training is, the better the training effect of the initial deep neural network is, and the better the robustness of the deep neural network obtained by training is.

FIG. 3 is a flowchart of an embodiment of a deep neural network training method of the present invention. The preset initial deep neural network may specifically include an initial CNN and an initial element multiplication network. And the deep neural network obtained after training comprises a corresponding CNN and an element multiplication network. As shown in fig. 3, the deep neural network training method of the embodiment includes:

302, setting a geometric perspective view of a sample video aiming at a scene of the sample video in advance, wherein the geometric perspective view comprises a corresponding relation between the number of pixels at different positions in the sample video and the real physical size of the scene; and respectively carrying out pedestrian positioning on each frame image in a plurality of original frame images participating in network training in the sample video to obtain the pedestrian position in each frame image and respectively distributing pedestrian ID to each pedestrian.

Because the body sizes of different pedestrians are different and the heads of the pedestrians are not easily shielded, the positions of the heads of the pedestrians can be used as the positions of the pedestrians for accurately and objectively representing the positions of the pedestrians.

And 304, calibrating the pedestrian information of each pedestrian in each frame image in the plurality of original frame images of the sample video respectively according to the pedestrian position in each frame image in the plurality of original frame images of the sample video, wherein the pedestrian information comprises the pedestrian position and the pedestrian ID.

And 306, respectively taking two adjacent frame images in the original frame images in the sample video as a training sample to be input to the initial convolutional neural network, taking a pre-labeled crowd density graph and a crowd velocity graph as supervision signals, and performing iterative training on the initial convolutional neural network until a training result meets a first preset convergence condition to obtain a final convolutional neural network.

The crowd density graph is used for representing the crowd density of each position in the current frame image, and the crowd speed graph is used for representing the speed of each person in the current frame image moving from the previous frame image to the current frame image.

Specifically, after two adjacent frame images in a plurality of original frame images in the sample video are respectively used as a training sample and input to an initial convolutional neural network, the initial convolutional neural network uses a later frame image in a current training sample as a current frame image, generates a crowd density map of the current frame image according to pedestrian information calibrated by each frame image, and generates a crowd speed map of the current frame image according to the pedestrian information and a geometric perspective view in the two frame images of the current training sample; comparing whether the deviation between the crowd density graph and the crowd speed graph generated by the initial convolutional neural network and the marked crowd density graph and the marked crowd speed graph is smaller than a preset condition or whether the times of iterative training of the initial convolutional neural network reach a preset time threshold value; if the deviation is not less than the preset condition or the number of iterative training times does not reach the preset number threshold, adjusting the network parameters of the initial convolutional neural network, returning to continue the operation 306 until the deviation is less than the preset condition or the number of iterative training times reaches the preset number threshold, ending the training of the initial convolutional neural network, and obtaining the convolutional neural network.

Specifically, two adjacent frame images in the plurality of original frame images in the sample video may be continuous two-frame original images, continuous three-frame or more original frame images, continuous two-frame original images sequentially extracted therefrom, discontinuous two-frame original images, two-frame original images sequentially extracted from discontinuous three-frame or more frame images, or optical flow images of original images. When the number of the extracted frame images is more than two frames, the current frame image and the previous frame image are respectively two frame images positioned at the back and the front in the original sample video, and the frame numbers of the two frame images are not required to be continuous.

In one specific example, the initial convolutional neural network may specifically generate a crowd density map of the current frame image by:

respectively acquiring the crowd density value of each position in the current frame image according to the pedestrian information in the current frame image;

and generating a crowd density map of the current frame image according to the crowd density value and the geometric perspective of each position in the current frame image.

For example, according to the positions of the pedestrians in each frame image, the crowd density values of the positions in the frame image can be obtained after the positions of the pedestrians are respectively marked in each frame image; the crowd density map in the frame image can be calculated and obtained through the following formula:

wherein D is_t(p) indicates the crowd density value at a p position in the frame image,

representing a normalized two-dimensional Gaussian distribution centered on the head marker P at the P position (i.e., representing the position of a pedestrian's head by a Gaussian kernel), σ_PRepresenting the variance, σ, of the Gaussian distribution_PIs determined from the geometric rendering of each particular sample video scene to ensure that each person has the same physical dimensions.

In another specific example, the initial convolutional neural network may generate a crowd velocity map of the current frame image by:

acquiring the moving speed of each pedestrian in the current frame image according to the position difference of each pedestrian in the current frame image in the previous frame image and the current frame image and the corresponding time difference of the previous frame image and the current frame image;

acquiring the crowd speed of each position in the current frame image according to the moving speed and the position of each pedestrian in the current frame image;

and generating a crowd speed map of the current frame image according to the crowd speed and the geometric perspective of each position in the current frame image.

For example, the crowd velocity map in the frame image can be calculated and obtained through the following formula:

wherein, V_t(p) represents the speed value of the population at position p;

the moving speed of the mark head P in the current frame image can be represented according to the position difference between two adjacent frame images

To obtain; k (P, P, r)_P) Is a function of the shape of a disc, the centre of which is the head mark P, with a radius r_PRadius r_PThe selection can be specifically carried out in the following way: conversion of the empirically set actual physical size of the human head from a geometric perspective into the number of pixels of the drinking position, e.g. radius r_PThe value of (A) can be selected to be 0.15m according to experience; k (P, P, r)_P)＝1(p≤||P-r_P||²)。

And 308, respectively taking two adjacent frame images in the plurality of original frame images in the sample video as a training sample to be input into the initial deep neural network, taking a pre-labeled crowd counting graph as a supervision signal, and performing iterative training on the initial deep neural network until a set index meets a second preset convergence condition to obtain a final deep neural network.

Specifically, after two adjacent frame images in the plurality of original frame images in the sample video are respectively used as a training sample to be input to an initial deep neural network, a convolutional neural network in the initial deep neural network uses a later frame image in a current training sample as a current frame image, a crowd density map of the current frame image is generated according to pedestrian information and a geometric perspective which are calibrated by each frame image in the sample video, and a crowd speed map of the current frame image is generated according to the pedestrian information and the geometric perspective in the two frame images of the current training sample and is input to an initial element multiplication network in the initial deep neural network; and multiplying the crowd density graph and the crowd velocity graph of the current frame image input by the convolutional neural network at the elements of the corresponding positions by the initial element multiplication network to obtain the crowd counting graph of the current frame image. Comparing whether the deviation between the crowd counting graph output by the element multiplication network and the pre-labeled crowd counting graph is smaller than a preset condition or whether the times of iterative training of the initial deep neural network reach a preset time threshold value; if the deviation is not less than the preset condition or the number of iterative training times does not reach the preset number threshold, adjusting the network parameters of the initial element multiplication network, returning to continue executing the operation 308 until the deviation is less than the preset condition or the number of iterative training times reaches the preset number threshold, ending the training of the initial deep neural network, and obtaining a final element multiplication network by the initial element multiplication network, thereby obtaining the final deep neural network.

In order to obtain the crowd counting graph, in the above embodiment of the present invention, the crowd density graph and the crowd speed graph of the frame image are obtained based on at least two frame images and the geometric perspective view in the plurality of original frame images in the sample video, and the crowd counting graph of the frame image is obtained by multiplying elements of the crowd density graph and the crowd speed graph of the current frame image at corresponding positions on the assumption that the density distribution and the walking speed of the pedestrian at the two frames are unchanged, so that the crowd counting graph is obtained conveniently.

In the embodiment of the invention shown in fig. 3, an initial deep neural network, which is a deep learning model, is introduced, an original video is directly used as a training sample video, a frame image in the original video is used as an input of an initial convolutional neural network, a pixel-level crowd density map, a labeled crowd density map and a crowd counting map are used as supervisory signals, cross-line counting is performed based on the crowd counting map during training instead of using only the total number of crowds, and the distribution condition of the crowds is also considered, so that the deep neural network for cross-line counting is obtained by training, has high robustness, is also applicable to extreme conditions of large crowd density, low crowd moving speed or immobility, can be applied across scenes, and has no problems that the identifiability of pedestrians in a time sequence slice image is low and the number of crowds in the time sequence slice image cannot be estimated, the over-line counting result is more objective and accurate.

In order to learn a better deep neural network, in the embodiment shown in fig. 3, the initial deep neural network is trained in two stages. The first stage corresponds to operation 306, which is two relatively simple tasks with relatively obvious semantic information, by giving estimates to the crowd density map and the crowd velocity map, respectively, through the initial convolutional neural network; the second stage corresponds to operation 308, which gives a direct estimate of the population count map by the initial element multiplication network.

Fig. 4 is a schematic diagram illustrating an initial deep neural network training in two stages according to an embodiment of the present invention. Inputting two adjacent frames of images in a sample video as a training sample into an initial convolutional neural network in an initial deep neural network, and outputting a crowd density graph and a crowd speed graph by the initial convolutional neural network at a first stage; inputting the crowd density graph and the crowd speed graph into an initial element multiplication network in an initial deep neural network, and outputting a crowd counting graph by the initial element multiplication network in a second stage.

It is noted that after the training of the first stage is completed, we can obtain the crowd count map by multiplying the crowd density map and the velocity map, but in practice, the obtained crowd density map and the velocity map may have a mismatch in spatial position because no spatial position matching constraint is given to the crowd density map and the velocity map in the training process of the first stage. Because the target of the second-stage training is obtained by multiplying the two crowd density graphs output by the first stage by the elements of the crowd speed at the corresponding positions, for this reason, the embodiment of the invention corrects the mismatch at the spatial position through the second-stage training, and effectively ensures the match of the crowd density graphs and the speed graphs at the spatial position; in addition, the crowd counting graph is used as a supervision signal in the second stage, so that the learning of a complex deep neural network is facilitated, and the deep neural network obtained through training has stronger and more accurate counting capability.

In a specific example of the embodiment shown in fig. 3, for example, the training result may be considered to satisfy the first preset convergence condition when any one or more of the following conditions are satisfied:

for a plurality of original frame images in each sample video, the ratio of the number of frames of an image, in which a crowd density map and a crowd speed map output by an initial convolutional neural network are consistent with a pre-labeled crowd density map and a crowd speed map, to the number of frames of the plurality of original frame images reaches a first preset threshold, that is: the ratio of the number of frames of the image, of which the crowd density graph output by the initial convolutional neural network is consistent with the pre-labeled crowd density graph and the crowd speed graph, to the number of frames of the image input into the sample video of the initial convolutional neural network reaches a first preset threshold value, and meanwhile, the ratio of the number of frames of the image, of which the crowd speed graph output by the initial convolutional neural network is consistent with the pre-labeled crowd speed graph, to the number of frames of the image input into the sample video of the initial convolutional neural network reaches the first preset threshold value;

for each frame image in the plurality of original frame images in each sample video, the similarity between the crowd density graph output by the initial convolutional neural network and the pre-labeled crowd density graph and the similarity between the crowd speed graph output by the initial convolutional neural network and the pre-labeled crowd speed graph are greater than a second preset threshold value;

aiming at the plurality of original frame images in each sample video, the average similarity between the crowd density graph output by the initial convolutional neural network and the pre-labeled crowd density graph and the average similarity between the crowd speed graph output by the initial convolutional neural network and the pre-labeled crowd speed graph are greater than a third preset threshold value;

and the iterative training times of the initial convolutional neural network reach a fourth preset threshold value.

In another specific example of the embodiment shown in fig. 3, the training result may be considered to satisfy the second preset convergence condition when any one or more of the following conditions are satisfied, for example:

aiming at the plurality of original frame images in each sample video, the ratio of the number of frames of an image, which is output by an initial element multiplication network and is consistent with a pre-marked crowd counting image, to the number of frames of the plurality of original frame images reaches a fifth preset threshold;

aiming at each frame of image in each sample video, the similarity between the crowd counting graph output by the initial element multiplication network and the pre-labeled crowd counting graph is greater than a sixth preset threshold;

aiming at all frame images in each sample video, the average similarity between the crowd counting graph output by the initial element multiplication network and the crowd counting graph obtained by artificial labeling is greater than a seventh preset threshold;

and the number of times of iterative training of the second part of the deep neural network reaches an eighth preset threshold value.

Wherein, according to the actual requirement, when any one or more of the following conditions are satisfied, the crowd density map is considered to be consistent with the pre-labeled crowd density map (or the crowd speed map is consistent with the pre-labeled crowd speed map):

the image characteristics of the crowd density graph output by the initial convolutional neural network and the image characteristics of the pre-labeled crowd density graph (or the crowd speed graph output by the initial convolutional neural network and the pre-labeled crowd speed graph) are completely the same;

the image features of the pre-labeled crowd density map include but are more than those of the crowd density map output by the initial convolutional neural network (or the image features of the pre-labeled crowd velocity map include but are more than those of the crowd velocity map output by the initial convolutional neural network);

the same characteristics between the image characteristics of the crowd density graph output by the initial convolutional neural network and the image characteristics of the pre-labeled crowd density graph (or the crowd speed graph output by the initial convolutional neural network and the pre-labeled crowd speed graph) reach a certain quantity or a certain preset proportion;

the same characteristics between the image characteristics of the crowd density graph output by the initial convolutional neural network and the image characteristics of the pre-labeled crowd density graph (or the crowd velocity graph output by the initial convolutional neural network and the pre-labeled crowd velocity graph) meet other preset conditions.

In addition, according to actual requirements, when any one or more conditions including but not limited to the following conditions are met, the crowd counting graph output by the initial element multiplication network is considered to be consistent with the pre-labeled crowd counting graph:

the image characteristics of the crowd counting graph output by the initial element multiplication network and the image characteristics of the pre-labeled crowd density graph are completely the same;

the image features of the pre-labeled crowd density map include but are more than those of the crowd count map output by the initial element multiplication network;

the same characteristics between the image characteristics of the population counting graph output by the initial element multiplication network and the image characteristics of the pre-labeled population counting graph reach a certain quantity or a certain preset proportion;

the same characteristics between the image characteristics of the population counting graph output by the initial element multiplication network and the image characteristics of the pre-labeled population counting graph meet other preset conditions.

In addition, the similarity between the two graphs, for example, the similarity between the crowd density graph output by the initial convolutional neural network and the pre-labeled crowd density graph, the similarity between the crowd velocity graph output by the initial convolutional neural network and the pre-labeled crowd velocity graph, and the similarity between the crowd count graph output by the initial element multiplication network and the pre-labeled crowd density graph, can be measured by the euclidean distance (L2) between the two graphs. Therefore, the euclidean distance between the two images can be obtained first, and whether the euclidean distance between the two images is greater than a preset threshold value is compared to determine whether the similarity between the two images is greater than the preset threshold value.

In addition, the embodiment of the invention also provides a method for counting the cross lines of the crowd in the video by the neural network obtained by training the deep neural network training method.

After the neural network obtained by the deep neural network training method is trained, a crowd counting graph of a frame image in a video can be obtained based on the deep neural network so as to carry out crowd cross-line counting in the video. The original frame image of the video to be cross-line counted is input into the deep neural network, and the deep neural network can output the crowd counting graph of the frame image by, but not limited to, the operations as described in any of the above embodiments of the present invention. In addition, the deep neural network used in the cross-line counting method according to the above embodiment of the present invention may be obtained based on the deep neural network training method according to any one of the above embodiments of the present invention, or may be obtained by other training methods, as long as the trained deep neural network can output the population counting map of the frame image for the input original frame image.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

FIG. 5 is a schematic structural diagram of an embodiment of the cross-line counting apparatus according to the present invention. The cross-line counting device of the embodiment can be used for realizing the cross-line counting method embodiments of the invention. As shown in fig. 5, the over line counting apparatus of this embodiment includes: the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit. Wherein:

the first acquisition unit is used as a deep neural network and used for receiving a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting and outputting a crowd counting graph of the original frame images.

The people counting map includes counting vectors of positions in the frame images, and the counting vectors are used for indicating the number of people respectively passing through between each frame image and an adjacent previous frame image in the plurality of original frame images in the counting direction (for example, two coordinate directions of a two-dimensional coordinate plane).

Illustratively, the first obtaining unit is specifically configured to sequentially extract at least two frame images from a plurality of original frame images in the video corresponding to the time period T to be analyzed, and generate a crowd count map of the current frame image by using a later frame image of the at least two frame images as the current frame image.

The second obtaining unit is configured to use each of the plurality of original frame images as a current frame image, and obtain, according to a people counting map of the current frame image, a number of people of the current frame image that pass through the LOI from at least one direction, for example, a number of people of the current frame image that pass through the LOI from one direction, or a number of people of the current frame image that pass through the LOI from two directions, for a line of interest LOI to be cross-line counted in the video.

Exemplarily, the second obtaining unit may be specifically configured to project the counting vectors at the positions on the LOI in the population counting graph in a normal direction of the LOI, respectively, to obtain scalar values at the positions on the LOI, where the positive and negative of the scalar values represent two cross-line directions of the LOI; and accumulating the positive scalar value and the negative scalar value on the LOI respectively to obtain the number of people passing through the current frame image in two directions on the LOI respectively.

And the third acquisition unit is used for respectively accumulating the number of people passing through the LOI by the plurality of original frame images in the at least one direction to acquire the number of unidirectional line crossing people of the LOI in the at least one direction in the time period T to be analyzed.

When the second acquiring unit acquires the number of people of the current frame image passing through the LOI from one direction, the third acquiring unit correspondingly accumulates the number of people of each frame image in the plurality of original frame images passing through the LOI in the direction to acquire the number of people of one-way line crossing of the LOI in the direction in the time period T to be analyzed. When the second acquiring unit acquires the number of people of the current frame image passing through the LOI from two directions, the third acquiring unit accumulates the number of people of each frame image in the original frame images passing through the LOI in the two directions respectively, and the number of people of one-way line crossing of the LOI in the two directions in the time period T to be analyzed is acquired.

The crowd cross-line counting device based on the embodiment of the invention respectively obtains the crowd counting graph of each frame of image corresponding to the time period T to be analyzed in the video through the deep neural network, respectively obtains the number of people passing through the LOI from at least one direction according to the crowd counting graph aiming at each frame of image, respectively accumulates the number of people passing through the LOI from a plurality of original frame images in at least one direction, and obtains the unidirectional cross-line number of people in at least one direction of the LOI in the time period T to be analyzed. Because the embodiment of the invention directly takes each frame image in the original video as input without using a time sequence slice image, the robustness is better, the method can be applied to various different scenes, is also suitable for extreme conditions of large crowd density, low crowd moving speed or immobility, and can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting diagram instead of only using the total number of the crowd, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.

FIG. 6 is a schematic structural diagram of another embodiment of the cross-line counting apparatus according to the present invention. As shown in fig. 6, compared with the embodiment shown in fig. 5, in the cross-line counting apparatus of this embodiment, the first obtaining unit specifically includes a convolutional neural network and an element multiplication network. Wherein:

and the convolutional neural network is used for receiving at least two input frame images, taking a later frame image in the at least two frame images as a current frame image, and generating a crowd density map and a crowd speed map of the current frame image. The crowd density map is used for representing the crowd density of each position in the current frame image, and the crowd speed map is used for representing the speed of each pedestrian in the current frame image moving from the adjacent previous frame image to the current frame image.

Based on the deep neural network training device provided by the above embodiment of the present invention, an original sample video is input to an initial deep neural network, a crowd counting graph labeled in advance in a plurality of original frame images in the sample video is used as a supervision signal, iterative training is performed on the initial deep neural network until a training result meets a preset condition, and the deep neural network is obtained so as to perform crowd cross-line counting on the deep neural network. Because the depth neural network directly takes the original frame image in the original video as input without using the time sequence slice image, the robustness is better, the depth neural network can be applied to various different scenes, the problems that the identifiability of pedestrians in the time sequence slice image is low and the number of people in the time sequence slice image cannot be estimated when the crowd density in the video is high, the crowd moving speed is low or the video is still, or the visual angle of a monitoring camera is low and the like are solved, the depth neural network is also suitable for the situations that the crowd density is high, the crowd moving speed is low or the video is still, and can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting graph instead of only using the total number of the crowd when the deep neural network is trained, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.

Illustratively, when the initial convolutional neural network generates the crowd density map of the current frame image, the initial convolutional neural network can be specifically used for respectively obtaining the crowd density values of all positions in the current frame image according to the pedestrian information in the current frame image; generating a crowd density map of the current frame image according to the crowd density values of all positions in the current frame image; when the crowd speed map of the current frame image is generated, the crowd speed map is specifically used for acquiring the moving speed of each pedestrian in the current frame image according to the position difference of each pedestrian in the current frame image in the current training sample in the previous frame image and the current frame image and the corresponding time difference of the previous frame image and the current frame image; acquiring the crowd speed of each position in the current frame image according to the moving speed and the position of each pedestrian in the current frame image; and generating a crowd speed map of the current frame image according to the crowd speed of each position in the current frame image.

And the element multiplication network is used for multiplying the crowd density graph and the crowd velocity graph of the current frame image at corresponding positions to obtain a crowd counting graph of the current frame image.

Further, referring to fig. 6, in another embodiment of the cross-line counting apparatus of the present invention, a calculating unit may be further included, configured to accumulate the number of unidirectional cross-line persons in two directions of the LOI, so as to obtain the total number of cross-line persons passing through the LOI in the time period T to be analyzed.

FIG. 7 is a schematic structural diagram of an embodiment of a deep neural network training device according to the present invention. As shown in fig. 7, the deep neural network training device of this embodiment includes a network training unit, configured to input a plurality of original frame images of a sample video to an initial deep neural network, and perform iterative training on the initial deep neural network with a population count map pre-labeled with the plurality of original frame images in the sample video as a supervision signal until a training result meets a preset condition to obtain a final deep neural network; the initial deep neural network includes an initial convolutional neural network CNN and an initial element multiplication network.

In a specific example of the above deep neural network training device embodiment, the plurality of original frame images are labeled with a crowd density map, a crowd speed map, and a crowd count map, respectively. Accordingly, in this embodiment, the network training unit may specifically be configured to train the initial deep neural network by:

respectively taking two adjacent frame images in the plurality of original frame images in the sample video as a training sample to be input into an initial convolutional neural network, taking a pre-labeled crowd density graph and a crowd speed graph as supervision signals, and performing iterative training on the initial convolutional neural network until a training result meets a first preset convergence condition to obtain a convolutional neural network; and

and respectively taking two adjacent frame images in the plurality of original frame images in the sample video as a training sample to be input into the initial deep neural network, taking a pre-labeled population counting graph as a supervision signal, and performing iterative training on the initial deep neural network until a set index meets a second preset convergence condition to obtain a final deep neural network.

For the training process of the initial deep neural network, the training result of which satisfies the first preset convergence condition, and the training result of which satisfies the second preset convergence condition, reference may be made to the above description of the embodiment shown in fig. 3; for details, reference may be made to the description of the embodiments of the line crossing counting method of the present invention, and details are not described herein.

The embodiment of the invention also provides a data processing device which comprises the overline counting device provided by any one of the above embodiments of the invention.

Specifically, the data processing apparatus of the embodiment of the present invention may be any apparatus having a data processing function, and may include, for example and without limitation: advanced reduced instruction set machines (ARM), Central Processing Units (CPU) or Graphics Processing Units (GPU), etc.

The data processing device provided based on the above embodiment of the present invention includes the cross-line counting device provided in any of the above embodiments of the present invention, and each frame image in the original video is directly used as an input without using a time series slice image, so that the robustness is better, the data processing device can be applied to various different scenes, is also applicable to extreme situations with high crowd density, low crowd moving speed or stillness, and can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting diagram instead of only using the total number of the crowd, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.

In addition, an embodiment of the present invention further provides an electronic device, which may be, for example, a mobile terminal, a Personal Computer (PC), a tablet computer, a server, and the like, and the electronic device is provided with the data processing apparatus according to any of the above embodiments of the present invention.

The electronic device provided based on the above embodiment of the present invention includes the above data processing device of the present invention, and thus includes the above cross-line counting device provided in any of the above embodiments of the present invention, and each frame image in the original video is directly used as an input without using a time series slice image, so that the robustness is better, the electronic device can be applied to various different scenes, is also applicable to extreme situations with large crowd density, low crowd moving speed or static, and can be applied across scenes; in addition, the embodiment of the invention carries out cross-line counting based on the crowd counting diagram instead of only using the total number of the crowd, and also considers the distribution condition of the crowd, so that the cross-line counting result is more objective and accurate.

Fig. 8 is a schematic structural diagram of an embodiment of an electronic device according to the present invention. As shown in fig. 7, an electronic device for implementing an embodiment of the present invention includes a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) that can perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM) or loaded from a storage section into a Random Access Memory (RAM). The cpu or the gpu may communicate with the rom and/or the ram to execute the executable instructions to perform operations corresponding to the cross-line counting method provided by the embodiments of the present invention, for example: inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a deep neural network, and outputting a crowd counting graph of the plurality of original frame images by the deep neural network; the crowd counting graph comprises a counting vector of each position, and the counting vector is used for representing the number of people passing through between each frame image and an adjacent previous frame image in the plurality of original frame images in the counting direction; respectively taking each frame image in the plurality of original frame images as a current frame image, and acquiring the number of people of the current frame image passing through an LOI (line of interest) from at least one direction according to a crowd counting graph of the current frame image aiming at the LOI to be subjected to line crossing counting in a video; and respectively accumulating the number of people of the plurality of original frame images passing through the LOI in the at least one direction to obtain the number of unidirectional line crossing people of the LOI in the at least one direction in the time period T to be analyzed. In addition, the central processing unit or the graphics processing unit may communicate with the read-only memory and/or the random access memory to execute the executable instructions so as to perform operations corresponding to the deep neural network training method provided by the embodiment of the present invention, for example: inputting a plurality of original frame images of a sample video into an initial deep neural network, taking a population counting graph labeled in advance by the plurality of original frame images as a supervision signal, and performing iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final deep neural network; the initial deep neural network includes an initial convolutional neural network and an initial element multiplication network.

In addition, in the RAM, various programs and data necessary for system operation may also be stored. The CPU, GPU, ROM, and RAM are connected to each other via a bus. An input/output (I/O) interface is also connected to the bus.

The following components are connected to the I/O interface: an input section including a keyboard, a mouse, and the like; an output section including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section including a hard disk and the like; and a communication section including a network interface card such as a LAN card, a modem, or the like. The communication section performs communication processing via a network such as the internet. The drive is also connected to the I/O interface as needed. A removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive as necessary, so that a computer program read out therefrom is previously mounted in the storage section as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for executing the method shown in the flowchart, the program code may include instructions corresponding to performing any one of the steps of the cross-line counting method provided by the embodiments of the present invention, for example, instructions for inputting a plurality of raw frame images corresponding to a time period T to be analyzed in a video that needs to be cross-line counted to a deep neural network, and outputting a crowd count map of the plurality of raw frame images by the deep neural network; the crowd counting graph comprises a counting vector of each position, and the counting vector is used for representing the number of people passing through between each frame image and an adjacent previous frame image in the plurality of original frame images in the counting direction; respectively taking each frame image in the plurality of original frame images as a current frame image, and acquiring instructions of the number of people of the current frame image passing through an LOI (line of interest) from at least one direction according to a crowd counting graph of the current frame image aiming at the LOI to be subjected to line crossing counting in a video; and respectively accumulating the number of people who pass through the LOI by the plurality of original frame images in the at least one direction to obtain the number of people who cross the line in one direction in the time period T to be analyzed by the LOI in the at least one direction. The program code may further include instructions corresponding to the execution of any one of the steps of the deep neural network training method provided by the embodiment of the present invention, for example, instructions for inputting a plurality of original frame images in a sample video to an initial deep neural network, taking a population count map pre-labeled with the plurality of original frame images as a supervision signal, iteratively training the initial deep neural network until a training result satisfies a preset condition, and obtaining a final deep neural network; the initial deep neural network includes an initial convolutional neural network and an initial element multiplication network. The computer program may be downloaded and installed from a network through the communication section, and/or installed from a removable medium. The computer program performs the above-mentioned functions defined in the method of the present invention when executed by a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU).

An embodiment of the present invention further provides a computer storage medium, configured to store a computer-readable instruction, where the instruction includes: inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a deep neural network, and outputting a crowd counting graph of the plurality of original frame images by the deep neural network; the crowd counting graph comprises a counting vector of each position, and the counting vector is used for representing the number of people passing through between each frame image and an adjacent previous frame image in the plurality of original frame images in the counting direction; respectively taking each frame image in the plurality of original frame images as a current frame image, and acquiring instructions of the number of people of the current frame image passing through an LOI (line of interest) from at least one direction according to a crowd counting graph of the current frame image aiming at the LOI to be subjected to line crossing counting in a video; and respectively accumulating the number of people who pass through the LOI by the plurality of original frame images in the at least one direction to obtain the number of people who cross the line in one direction in the time period T to be analyzed by the LOI in the at least one direction. Alternatively, the instructions include: inputting a plurality of original frame images in a sample video into an initial deep neural network, taking a population counting graph labeled in advance by the plurality of original frame images as a supervision signal, and performing iterative training on the initial deep neural network until a training result meets a preset condition to obtain a final instruction of the deep neural network; the initial deep neural network includes an initial convolutional neural network and an initial element multiplication network.

In addition, an embodiment of the present invention further provides a computer device, including:

a memory storing executable instructions;

The embodiment of the invention can be applied to all scenes needing crowd flow statistics, such as:

the method comprises the following steps that 1, when the number of cross-line people at the time period T to be analyzed of the subway entrance/exit needs to be counted, videos of all the entrances and exits of the subway are collected through a monitoring camera, all the entrances and exits of the subway are respectively used as LOIs, and the videos of all the entrances and exits of the subway in the time period T to be analyzed are input into the deep neural network of the embodiment of the invention;

the method comprises the following steps that scene 2, aiming at the touring of urban masses, videos of touring streets are collected through a street monitoring camera, LOIs are arranged in the width direction of the touring streets, the videos of the touring streets in the LOIs within a time period T to be analyzed are input into the deep neural network of the embodiment of the invention, the number of people participating in the touring and the moving state of the crowd can be obtained through the cross-line counting method of the embodiment of the invention, and police force is conveniently allocated to ensure the touring order and the public safety;

and 3, aiming at the scenic spot or the public stadium, the video of the scenic spot or the public stadium can be acquired through a monitoring camera, the LOI is arranged at the entrance and exit of the scenic spot or the public stadium, the video of the scenic spot or the public stadium is input into the deep neural network of the embodiment of the invention, and the people entering and exiting the scenic spot or the stadium can be counted by the cross-line counting method of the embodiment of the invention, so that the people flow is reasonably controlled, and the danger of trampling accidents and the like caused by overcrowding is avoided.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the device and apparatus embodiments, since they correspond to the method embodiments basically, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.

The method, apparatus and device of the present invention may be implemented in a number of ways. For example, the methods, apparatus and devices of the present invention may be implemented by software, hardware, firmware or any combination of software, hardware and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.

The method and apparatus, device of the present invention may be implemented in a number of ways. For example, the method, apparatus and device of the present invention may be implemented by software, hardware, firmware or any combination of software, hardware and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.

The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method of line crossing counting, comprising:

inputting a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting to a deep neural network;

the depth neural network sequentially extracts at least two frames of images from a plurality of original frame images corresponding to a time period T to be analyzed in the video, takes a later frame image in the at least two frames of images as a current frame image, generates a crowd counting graph of the current frame image, and outputs the crowd counting graph of the plurality of original frame images; the crowd counting graph comprises counting vectors of all positions in the frame images, and the counting vectors are used for indicating the number of people passing through in the counting direction between each frame image and the adjacent previous frame image in at least two frames of images extracted sequentially; the at least two frames of images extracted sequentially comprise any one of the following items: continuous original frame images, discontinuous original frame images, or partial continuous original frame images and partial discontinuous original frame images;

respectively taking each frame image in the plurality of original frame images as a current frame image, and acquiring the number of people of the current frame image passing through an LOI (line of interest) from at least one direction according to a crowd counting graph of the current frame image aiming at the LOI to be subjected to line crossing counting in a video;

2. The method of claim 1, wherein the counting direction comprises two coordinate directions of a two-dimensional coordinate plane.

3. The method of claim 2, wherein said obtaining the number of people that the current frame image passes through the LOI from at least one direction respectively comprises: acquiring the number of people of the current frame image passing through the LOI from two directions respectively;

the accumulating the number of people passing through the LOI by the plurality of original frame images in the at least one direction respectively to obtain the number of people passing through the one-way line in the period of time T to be analyzed, wherein the number of people passing through the LOI in the at least one direction respectively comprises:

and respectively accumulating the number of people of which the original frame images pass through the LOI in the two directions to obtain the number of unidirectional line crossing people of the LOI in the two directions in the time period T to be analyzed.

4. The method of any one of claims 1 to 3, wherein the generating the people counting map of the current frame image comprises:

inputting the plurality of original frame images into the deep neural network, and generating a crowd density map and a crowd speed map of the current frame image by a convolutional neural network in the deep neural network; the crowd density graph is used for representing the crowd density of each position in the current frame image, and the crowd speed graph is used for representing the speed of each pedestrian in the current frame image moving from the adjacent previous frame image to the current frame image;

and inputting the crowd density map and the crowd speed map of the current frame image to an element multiplication network in the deep neural network, and multiplying the crowd density map and the crowd speed map of the current frame image at corresponding positions by the element multiplication network to obtain a crowd counting map of the current frame image.

5. The method of claim 3, wherein obtaining the number of people that the current frame image passes through the LOI from two directions respectively comprises:

respectively projecting the counting vectors of all positions on the LOI in the people counting graph in the normal direction of the LOI to obtain the scalar values of all the positions on the LOI, wherein the positive and negative of the scalar values represent two cross-line directions of the LOI;

6. The method of claim 4, wherein obtaining the number of people that the current frame image passes through the LOI from two directions respectively comprises:

7. The method of claim 3, further comprising:

and accumulating the number of the unidirectional cross-line persons of the LOI in the two directions to obtain the total number of the cross-line persons passing through the LOI in the time period T to be analyzed.

8. The method of claim 4, further comprising:

and accumulating the number of the unidirectional cross-line persons of the LOI in two directions to obtain the total number of the cross-line persons passing through the LOI in the time period T to be analyzed.

9. A deep neural network training method, wherein the deep neural network is the deep neural network in the cross-line counting method according to any one of claims 1 to 8; the method comprises the following steps:

10. The method of claim 9, wherein the plurality of original frame images are labeled with a crowd density map and a crowd velocity map, a crowd count map, respectively;

the method comprises the steps of inputting a plurality of original frame images of a sample video into an initial deep neural network, taking a population counting graph pre-labeled by the original frame images as a supervision signal, and performing iterative training on the initial deep neural network until a training result meets a preset condition, wherein the training step comprises the following steps:

respectively taking two adjacent frames of images in the plurality of original frame images as a training sample to be input into the initial convolutional neural network, taking a pre-labeled crowd density graph and a crowd speed graph as supervision signals, and performing iterative training on the initial convolutional neural network until a training result meets a first preset convergence condition to obtain the convolutional neural network; and

and respectively taking two adjacent frames of images in the plurality of original frame images as a training sample to be input into the initial deep neural network, taking a pre-labeled population counting graph as a supervision signal, and performing iterative training on the initial deep neural network until a set index meets a second preset convergence condition to obtain the final deep neural network.

11. The method of claim 10, further comprising:

respectively positioning pedestrians for each frame image in the plurality of original frame images to obtain the positions of the pedestrians in each frame image and respectively allocating pedestrian Identification (ID) to each pedestrian, wherein the pedestrian ID is used for uniquely identifying one pedestrian in the video;

and respectively calibrating the pedestrian information of each pedestrian in each frame image according to the pedestrian position in each frame image, wherein the pedestrian information comprises the pedestrian position and the pedestrian ID.

12. The method of claim 11, further comprising:

pre-setting a geometric perspective of the sample video for a scene of the sample video; the geometric rendering comprises a correspondence between the number of pixels at different positions in the sample video and the true physical dimensions of the scene;

after two adjacent frame images in the plurality of original frame images are respectively used as a training sample to be input to the initial convolutional neural network, the method further includes:

the initial convolutional neural network takes a later frame image in a current training sample as a current frame image, generates a crowd density map of the current frame image according to the pedestrian information calibrated by each frame image and the geometric perspective, and generates a crowd speed map of the current frame image according to the pedestrian information in the two frame images of the current training sample and the geometric perspective.

13. The method of claim 12, wherein generating the crowd density map of the current frame image comprises:

respectively acquiring the crowd density value of each position in the current frame image according to the pedestrian information in the current frame image and the geometric perspective;

and generating a crowd density map of the current frame image according to the crowd density values of all positions in the current frame image.

14. The method of claim 12 or 13, wherein the generating the crowd velocity map of the current frame image comprises:

acquiring the moving speed of each pedestrian in the current frame image according to the position difference of each pedestrian in the current frame image in the current training sample in the previous frame image and the current frame image and the corresponding time difference of the previous frame image and the current frame image;

and generating a crowd speed map of the current frame image according to the crowd speed of each position in the current frame image and the geometric perspective.

15. The method according to any one of claims 10 to 13, wherein the training result satisfying a first preset convergence condition comprises:

aiming at the original frame images, the proportion of the frame number of the image, of which the crowd density graph and the crowd speed graph output by the initial convolutional neural network are consistent with the pre-marked crowd density graph and the crowd speed graph, to the frame number of the original frame images reaches a first preset threshold; and/or

For each frame image in the plurality of original frame images, the similarity between the crowd density map output by the initial convolutional neural network and the pre-labeled crowd density map, and the similarity between the crowd speed map output by the initial convolutional neural network and the pre-labeled crowd speed map are greater than a second preset threshold; and/or

For the plurality of original frame images, the average similarity between the crowd density graph output by the initial convolutional neural network and the pre-labeled crowd density graph and the average similarity between the crowd speed graph output by the initial convolutional neural network and the pre-labeled crowd speed graph are greater than a third preset threshold; and/or

16. The method of claim 14, wherein the training result satisfying a first preset convergence condition comprises:

17. The method according to any one of claims 10 to 13, wherein the training result satisfying a second preset convergence condition comprises:

aiming at the original frame images, the ratio of the number of frames of the image, which is output by the initial element multiplication network and is consistent with the pre-labeled crowd counting graph, to the number of frames of the original frame images reaches a fifth preset threshold; and/or

For each frame image in the plurality of original frame images, the similarity between the crowd counting graph output by the initial element multiplication network and the pre-labeled crowd counting graph is greater than a sixth preset threshold; and/or

For the plurality of original frame images, the average similarity between the crowd counting graph output by the initial element multiplication network and the crowd counting graph obtained by pre-labeling is greater than a seventh preset threshold; and/or

18. The method of claim 14, wherein the training result satisfying a second predetermined convergence condition comprises:

19. An over-the-wire counting device, comprising:

the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used as a depth neural network and is used for receiving a plurality of original frame images corresponding to a time period T to be analyzed in a video needing cross-line counting, sequentially extracting at least two frame images from the plurality of original frame images corresponding to the time period T to be analyzed in the video, taking a later frame image in the at least two frame images as a current frame image, generating a crowd counting image of the current frame image and outputting the crowd counting image of the original frame image; the crowd counting graph comprises counting vectors of all positions in the frame images, and the counting vectors are used for representing the number of people passing through between all the frame images and the adjacent previous frame images in the counting direction; the at least two frames of images extracted sequentially comprise any one of the following items: continuous original frame images, discontinuous original frame images, or partial continuous original frame images and partial discontinuous original frame images;

the second acquisition unit is used for respectively taking each frame image in the plurality of original frame images as a current frame image, aiming at an interested line LOI to be subjected to line crossing counting in a video, and acquiring the number of people of the current frame image passing through the LOI from at least one direction according to a crowd counting graph of the current frame image; and the third acquisition unit is used for respectively accumulating the number of people passing through the LOI by the plurality of original frame images in the at least one direction to obtain the number of unidirectional line crossing people of the LOI in the at least one direction in the time period T to be analyzed.

20. The apparatus of claim 19, wherein the counting direction comprises two coordinate directions of a two-dimensional coordinate plane.

21. The apparatus according to claim 20, wherein the second obtaining unit is specifically configured to obtain the number of people passing through the LOI from two directions respectively for the current frame image;

the third obtaining unit is specifically configured to respectively accumulate the number of people that the original frame images pass through the LOI in the two directions, and obtain the number of people that the LOI crosses the line in the two directions in the time period T to be analyzed.

22. The apparatus according to any one of claims 19 to 21, wherein the first obtaining unit comprises:

the convolutional neural network is used for receiving at least two input frame images, taking a later frame image in the at least two frame images as a current frame image and generating a crowd density map and a crowd speed map of the current frame image; the crowd density graph is used for representing the crowd density of each position in the current frame image, and the crowd speed graph is used for representing the speed of each pedestrian in the current frame image moving from the adjacent previous frame image to the current frame image;

and the element multiplication network is used for multiplying the elements of the crowd density map and the crowd velocity map of the current frame image at the corresponding positions to obtain the crowd counting map of the current frame image.

23. The apparatus according to any one of claims 19 to 21, wherein the second obtaining unit is specifically configured to:

respectively projecting the counting vectors of all positions on the LOI in the people counting graph in the normal direction of the LOI to obtain the scalar values of all the positions on the LOI, wherein the positive and negative of the scalar values represent two cross-line directions of the LOI; and

24. The apparatus of claim 22, wherein the second obtaining unit is specifically configured to:

25. The apparatus of claim 21, further comprising:

and the calculating unit is used for accumulating the number of unidirectional cross-line people in the two directions of the LOI to obtain the total number of cross-line people passing through the LOI in the time period T to be analyzed.

26. The apparatus of claim 22, further comprising:

and the calculating unit is used for accumulating the number of unidirectional cross-line people of the LOI in two directions to obtain the total number of cross-line people passing through the LOI in the time period T to be analyzed.

27. A deep neural network training device, wherein the deep neural network is the deep neural network in the cross-line counting method according to any one of claims 1 to 8; the device comprises:

28. The apparatus of claim 27, wherein the plurality of original frame images are labeled with a crowd density map and a crowd velocity map, a crowd count map, respectively;

the network training unit is specifically configured to:

respectively inputting two adjacent frames of images in the plurality of original frame images as a training sample to the initial convolutional neural network, performing iterative training on the initial convolutional neural network by using a pre-labeled crowd density graph and a crowd speed graph as supervision signals until a training result meets a first preset convergence condition, and obtaining the convolutional neural network; and

29. The apparatus of claim 28, wherein the scene of the sample video is pre-labeled with geometric perspective, the geometric perspective comprising a correspondence between the number of pixels at different positions in the sample video and a real physical size of the scene; pedestrian information of each pedestrian is calibrated in advance in the plurality of original frame images, the pedestrian information comprises a pedestrian position and a pedestrian ID, and the pedestrian ID uniquely identifies one pedestrian;

the initial convolutional neural network is used for generating a crowd density map of the current frame image according to the pedestrian information and the geometric perspective calibrated by each frame image by taking the later frame image in the current training sample as the current frame image, and generating a crowd speed map of the current frame image according to the pedestrians in the two frame images of the current training sample and the geometric perspective.

30. The apparatus according to claim 29, wherein the initial convolutional neural network is configured to, when generating the crowd density map of the current frame image, specifically, obtain the crowd density values of the positions in the current frame image according to the pedestrian information in the current frame image and the geometric rendering; and generating a crowd density map of the current frame image according to the crowd density values of all positions in the current frame image.

31. The apparatus according to claim 29 or 30, wherein the initial convolutional neural network, when generating the crowd velocity map of the current frame image, is specifically configured to:

32. A data processing apparatus, comprising: the flying lead counting device of any one of claims 19 to 26; or a cross-line counting device as claimed in any one of claims 27 to 31.

33. The apparatus of claim 32, wherein the data processing apparatus comprises an advanced reduced instruction set machine (ARM), a Central Processing Unit (CPU), or a Graphics Processing Unit (GPU).

34. An electronic device, characterized in that a data processing apparatus according to claim 32 or 33 is provided.