CN111291646A

CN111291646A - People flow statistical method, device, equipment and storage medium

Info

Publication number: CN111291646A
Application number: CN202010068164.3A
Authority: CN
Inventors: 蔡晓聪; 侯军; 伊帅
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2020-06-16

Abstract

The embodiment of the application provides a people flow statistical method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a video frame sequence of a target area within a preset time length; identifying the video frame sequence to obtain the position of each head image in each video frame of the video frame sequence in each video frame; performing target tracking on each head image in each video frame in the video frame sequence to obtain a target tracking result; and counting the pedestrian volume moving from the first side of the boundary of the pedestrian volume statistics to the second side of the boundary according to the target tracking result. By implementing the embodiment of the application, all people in the video frame sequence can be continuously and effectively tracked, the people flow can be accurately counted, and the user experience is improved.

Description

People flow statistical method, device, equipment and storage medium

Technical Field

The present application relates to the field of image processing, and in particular, to a people flow rate statistical method, device, and apparatus, and storage medium.

Background

In some high-density crowd occasions, the phenomenon of disaster accidents caused by overcrowding of the crowd often appears. In order to prevent accidents, the dynamic information of the crowd needs to be acquired in real time, and corresponding measures are taken in time according to the dynamic information of the crowd. For example, the scenic spot manager acquires the traffic of people in the scenic spot in real time, and controls the number of people entering the scenic spot according to the real-time data.

At present, the human flow rate statistics schemes are generally based on sparse crowd scenes, and most of the methods are based on human body characteristics (such as clothes color, hair color, stature characteristics and the like) to identify and track, and the human flow rate condition is counted according to a tracking result. However, in a high-density crowd scene, crowds are crowded, people and people have a serious mutual shielding phenomenon, and the characteristics of human bodies cannot be identified, so that the tracking effect is poor, and the people flow statistical result is inaccurate.

Disclosure of Invention

The embodiment of the application provides a people flow rate statistical method, a device, equipment and a storage medium, which can continuously and effectively track people, so that the people flow rate of a target area is accurately counted.

In a first aspect, an embodiment of the present application provides a people flow rate statistical method, where the method includes: acquiring a video frame sequence of a target area within a preset time length; identifying the video frame sequence to obtain the position of each head image in each video frame of the video frame sequence in each video frame; performing target tracking on each head image in each video frame in the video frame sequence to obtain a target tracking result; and counting the pedestrian volume moving from the first side of the boundary of the pedestrian volume statistics to the second side of the boundary according to the target tracking result.

It can be seen that, in the embodiment of the present application, first, a video frame sequence of a target area within a preset duration is obtained, then, the video frame sequence is identified, positions of head images in video frames in the video frame sequence in the video frames are obtained, next, head images of individuals in the video frame sequence are tracked, a tracking result is obtained, and finally, according to the tracking result, a pedestrian volume moving from a first side to a second side of a boundary of a pedestrian volume statistic is counted, so that the pedestrian volume of the target area within the preset duration is obtained. In the method and the device, the positions of the head images of all the persons in all the video frames are obtained through identification, and all the head images are tracked, so that the problems that tracking effect is poor and statistics is inaccurate due to mutual shielding among the persons are avoided. By implementing the embodiment of the application, the method and the device are particularly suitable for people flow statistics of high-density people, and can continuously and effectively track the head images of all people in the video frame sequence, accurately count the people flow condition of the target area within the preset time length, and improve the user experience.

Based on the first aspect, in a possible implementation, the identifying the sequence of video frames to obtain the position of each head image in each video frame of the sequence of video frames in each video frame includes: and inputting the video frame sequence into a neural network model for identification to obtain coordinate values of central pixel points of all head images in all the video frames in the video frames.

It can be understood that the image is composed of a plurality of pixels, the head image of each person occupies a certain number of pixel points in each video frame, and each pixel point has respective coordinate value in the image or the video frame. Therefore, the video frame sequence is input into the neural network for identification, the coordinate values of the central pixel points of the head images of all the persons in the video frame can be obtained, the positions of the head images of the persons are represented by the coordinate values of the central pixel points, and the accuracy is higher.

Based on the first aspect, in a possible implementation manner, the performing target tracking on each head image in each video frame in the sequence of video frames to obtain a target tracking result specifically includes: matching each head image in each video frame to obtain an identity identification number of each head image; wherein the same ID number in different video frames is used for indicating the head image of the same person in the different video frames; constructing a human head frame according to coordinate values of central pixel points of all head images in all video frames, wherein the human head frame is used for indicating areas of the head images of people in the video frames; and tracking each head frame according to the identification number of each head image to obtain a tracking result.

It can be seen that, through matching each head image in different video frames, the identification number of each individual head image is obtained, wherein, same identification number is used for indicating the head image of same person in different video frames, the region of head image of person is indicated by reuse head frame, each individual head frame all corresponds an identification number like this, through tracking each individual head frame, a tracking result can be obtained, the tracking result can be the position information of each head image of each person (same identification number corresponds to same person) in each video frame, also can be the movement route of each person, etc.

In the method, the area of the head image of the person in the video frame is indicated by the head frame, then each head frame is tracked according to the identification number of each head image, and by implementing the method, the situation that tracking effect is not ideal due to mutual shielding between people is avoided, and statistical accuracy is improved.

Based on the first aspect, in a possible implementation manner, the constructing a human head frame according to coordinate values of center pixel points of each head image in each video frame includes: and constructing the human head frame by taking the coordinate value of the central pixel point of each head image in each video frame as a center.

It can be understood that the human head frame is constructed by taking the coordinate value of the central pixel point of the head image of each person in the video frame as the center, so that the human head frame can more accurately represent the head of the person, and the head image of each person can be tracked more accurately.

In a possible implementation manner, based on the first aspect, the counting the human traffic moving from the first side of the boundary of the human traffic statistics to the second side of the boundary according to the target tracking result includes: determining whether the head box with the same identification number in the different video frames moves from a first side of the boundary to a second side of the boundary according to the target tracking result; and counting the number of the head frames moving from the first side of the boundary to the second side of the boundary within the preset duration to obtain a statistical result of the pedestrian volume.

It can be seen that the tracking result can determine the positions of the head frames corresponding to the same id number (the same person) in the sequence of video frames, and then compare the positions of the head frames corresponding to each person with the position of the boundary line, so as to determine whether each person moves from the first side to the second side of the boundary line, that is, determine whether each person crosses the boundary line, thereby counting the number of persons crossing the boundary line, that is, the pedestrian volume of the target area within the preset duration. The first and second sides are used herein only to distinguish the different sides of the dividing line and are not used to describe a particular order.

Based on the first aspect, in a possible implementation manner, the determining whether the head box with the same id number in the different video frame moves from the first side of the boundary to the second side of the boundary according to the target tracking result includes: determining the central point position of the human head frame with the same identification number in the different video frames according to the target tracking result; and determining whether the head frame with the same identification number moves from the first side of the boundary to the second side of the boundary according to the position relation between the central point position and the boundary.

It is to be understood that, when determining whether or not each person crosses the boundary, the positional relationship between the position of each person's head frame corresponding to each person and the boundary is compared, and the positional relationship between the position of the center point of each person's head frame and the boundary may be compared. If the center point position of each individual head frame corresponding to the same person is located on both the first side and the second side of the boundary line, it is determined that the center point position is moved from the first side to the second side of the boundary line, and if the center point position of each individual head frame corresponding to the same person is located only on the first side or only on the second side of the boundary line, it is determined that the center point position is not moved from the first side to the second side of the boundary line.

Based on the first aspect, in a possible implementation manner, the determining, according to the tracking result, whether each human head box in the different video frames corresponding to the same identification number moves from a first side of the boundary to a second side of the boundary includes: determining the boundary line position of the human head frame with the same identification number in the different video frames according to the target tracking result; and determining whether the head frame with the same identification number moves from the first side of the boundary line to the second side of the boundary line according to the position relation between the position of the boundary line and the position of the boundary line.

It is to be understood that, when determining whether or not each individual crosses the boundary, the positional relationship between the position of each individual head frame corresponding to each individual and the boundary is compared, and the positional relationship between the position of the entire boundary line of each individual head frame and the boundary may be compared. If the entire boundary line of each head frame corresponding to the same person is located on both the first side and the second side of the boundary line, it is determined that the head frame is moved from the first side to the second side of the boundary line, and if the entire boundary line of each head frame corresponding to the same person is located only on the first side or only on the second side of the boundary line, it is determined that the head frame is not moved from the first side to the second side of the boundary line.

In a second aspect, an embodiment of the present application provides a people flow rate statistics apparatus, including:

the acquisition module is used for acquiring a video frame sequence of a target area within a preset time length;

the identification module is used for identifying the video frame sequence to obtain the position of each head image in each video frame of the video frame sequence in each video frame;

the tracking module is used for tracking the target of each head image in each video frame in the video frame sequence to obtain a target tracking result;

and the counting module is used for counting the pedestrian volume moved from the first side of the boundary of the pedestrian volume statistics to the second side of the boundary according to the target tracking result.

Based on the second aspect, in a possible implementation manner, the identification module is specifically configured to: and inputting the video frame sequence into a neural network model for identification to obtain coordinate values of central pixel points of all head images in all the video frames in the video frames.

Based on the second aspect, in a possible implementation, the tracking module is specifically configured to: matching each head image in each video frame to obtain an identity identification number of each head image; wherein the same ID number in different video frames is used for indicating the head image of the same person in the different video frames; constructing a human head frame according to coordinate values of central pixel points of all head images in all video frames, wherein the human head frame is used for indicating areas of the head images of people in the video frames; and tracking each head frame according to the identification number of each head image to obtain a tracking result.

Based on the second aspect, in a possible implementation manner, the constructing a human head frame according to coordinate values of center pixel points of each head image in each video frame includes: and constructing the human head frame by taking the coordinate value of the central pixel point of each head image in each video frame as a center.

Based on the second aspect, in a possible implementation manner, the statistics module is specifically configured to: determining whether the head box with the same identification number in the different video frames moves from a first side of the boundary to a second side of the boundary according to the target tracking result; and counting the number of the head frames moving from the first side of the boundary to the second side of the boundary within the preset duration to obtain a statistical result of the pedestrian volume.

Based on the second aspect, in a possible implementation, the statistical module is further configured to: determining the central point position of the human head frame with the same identification number in the different video frames according to the target tracking result; and determining whether the head frame with the same identification number moves from the first side of the boundary to the second side of the boundary according to the position relation between the central point position and the boundary.

Based on the second aspect, in a possible implementation, the statistical module is further configured to: determining the boundary line position of the human head frame with the same identification number in the different video frames according to the target tracking result; and determining whether the head frame with the same identification number moves from the first side of the boundary line to the second side of the boundary line according to the position relation between the position of the boundary line and the position of the boundary line.

Each functional module in the apparatus provided in the embodiment of the present application is specifically configured to implement the method described in the first aspect.

In a third aspect, an embodiment of the present application provides a people flow rate statistics apparatus, including a processor, a communication interface, and a memory; the memory is configured to store instructions, the processor is configured to execute the instructions, and the communication interface is configured to receive or transmit data; wherein the processor executes the instructions to perform the method as described in the first aspect or any specific implementation manner of the first aspect.

In a fourth aspect, embodiments of the present application provide a non-volatile storage medium for storing program instructions, which, when applied to a device for people flow statistics of a high-density population, can be used to implement the method described in the first aspect.

In a fifth aspect, the present application provides a computer program product, which includes program instructions, and when the computer program product is executed by an apparatus for people flow statistics of high-density crowd, the apparatus executes the method of the first aspect. The computer program product may be a software installation package, which may be downloaded and executed on a device for people traffic statistics of a high density population in case it is desired to use the method provided by any of the possible designs of the first aspect described above, to carry out the method of the first aspect.

It can be seen that the present application provides a people flow rate statistical method, which is particularly suitable for people flow rate statistics in high-density people, and the method includes firstly, inputting an acquired video frame sequence into a neural network model to identify a head image, obtaining coordinate values of central pixel points of the head image of each person in the video frame, then, taking the coordinate values as a center to construct a head frame, wherein the head frame is used for indicating the region position of the head image of the person in the video frame, determining the head frames corresponding to each person in different video frames by matching the head images of each person in different video frames, and identifying each head frame corresponding to the same person by using the same identification number. And finally, determining the number of people crossing the boundary, namely the pedestrian volume of the target area within the preset duration, according to the position relationship between the head frame position of each person corresponding to each person and the boundary. Therefore, in some high-density crowd scenes, crowds are crowded, the number of people in a video frame is large, so that the image of the people is small, a neural network model is adopted to identify the central pixel point of the head image of each person, the people in the video frame can be accurately determined, and the phenomenon of error statistics caused by missed identification or error identification is prevented; the human head frame is constructed again, and each person is tracked based on the head image of the person in the area where the human head frame is located, so that continuous effective tracking can be kept, and the statistics of the human flow is more accurate.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 2 is a schematic diagram of another system architecture according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a neural network according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a people flow rate statistical method according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of another people flow rate statistical method provided in the embodiment of the present application;

fig. 6 is a schematic view of an application scenario of pedestrian volume statistics according to an embodiment of the present application;

FIG. 7 is a diagram illustrating two video frames in a sequence of video frames according to an embodiment of the present application;

FIG. 8 is a schematic diagram of two video frames in a sequence of video frames including a center pixel point of a head image of each person according to an embodiment of the present application;

FIG. 9 is a schematic diagram of two video frames in a video frame sequence including a human head box of each person according to an embodiment of the present application;

FIG. 10 is a schematic diagram of two video frames in a sequence of video frames including a tracking ID according to an embodiment of the present application;

FIG. 11 is a diagram illustrating two video frames in a video frame sequence including a boundary of people flow statistics according to an embodiment of the present application;

fig. 12 is a schematic view of a pedestrian flow rate statistic apparatus according to an embodiment of the present disclosure;

fig. 13 is a schematic diagram of another human traffic statistic device provided in the embodiment of the present application;

fig. 14 is a schematic diagram of still another human traffic statistic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It is to be understood that the terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only, and is not intended to be limiting of the application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is noted that, as used in this specification and the appended claims, the term "comprises" and any variations thereof are intended to cover non-exclusive inclusions. For example, a system, article, or apparatus that comprises a list of elements/components is not limited to only those elements/components but may alternatively include other elements/components not expressly listed or inherent to such system, article, or apparatus.

It is also understood that the term "if" may be interpreted as "when", "upon" or "in response to" determining "or" in response to detecting "or" in the case of … "depending on the context.

It should also be noted that the terms "first," "second," "third," "fourth," and the like in the description and in the claims, are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order.

Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture provided in the embodiment of the present application. The system architecture shown in FIG. 1 may include: the camera and the server can be directly connected or indirectly connected. The server may be an application server, or a server in a cloud server cluster, and the camera may be a home-level camera, a business-level camera, a network camera, or even a monitoring camera in a monitoring room.

Wherein the camera is used for acquiring a sequence of video frames of the target region. The functional parameters of the different types of cameras are different and the positions at which the sequence of video frames is captured can be chosen according to the functional parameters of the particular camera so that the persons of the target area are all included within the picture of the camera. The server is used for processing the acquired video frame sequence and comprises the following steps: images in respective video frames of a sequence of video frames are identified, tracked, and the like.

In a specific embodiment, the camera is a web camera, and the web camera and the server directly establish connection through a wireless network. The network camera collects a video frame sequence of a target area within a preset time length, transmits the video frame sequence to the server through the wireless network, and the server carries out a series of processing on the received video frame sequence, finally counts the pedestrian flow of the target area within the preset time length, realizes real-time mastering of the pedestrian flow condition of the target area and takes measures in time according to the pedestrian flow.

In yet another embodiment, the camera is a surveillance camera of a surveillance room, and the surveillance camera is not directly connected to the server. Firstly, acquiring a surveillance video of a target area by a surveillance camera, intercepting a video frame sequence in a preset time length from the surveillance video, storing the video frame sequence in a U disk or other equipment, establishing connection between the U disk or other equipment and a server, inputting the video frame sequence into the server, and carrying out a series of processing on the video frame sequence by the server to finally count the pedestrian volume condition of the target area in the preset time length.

As shown in fig. 2, the present embodiment provides yet another system architecture 100. Referring to fig. 2, data acquisition device 160 is configured to acquire training data, which in the embodiment of the present application includes a sequence of video frames containing head images of persons and tags including coordinate information of the head images of the respective persons in the respective video frames of the sequence of video frames. The data acquisition device 160 here may be a camera.

After the training data is collected, the data collection device 160 stores the training data in the database 130, and the training device 120 trains the recognition model 113 based on the training data maintained in the database 130.

The following describes that the training device 120 obtains the recognition model 113 based on training data, the input data of the training device 120 includes a video frame sequence including a head image of a person and a tag, the training device 120 processes the input video frame sequence including the head image of the person, compares the output coordinate information with the coordinate information in the tag until a difference between the coordinate information output by the training device 120 and the coordinate information in the tag is smaller than a preset threshold, and considers the output coordinate information as being capable of replacing the coordinate information in the tag, thereby completing training of the recognition model 113.

The recognition model 113 can be used to implement the people flow statistics method for high-density people provided in the embodiment of the present application, a video frame sequence to be processed is input into the recognition model 113, so that coordinate information of head images of people in the video frame sequence can be obtained, the tracking module 114 matches the head images of people in different video frames of the video frame sequence, and identifies the head frames of people in different video frames corresponding to the same person with the same tracking ID, so as to implement tracking of people in the video frame sequence, the statistics module 115 performs statistics on the people flow according to the coordinate information of the head images of people output by the recognition model 113 and the tracking ID of the head frames output by the tracking module 114, and finally outputs the people flow statistics result to the user equipment 140 through the I/O interface 112. It should be noted that, in practical applications, the training data maintained in the database 130 may not be all from the data acquisition device 160, and may be obtained from other devices. It should be noted that the training device 120 does not necessarily have to perform the training of the recognition model 113 based on the training data maintained by the database 130, and may also obtain the training data from other devices to perform the model training. The training device 120 may exist separately from the execution device 110 or may be integrated within the execution device 110.

The recognition model 113 trained according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 2 is applied, where the execution device 110 may be an application server or a server in a cloud server cluster, in fig. 2, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, a user may input data to the I/O interface 112 through a user device 140, the input data may include a video frame sequence including a head image of a person in an embodiment of the present application, and the user device 140 may include: the mobile phone, the tablet personal computer, the camera, the notebook computer and the like can be used for photographing.

The tracking module 114 is configured to track a head image of each person in each video frame of the sequence of video frames according to an output of the recognition model 113, and the counting module 115 is configured to count the person flow according to output results of the recognition model 113 and the tracking module 114. In the embodiment of the present application, the calculation module 111 is used for processing input/output data.

In the process that the execution device 110 processes the input data, or in the process that the calculation module 111 of the execution device 110 performs the calculation and other related processes, the execution device 110 may call the data, the code and the like in the data storage system 150 for corresponding processes, or store the data, the instruction and the like obtained by corresponding processes in the data storage system 150.

It should be noted that the training device 120 may generate corresponding recognition models 113 for different targets or different tasks based on different training data, and the corresponding recognition models 113 may be used to achieve the targets or complete the tasks, so as to provide the user with the desired results.

The recognition model described in the embodiments of the present application is configured based on a Convolutional Neural Network (CNN), which is described below.

The convolutional neural network is a deep neural network with a convolutional structure, and may be a deep learning (deep learning) architecture, where the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons respond to overlapping regions in an image input thereto.

The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. We can use the same learned image information for all locations on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a Convolutional Neural Network (CNN)300 according to an embodiment of the present disclosure. As shown in fig. 3, Convolutional Neural Network (CNN)300 may include an input layer 310, a convolutional/pooling layer 320, and a neural network layer 330.

The input layer 310 may process multi-dimensional data, e.g., the input layer may acquire and process a sequence of video frames captured by different types of cameras; commonly, the input layer of a one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array, where the one-dimensional array is typically a time or spectral sample; the two-dimensional array may include a plurality of channels; an input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array.

Because of the learning using gradient descent, the input features of the convolutional neural network can be normalized. Specifically, before inputting the learning data into the convolutional neural network, normalization processing is performed on the input data in a channel or time/frequency dimension. The standardization of the input features is beneficial to improving the operation efficiency and the learning performance of the algorithm.

Convolutional layer/pooling layer 320 can include, as examples 321-326, in one implementation, 321 is a convolutional layer, 322 is a pooling layer, 323 is a convolutional layer, 324 is a pooling layer, 325 is a convolutional layer, 326 is a pooling layer; in another implementation, 321, 322 are convolutional layers, 323 are pooling layers, 324, 325 are convolutional layers, and 326 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

Taking convolutional layer 321 as an example, convolutional layer 321 may include many convolution operators, which are also called convolutional kernels, and act as a filter for extracting specific information from the input image matrix in image processing, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, which depends on the value of step size stride) in the horizontal direction, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrixes can be used for extracting different features in the image, for example, one weight matrix is used for extracting image edge information, another weight matrix is used for extracting specific colors of the image, another weight matrix is used for blurring unwanted noise points in the image … …, the dimensions of the multiple weight matrixes are the same, the dimensions of feature maps extracted by the multiple weight matrixes with the same dimensions are also the same, and the extracted multiple feature maps with the same dimensions are combined to form the output of convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 300 to make correct prediction. In the application, a video frame sequence containing human head images and labels are trained, wherein the labels comprise coordinate information of central pixel points of the head images of all the people, and the convolutional neural network model outputs the coordinate information of the central pixel points of the head images of all the people in all the video frames of the video frame sequence.

It should be noted that 321-326 layers are merely examples, and more convolution layers and/or more pooling layers may be provided. When convolutional neural network 300 has multiple convolutional layers, the initial convolutional layer (e.g., 321) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 300 increases, the features extracted by the convolutional layers (e.g., 326) further back become more complex, such as features of high-level semantics. The embodiment of the application utilizes the characteristics of different scales to assist in solving the related technical problems.

Since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, i.e., 321-326 layers as illustrated by 320 in fig. 3, which may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. During image processing, a pooling layer may be used to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a particular range to produce an average. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

Generally speaking, the convolution kernels in the convolutional layers contain weight coefficients (weight matrix), while the pooling layers do not contain weight coefficients, so in some scenarios, the pooling layers may also not be considered as independent layers.

After processing by convolutional layer/pooling layer 320, convolutional neural network 300 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 320 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information, convolutional neural network 300 needs to utilize neural network layer 330 to generate one or a set of the number of required classes of output. Therefore, the neural network layer 330 may include a plurality of hidden layers (331, 332 to 33n shown in fig. 3) and an output layer 340, where parameters included in the plurality of hidden layers may be obtained by pre-training according to related training data of a specific task type, for example, in this application, a head image of each person in a video frame sequence needs to be recognized to obtain coordinate information of the head image of each person in the video frame, and the task type may include image recognition and the like.

Hidden layers in convolutional neural networks include, for example, fully-connected (FC) layers, which typically pass signals only to other fully-connected layers. The feature map loses 3-dimensional structure in the fully connected layer, is expanded into vectors and is transferred to the next layer through the excitation function. In some possible convolutional neural networks, the function of the fully-connected layer may be partially replaced by global averaging pooling (global averaging pooling), which averages all the values of each channel of the feature map.

After the hidden layers in the neural network layer 330, i.e. the last layer of the whole convolutional neural network 300 is the output layer 340, the output layer 340 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 310 to 340 in fig. 3 is the forward propagation) of the whole convolutional neural network 300 is completed, the backward propagation (i.e. the propagation from 340 to 310 in fig. 3 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 300 and the error between the result output by the convolutional neural network 300 through the output layer and the ideal result.

The output layer 340 may output the category labels using a logistic function or a normalized exponential function (softmax function). For example, in the present application, the head image of each person is recognized to obtain the coordinate information of the head image of each person in the video frame, so the output layer may be designed to output the coordinate values of all the pixel points in the region where the head image of each person is located, or the output layer may be designed to output the coordinate values of the central pixel point of the head image of each person.

It should be noted that the convolutional neural network 300 shown in fig. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the neural network layer 330 for processing.

Referring to fig. 4, based on the above system architecture, the present application provides a people flow rate statistical method, which is applied to a server, and includes, but is not limited to, the following steps:

s101, obtaining a video frame sequence of a target area in a preset time length.

In one embodiment, a video camera may be used to capture a sequence of video frames of a target region within a preset time period. For example, if the pedestrian volume situation on the pedestrian crossing is counted, a camera can be used for shooting at a certain appropriate place near the pedestrian crossing; or, the camera may be used to photograph the crosswalk within any time period within a preset time period, and then the video frame sequence within the preset time period is captured.

In yet another embodiment, the sequence of video frames of the target area within a preset time period may be obtained from the monitoring device. For example, if the pedestrian traffic condition on the pedestrian crossing is still to be counted, the video frame sequence within the preset time duration may be intercepted from the monitoring device where the pedestrian crossing is located, and then copied to be used for counting and analyzing the pedestrian traffic condition within the preset time duration on the pedestrian crossing.

It should be noted that, in the process of acquiring the video frame sequence of the target area, the same camera or the same monitoring device or the same other photographing device is used all the time, and the photographing position and the photographing angle are also fixed, so that the sizes of the video frames in the obtained video frame sequence are identical, but the positions of the people and the number of the people in the video frames are changed.

S102, identifying the video frame sequence to obtain the position of each head image in each video frame.

The video frame sequence is identified to obtain position information of the head image of each person in each video frame, where the position information may be coordinate information, where the coordinate information may be a coordinate value of a certain fixed point of the head image of the person, for example, a coordinate value of a central pixel point of the head image of the person, or coordinate values of all pixel points of a region location where the head image of the person is located (including a coordinate value of a central pixel point of the head image of the person and coordinate values of a plurality of pixel points in each of the up, down, left, and right directions), and then the coordinate values of all pixel points represent the position of the head image of each person in the video frame.

In the present application, the recognition result may be coordinate information of a head image of each person in each video frame, and in a specific implementation, the recognition result may also be output in the form of a video frame sequence, that is: and recognizing the video frame sequence to obtain the video frame sequence comprising an identification symbol, wherein the identification symbol is used for identifying the position of the head image of each person in the video frame. The identification symbol may be a fixed point or may be a fixed shape, such as a rectangle, square, circle, diamond, etc.

S103, performing target tracking on each head image in each video frame in the video frame sequence to obtain a target tracking result.

The head images of each person in a sequence of video frames are tracked in order to determine the position of each person's head image in different video frames. Firstly, matching the head images of people in different video frames to determine the head images of different video frames corresponding to each person, then representing the head images of different video frames corresponding to the same person by using the same identifier, and representing the head images of different people by using different identifiers, namely completing the tracking process of the head images of all people in the video frame sequence.

When the head images of the people in the different video frames are matched to determine the head images of the different video frames corresponding to each person, a conventional feature extraction method may be adopted, for example, based on the features of the head images of the people, including: extracting the characteristics of the head images of people, matching the characteristics of the head images of people in different video frames, and determining the position of each head image corresponding to the same person; other algorithms, such as a bipartite graph algorithm in combination with any tracking algorithm, etc., may also be used.

By tracking each person in the video frame sequence, the position of the head image of each person in different video frames can be determined, and therefore the change situation of the position of each person can be obtained. This change in position can be reflected by a change in the coordinate information of the head image of each person in different video frames, a change in the position of the head image of each person in different video frames, or the tracking effect map (movement route) of 404 in fig. 4.

And S104, counting the pedestrian volume moving from the first side of the boundary of the pedestrian volume statistics to the second side of the boundary according to the target tracking result.

In the embodiment of the present application, the boundary of the people flow statistics is preset. The boundary line may be a straight line, a curve, a line segment, a rectangle, a circle, an ellipse, a polygon, etc., and the shape of the boundary line is not particularly limited.

As described above, since each head image having the same mark corresponds to one person, it is only necessary to determine whether each person moves from the first side of the boundary to the second side of the boundary based on the coordinate information of the head image in each video frame corresponding to each mark. Whenever a person crosses the boundary, the flow of people is increased by 1. The method specifically comprises the following steps: for one person, if the coordinate information of the head image of the same person exists on both the first side and the second side of the boundary, the flow rate is increased by 1; if the coordinate information of the head image of the same person exists on the first side of the boundary line but does not exist on the second side of the boundary line, the flow rate of people is unchanged; if the coordinate information of the head image of the same person exists on the second side of the boundary line but does not exist on the first side of the boundary line, the amount of the pedestrian volume is not changed. And finally, traversing each person in the video frame sequence, and counting the number of the persons crossing the boundary, so as to obtain the pedestrian volume of the target area within the preset time length.

In a specific embodiment, when determining whether the coordinate information of the head image is on the first side or the second side of the boundary, or determining whether a person crosses the boundary, the determination may be made according to a positional relationship between the coordinate value of the center pixel point of the head image and the boundary; or the position relationship between the boundary line position of the coordinate values of all the pixel points of the area position of the head image and the boundary line can be determined.

It should be noted that, in the present application, for convenience of description, the first side and the second side are used to distinguish different regions of the boundary line, and are not used to describe a specific order. For example, the boundary of traffic statistics divides the store into an in-store area and an out-of-store area, and when the traffic is counted moving from a first side of the boundary to a second side of the boundary of the traffic statistics, the counted population includes both people moving from the in-store area to the out-of-store area and people moving from the out-of-store area to the in-store area. Of course, the present application may also be used to only count traffic moving from an in-store area to an out-of-store area, or to only count traffic moving from an out-of-store area to an in-store area. For example, a moving route marked with a moving direction is obtained according to the positions of the head images of the individuals in different video frames and the time sequence of collecting the video frames, so that the pedestrian volume moving from the in-store area to the out-of-store area or the pedestrian volume moving from the out-of-store area to the in-store area can be counted according to the moving route marked with the moving direction.

It can be seen that, when the technical solution of the embodiment of the present application is implemented, first, a video frame sequence of a target area within a preset duration is obtained, then, the video frame sequence is identified, positions of head images of individuals in the video frames are obtained, then, the head images of the same person are identified by matching the head images in different video frames, then, the head images of the persons are tracked, a tracking result is obtained, the tracking result indicates the positions of the head images of the individuals in the different video frames, and finally, according to the tracking result, whether the individuals move from a first side of a boundary to a second side of the boundary is determined, and a traffic condition of the target area is counted. Therefore, the embodiment of the application shows the position of each person by using the position of the head image, tracks the head image of each person, avoids the situation of poor tracking effect caused by mutual occlusion, and realizes continuous and effective tracking of each person; and determining the people crossing the boundary of the people flow rate statistics according to the tracking result, thereby accurately counting the people flow rate of the target area within the preset duration.

Referring to fig. 5, fig. 5 is a schematic diagram of another people flow rate statistical method provided by an embodiment of the present application, which is applied to a server, and includes, but is not limited to, the following steps:

s201, acquiring a video frame sequence of a target area in a preset time length.

For the brevity of the description, the content of this part is referred to as S101, and is not described herein again.

S202, inputting the video frame sequence into the neural network model to obtain coordinate values of central pixel points of the head images of the individuals.

The neural network model in the present application may be a model of the convolutional neural network described above, or may be another network model. This step is briefly described below by taking a convolutional neural network model as an example. Before inputting a video frame sequence into a convolutional neural network model, the convolutional neural network model needs to be trained, a sample used for training is the video frame sequence comprising human head images, a label is a coordinate value of a central pixel point of each human head image in each video frame of the video frame sequence, the network model is trained based on the sample and the label, and a final output result reaches a predicted value through continuously modifying parameters in the network model, so that the trained convolutional neural network model is obtained. And inputting the obtained video frame sequence into the trained convolutional neural network model to obtain an output result, wherein the output result is the coordinate value of the central pixel point of the head image of each person.

It should be noted that, if a video frame in the video frame sequence of the output result is smaller, the enlarging operation may be performed on each video frame in the output result, so as to facilitate the following tracking of the head image of each person in the video frame. For example, if the video frame size in the output result is 160 pixels by 160 pixels, since the head frame needs to be constructed by centering on the central pixel point of the head image, if the video frame is small, the head image of each person is small, the head frame is not convenient to construct, and the head image is not convenient to track, the video frame needs to be amplified, the video frame sequence in the output result can be amplified by 8 times and changed into the video frame sequence with 1280 pixels by 1280 pixels, so that the head image of the person can be accurately tracked, and the traffic condition of the target area can be accurately counted.

S203, a human head frame is constructed centering on the coordinate value of the central pixel point of the head image of each person.

The method includes that each individual head image in a video frame sequence corresponds to a central pixel point, for convenience of subsequent tracking and statistics, a head frame is constructed by taking the coordinate value of the central pixel point of each individual head image as a center and taking the fixed width or height as a center, the head frame is used for indicating the position of an area of the individual head image in the video frame, the coordinate values of all the pixel points of each individual head image in the video frame are obtained, and the position of the head frame is the position of the individual head image.

It should be noted that, in this application, the human head frame may be coordinate values of all pixel points of the human head image, or may be a frame in the video frame that marks the real position of the human head image, or the central pixel point of the head image of each person in S202 is a real position point of the head image of the person marked in the video frame, and the real position point of the head image of the person marked in the video frame is taken as a center to construct the real position human head frame with a fixed length or width, so as to obtain a video frame sequence including the human head frame, where the human head frame marks the position of the human head image. In the embodiment of the present application, the human head frame may be rectangular, diamond, square, circle, polygon, and the like, and the shape of the human head frame is not particularly limited herein.

For example, a square frame can be obtained by respectively extending a plurality of pixels upwards, downwards, leftwards and rightwards with the coordinate value of the central pixel point of the human head image as the center, and the square frame is called as the human head frame, and the human head image in the video frame is identified by the human head frame, so that the human head images in all the video frames in the video frame sequence are identified by the human head frame.

For another example, a circular frame, also called a human head frame, may be generated by using the coordinate value of the central pixel point of the human head image as the center and using the fixed pixel length as the radius, and the circular frame is used to identify the human head image in the video frame; or, a rectangle or polygon, etc. may be generated with a fixed pixel length by taking the coordinate value of the central pixel point of the human head image as the center, so as to identify the human head image in all the video frames of the video frame sequence.

And S204, tracking the head image of each person to obtain a tracking result.

The method comprises the steps of determining each head image in different video frames corresponding to the same person by matching the head images of each person in different video frames, and identifying each head image corresponding to the same person by using the same Identity Document (tracking ID for short), wherein each head image corresponds to a head frame of one person, so that each head frame of one person corresponds to one tracking ID, the tracking IDs corresponding to the head frames of the same person are the same, and the head frames of different persons correspond to different tracking IDs. Thus, each person's head box in a sequence of video frames has a tracking ID, the same person's head box in each video frame has the same tracking ID, and different person's head boxes have different tracking IDs. The person head frames corresponding to the tracking IDs are tracked, so that the position information of the head images of the persons in different video frames can be obtained, and the position information can be represented by coordinate information or a moving route.

In one embodiment, the head images of the individuals in the different video frames may be matched based on the head image feature information, where the head image feature information includes feature information of the heads of the individuals, feature information of the faces, feature information of the hairs, and the like, and the matching of the head images in the different video frames may be implemented by performing feature extraction on the head image information using a certain algorithm or algorithms, and then performing feature matching, and the like. Firstly, extracting feature information of head images of persons in all video frame sequences, namely extracting feature information of identification areas of the head frames of the persons, such as: eye features, nose features, mouth features, facial features, hair features, and the like. Then, based on different video frames, the extracted feature information of the head images is matched, head images with the same feature information in different video frames are determined, the head images with the same feature information are identified by one same tracking ID, and head images with different feature information are identified by different tracking IDs.

In yet another embodiment, the head images of individuals in a sequence of video frames may be tracked using some tracking algorithm, such as a Kalman filtering algorithm. Firstly, determining a human head box representing the same person in each video frame through a bipartite graph algorithm or other matching algorithms, and then tracking by using a Kalman filtering algorithm, wherein the method specifically comprises the following steps: according to the direction information of the head of a person in a historical video frame, the position information of the head in the video frame and the like, accumulating the calculated speed, acceleration and the like and the head image information (the direction information of the head, the position information of the head in the video frame and the like) in the current video frame, updating the area position of the head frame (the updated area position of the head frame can more accurately indicate the position of the head image of each person in each video frame), identifying the head frame representing the same person by using the same tracking ID, and identifying the head frames representing different persons by using different tracking IDs. Finally, tracking of the head images of the individual persons can be realized by tracking the head frames with the same tracking ID, so that the positions of the head images of the individual persons in different video frames or the moving routes of the individual persons can be obtained.

And S205, counting the pedestrian volume in the target area within the preset time according to the tracking result.

The tracking IDs corresponding to the same person are the same, and whether the coordinate position of the head box of each person corresponding to each tracking ID (i.e., the same person) moves from the first side of the boundary to the second side of the boundary can be determined according to the coordinate position of the head box of each person in each video frame corresponding to each tracking ID, so as to determine whether each person crosses the boundary, thereby counting the number of persons crossing the boundary. The boundary is preset, and may be a straight line, a curve, a line segment, a rectangle, a circle, an ellipse, a polygon, and the like, and the shape of the boundary is not particularly limited.

In one embodiment, the central point position of each individual head frame with the same ID in different video frames is determined, and then the positional relationship between the central point position of each individual head frame corresponding to each tracking ID and the boundary is determined, and the pedestrian volume is counted. For a person, the person's head box in each video frame corresponds to the same tracking ID, and if the center point of each person's head box with the same tracking ID is located on both the first side and the second side of the boundary, it indicates that the person corresponding to the tracking ID crosses the boundary; if the center point of the head box of each person with the same tracking ID is only positioned on the first side of the boundary and is not positioned on the second side of the boundary, the person corresponding to the tracking ID is represented as not crossing the boundary; if the center point of the head box of each person with the same tracking ID is only located on the second side of the boundary and is not located on the first side of the boundary, it means that the person corresponding to the tracking ID does not cross the boundary, and finally the number of persons crossing the boundary, i.e. the flow rate of persons, is counted. Actually, when the number of people crossing the boundary is determined, the statistical calculation is performed on the people in the whole video frame sequence, that is, the people flow in the target area within the preset time length is calculated by traversing all the people in the video frame sequence.

In another embodiment, the boundary position of each person head frame with the same ID in different video frames is determined, and then the relationship between the boundary position of each person head frame corresponding to each tracking ID and the boundary is determined, and the traffic is counted. For a person, the head frames of the person in each video frame correspond to the same tracking ID, and if the boundary line of each head frame with the same tracking ID is located on both the first side and the second side of the boundary line, it indicates that the person corresponding to the tracking ID crosses the boundary line; if the boundary line of each person head frame with the same tracking ID is only positioned on the first side of the boundary line and is not positioned on the second side of the boundary line, the person corresponding to the tracking ID is represented as not crossing the boundary line; if the boundary line of each person head box with the same tracking ID is only located on the second side of the boundary line and is not located on the first side of the boundary line, it indicates that the person corresponding to the tracking ID does not cross the boundary line, and finally, the number of persons crossing the boundary line, i.e., the flow rate of persons, is counted. Actually, when the number of people crossing the boundary is determined, the statistical calculation is performed on the people in the whole video frame sequence, that is, the people flow in the target area within the preset time length is calculated by traversing all the people in the video frame sequence.

It should be noted that, in the embodiment of the present application, the boundary line of the human head frame includes all straight lines, curved lines, or line segments, etc. that constitute the whole human head frame, for example, the boundary line of the rectangular human head frame includes four line segments that constitute a rectangle, the boundary line of the circular human head frame includes the whole circumference, and the boundary line of the square human head frame includes four line segments that constitute a square, etc.

It should be noted that, in the present application, for convenience of description, the first side and the second side are used to distinguish different regions of the boundary line, and are not used to describe a specific order. For example, the boundary of traffic statistics divides the store into an in-store area and an out-of-store area, and when the traffic is counted moving from a first side of the boundary to a second side of the boundary of the traffic statistics, the counted population includes both people moving from the in-store area to the out-of-store area and people moving from the out-of-store area to the in-store area.

It can be seen that, firstly, the video frame sequence of the target area within the preset duration is obtained, then, the video frame sequence is input into the neural network model for recognition, the coordinate values of the central pixel points of the head images in different video frames are obtained, secondly, the head frame is constructed by taking the coordinate values as the center, the head frame indicates the area of the head images in the video frames, thirdly, the head images in different video frames are matched to determine the head images corresponding to the same person, and the head images (head frames) corresponding to the same person are identified by the same tracking ID, therefore, the tracking of the head images of the persons is realized by tracking the tracking IDs, so that the tracking result is obtained, and finally, the people flow in the target area within the preset duration is counted according to the tracking result. The embodiment of the application is particularly suitable for high-density crowds, can overcome the defects of the prior art, and continuously and effectively tracks the head images of all people; and accurately counting the pedestrian volume in the target area within the preset time according to the tracking result.

Referring to fig. 6, for ease of understanding, fig. 6 is a schematic diagram illustrating a specific application scenario. The statistical method for human traffic in the scene diagram mainly involves the processes from 401 to 405, and the processes from 401 to 405 are described below according to fig. 6:

401 is a video frame sequence, and the video frame sequence may be acquired by a camera in a preset time period, and then sent to a server through a wireless network, and the server acquires the video frame sequence; or the video of the target area is acquired by the camera, then the server acquires the video from the camera through the mobile device, and then a video frame sequence within a preset time duration is obtained by intercepting, and the like.

402 is a convolutional neural network that requires training a convolutional neural network model before processing a video frame or image using the convolutional neural network. In the training process, the adopted training sample is a video frame sequence which is acquired by a camera and comprises people, and the label is a coordinate value of a central pixel point of a head image of each person in the video frame, so that a convolutional neural network model is obtained through continuous training based on the training sample and the label (actually, the training process is a process of continuously modifying each parameter in the network until an output result reaches a prediction result). Therefore, the obtained video frame sequence can be input into the convolutional neural network model to obtain the coordinate values of the central pixel points of the head images of the people in the video frame sequence in the video frame.

Reference numeral 403 denotes a video frame sequence including a head frame, and the head frame is constructed centering on coordinate values of center pixel points of the head image of each person, and the position of the head frame is the position of the head image representing each person. In practical implementation, the human head frame may be embodied in the form of a specific frame in a video frame or an image, or may also be embodied in the form of coordinates, that is, coordinate values of a central pixel point of the head image of each person are taken as a center, and the coordinate values and a certain number of coordinate values in corresponding up, down, left, and right directions (that is, coordinate values of positions of the specific frame) are output. The human head frame may be rectangular, diamond, square, etc., and the shape of the human head frame is not particularly limited.

404 is a multi-target tracking effect graph. The tracking of the head image at the position of the head frame is performed in order to match each person in different video frames of the sequence of video frames and determine the position of each person at different times, i.e. in different video frames. However, in order to exhibit the tracking effect, the change in the position of each person is shown on one drawing, and the trajectory of the line corresponding to the head of each person in the tracking effect drawing represents the change in the position of each person.

405 is the people flow statistics. And determining whether each person crosses the boundary according to the relationship between the position of each person in the video frame sequence and the position of the boundary, so as to count the number of the persons crossing the boundary, namely the pedestrian volume of the target area within the preset duration. For details, reference is made to the description of the method embodiments above, which are not repeated herein.

In order to more intuitively and clearly understand the embodiment of the present application, the following describes a people flow rate statistics method for high-density people by way of example, in a specific implementation process of the present application, a center pixel point of a head image of each person, a head frame, and a boundary of the people flow rate statistics are embodied in a form of coordinate information, where the embodiment of the present application is explained in a form of a video frame image, and the coordinate information in the embodiment of the present application is intuitively embodied on the video frame image, and does not form a limitation on the present application. Referring to fig. 7, (1) and (2) in fig. 7 are any two video frames in the obtained video frame sequence of the target region, and the method of the present embodiment will be explained below by taking the two video frames as an example.

First, a target area is photographed by a camera, and a sequence of video frames in the target area is obtained, see (1) in fig. 7 and (2) in fig. 7. Then, the video frame sequence is input into the trained convolutional neural network model, and coordinate values of the central pixel point of the head image of each person are output, see (1) in fig. 8 and (2) in fig. 8, for easy visual understanding, the point in (1) in fig. 8 and (2) in fig. 8 is the central pixel point representing the head image of each person. And then, a rectangular human head frame is constructed by taking the coordinate value of the central pixel point of the head image of each person as the center and taking the fixed height and width. Referring to (1) in fig. 9 and (2) in fig. 9, in order to facilitate intuitive understanding, the frames in (1) in fig. 9 and (2) in fig. 9 represent head frames (the head frames may be in the form of coordinate values). Tracking the head images of the individuals in the video frame sequence by using a Kalman filtering tracking algorithm, and setting a specific tracking ID for each head frame, wherein the same tracking ID is used for identifying the head frame of the same individual in different video frames, the head frames of the different individuals are identified by different tracking IDs, see (1) in FIG. 10 and (2) in FIG. 10, each head frame in (1) in FIG. 10 and (2) in FIG. 10 corresponds to one tracking ID, and the tracking ID of the head frame of the same individual in (1) in FIG. 10 is the same as the tracking ID of the head frame of the same individual in (2) in FIG. 10. Finally, comparing the central point position of each head frame corresponding to each person with the positional relationship of a preset boundary line, and counting the pedestrian volume of the target area within a preset time period, see (1) in fig. 11 and (2) in fig. 11, wherein the boundary lines in (1) in fig. 11 and (2) in fig. 11 are preset (the lines in (1) in fig. 11 and (2) in fig. 11 represent the boundary lines of the preset pedestrian volume statistics, it should be noted that "the first side" and "the second side" in (1) and (2) in fig. 11 are set only for convenience of describing the embodiment of the present application and do not constitute a limitation of the present application, and the boundary lines are not shown in fig. 7, 8, 9, and 10), and judging the positional relationship between the central point position of the head frame having the same tracking ID in (1) in fig. 11 and (2) in fig. 11, it can be seen that: the center points of the head boxes corresponding to the tracking ID1, the tracking ID2 and the tracking ID3 are located on the first side and the second side of the boundary, respectively, and indicate that the persons corresponding to the tracking ID1, the tracking ID2 and the tracking ID3 cross the boundary; similarly, the person corresponding to trace ID5 and trace ID6 crosses the boundary; and the central point position of the head frame corresponding to the tracking ID4 is only located on the first side of the boundary, so that the person corresponding to the tracking ID4 does not cross the boundary, and finally, the number of the persons crossing the boundary within the preset duration is counted, that is, the pedestrian volume of the target area within the preset duration is obtained.

It can be seen that, in the embodiment of the present application, first, a video frame sequence of a target area needs to be acquired, and then, the acquired video frame sequence is input into a neural network model to obtain coordinate values of center pixel points of head images of individuals, in order to facilitate tracking of the head images of the individuals in the video frame sequence, an individual head frame is constructed according to the coordinate values of the center pixel points of the head images of the individuals, the head images of the individuals in different video frames are matched, and the individual head frames corresponding to the same person are identified by the same tracking ID, so that the number of persons crossing a boundary of a people flow statistic is determined by determining a position relationship between a center point position of each individual head frame corresponding to each tracking ID and a boundary of the people flow statistic, or determining a position relationship between a boundary position of each individual head frame corresponding to each tracking ID and a boundary of the people flow statistic, i.e. the flow of people in the target area. Therefore, in the embodiment of the application, the head of the person is identified by using the central pixel point of the head image, so that each person in the video frame sequence can be identified more accurately; then, a head frame is constructed according to the center pixel point of the head image, the head image of the area where the head frame is located is tracked, and the tracking continuity and effectiveness are guaranteed; according to the position relation between the position of the human head frame and the boundary of the people flow statistics, the number of people moving from the first side of the boundary of the people flow statistics to the second side of the boundary of the people flow statistics is determined, the people flow of the target area can be accurately counted, and the user experience is improved.

Referring to fig. 12, fig. 12 is a schematic diagram of a people flow rate statistics apparatus 70 provided in an embodiment of the present application, where the apparatus 70 is used for implementing statistics on people flow rate, and may include:

an obtaining module 701, configured to obtain a video frame sequence of a target region within a preset time duration;

an identifying module 702, configured to identify a video frame sequence to obtain a position of each head image in each video frame of the video frame sequence in each video frame;

a tracking module 703, configured to perform target tracking on each head image in each video frame in the video frame sequence to obtain a target tracking result;

and the counting module 704 is used for counting the pedestrian volume moving from the first side of the boundary of the pedestrian volume statistics to the second side of the boundary according to the target tracking result.

In a possible implementation, the identification module 702 is specifically configured to: and inputting the video frame sequence into a neural network model, and identifying to obtain the coordinate value of the central pixel point of each head image in each video frame in the video frame.

In a possible embodiment, the tracking module 703 is specifically configured to: matching each head image in each video frame to obtain an identity identification number of each head image; wherein the same ID number in different video frames is used for indicating the head image of the same person in different video frames; constructing a human head frame according to the coordinate values of the central pixel points of the head images in the video frames, wherein the human head frame is used for indicating the areas of the head images of the human in the video frames; and tracking each head frame according to the identification number of each head image to obtain a tracking result.

In a possible implementation manner, constructing a human head frame according to coordinate values of central pixel points of each head image in each video frame includes: and taking the coordinate value of the central pixel point of each head image in each video frame as a center to construct a human head frame.

In a possible implementation, the statistical module 704 is specifically configured to: determining whether the head box with the same identification number in the different video frames moves from a first side of the boundary to a second side of the boundary according to the target tracking result; and counting the number of the head frames moving from the first side of the boundary to the second side of the boundary within the preset duration to obtain a statistical result of the pedestrian volume.

In a possible embodiment, the statistics module 704 is further configured to: determining the central point position of the human head frame with the same identification number in the different video frames according to the target tracking result; and determining whether the head frame with the same identification number moves from the first side of the boundary to the second side of the boundary according to the position relation between the central point position and the boundary.

In a possible embodiment, the statistics module 704 is further configured to: determining the boundary line position of the human head frame with the same identification number in the different video frames according to the target tracking result; and determining whether the head frame with the same identification number moves from the first side of the boundary line to the second side of the boundary line according to the position relation between the position of the boundary line and the position of the boundary line.

The functional modules of the apparatus 70 are used to implement the method described in the embodiment of fig. 4 or fig. 5, and specific contents may refer to descriptions in relevant contents of the embodiment of fig. 4 or fig. 5, and for brevity of description, no further description is given here.

Referring to fig. 13, fig. 13 is a schematic diagram of a people flow rate statistics device according to an embodiment of the present application, where the device may be implemented in an application server 800, and the device at least includes: processor 810, communication interface 820, and memory 830, with processor 810, communication interface 820, and memory 830 being coupled by bus 840. Wherein the content of the first and second substances,

the processor 810 is used to execute the obtaining module 701, the identifying module 702, the tracking module 703 and the counting module 704 in fig. 12 by calling the program code in the memory 830. In practical applications, processor 810 may include one or more general-purpose processors, wherein a general-purpose processor may be any type of device capable of Processing electronic instructions, including a Central Processing Unit (CPU), a microprocessor, a microcontroller, a main processor, a controller, and an ASIC (Application Specific Integrated Circuit), among others. The processor 810 reads the program code stored in the memory 830 and cooperates with the communication interface 820 to perform some or all of the steps of the method performed by the apparatus 70 for people traffic statistics of high density populations of the above-described embodiments of the present application.

The communication interface 820 may be a wired interface (e.g., an ethernet interface) for communicating with other computing nodes or devices. When communication interface 820 is a wired interface, communication interface 820 may employ a Protocol family over TCP/IP, such as RAAS Protocol, Remote Function Call (RFC) Protocol, Simple Object Access Protocol (SOAP) Protocol, Simple Network Management Protocol (SNMP) Protocol, Common Object Request Broker Architecture (CORBA) Protocol, and distributed Protocol, among others.

Memory 830 may store program codes as well as program data. The program code includes code of the acquiring module 701, code of the identifying module 702, code of the tracking module 703, and code of the counting module 704. The program data includes: coordinate information of the head image, identification of the head image, tracking ID, and traffic volume, etc. In practical applications, the Memory 830 may include a Volatile Memory (Volatile Memory), such as a Random Access Memory (RAM); the Memory may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk Drive (HDD), or a Solid-State Drive (SSD) Memory, which may also include a combination of the above types of memories.

Referring to fig. 14, the present application provides another schematic structural diagram of a people flow rate statistics device, and the device for statistics of people flow rate according to this embodiment may be implemented in a cloud server 900 of a cloud service cluster, and at least includes: at least one computing node 910, and at least one storage node 920. Wherein the content of the first and second substances,

the computing node 910 includes one or more processors 911, a communication interface 912, and a memory 913, which may be coupled via a bus 914 between the processors 911, the communication interface 912, and the memory 913.

The processor 911 includes one or more general-purpose processors for executing the acquiring module 701, the identifying module 702, the tracking module 703 and the counting module 704 in fig. 12 by calling the program code in the memory 913. A general-purpose processor may be any type of device capable of processing electronic instructions, including a Central Processing Unit (CPU), a microprocessor, a microcontroller, a main processor, a controller, an Application Specific Integrated Circuit (ASIC), and the like. It can be a dedicated processor for the compute node 910 only or can be shared with other compute nodes 910. The processor 911 reads the program code stored in the memory 913 to cooperate with the communication interface 912 to perform part or all of the steps of the method performed by the apparatus 70 for people traffic statistics of high-density people group in the above-mentioned embodiment of the present application.

The communication interface 912 may be a wired interface (e.g., an ethernet interface) for communicating with other computing nodes or users. When communication interface 912 is a wired interface, communication interface 912 may employ a Protocol family over TCP/IP, such as RAAS Protocol, Remote Function Call (RFC) Protocol, Simple Object Access Protocol (SOAP) Protocol, Simple Network Management Protocol (SNMP) Protocol, Common Object Request Broker Architecture (CORBA) Protocol, and distributed Protocol, among others.

The Memory 913 may include a Volatile Memory (Volatile Memory), such as a Random Access Memory (RAM); the Memory may also include a Non-volatile Memory (Non-volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk Drive (HDD), or a Solid-State Drive (SSD) Memory, which may also include a combination of the above types of memories.

The storage node 920 includes one or more storage controllers 921, storage arrays 922. The memory controller 921 and the memory array 922 may be connected by a bus 923.

Storage controller 921 includes one or more general-purpose processors, where a general-purpose processor may be any type of device capable of processing electronic instructions, including a CPU, microprocessor, microcontroller, host processor, controller, ASIC, and the like. It can be a dedicated processor for only a single storage node 920 or can be shared with the computing node 900 or other storage nodes 920. It is understood that in this embodiment, each storage node includes one storage controller, and in other embodiments, a plurality of storage nodes may share one storage controller, which is not limited herein.

Memory array 922 may include multiple memories. The memory may be a non-volatile memory, such as a ROM, flash memory, HDD or SSD memory, and may also include a combination of the above kinds of memory. For example, the storage array may be composed of a plurality of HDDs or a plurality of SDDs, or the storage array may be composed of HDDs and SDDs. In which a plurality of memories are combined in various ways to form a memory group with the aid of the memory controller 921, thereby providing higher storage performance than a single memory and providing a data backup technique. Optionally, memory array 922 may include one or more data centers. The plurality of data centers may be located at the same site or at different sites, and are not limited herein. Memory array 922 may store program codes and program data. The program code includes code of the acquiring module 701, code of the identifying module 702, code of the tracking module 703, and code of the counting module 704. The program data includes: coordinate information of the head image, identification of the head image, tracking ID, and traffic volume, etc.

The present application also provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program is executed by hardware (for example, a processor, etc.) to implement part or all of the steps of any one of the methods performed by the people flow rate statistics apparatus in the present application.

The embodiments of the present application also provide a computer program product, which, when being read and executed by a computer, causes a people flow rate statistic device to perform part or all of the steps of the method for counting people flow rate in the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented, in whole or in part, by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, memory Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state Disk, SSD)), among others. In the embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially or partially contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A people flow statistical method is characterized by comprising the following steps:

acquiring a video frame sequence of a target area within a preset time length;

identifying the video frame sequence to obtain the position of each head image in each video frame of the video frame sequence in each video frame;

performing target tracking on each head image in each video frame in the video frame sequence to obtain a target tracking result;

and counting the pedestrian volume moving from the first side of the boundary of the pedestrian volume statistics to the second side of the boundary according to the target tracking result.

2. The method of claim 1, wherein the identifying the sequence of video frames to obtain the position of each head image in each video frame of the sequence of video frames in each video frame comprises:

and inputting the video frame sequence into a neural network model for identification to obtain coordinate values of central pixel points of all head images in all the video frames in the video frames.

3. The method according to claim 2, wherein the performing target tracking on each head image in each video frame in the sequence of video frames to obtain a target tracking result specifically comprises:

matching each head image in each video frame to obtain an identity identification number of each head image; wherein the same ID number in different video frames is used for indicating the head image of the same person in the different video frames;

constructing a human head frame according to coordinate values of central pixel points of all head images in all video frames, wherein the human head frame is used for indicating areas of the head images of people in the video frames;

and tracking each human head frame according to the identification number of each head image to obtain a tracking result.

4. The method according to claim 3, wherein constructing a head frame according to the coordinate values of the central pixel point of each head image in each video frame comprises:

and constructing the human head frame by taking the coordinate value of the central pixel point of each head image in each video frame as a center.

5. The method according to claim 3 or 4, wherein the counting the pedestrian volume moving from a first side of a boundary of a pedestrian volume statistic to a second side of the boundary according to the target tracking result comprises:

determining whether the head box with the same identification number in the different video frames moves from a first side of the boundary to a second side of the boundary according to the target tracking result;

and counting the number of the head frames moving from the first side of the boundary to the second side of the boundary within the preset duration to obtain a statistical result of the pedestrian volume.

6. The method of claim 5, wherein determining whether the head box with the same id number in the different video frame moves from a first side of the boundary to a second side of the boundary according to the target tracking result comprises:

determining the central point position of the human head frame with the same identification number in the different video frames according to the target tracking result;

and determining whether the head frame with the same identification number moves from the first side of the boundary to the second side of the boundary according to the position relation between the central point position and the boundary.

7. The method of claim 5, wherein determining whether the head boxes in the different video frames corresponding to the same ID number move from a first side of the boundary to a second side of the boundary according to the tracking result comprises:

determining the boundary line position of the human head frame with the same identification number in the different video frames according to the target tracking result;

and determining whether the head frame with the same identification number moves from the first side of the boundary line to the second side of the boundary line according to the position relation between the position of the boundary line and the position of the boundary line.

8. A people flow statistic apparatus, comprising:

9. A computer-readable storage medium comprising program instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-7.

10. A people flow statistics device, characterized in that the device comprises a memory, a processor and a communication interface; the memory is used for storing information and data, the communication interface is used for receiving or sending information and data, and the processor is used for calling the information and data stored in the memory and executing the method according to any one of claims 1-7.