CN115661586A

CN115661586A - Model training and people flow statistical method, device and equipment

Info

Publication number: CN115661586A
Application number: CN202211575979.6A
Authority: CN
Inventors: 白帆
Original assignee: Yunli Intelligent Technology Co ltd
Current assignee: Yunli Intelligent Technology Co ltd
Priority date: 2022-12-09
Filing date: 2022-12-09
Publication date: 2023-01-31
Anticipated expiration: 2042-12-09
Also published as: CN115661586B

Abstract

The application provides a model training and pedestrian flow statistical method, a device and equipment, which relate to the technical field of artificial intelligence, and the model training method comprises the following steps: acquiring a training sample, wherein the training sample comprises a plurality of sample images and labeling information of the sample images, the labeling information comprises labeling positions and labeling marks of target objects, and the plurality of sample images are a plurality of continuous images in a sample video; inputting the multiple sample images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding sample image of a target object; and adjusting the parameters of the object recognition model according to the labeling information and the recognition information. The adverse effect of target object identification caused by the change of the environment is reduced, and therefore the accuracy and the effect of people flow statistics are improved.

Description

Model training and people flow statistical method, device and equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a model training and people flow statistical method, device and equipment.

Background

People flow statistics refers to a process of acquiring real-time people flow data through a certain technical means, and can provide important support for fine management of public places such as scenic spots, markets and the like.

At present, video monitoring is mainly adopted to judge the flow of people in real time. Specifically, through the repeated utilization of the monitoring equipment, the people in the monitoring video are identified through matching according to the space and appearance similarity in the monitoring video, and therefore people flow statistics is carried out. Because the people flow statistical method is realized based on space and appearance similarity matching, when the environment changes (for example, occlusion is generated), the current people flow statistical effect is poor.

Disclosure of Invention

The application relates to a method, a device and equipment for model training and people flow statistics, and aims to improve the effect of current people flow statistics.

In a first aspect, the present application provides a model training method, including:

acquiring a training sample, wherein the training sample comprises a plurality of sample images and annotation information of the sample images, the annotation information comprises annotation positions and annotation identifications of target objects, and the plurality of sample images are a plurality of continuous images in a sample video;

inputting the multiple sample images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding sample image of a target object;

and adjusting the parameters of the object recognition model according to the labeling information and the recognition information.

In one possible embodiment, the object recognition model comprises a backbone network module, a target detection module and a global tracking module; the inputting the plurality of sample images into an object recognition model to obtain recognition information output by the object recognition model includes:

performing feature extraction processing on the plurality of sample images based on the backbone network module to obtain feature maps of the plurality of sample images;

performing target detection processing on the feature map according to the target detection module to obtain detection data corresponding to the plurality of sample images;

and processing the detection data and the feature map based on the global tracking module to obtain the identification information.

In one possible embodiment, the object detection module comprises a first detection head, a second detection head and a third detection head; the performing, according to the target detection module, target detection processing on the feature map to obtain detection data corresponding to the plurality of sample images includes:

performing central point detection processing on the feature map according to the first detection head to obtain the central point position of a target object in each sample image;

performing central point offset detection processing on the feature map according to the second detection head to obtain central point offsets of target objects in the sample images;

carrying out anchor point detection processing on the feature map according to the third detection head to obtain the size of an anchor point boundary frame corresponding to the central point position of the target object in each sample image;

the detection data includes the center point location, the center point offset, and a size of the anchor bounding box.

In a possible implementation manner, the processing the detection data and the feature map based on the global tracking module to obtain the identification information includes:

according to the detection data and the feature map, obtaining the characteristic feature of each target object in the multiple sample images;

and carrying out global tracking processing on the characterization features according to the global tracking module to obtain the identification information.

In a possible implementation manner, the characterization feature includes a characterization feature map corresponding to each target object; the performing global tracking processing on the characterization feature according to the global tracking module to obtain the identification information includes:

acquiring tracking track information corresponding to each target object according to a characterization feature map corresponding to each target object, wherein the tracking track information comprises an index of a target sample image, time of the target sample image and a central point position of the target object on the target sample image, and the target sample image is a sample image including the target object in the multiple sample images;

and obtaining the identification information according to the tracking track information corresponding to each target object.

In a possible implementation manner, the adjusting the parameters of the object recognition model according to the labeling information and the recognition information includes:

acquiring a detection loss value corresponding to the target detection module according to the detection data and the labeling information;

acquiring a global tracking loss value corresponding to the global tracking module according to the identification information and the labeling information;

and adjusting parameters of the object identification model according to the detection loss value and the global tracking loss value.

In a second aspect, the present application provides a people flow rate statistical method, including:

acquiring a first video, wherein the first video comprises a plurality of images;

inputting the multiple images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding image of a target object in the first video; the object recognition model is a model obtained by training according to the model training method of any one of the first aspect;

and determining the flow of people corresponding to the first video according to the identification information.

In a third aspect, the present application provides a model training apparatus, comprising:

the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a training sample, the training sample comprises a plurality of sample images and annotation information of the sample images, the annotation information comprises annotation positions and annotation marks of target objects, and the sample images are continuous images in a sample video;

the first processing unit is used for inputting the plurality of sample images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding sample image of a target object;

and the training unit is used for adjusting the parameters of the object recognition model according to the labeling information and the recognition information.

In one possible embodiment, the object recognition model comprises a backbone network module, a target detection module and a global tracking module; the first processing unit is specifically configured to:

performing target detection processing on the feature map according to the target detection module to obtain detection data corresponding to the multiple sample images;

In one possible embodiment, the object detection module comprises a first detection head, a second detection head and a third detection head; the first processing unit is specifically configured to:

In a possible implementation manner, the first processing unit is specifically configured to:

In a possible implementation manner, the characterization feature includes a characterization feature map corresponding to each target object; the first processing unit is specifically configured to:

In a possible implementation, the training unit is specifically configured to:

In a fourth aspect, the present application provides a people flow rate statistics apparatus, comprising:

the second acquisition unit is used for acquiring a first video, and the first video comprises a plurality of images;

the second processing unit is used for inputting the plurality of images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding image of a target object in the first video; the object recognition model is a model obtained by training according to the model training method of any one of the first aspect;

and the determining unit is used for determining the flow of people corresponding to the first video according to the identification information.

In a fifth aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the model training method according to any one of the first aspect when executing the program, or implements the people flow statistical method according to the second aspect when executing the program.

In a sixth aspect, the present application provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the model training method according to any one of the first aspect, or which, when executed by a processor, implements the people flow statistics method according to the second aspect.

In a seventh aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the model training method according to any one of the first aspect, or which, when executed by a processor, implements the people flow statistics method according to the second aspect.

According to the method, the device and the equipment for model training and people flow statistics, firstly, a training sample is obtained, the training sample comprises a plurality of sample images and annotation information of the sample images, the annotation information comprises annotation positions and annotation marks of target objects, and the plurality of sample images are continuous images in a sample video; then inputting the plurality of sample images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position and a recognition identifier of a target object and a sample image of the object; and finally, adjusting the parameters of the object recognition model according to the labeling information and the recognition information. Because the object recognition model is trained through a plurality of sample images in the sample video, and a certain correlation exists among the plurality of sample images, even if the environment of a region corresponding to the sample video changes, the correlation between the relevant features and the target object can be learned according to the correlation among the plurality of sample images, the track tracking of the target object is realized based on the identification and the position of the target object, the adverse effect of the target object recognition caused by the change of the environment is reduced, and the accuracy and the effect of the people flow statistics are improved.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating a model training method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an object recognition model according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a process of processing a sample image by an object recognition model according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a backbone network module according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a global tracking module according to an embodiment of the present application;

FIG. 7 is a schematic processing diagram of a global tracking module according to an embodiment of the present application;

fig. 8 is a schematic flow chart of a people flow rate statistical method according to an embodiment of the present application;

FIG. 9 is a comparison diagram illustrating the human traffic identification provided by the embodiment of the present application;

FIG. 10 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a people flow rate statistic device according to an embodiment of the present application;

fig. 12 is a schematic physical structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

To make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

People flow statistics refers to a process of acquiring real-time people flow data through a certain technical means, and can provide important support for fine management of public places such as scenic spots, markets and the like. Through comparatively accurate people flow statistics, grasp real-time people flow data, can dynamic increase and decrease the service personnel, promote quality of service, the administrator of being convenient for deals with the potential safety hazard that unexpected people flow increases and cause to current regional actual personnel quantity of real time monitoring, thereby reach early warning and control effect, reduce the human cost.

At present, video monitoring is mainly adopted to judge the flow of people in real time. Specifically, taking a Tracking-By-Detection (TBD) method as an example, people in a surveillance video are identified By reusing surveillance equipment and performing matching according to spatial and appearance similarities in the surveillance video, so as to perform statistics on the flow of people. Because the people flow statistical mode is realized based on space and appearance similarity matching, the time modeling capability is lacked, and the people flow statistical effect is poor when the environment changes (such as occlusion is generated).

Based on this, the embodiment of the application provides a model training and pedestrian flow statistical method, which implicitly performs time correlation on long-range time change of a learning modeling target, and effectively solves the problem of poor pedestrian flow statistical effect caused by shielding due to environmental change. An application scenario applicable to the present application is first described below with reference to fig. 1.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application, and as shown in fig. 1, the application scenario includes a camera 11 and a server 12, where the camera 11 and the server 12 are connected by a wire or wirelessly.

The camera 11 is used for shooting an area a, and obtaining a corresponding shot video, wherein the range shot by the camera 11 (i.e. the area a) is not changed, and the environment or pedestrians in the area a may change.

The camera 11 sends the shot video to the server 12, and the server 12 processes the video data after receiving the shot video, so as to obtain the pedestrian volume of the area a in the corresponding time period.

In connection with the application scenario of fig. 1, a method according to an exemplary embodiment of the present application is described below with reference to fig. 2. It should be noted that the above application scenarios are only presented to facilitate understanding of the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.

Although the execution subject is the server 12 in the scenario illustrated in fig. 1, the execution subject of each embodiment in the present application may be, for example, a device having a data processing function, such as a server, a processor, a microprocessor, a chip, and the like, and for example, the execution subject may also be a client. The specific execution main body of each embodiment in the present application is not limited, and may be selected and set according to actual needs, and any device having a data processing function may be used as the execution main body of each embodiment in the present application. Furthermore, the execution subject for executing the model training method and the execution subject for executing the people flow rate statistical method may be the same or different.

Fig. 2 is a schematic flowchart of a model training method provided in an embodiment of the present application, and as shown in fig. 2, the method may include:

s21, a training sample is obtained, the training sample comprises a plurality of sample images and labeling information of the sample images, the labeling information comprises labeling positions and labeling marks of the target object, and the plurality of sample images are continuous images in the sample video.

The training samples are used for training the object recognition model, in the embodiment of the present application, the number of the training samples is one or more groups, and in the following embodiments, the process of using any one group of the training samples for training the object recognition model is described.

Any group of training samples comprises a plurality of sample images, the plurality of sample images are continuous images in the sample video, namely the plurality of sample images are arranged according to the time sequence of the sample images in the sample video, and the sample images comprise corresponding time information. Because the multiple sample images all belong to the same sample video, the multiple sample images have a certain association relationship.

The training sample comprises a plurality of sample images and also comprises the labeling information of each sample image, and the labeling information can be acquired by a server after being labeled by a labeling person. The marking information mainly comprises a marking position and a marking identification of a target object on the sample image, wherein the target object is an object to be identified by the object identification model; the labeling position is the position of the target object on the sample image, and the position can be labeled by a rectangular frame covering the target object, or other possible forms; the label mark is in one-to-one correspondence with the target object and is used for uniquely marking the target object.

Taking the example that the object recognition model is used for recognizing people, the target object is all people on the sample image, the labeling position is a rectangular frame covering the target object on the sample image, and the labeling identifier is an identifier used for marking the target object on the sample image.

It should be noted that, because the training sample includes multiple sample images, each sample image needs to be labeled, and the same target object may appear on different sample images, and different sample images may also include different target objects, the labeling identifier needs to be uniformly labeled on the multiple sample images. That is, for any one target object, no matter which sample image the target object appears on, the corresponding label of the target object is the same; for different target objects, no matter which sample images the target objects appear on, the corresponding annotation identifications of the target objects are different. In summary, the target objects and the label identifiers correspond to each other one by one, the same target object corresponds to the same label identifier, and different target objects correspond to different label identifiers. Aiming at a group of training samples, a plurality of sample images from the same sample video are set so that an object recognition model can learn the mutual association relationship among different sample images of the sample video, the modeling on time is realized, and the accuracy of target object recognition is improved.

And S22, inputting the plurality of sample images into the object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding sample image of the target object.

After a plurality of sample images are input to the object recognition model, the object recognition model processes the sample images, a target object on each sample image is recognized, and finally recognition information is output, wherein the recognition information comprises a recognition position, a recognition mark and a corresponding sample image of the target object.

The sample image corresponding to the target object refers to the sample image on which the target object identified by the object identification model appears, the identification position of the target object refers to the position of the target object identified by the object identification model on the sample image, and the identification mark of the target object refers to the mark of the target object identified by the object identification model.

And S23, adjusting parameters of the object identification model according to the marking information and the identification information.

After the identification information is obtained, the server calculates a model loss value according to the labeling information and the identification information, and then adjusts parameters of the object identification model according to the model loss value.

Before the model training termination condition is reached, aiming at any group of training samples, the parameters of the object recognition model can be adjusted by adopting the scheme. And when the model training termination condition is reached, stopping the model training, and obtaining the trained object recognition model. The model training termination condition may be set according to actual needs, for example, a maximum number of times of model training may be set, for example, a convergence condition may be set, and the present embodiment does not limit this.

After the training of the object recognition model is completed, the ability of recognizing the position and the identification of the target object in each image in the video is achieved, the image where the target object appears can be obtained according to the identification of the target object, the position where the target object is located can be obtained according to the position of the target object, and the target object can be subjected to track tracking based on the identification and the position of the target object, so that the people flow statistics is achieved according to the track tracking result.

The model training method provided by the embodiment of the application comprises the steps of firstly, obtaining a training sample, wherein the training sample comprises a plurality of sample images and annotation information of the sample images, the annotation information comprises annotation positions and annotation marks of target objects, and the sample images are continuous images in a sample video; then inputting the plurality of sample images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position and a recognition identifier of a target object and a sample image of the object; and finally, adjusting the parameters of the object recognition model according to the labeling information and the recognition information. Because the object recognition model is trained through a plurality of sample images in the sample video, and a certain correlation exists among the plurality of sample images, even if the environment of a region corresponding to the sample video changes, the correlation between the relevant features and the target object can be learned according to the correlation among the plurality of sample images, the track tracking of the target object is realized based on the identification and the position of the target object, the adverse effect of the target object recognition caused by the change of the environment is reduced, and the accuracy and the effect of the people flow statistics are improved.

On the basis of any of the embodiments described above, the solution of the present application is further described below with reference to the specific drawings.

Fig. 3 is a schematic structural diagram of an object identification model provided in an embodiment of the present application, and as shown in fig. 3, the object identification model includes a backbone network module, a target detection module, and a global tracking module, where a sample image is first input to the backbone network module, and then passes through the target detection module and the global tracking module in sequence, and finally identification information is obtained.

The following describes the processing procedure of the object recognition model with reference to fig. 4 on the basis of the structure of the object recognition model illustrated in fig. 3.

Fig. 4 is a schematic flowchart of a process of processing a sample image by an object recognition model according to an embodiment of the present application, and as shown in fig. 4, the process includes:

and S41, performing feature extraction processing on the multiple sample images based on the backbone network module to obtain feature maps of the multiple sample images.

As shown in fig. 3, a plurality of sample images are input to the backbone network module as input images. Optionally, the plurality of sample images are input to the backbone network module after being preprocessed, and the preprocessing may include image enhancement processing and/or size processing of the sample images.

The image enhancement processing refers to processing the sample image by adopting an image enhancement algorithm. The image enhancement algorithm may for example comprise one or more of the following: smoothing and filtering the image mean value; motion blur; median filtering; image sharpening enhancement; enhancing the image; gaussian blur; gaussian noise; self-adaptive Gaussian noise; camera sensor noise; simulating image fog enhancement randomly; self-adaptive histogram equalization; random hue, saturation modification; random brightness and contrast modification; random channel rearrangement; optical distortion; random gamma noise; random color dithering; vertically overturning; horizontally turning; image transposition; rotating at random angles; affine transformation; grid distortion; random grid arrangement; grid erasing; elastic transformations, etc. One or more of the image enhancement algorithms can be selected, or the corresponding image enhancement algorithm is selected according to a randomly generated probability value in the model training process to perform image enhancement processing on the sample image, so that the rapid convergence of the model is guaranteed, and the generalization capability of the model is improved.

The size processing is mainly to adjust the size of the sample image. For example, if the image size that can be handled by the object recognition model is a square and the sample image is a rectangle, the size of the sample image can be adjusted. If the rectangular sample image is directly scaled to obtain an image size that can be handled by the object recognition model, the sample image will be distorted. Therefore, the embodiment of the application can adopt letterbox (a method for adjusting the image size in target detection) to perform data filling (by using the scheme in yolov5 detection network), and the length-width ratio of the original sample image is maintained by filling the boundary, and meanwhile, the requirement of square image input of the object recognition model is met.

After the sample images are input to the trunk network module, feature extraction processing is carried out on the multiple sample images based on the trunk network module, and feature maps of the multiple sample images are obtained. In the embodiment of the application, the backbone network module may be, for example, a dla34 network, which balances the operation speed and accuracy, and deep layer aggregation (dla) fuses features between different stages through iterative deep aggregation, so as to improve the detection effect of targets with different scales.

Fig. 5 is a schematic structural diagram of a backbone network module provided in an embodiment of the present application, and as shown in fig. 5, the backbone network module may include a convolution sub-module, a hierarchical depth aggregation sub-module, and an aggregation point sub-module, where the convolution sub-module may be configured to perform convolution processing on an input, the hierarchical depth aggregation sub-module may be configured to perform hierarchical depth aggregation processing on the input, and the aggregation point sub-module may be configured to perform aggregation processing on the input. At different levels, up-sampling or down-sampling can be performed according to actual needs.

Fig. 5 illustrates a structure of a backbone network module, wherein different sub-modules are represented by different blocks and different processes are represented by different arrows. It should be noted that the structure in fig. 5 is only one possible structure of the backbone network module, and the backbone network module may also be any other possible structure.

And S42, carrying out target detection processing on the characteristic graph according to the target detection module to obtain detection data corresponding to the multiple sample images.

After the main network module outputs the feature maps of the multiple sample images, the feature maps are input to the target detection module, and the target detection module performs target detection processing on the feature maps to obtain detection data corresponding to the multiple sample images.

As shown in fig. 3, the object detection module includes a first detection head, a second detection head, and a third detection head, to which feature maps are input, respectively.

And for the first detection head, carrying out central point detection processing on the feature map according to the first detection head so as to obtain the central point position of the target object in each sample image. The first detection head may be, for example, a heatmap (heatmap) detection head, and the heatmap of each channel in the heatmap detection head predicts the center point position of the target objects of the category that may exist in the feature map, so as to predict the number of the target objects of the category in the current feature map and the center point position thereof. For example, if two persons a and b are included in the sample image, and the persons a and b are different target objects, the first detection head is configured to detect the position of the center point of the person a and the position of the center point of the person b according to the feature map. Optionally, the position of the center point may be a center point of a rectangular frame covering the target object, the position of the center point detected by the first detection head is the center point of the rectangular frame covering the target object detected by the first detection head, and the actual position of the center point may have a certain difference, and the loss value corresponding to the first detection head may be calculated according to the difference. For example, after obtaining the predicted center point position and the actual center point position, the loss value of the first detection head is obtained using a deformed center loss (focal loss) loss function based on the predicted center point position and the actual center point position.

For example, let the sample image be

The heat map generated by the heat map detection head (which is a feature map having a Gaussian distribution centered on the position of the center point of the target object in the sample image) is

Where H is the height of the sample image (and also the number of pixels included in the height of the sample image),w is the width of the sample image (also the number of pixels included in the width of the sample image), R represents the dimension, and the dimension of each sample image is

. Predicted heatmaps

In targeting heatmaps

A gaussian distribution map of the center point of the target corresponding to a certain position (i.e., position (x, y))

(the value of the center point thereof is 1) if

Then the corresponding pixel point is represented as the central point position of the target object, if so

Then it represents the corresponding pixel point as background, and each real coordinate corresponds to

Where p is the coordinate of the corresponding point on the sample image, then the corresponding correct label (group route) on the heat map is

This is because, when the size of the heat map is gradually reduced along with the convolution operation, for example, to 1/4 of the original image, the corresponding coordinates p on the heat map can be obtained according to the coordinates p on the original sample image

. The heat map response of the coordinates (x, y) on the sample image is then formulated as

Wherein

Object size-adaptive standard deviation (x, y) is the actual coordinates on the sample image,

coordinates predicted for the heatmap.

The center loss of the logistic regression using pixels in training, namely:

wherein the content of the first and second substances,

as the central loss (i.e. the loss value of the first detection head), xy represents the x-th pixel point in the horizontal direction and the y-th pixel point in the vertical direction on the sample image,

as the labeled target center point (i.e. the position of the center point of the labeled target object),

is a predicted target center point (i.e., a predicted center point position of a target object), N is the number of center points included on the sample image,

and

all are hyper-parameters.

And for the second detection head, performing central point offset detection processing on the feature map according to the second detection head to obtain central point offsets of the target objects in the sample images. The second detection head may be, for example, a center offset (center offset) detection head for detecting an offset amount of the target object from the center point position.

The second detection head is used for more accurately positioning the target object, wherein the central point offset is an offset of each part of the target object relative to the central point position, and taking the target object as an example, distances of parts of the person, such as limbs and the head, on the sample image relative to the central point position of the person can be used as the central point offset.

And the second detection head predicts the central point offset of the target object according to the input feature map, and then calculates the loss value of the second detection head according to the actual central point offset of the target object.

Let the predicted center point offset of the second detection head be

Using L1 losses, one can obtain:

wherein the content of the first and second substances,

is a loss value of the second detection head, N is the number of center points included on the sample image, p is a center point labeled on the sample image,

for the predicted coordinates on the heat map,

for the offset of the center point corresponding to the pixel point p, M is a scaling scale, for example, taking M =4 as an example, it means that the original image is scaled by 4 times.

And for the third detection head, carrying out anchor point detection processing on the feature map according to the third detection head so as to obtain the size of an anchor point boundary frame corresponding to the central point position of the target object in each sample image. The third detection head may be, for example, a box size (box size) detection head that is a width and height of the prediction target object, i.e., a bounding box size. And the box size detection head is responsible for estimating the height and width of the target bounding box at each anchor point according to the feature map, and calculating the loss value of the size of the target bounding box through an L1 loss function by comparing the predicted height and width with the height and width in the real label as the loss value of the box size detection head.

Setting the size of the predicted target as

Wherein

The size of the bounding box of the target k is assumed as the size characteristic diagram of the bounding box of the target

Wherein

The coordinates of the upper left point of the target bounding box,

is the coordinate of the lower right point of the target boundary box, and the coordinate of the central point position is

Performing regression on the frame size of the target object k

，

Width of boundary box of representing target

And height

Using the L1 loss value one can obtain:

wherein, the first and the second end of the pipe are connected with each other,

n is the number of center points included on the sample image,

for the width and height of the predicted target bounding box,

the actual size of the target bounding box, i.e. the width and height of the labeled target bounding box.

And acquiring a detection loss value corresponding to the target detection module according to the detection data and the labeling information. In the above embodiment, a calculation method of loss values of three detection heads is introduced, and a detection loss value corresponding to the target detection module can be obtained according to the loss values of the three detection heads as follows:

wherein the content of the first and second substances,

，

。

in order to detect the value of the loss,

and

are all loss value weights.

And S43, processing the detection data and the feature map based on the global tracking module to obtain identification information.

The global tracking module may be a transform-based global tracking module. Firstly, according to the position of the central point of a target object detected by a target detection module, the target object is placed in a corresponding area in a backbone network module, a Region of interest pooling (ROI pooling) layer is used for extracting the characterization feature of the target object (the Region of interest pooling layer can accept input of any size and output fixed dimension feature), and the characterization feature is input into a global tracking module and used as common input with a track index to predict the probability of tracking track.

Among them, a transform-based global tracking module may use a DETR (Detection transform, which is an end-to-end learning system for object Detection) structure, but only one coding layer and one decoding layer are used. Fig. 6 is a schematic structural diagram of a global tracking module according to an embodiment of the present application, and as shown in fig. 6, the global tracking module includes a self-attention (self-attention) sub-module, a linear rectification sub-module, and a cross-attention (cross-attention) sub-module. The detection data and the characteristic diagram are input into the self-attention submodule, and then the identification information is obtained through processing of the self-attention submodule, the linear rectification submodule, the cross-attention submodule and other submodules. Fig. 7 is a schematic processing diagram of the global tracking module according to the embodiment of the present disclosure, and as shown in fig. 7, the input of the global tracking module includes detection data and a feature map, the detection data includes a central point position of a target object, a central point offset, and a size of an anchor point bounding box, and the feature map is used to reflect a feature of a sample image. After the detection data and the feature map are input into the global tracking module, the global tracking module processes the detection data and the feature map and outputs identification information.

Specifically, firstly, according to the detection data and the feature map, the characteristic features of each target object in a plurality of sample images are obtained. As shown in fig. 3, after the detection data are obtained, the detection data and the feature maps are both input to the region of interest pooling layer, and the region of interest pooling layer aligns the detection data and the feature maps to obtain a plurality of feature area maps, wherein the sizes of the plurality of feature area maps are consistent, and the number of the feature area maps is consistent with the number of all target objects included in the plurality of sample images. And then, combining the multiple characteristic region graphs to obtain the characteristic characteristics of each target object.

After the characteristic features of the target objects are obtained, the characteristic features of the target objects are input into the global tracking module, global tracking processing is carried out on the characteristic features according to the global tracking module, and then the identification information can be obtained.

The characterization features include characterization feature maps corresponding to the target objects, specifically, tracking track information corresponding to the target objects is obtained according to the characterization feature maps corresponding to the target objects, the tracking track information includes indexes of the target sample images, time of the target sample images, and a center point position of the target object on the target sample images, and the target sample images are sample images including the target objects in the multiple sample images. Then, according to the tracking track information corresponding to each target object, identification information is obtained.

Then, according to the identification information and the labeling information, a global tracking loss value corresponding to the global tracking module can be obtained.

The main training process of the transform-based global tracking module is that a series of target objects are contained in a sample image I according to T frame images (T =8 during training and T =16 during prediction) and the central point positions of the target objects positioned by the target detection module

Target object

Corresponding position information is

Wherein

Inputting successive frames

Suppose that the t-th frame image is detected

The target object with the confidence level of coincidence is

，

Extracting features for the corresponding target center point position information according to the position information,

to represent

Corresponding features, therefore

Corresponding feature set to

Wherein, in the step (A),

extracting position and size information of the target object on the sample image of the t frame according to the detection head, normalizing the position and size information to obtain the same size through an ROI (region of interest) pooling layer,

representing the number of target objects on the sample image of the T-th frame, the total feature set of the T-frame being

A series of tracking tracks corresponding to all targets

For arbitrary trajectories

Wherein

Represents the state at t frames when

Representing that object k is not present in the t frame.

The input of a global tracking module of a transform is divided into two parts, a detected target feature map is used as encoding input, a query matrix is used as decoding input, wherein the query matrix is a known target feature matrix (namely, feature values corresponding to M persons appearing in a previous frame), and a correlation matrix (namely, corresponding relations of M known targets and N current detection targets) between a query and a current target is output, wherein N is the total detection targets in all frames, namely, a score vector is generated by each target feature F which is requested to be extracted relative to all frames.

The server predicts the discrete matching value of all targets to each track at the time t, and therefore constructs an independent softmax activation function for normalization:

wherein the content of the first and second substances,

in order to activate the function(s),

represents the ith target object in the sample image of the t frame, k represents the kth target object,

indicating a request for a known target object, for example target object a appearing on the first frame sample image, in the next N frame sample images as a request, to find which of the target objects of all targets in the T frame sample images is target object a,

the representative request k corresponds to the score corresponding to the ith target object in the t frame.

The goal of the training is to learn a transformer-based tracker to estimate

For each track

In other words, the following log-likelihood ratios are optimized:

wherein the content of the first and second substances,

the trajectory representing the target object k request, i.e. the feature set F extracted in all T frame sample images, can be matched, i.e. which feature sets and trajectories can be matched. s represents the features that can be matched with the trajectory of the requesting target object k at all times T, T represents the number of sample images,

representing the kth target object in the sample image of the s-th frame (i.e. the s-th instant),

indicating an empty set.

For all unmatched features, an empty trajectory is constructed:

representing the characteristics of the jth target object at time s. The final tracking loss function is then:

training of the entire network automatically balances the detection and tracking tasks using an uncertainty loss function, which is defined as follows:

and

are hyper-parameters used to balance detection and global tracking loss values.

And adjusting parameters of the object recognition model according to the detection loss value and the global tracking loss value. In the training process, the optimization function used can be detailed Stochastic Gradient Descent (SGD) + Momentum (Momentum).

Gradient reduction: the direction of the gradient is the direction in which the function rises fastest at a given point, and then the opposite direction of the gradient is the direction in which the function falls fastest at the given point, so when the gradient is reduced, the weight is updated along the opposite direction of the gradient, and the global optimal solution can be effectively found.

The SGD algorithm randomly extracts one group from samples, updates the samples once according to gradient after training, extracts the other group again, and updates the other group again. Random here means that the samples are randomly disturbed during each iteration.

And (3) adding an SGD method of momentum, wherein the momentum value range [0,1] is 0.9 in the embodiment of the application. The significance of the momentum is that if the gradient signs of the current time and the last time are the same, the gradient can be accelerated to descend (the amplitude is larger), and the problem that the gradient is slowly descended originally can be solved; if the signs of the gradients at this time and the last time are opposite, the gradients at this time and the last time mutually suppress and slow down the oscillation. Due to the momentum, when the local optimal point is reached, the local optimal point can jump out by the momentum and is not easy to sink into the local optimal point.

The learning rate updating strategy of the OneCyclr (a learning rate scheduler) does not monotonically reduce the learning rate in the training process, but enables the learning rate to change back and forth between the set maximum value and the set minimum value, the process of increasing the learning rate can help the loss function value to escape from the saddle point, the optimal learning rate is between the set maximum value and the set minimum value, and the value near the optimal learning rate is used all the time in the whole training process.

By matching SGD + momentum (0.9) + OneCyclr, the convergence rate and generalization capability of the model can be effectively improved in the training set (public data set + self-contained data label) used in the embodiment of the application.

Fig. 8 is a schematic flow chart of a people flow rate statistical method according to an embodiment of the present application, and as shown in fig. 8, the method may include:

s81, acquiring a first video, wherein the first video comprises a plurality of images.

The first video is a video obtained by shooting a certain area, and includes a plurality of continuous images, the images include corresponding time information, and the images also include target objects, which may be pedestrians, passerby, and the like.

And S82, inputting the plurality of images into the object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises the recognition position, the recognition identifier and the corresponding image of the target object in the first video.

The object recognition model is a model obtained by training according to the model training method of the embodiment, and after the plurality of images are input to the object recognition model, the plurality of images are processed by the object recognition model, so that the recognition information can be output, wherein the recognition information comprises the recognition position, the recognition identifier and the corresponding image of the target object in the first video.

The image corresponding to the target object refers to the image on which the target object identified by the object identification model appears, the identification position of the target object refers to the position of the target object identified by the object identification model on the image, and the identification mark of the target object refers to the mark of the target object identified by the object identification model.

And S83, determining the flow of people corresponding to the first video according to the identification information.

After the identification information is obtained, the track of each target object in the video can be determined according to the identification information, so that the track of each target object in the video can be tracked, and the target objects passing through or entering and exiting in a certain time period in an area can be determined, so that the people flow rate corresponding to the first video is obtained.

Fig. 9 is a schematic diagram for comparing pedestrian volume identification provided in the embodiment of the present application, and as shown in fig. 9, the upper half is a schematic diagram for detecting pedestrian volume by using a TBD method, and because the people in the video are identified by matching spatial and appearance similarities, the pedestrian volume is counted, and the statistical method lacks a time modeling capability (as shown in the upper half of fig. 9, at most, only people in two adjacent images can be associated to determine whether the same person exists), and cannot effectively track the track of each person in the video, so that the result of pedestrian volume counting is poor. The lower part of fig. 9 is a schematic diagram of performing people flow statistics by using the scheme of the embodiment of the present application, and the long-range time variation of the learning modeling target is implicitly time-correlated, so that people in all images in the same video can be correlated, whether the people are the same person or not can be determined, the time modeling capability is provided, the trajectory of each person in the video can be effectively tracked, and the people flow statistics effect is improved.

In summary, since the object recognition model is trained through a plurality of sample images in the sample video, and the plurality of sample images have a certain correlation, even if the environment of the region corresponding to the sample video changes, the correlation between the relevant features and the target object can be learned according to the correlation between the plurality of sample images, and the trajectory tracking of the target object is realized based on the identification and the position of the target object, so that the adverse effect of the target object recognition caused by the change of the environment is reduced.

Fig. 10 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application, and as shown in fig. 10, the model training apparatus 100 includes:

the first obtaining unit 101 is configured to obtain a training sample, where the training sample includes multiple sample images and annotation information of the sample images, the annotation information includes annotation positions and annotation identifiers of target objects, and the multiple sample images are multiple continuous images in a sample video;

the first processing unit 102 is configured to input the multiple sample images to an object recognition model, and obtain recognition information output by the object recognition model, where the recognition information includes a recognition position, a recognition identifier, and a corresponding sample image of a target object;

a training unit 103, configured to adjust parameters of the object recognition model according to the labeling information and the recognition information.

In one possible embodiment, the object recognition model comprises a backbone network module, a target detection module and a global tracking module; the first processing unit 102 is specifically configured to:

In one possible embodiment, the object detection module comprises a first detection head, a second detection head and a third detection head; the first processing unit 102 is specifically configured to:

performing anchor point detection processing on the feature map according to the third detection head to obtain the size of an anchor point bounding box corresponding to the central point position of the target object in each sample image;

In a possible implementation manner, the first processing unit 102 is specifically configured to:

In a possible implementation manner, the characterization feature includes a characterization feature map corresponding to each target object; the first processing unit 102 is specifically configured to:

In a possible implementation, the training unit 103 is specifically configured to:

The model training device provided by the embodiment of the application can be used for executing the technical scheme of the embodiment of the model training method, the implementation principle and the technical effect are similar, and the details are not repeated here.

Fig. 11 is a schematic structural diagram of a people flow rate statistic device according to an embodiment of the present application, and as shown in fig. 11, the people flow rate statistic device 110 includes:

a second obtaining unit 111, configured to obtain a first video, where the first video includes multiple images;

a second processing unit 112, configured to input the multiple images into an object recognition model, so as to obtain recognition information output by the object recognition model, where the recognition information includes a recognition position, a recognition identifier, and a corresponding image of a target object in the first video; the object recognition model is a model obtained by training according to the model training method in the embodiment;

a determining unit 113, configured to determine, according to the identification information, a people flow rate corresponding to the first video.

The people flow rate statistical device provided by the embodiment of the application can be used for executing the technical scheme of the people flow rate statistical method embodiment, the implementation principle and the technical effect are similar, and the details are not repeated here.

Fig. 12 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 12: a processor (processor) 1210, a communication Interface (Communications Interface) 1220, a memory (memory) 1230, and a communication bus 1240, wherein the processor 1210, the communication Interface 1220, and the memory 1230 communicate with each other via the communication bus 1240. Processor 1210 may invoke logic instructions in memory 1230 to perform a method of model training comprising: acquiring a training sample, wherein the training sample comprises a plurality of sample images and annotation information of the sample images, the annotation information comprises annotation positions and annotation identifications of target objects, and the plurality of sample images are a plurality of continuous images in a sample video; inputting the multiple sample images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding sample image of a target object; and adjusting the parameters of the object recognition model according to the labeling information and the recognition information. Alternatively, processor 1210 may invoke logic instructions in memory 1230 to perform a people traffic statistical method comprising: acquiring a first video, wherein the first video comprises a plurality of images; inputting the multiple images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding image of a target object in the first video; and determining the flow of people corresponding to the first video according to the identification information.

In addition, the logic instructions in the memory 1230 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present application further provides a computer program product, the computer program product including a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the model training method provided by the above embodiments, the method including: acquiring a training sample, wherein the training sample comprises a plurality of sample images and labeling information of the sample images, the labeling information comprises labeling positions and labeling marks of target objects, and the plurality of sample images are a plurality of continuous images in a sample video; inputting the multiple sample images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding sample image of a target object; and adjusting parameters of the object recognition model according to the labeling information and the recognition information. Alternatively, when the computer program is executed by a processor, a computer can execute the people flow rate statistical method provided by the above embodiments, and the method includes: acquiring a first video, wherein the first video comprises a plurality of images; inputting the multiple images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding image of a target object in the first video; according to the identification information, determining the flow of people corresponding to the first video

In yet another aspect, the present application further provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the model training method provided in the above embodiments, the method including: acquiring a training sample, wherein the training sample comprises a plurality of sample images and labeling information of the sample images, the labeling information comprises labeling positions and labeling marks of target objects, and the plurality of sample images are a plurality of continuous images in a sample video; inputting the multiple sample images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding sample image of a target object; and adjusting the parameters of the object recognition model according to the labeling information and the recognition information. Alternatively, the computer program is implemented to perform the people flow rate statistical method provided by the above embodiments when executed by a processor, and the method includes: acquiring a first video, wherein the first video comprises a plurality of images; inputting the plurality of images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding image of a target object in the first video; and determining the flow of people corresponding to the first video according to the identification information.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of model training, comprising:

acquiring a training sample, wherein the training sample comprises a plurality of sample images and labeling information of the sample images, the labeling information comprises labeling positions and labeling marks of target objects, and the plurality of sample images are a plurality of continuous images in a sample video;

2. The method of claim 1, wherein the object recognition model comprises a backbone network module, a target detection module, and a global tracking module; the inputting the plurality of sample images into an object recognition model to obtain recognition information output by the object recognition model includes:

3. The method of claim 2, wherein the object detection module comprises a first detection head, a second detection head, and a third detection head; the performing, according to the target detection module, target detection processing on the feature map to obtain detection data corresponding to the plurality of sample images includes:

4. The method of claim 2, wherein the processing the detection data and the feature map based on the global tracking module to obtain the identification information comprises:

5. The method of claim 4, wherein the characterization feature comprises a characterization feature map corresponding to each of the target objects; the performing global tracking processing on the characterization feature according to the global tracking module to obtain the identification information includes:

6. The method according to any one of claims 3 to 5, wherein the adjusting the parameters of the object recognition model according to the labeling information and the recognition information comprises:

and adjusting parameters of the object recognition model according to the detection loss value and the global tracking loss value.

7. A people flow statistical method is characterized by comprising the following steps:

inputting the multiple images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding image of a target object in the first video; the object recognition model is a model obtained by training according to the model training method of any one of claims 1 to 6;

8. A model training apparatus, comprising:

9. A people flow statistic apparatus, comprising:

the second processing unit is used for inputting the multiple images into an object recognition model to obtain recognition information output by the object recognition model, wherein the recognition information comprises a recognition position, a recognition identifier and a corresponding image of a target object in the first video; the object recognition model is a model obtained by training according to the model training method of any one of claims 1 to 6;

10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the model training method according to any one of claims 1-6 when executing the program, or the processor implements the traffic statistics method according to claim 7 when executing the program.