CN113269065B

CN113269065B - Method for counting people flow in front of screen based on target detection algorithm

Info

Publication number: CN113269065B
Application number: CN202110530344.3A
Authority: CN
Inventors: 雷李义
Original assignee: Shenzhen Image Data Technology Co ltd
Current assignee: Shenzhen Image Data Technology Co ltd
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2023-02-28
Anticipated expiration: 2041-05-14
Also published as: CN113269065A

Abstract

The application relates to a method for counting the flow of people in front of a screen based on a target detection algorithm, which comprises a historical information recording base, wherein a plurality of historical people head frames are stored in the historical information recording base; receiving a video, wherein the video is divided into a plurality of single-frame images according to the time sequence; performing feature extraction on the single-frame image through a target detection neural network model to obtain a plurality of target frames containing category information and position information; filtering the target frames according to the category information and the position information to obtain a plurality of target face frames and a plurality of target head frames; and comparing the target person head frame with the historical person head frame and the target person face frame one by one respectively, and further counting the flow number of the person and the screen attention number.

Description

Method for counting people flow in front of screen based on target detection algorithm

Technical Field

The application relates to the technical field of face recognition, in particular to a method for counting the flow of people in front of a screen based on a target detection algorithm.

Background

With the development and progress of artificial intelligence technology, people flow statistics technology based on video streams has been rapidly developed and has been applied to many public scenes such as scenic spots, communities and markets. However, the existing application usually focuses on tracking and counting people, and lacks statistics on aspects such as people focusing on a screen, and cannot count the attraction of contents displayed on the screen to users, but the statistical data on the aspect is very important for a commercial screen.

Content of application

Technical problem to be solved

The application provides a method for counting the flow of people in front of a screen based on a target detection algorithm, which solves the technical problem that in the prior art, only the flow of people in front of the screen can be counted, but the screen attention number concerning the screen content cannot be counted.

(II) technical scheme

In order to achieve the above purpose, the present application provides the following technical solutions:

a method for counting the flow of people in front of a screen based on a target detection algorithm comprises the following steps:

step S1: a history information database storing a plurality of history head frames;

step S2: receiving a video, wherein the video is divided into a plurality of single-frame images according to the time sequence;

and step S3: performing feature extraction on the single-frame image through a target detection neural network model to obtain a plurality of target frames containing category information and position information;

and step S4: filtering the target frames according to the category information and the position information to obtain a plurality of target face frames and a plurality of target head frames;

step S5: compare the target person head frame with a plurality of historical person head frames one by one, output a first matching value each time the comparison is made, and judge that the first matching value > a first threshold? If the first matching value is larger than the first threshold value, the target pedestrian head frame and the historical pedestrian head frame are judged to be the same pedestrian, the pedestrian flow number is not updated, otherwise, the target pedestrian head frame and the historical pedestrian head frame are judged to be different pedestrians, and the pedestrian flow number is added by 1; until the comparison of all target person head frames is completed;

step S6: compare the target face frame with the plurality of target face frames one by one, output a second matching value after the comparison is completed, and judge that the second matching value is greater than a second threshold? If the second matching value is larger than the second threshold value, judging that the target person head frame has the concerned screen content, namely adding 1 to the screen concerned number, otherwise, judging that the target person head frame does not pay attention to the screen content, and not updating the screen concerned number; until all target people finish comparing the head money;

step S7: and updating the historical information base, replacing the corresponding historical person head frame in the historical information base with the target person head frame when the first matching value is larger than the first threshold value, and adding the target person head frame into the historical information base and marking as the historical person head frame when the first matching value is smaller than the first threshold value.

Preferably, the target neural detection network model is established based on an SSD target detection algorithm and is obtained by inputting a real human head image and a human face image with a limited angle, and the limited angle range is a horizontal rotation angle of-45 ° to +45 °.

Preferably, the category information includes a head image, a face image and a background image, and the position information is a relative coordinate [ x ] of the target frame in the single-frame image ₀ ,y ₀ ,x ₁ ,y ₁ ]。

Preferably, step S4 comprises:

step S41: filtering the long-distance target frame: calculating the width-height product of the target frame according to the relative coordinates of the target frame, and filtering the target frame with the width Gao Chengji smaller than 0.03;

step S42: filtering static target frames: obtaining central coordinate offsets dx and dy and width and height offsets dw and dh between different target frames according to the relative coordinates of the target frames, judging that the target frames with the dx, dy, dw and dh smaller than 0.02 are static in the single-frame images, counting the frame number of the different single-frame images, and judging that the object corresponding to the target frame is a static target and filtering when the static accumulated time of the target frame exceeds 1 minute;

step S43: and obtaining a target human head frame and a target human face frame for the residual target frames filtered in the step S41 and the step S42 according to the category information.

Preferably, the video is a real-time video or a historical video, and the single-frame image is a video picture captured every 0.2 seconds in the video.

Preferably, the first matching value is an intersection ratio of the target human head frame and the historical human head frame, the value range of the first matching value is 0 to 1, the first threshold value is 0, when the first matching value is greater than the first threshold value 0, it is determined that the pedestrians corresponding to the target human head frame and the historical human head frame are the same pedestrian, otherwise, the pedestrians are different pedestrians, and at this time, the pedestrian flow number is increased by 1.

Preferably, step S5 further comprises a verification of the human flow number: counting lines are arranged on two sides of the edge of the picture and used for detecting the entering or leaving of pedestrians in the picture and recording the number of entering people and the number of leaving people; for the target human head frame and the historical human head frame which are judged as the same pedestrian, judging the motion direction of the corresponding pedestrian according to the positions of the target human head frame and the historical human head frame between the two counting lines, and recording the entering and leaving data for the same tracked pedestrian only once;

checking the flow number of people, the number of people entering and the number of people leaving for a period of time to obtain the flow number of the checked people: the check man flow number = [ (number of entering persons + number of leaving persons)/2 + number of man flows ]/2.

Preferably, the second matching value includes a similarity value and a matching frequency, where the similarity value is an intersection and combination ratio of the target human head frame and the target human face frame, a value range of the similarity value is 0 to 1, the matching frequency is a frequency of successful matching between the target human head frame and the target human face frame, the second threshold includes a similarity threshold of 0.3 and a matching threshold of 15 times, that is, when the similarity value is greater than the similarity threshold, the target human head frame and the target human face frame are determined as the same pedestrian, and when the matching frequency is greater than the matching threshold, the screen attention number is increased by 1.

Preferably, updating the history information base further comprises counting the number of times of no matching success of the history head box, and deleting the history head box when the number of times of no matching of the history head box exceeds 5 times.

(III) advantageous effects

Compared with the prior art, the beneficial effects of this application are:

the application provides a method for counting the flow of people before a screen based on a target detection algorithm, wherein a target detection neural network model of image data with heads and image data with limited angles is established to output a target head frame of the pedestrian passing in a period of time before the screen, and the target head frame is tracked and synchronized with historical head frame information to complete the flow of people before the screen in a period of time; and counting the number of pedestrians who pass through the screen and pay attention to the screen content within a period of time, namely the screen attention number, by adopting the mutual matching of the target human face frame and the target human face frame with the angle limit.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application and not to limit the application, in which:

FIG. 1 shows an overall flow diagram of an embodiment of the present application;

FIG. 2 illustrates a target detection neural network model logic diagram of an embodiment of the present application;

FIG. 3 shows a flowchart of step S4 of an embodiment of the present application;

fig. 4 shows an overall logic diagram of an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1 and 4, the embodiment of the present application discloses a method for counting people flow in front of a screen based on a target detection algorithm, which is mainly used for advertisement screens in shopping malls and scenic spots with large people flow and is used for counting the people flow and the screen attention in front of the screen within a period of time, and the method comprises the following steps:

step S1: recording a historical information base, wherein a plurality of historical people head frames are stored in the historical information base;

step S2: receiving a video, wherein the video is divided into a plurality of single-frame images according to the time sequence; specifically, the video is a real-time video or a historical video, the single-frame image is a video picture captured every 0.2 seconds in the video, and in this embodiment, the video can be acquired by using a camera installed on a screen.

specifically, the target detection neural network model is established based on an SSD target detection algorithm and is obtained by inputting a real human head image and a human face image with a limited angle for training, wherein the limited angle range is a horizontal rotation angle of-45 degrees to +45 degrees, so that the human face image which looks at the screen in front of the screen is screened subsequently, and the number of people paying attention to the screen is counted; referring to fig. 2, the image firstly extracts the bottom visual features for the small-scale target through 15 layers of directly connected convolutional neural networks, then extracts the middle visual features for the medium-scale target through 6 layers of directly connected convolutional neural networks, and then further extracts the high-level visual features for the large-scale target through 6 layers of directly connected convolutional neural networks. Performing regression on the visual features of the three layers through two independent two-layer convolutional neural networks respectively to obtain category information and position information of the target frame; because the target detection neural network model can output a large number of overlapped target frames, and because the non-maximum suppression algorithm can screen the target frames with high overlapping degree, only the target frames with high confidence coefficient are reserved, and the overlapped target frames with low confidence coefficient are removed, the target frames with category information and position information output by the target neural network model are filtered by the non-maximum suppression algorithm, and finally the target frames with category information and position information are output.

The category information comprises a human head image, a human face image and a background image, and the position information is the relative coordinate [ x ] of the target frame in the single-frame image ₀ ,y ₀ ,x ₁ ,y ₁ ]. (ii) a The human face image is a human face with a limited angle of-45 degrees to +45 degrees, and comprises a human face image with a limited angle of 0 degrees, namely the situation that a pedestrian stands in front of a screen to look directly at the screen, a human face image with a limited angle of 0 degrees to 45 degrees, namely the situation that the pedestrian turns left to look at the screen in the walking process, and a human face image with a limited angle of 0 degrees to-45 degrees, namely the situation that the pedestrian turns right to look at the screen in the walking process.

referring to fig. 3, step S4 includes:

step S41: filtering the long-distance target frame: calculating the width-height product of the target frame according to the relative coordinates of the target frame, and filtering the target frame with the width Gao Chengji less than 0.03; specifically, the pedestrian flow in front of the screen mainly considers the pedestrian flow statistics that the screen content can be seen near the screen, so that the pedestrians far away from the screen are firstly filtered, and only the pedestrians within a certain distance range from the screen are counted. The average size of human head frames and human face frames corresponding to different distances is obtained through field measurement, the target frames obtained by a target detection neural network model are traversed in the statistical process, the target frames with undersize and unqualified sizes are removed, in the embodiment, a camera of a screen is tested, and the width-height product of the target frames at the position 3 meters away from the screen is found to be about 0.03 through the test, so that the target frames beyond 2 meters are not counted in the embodiment, and therefore the target frames with the width Gao Chengji smaller than 0.03 are filtered.

Step S42: filtering static target frames: obtaining central coordinate offsets dx and dy and width and height offsets dw and dh between different target frames according to the relative coordinates of the target frames, judging that the target frames with the dx, dy, dw and dh smaller than 0.02 are static in the single-frame images, counting the frame number of the different single-frame images, and judging that the object corresponding to the target frame is a static target and filtering when the static accumulated time of the target frame exceeds 1 minute; specifically, because the environment in a market is complex, a billboard with a portrait may exist in the background of the picture, and therefore filtering of the static target frame is added. Comparing the positions and sizes of a target frame in the current single-frame image and a historical human head frame in a historical information base, when the difference between the positions and the sizes is smaller than a certain threshold value, the target of the current frame is considered to be in a static state, namely the central coordinate offsets dx and dy and the width and height offsets dw and dh between different target frames, and setting the target frames with dx, dy, dw and dh smaller than 0.02 to judge that the target frames are static in the single-frame image; and counting the number of frames of the single-frame image with the target frame, considering the target as a static background target when the static time or the number of times of the target frame exceeds a certain threshold, namely the static accumulated time of the target frame exceeds 1 minute or the number of times of the target frame at the position exceeds 300 times, determining that the object corresponding to the target frame at the position is static, and filtering the target frame at the position detected later.

Step S5: comparing the target person head frame with a plurality of historical person head frames one by one, outputting a first matching value each time of comparison, and judging that the first matching value is greater than a first threshold value? If the first matching value is larger than the first threshold value, the target pedestrian head frame and the historical pedestrian head frame are judged to be the same pedestrian, the pedestrian flow number is not updated, otherwise, the target pedestrian head frame and the historical pedestrian head frame are judged to be different pedestrians, and the pedestrian flow number is added by 1; until the comparison of all target person head frames is completed; specifically, the first matching value is an intersection ratio of the target human head frame and the historical human head frame, the value range of the first matching value is 0 to 1, the first threshold value is 0, when the first matching value is larger than the first threshold value 0, it is determined that the pedestrians corresponding to the target human head frame and the historical human head frame are the same pedestrian, otherwise, the pedestrians are different pedestrians, and the pedestrian flow number is increased by 1. And traversing and comparing the target head frame of the current frame with the historical head frames in the historical record library, and judging whether the two target frames are the same pedestrian or not by calculating the intersection and parallel ratio between the two frames.

If the current real target person head frame is unsuccessfully matched with the historical person head frame in the historical information base, judging that the current real target person head frame is different pedestrians, and adding 1 to the number of the pedestrian volume;

if the target human head frame of the current frame is successfully matched with the historical human head frame of the historical information base, the same pedestrian is judged, and then the flow number of the person is further verified through verifying the position change of the target frames judged as the same pedestrian in different time in the picture to obtain the flow number of the verified person: counting lines are arranged on two sides of the edge of the picture and used for detecting the entering or leaving of pedestrians in the picture and recording the number of entering people and the number of leaving people; for the target human head frame and the historical human head frame which are judged as the same pedestrian, judging the motion direction of the corresponding pedestrian according to the positions of the target human head frame and the historical human head frame between the two counting lines, and recording the entering and leaving data for the same tracked pedestrian only once;

checking the flow number of people, the number of people entering and the number of people leaving for a period of time to obtain the flow number of the checked people: the check man flow number = [ (number of entering persons + number of leaving persons)/2 + number of man flows ]/2. The finally calculated flow number of the examiners is the flow number of the persons passing through the screen within a period of time, and the period of time can be adjusted according to actual use scenes and is generally set to be 1 day.

Because the angle of the face is limited in the target detection neural network model, the pedestrian corresponding to the detected face frame can be considered to face the screen and can see the content displayed on the screen, and therefore the number of times of attention of the pedestrian to the screen, namely the screen attention number, is counted by matching the target human head frame with the target face frame. As step S6:

specifically, the second matching value includes a similarity value and matching times, where the similarity value is an intersection and comparison of the target human head frame and the target human face frame, a value range of the similarity value is 0 to 1, the matching times are times of successful matching of the target human head frame and the target human face frame, the second threshold includes a similarity threshold of 0.3 and a matching threshold of 15 times, that is, when the similarity value is greater than the similarity threshold, the target human head frame and the target human face frame are determined as the same pedestrian, and when the matching times are greater than the matching threshold, the screen attention number is increased by 1. Specifically, a target human head frame and a target human face frame of a current single-frame image are subjected to traversal comparison, an intersection ratio between the two frames, namely a similarity value, is calculated, when the similarity value is larger than a similarity threshold value of 0.3, the target human head frame and the target human face frame are considered to be the same pedestrian, then the matching times of the target human head frame and the target human face frame are recorded, and when the matching times exceed the matching threshold value of 15 times, it is judged that the pedestrian is concerned to annotate a screen, namely the screen attention number is increased by 1.

Because the camera is installed on the screen, only pictures at a horizontal angle can be shot, and a single-frame image overlapped by pedestrians can be shot in an occasion where the overlapping of the pedestrians is difficult to avoid, the target frame information of the current single-frame image cannot be directly used by the historical people frame recorded in the historical information base, and the historical information base needs to be updated by adopting a strategy like the step S7, so that the historical people frame corresponding to the pedestrian in the historical information base cannot be lost when the pedestrian disappears in the picture and reappears in the picture in the case of short overlapping.

Step S7: and updating the historical information base, replacing the corresponding historical person head frame in the historical information base with the target person head frame when the first matching value is larger than the first threshold value, and adding the target person head frame into the historical information base and marking as the historical person head frame when the first matching value is smaller than the first threshold value. Further, updating the historical information base also comprises counting the times of no matching success of the historical person head frame, and when the times of no matching of the historical person head frame exceeds 5 times, deleting the historical person head frame.

It should be noted that although embodiments of the present application have been shown and described, it would be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the application, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for counting the flow of people in front of a screen based on a target detection algorithm is characterized by comprising the following steps:

and step S3: performing feature extraction on the single-frame image through a target detection neural network model to obtain a plurality of target frames containing category information and position information; the target detection neural network model is established based on an SSD target detection algorithm and is obtained by inputting a real human head image and a human face image with a limited angle for training, wherein the limited angle range is a horizontal rotation angle of-45 degrees to +45 degrees;

step S5: comparing the target human head frame with a plurality of historical human head frames one by one, outputting a first matching value every time of comparison, wherein the first matching value is the intersection and comparison of the target human head frame and the historical human head frames, setting the value range of the first matching value to be 0-1 and a first threshold value 0, judging whether the first matching value is larger than the first threshold value 0, if the first matching value is larger than the first threshold value 0, judging that the target human head frame and the historical human head frames are the same pedestrian, the pedestrian flow number is not updated, otherwise, judging that the target human head frame and the historical human head frames are different pedestrians, and adding 1 to the pedestrian flow number; until the comparison of all target person head frames is completed;

step S6: comparing the target human head frame with the plurality of target human face frames one by one, outputting a second matching value after the comparison is finished, judging whether the second matching value is greater than a second threshold value, if the second matching value is greater than the second threshold value, judging that the target human head frame has attention screen content, namely adding 1 to the screen attention number, otherwise, judging that the target human head frame does not pay attention to the screen content, and not updating the screen attention number; until all the target people finish comparing the head money; specifically, the second matching value includes a similarity value and a matching frequency, wherein the similarity value is an intersection and combination ratio of a target human head frame and a target human face frame, the value range of the similarity value is 0 to 1, the matching frequency is the frequency of successful matching of the target human head frame and the target human face frame, the second threshold includes a similarity threshold value of 0.3 and a matching threshold value of 15 times, namely when the similarity value is greater than the similarity threshold value, the target human head frame and the target human face frame are determined as the same pedestrian, and when the matching frequency is greater than the matching threshold value, the screen attention number is increased by 1;

2. The method according to claim 1, wherein the category information comprises a head image, a face image and a background image, and the position information is a relative coordinate [ x ] of the target frame in the single frame image ₀ ,y ₀ ,x ₁ ,y ₁ ]。

3. The method for counting the flow of people before the screen based on the target detection algorithm as claimed in claim 2, wherein the step S4 comprises:

4. The method according to claim 1, wherein the video is a real-time video or a historical video, and the single-frame image is a video frame captured every 0.2 seconds in the video.

5. The method for counting the flow of people before the screen based on the target detection algorithm as claimed in claim 4, wherein the step S5 further comprises the verification of the flow number of people: counting lines are arranged on two sides of the edge of the picture and used for detecting the entering or leaving of pedestrians in the picture and recording the number of entering people and the number of leaving people; for the target human head frame and the historical human head frame which are judged as the same pedestrian, judging the motion direction of the corresponding pedestrian according to the positions of the target human head frame and the historical human head frame between the two counting lines, and recording the entering and leaving data for the same tracked pedestrian only once;

6. The method of claim 1, wherein updating the historical information base further comprises counting the number of times that no matching of the historical people header box is successful, and when the number of times that no matching of the historical people header box is successful exceeds 5 times, deleting the historical people header box.