CN112258571B

CN112258571B - Indoor pedestrian positioning method based on monocular vision

Info

Publication number: CN112258571B
Application number: CN202011023002.4A
Authority: CN
Inventors: 林宇; 赵宇迪
Original assignee: Shanghai Shuchuan Data Technology Co ltd
Current assignee: Shanghai Shuchuan Data Technology Co ltd
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2023-05-30
Anticipated expiration: 2040-09-25
Also published as: CN112258571A

Abstract

The invention discloses an indoor pedestrian positioning method based on monocular vision, which comprises a pedestrian positioning structure, wherein the pedestrian positioning structure consists of a high-definition monitoring camera, a humanoid detector and a coordinate calculator. The calibration process of the camera posture is not needed to be finished automatically on line, the posture of the camera can still change gradually along with time after being installed due to the action of gravity, and the method can automatically update the measurement of the camera posture without manual intervention, thereby saving a great deal of labor and time for on-site calibration; the invention does not require the person to be positioned to carry a positioning tag or other electronic equipment, can complete positioning under the condition that the person does not feel, is only applicable to the coordinates of the humanoid detection frame during positioning, and does not relate to any privacy data; the invention can achieve the indoor positioning accuracy of 50cm based on the common monocular monitoring camera, and has obvious advantages in implementation cost and positioning accuracy.

Description

Indoor pedestrian positioning method based on monocular vision

Technical Field

The invention relates to the technical field of indoor positioning, in particular to an indoor pedestrian positioning method based on monocular vision.

Background

Along with the continuous promotion of the digitalized and intelligent marketing trend of off-line commercial stores, how to effectively locate the instant position of a customer (pedestrian) in an indoor commercial scene becomes a key problem for providing personalized and intelligent service or interaction, and the existing indoor locating method mainly comprises the following steps:

based on the WIFI positioning of the mobile device (mainly a mobile phone), the distance between the pedestrian holding the mobile device and each WIFI access device is estimated through the signal intensity between the mobile device and a plurality of WIFI access devices, and the position coordinates of the pedestrian can be determined through a triangle, a fingerprint and other methods because the position of each WIFI access device is accurately measured in advance.

Ultra-wideband positioning, WIFI positioning accuracy is greatly affected by indoor environment display, UWB positioning technology reduces the influence of environment display on ranging accuracy by sending extremely narrow pulses, but a person to be positioned is required to hold UWB tag equipment.

The binocular vision positioning method is characterized in that the conventional monocular camera cannot obtain depth information of pedestrians from the camera, the depth information is directly used for indoor positioning, difficulty is high, binocular vision performs vision feature matching by using images of the camera with known optical center distances, and position information of a target distance camera system is calculated according to parallax information among the images and camera parameters calibrated in advance.

In the prior art, the WIFI positioning technology has wide application, the positioning accuracy is generally 5-10 meters, but the positioning accuracy is sensitive to environmental display, the fluctuation of the positioning accuracy is large, and the positioning accuracy range is large and even exceeds the height of a common floor, so that the positioning relationship and interaction behavior between indoor pedestrians (customers) and commercial facilities are difficult to determine by the WIFI positioning.

Although the UWB positioning technology can achieve positioning accuracy within 1 meter (positioning accuracy is in the order of decimeters under ideal non-shielding conditions), the UWB positioning technology is mainly used for personnel management in commercial and industrial scenes because the positioned personnel are required to hold UWB tags, and is difficult to be widely applied to pedestrian (mainly customers) positioning in off-line store scenes.

Binocular vision positioning is less affected by environment display, positioning accuracy is stable, but parallax between the binocular vision positioning and a camera is dependent, parallax can be reduced along with the distance between a target object and the camera system, if a person to be positioned and the camera system reach more than 5 meters, in order to ensure effective parallax, the optical center distance of the binocular camera needs to be correspondingly increased, so that the appearance of the camera is huge or difficult to calibrate, and the binocular vision positioning device cannot be suitable for installation in a commercial scene.

Based on the problems in the prior art, the invention mainly provides a monocular vision method capable of positioning pedestrians within 10 meters in real time in an indoor commercial store scene, and positioning accuracy is up to within 1 meter, so that the intelligent marketing/interaction requirement of off-line stores is met, and the defects in the prior art are overcome.

Disclosure of Invention

The invention aims to provide an indoor pedestrian positioning method based on monocular vision, which has the advantage of high positioning accuracy and solves the problems in the background art.

In order to achieve the above purpose, the present invention provides the following technical solutions: an indoor pedestrian positioning method based on monocular vision comprises the following steps:

s101: the method comprises the steps that a high-definition monitoring camera collects videos in an off-line store scene, image frames are sent to a human-shaped detector after being decoded, the image frame interval is 40ms, namely 25 frames of images are sent to the human-shaped detector every second, without losing generality, if the original frame rate is 30fps, 30 frames of images can be sent to the human-shaped detector every second, and when the computing resources are insufficient, frame skipping processing is carried out on the image frames, wherein the frame skipping processing is not lower than 1 frame every second;

s102: after receiving the real-time image frames, the humanoid detection module detects humanoid frame coordinates in the image frames, and after collecting a humanoid frame coordinate set in a period of history time, the humanoid frame coordinate set is sent to the coordinate calculation module for estimating the gesture of the camera;

s103: the received historical period humanoid frame coordinate set may be described as o= { O ₁ ,O ₂ …O _n I.e. the collection is made up of n humanoid frames, each humanoid frame O _i = (ti, pi), wherein ti is a timestamp corresponding to the human frame, pi is an image coordinate of the human frame, and the camera pose estimation module automatically estimates a downward inclination angle theta of the vertical direction of the camera according to human frame coordinate data of the set O without any manual calibration assistance; the optical imaging formulas of the head vertex P1 and the sole point P2 are combined as follows,

where f is the focal length of the camera measured in advance, and the world coordinate of the pedestrian head vertex P1 is (X ₁ ，Y ₁ ，Z ₁ ) The world coordinate of the contact point P2 between the pedestrian foot and the floor is (X) ₁ ，Y ₂ ，Z ₁ ) Human-shaped frame O _i Is y on the upper longitudinal coordinate of (2) ₁ The lower ordinate is y ₂ Subtracting to obtain

Wherein Y is ₂ -Y ₁ I.e. the height of the pedestrian;

s201: selecting a human-shaped frame with visible heads and feet from a historical human-shaped frame set, wherein the heads or the feet are shielded or invisible to be deleted, and the step is finished by utilizing a human skeleton point detection algorithm OpenPose;

s202: traversing the filtered human-shaped frame set visible to the head and foot of the pedestrian, and sequentially taking out one human-shaped frame O _i If the upper edge is y ₁ The lower ordinate is y ₂ Taking this as an observation to estimate the camera downtilt θ:

initial estimated value theta of camera downward inclination angle theta with 10 degrees ⁽⁰⁾ Z in the formula (3) is omitted ₁ The second term of (1) is

Due to taking theta ⁽⁰⁾ =10 degrees, Z can be calculated from equation (4) ₁ Is the initial estimate Z1 of (1) ⁽⁰⁾ F is the focal length of the camera measured in advance, and the world coordinate of the pedestrian head vertex P1 is (X ₁ ，Y ₁ ，Z ₁ ) The world coordinate of the contact point P2 between the pedestrian foot and the floor is (X) ₁ ，Y ₂ ，Z ₁ )；

Will Z1 ⁽⁰⁾ Substituted into the followingCalculating a first iteration value Z1 according to the formula (5) ⁽¹⁾ Wherein the hyper-parameter alpha takes the decimal fraction between 0 and 1 and takes 0.5;

/>

at a first iteration value Z1 ⁽¹⁾ Substituting formula (6) to reversely calculate the first iteration value theta of the declination angle theta ⁽¹⁾

Will estimate the value theta for the first time ⁽¹⁾ Substitution of θ in equation (5) ⁽⁰⁾ To

Substitution of +.>

Obtaining Z ₁ Second estimate of +.>

Replace it by +.>

The first iteration value theta of the downward inclination angle theta can be obtained ⁽²⁾ ；

The Z is obtained by the two-point difference of the formula (5) ₁ Equation (6) is a single point one-time equation, thus when Z ₁ Alternating with theta, when the iteration value approaches to a true value, the iteration values of the front and rear wheels tend to converge, the common downtilt angle of the offline store camera is between 15 and 40 degrees, the convergence iteration number is about 3 to 6, and the iteration number is 5, namely theta ⁽⁵⁾ As a humanoid frame O _i Posterior estimate for observed downtilt angle θ

S203: after traversing the collection of human frames, each human frame O _i Obtain a corresponding posterior estimation value of the declination angle theta

Establishing angle histogram with each 0.5 degree as one grid from 10 degrees to 45 degrees, initializing each grid to count to 0, and estimating each human shape frame in human shape frame set +.>

Falls into corresponding angle histogram grids, and takes the grid with the largest count as the final estimated value +.>

S104: the humanoid detection model sends humanoid frame coordinate information on the real-time image frame to a coordinate calculator, wherein the image coordinates of the row of humanoid head vertexes are (x) ₁ ,y ₁ ) The sole point image coordinates are (x ₁ ,y ₂ )；

S105: final estimation using downtilt angle θ

Due to the camera focal length f and the camera mounting height Y ₂ Measuring in advance, taking the height of the person as the statistical average value of 165cm, knowing Y ₁ ＝Y ₂ -165, and then combining the pedestrian overhead coordinates (x ₁ ,y ₁ ) And->

The following formula is substituted in to be substituted,

calculating physical world coordinates (X) of the pedestrian standing position relative to the camera ₁ ，Z ₁ ) Thereby determining the mutual positional relationship between the pedestrian and the camera.

The indoor pedestrian positioning method based on monocular vision further comprises a pedestrian positioning structure, wherein the pedestrian positioning structure consists of a high-definition monitoring camera, a humanoid detector and a coordinate calculator.

Preferably, the high-definition monitoring camera is responsible for collecting real-time video in an off-line store scene, and the real-time video is decoded into a real-time image frame sequence and then transmitted to the human-shaped detector.

Preferably, the human-shaped detector comprises a human-shaped detection module which is responsible for extracting human-shaped frames in the image frames and maintaining a historical human-shaped frame set which comprises coordinate information of human-shaped frames which appear in a past period of time.

Preferably, the coordinate calculator comprises a camera gesture estimation module and a coordinate positioning calculation module, wherein the former is responsible for calculating the camera gesture by using the historical humanoid frame, and the latter is responsible for converting the camera image coordinates of the real-time humanoid frame into physical world coordinates, namely, the positioning of the indoor humanoid is completed.

Compared with the prior art, the invention has the following beneficial effects:

1. the calibration process of the camera posture is not needed to be finished automatically on line, the posture of the camera can still change gradually along with time after being installed due to the action of gravity, the method of the invention can automatically update the measurement of the camera posture without manual intervention, and a great deal of labor and time for on-site calibration are saved.

2. The invention does not require the person to be positioned to carry a positioning tag or other electronic equipment, can complete positioning under the condition that the person does not feel, is only applicable to the coordinates of the humanoid detection frame during positioning, and does not relate to any privacy data.

3. The invention can achieve the indoor positioning accuracy of 50cm based on the common monocular monitoring camera, and has obvious advantages in implementation cost and positioning accuracy.

Drawings

FIG. 1 is a schematic diagram of a system architecture of the present invention;

FIG. 2 is a flow chart of an embodiment of the present invention;

FIG. 3 is a schematic diagram of a camera pose estimation principle according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating the implementation of the automatic pose estimation algorithm according to the present invention.

In the figure: 1. a pedestrian positioning structure; 2. a high definition monitoring camera; 3. a humanoid detector; 4. and a coordinate calculator.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-4, the present invention provides a technical solution: as shown in fig. 1, the implementation method involves a high definition monitoring camera 2, a human shape detector 3, and a coordinate calculator 4.

The high-definition monitoring camera 2 is responsible for collecting real-time video in an off-line store scene, and the real-time video is decoded into a real-time image frame sequence and then transmitted to the human-shaped detector 3.

The human detector 3 comprises a human detection module responsible for extracting human frames in the image frames and maintaining a set of historical human frames containing coordinate information of human frames occurring over a period of time.

The coordinate calculator 4 comprises a camera pose estimation module and a coordinate positioning calculation module, wherein the former is responsible for calculating the camera pose by using the historical humanoid frame, and the latter is responsible for converting the camera image coordinates of the real-time humanoid frame into physical world coordinates, namely, the positioning of the indoor humanoid frame is completed.

As shown in fig. 2, the indoor pedestrian positioning method according to the embodiment of the invention includes the following steps:

s101: the high-definition monitoring camera 2 collects the video in the off-line store scene, the image frames are sent to the human-shaped detector 3 after being decoded, in the embodiment, the image frame interval is 40ms, that is, 25 frames of images are sent to the human-shaped detector 3 every second, and when the computing resource is insufficient, the image frames can be subjected to frame skipping processing, and generally not lower than 1 frame per second.

S102: after receiving the real-time image frame, the humanoid detection module detects humanoid frame coordinates in the image frame, gathers humanoid frame coordinates in a period of history time (humanoid frame coordinates acquired at 10-12 am a day before use in the embodiment), and sends the humanoid frame coordinates to the coordinate calculation module for estimating the pose of the camera.

S103: the received historical period humanoid frame coordinate set may be described as o= { O ₁ ,O ₂ …O _n I.e. the collection is made up of n humanoid frames, each humanoid frame O _i ＝(t _i ,p _i ) Wherein t is _i Is a timestamp corresponding to the human-shaped frame, p _i The camera attitude estimation module automatically estimates the downward inclination angle theta of the vertical direction of the camera according to the coordinate data of the human-shaped frame of the set O under the condition that no manual calibration assistance is needed.

As a key part of the present invention, the following describes the algorithm process of the attitude estimation module in detail with reference to fig. 3:

as shown in fig. 3, when a pedestrian (customer) in an off-line store appears in the field of view of the high-definition monitoring camera 2 with the optical center of the camera as the common origin of the camera image coordinate system and the physical world coordinate system, the world coordinate of the pedestrian head vertex P1 is (X ₁ ，Y ₁ ，Z ₁ ) The world coordinate of the contact point P2 between the pedestrian foot and the floor is (X) ₁ ，Y ₂ ，Z ₁ ) According to the basic optical imaging law, the following formula can be given:

wherein (x) ₁ ,y ₂ ) Is the coordinates on the image of the contact point P2 between the foot of the pedestrian and the floor, f is the focal length of the camera which has been measured in advance, Y ₂ I.e. the camera mounting height, has also been measured in advance, the unknowns in the above formula include Z ₁ ,X ₁ And a downward inclination angle theta due to two squaresThere are three unknowns, so conventionally vision-based positioning methods require one unknown to be reduced by other measurement approaches, e.g. binocular vision uses a method of calculating Z by binocular parallax ₁ 。

The method adopted in the embodiment is to automatically estimate the downward inclination angle theta of the camera through the data of the historical human-shaped frame set.

As a key part of the present invention, the following describes the flow of automatic estimation in detail:

first, the optical imaging formulas of the head apex P1 and the sole point P2 are combined as follows,

subtracting to obtain

Wherein Y2-Y1 is the height of the pedestrian, and in this embodiment, assuming that the statistical average of the height of the pedestrian is 165cm, that is y1=y2-165, the unknown quantity in the above formula only includes Z ₁ And θ.

As shown in fig. 4, the automatic estimation flow of the camera downtilt angle θ is as follows:

s201: the step is completed by utilizing a human skeleton point detection algorithm OpenPose in the embodiment, which is not an innovation point of the invention and is not expanded and discussed.

initial estimated value theta of camera downward inclination angle theta with 10 degrees ⁽⁰⁾ The quadratic term of Z1 in the formula (3) is omitted, and there is

Due to taking theta ⁽⁰⁾ =10 degrees, the initial estimate Z1 of Z1 can be calculated from equation (4) ⁽⁰⁾

Will Z1 ⁽⁰⁾ Substituting the following formula (5) to calculate a first iteration value Z1 ⁽¹⁾ Wherein the hyper-parameter alpha takes a fraction between 0 and 1, in this example 0.5.

Substitution of +.>

Obtaining Z ₁ Second estimate of +.>

Replace it by +.>

The first iteration value theta of the downward inclination angle theta can be obtained ⁽²⁾ 。

The Z is obtained by the two-point difference of the formula (5) ₁ Equation (6) is a single point one-time equation, thus when Z ₁ Alternate iteration with theta, when the true value is close, the iteration is performed before and afterThe iteration values of the two rounds tend to converge, the common downtilt angle of the offline store camera is between 15 and 40 degrees, the number of converged iterations is about 3 to 6, and in the embodiment, the number of iterations is 5, namely θ ⁽⁵⁾ As a humanoid frame O _i Posterior estimate for observed downtilt angle θ

After the estimation of the downward inclination angle theta is completed, the image coordinates of the real-time human-shaped frame can be converted into the coordinates of a physical world coordinate system by utilizing an optical imaging rule, and the following steps are as follows:

s104, the humanoid detection model sends humanoid frame coordinate information on the real-time image frame to the coordinate calculator 4, wherein the image coordinates of the row of humanoid head vertexes are (x) ₁ ,y ₁ ) The sole point image coordinates are (x ₁ ,y ₂ )。

S105, final estimated value of the downtilt angle theta is used

Due to the camera focal length f and the camera mounting height Y ₂ Measuring in advance, taking the height of the person as the statistical average value of 165cm, knowing Y ₁ ＝Y ₂ -165, re-vertex the pedestrian headCoordinates (x) ₁ ,y ₁ ) And->

Substituting the following formula, calculating physical world coordinates (X ₁ ，Z ₁ ) Thereby determining the mutual positional relationship between the pedestrian and the camera. />

The key point in the above step S105 is that, after the camera downtilt angle is determined, the present invention calculates the physical world coordinates by the overhead point coordinates y1 of the pedestrians, whereas the conventional method generally calculates the physical world coordinates by using the sole points y2, the reason for this is mainly because the conventional method is to consider that the height of each specific pedestrian is unknown, but in practice the probability that the feet of the pedestrians are blocked is large, the conventional method uses y2 of the lower edge of the humanoid frame as the sole point to calculate the physical world coordinates with a large fluctuation, resulting in poor accuracy of the positioning estimation result.

In the embodiment, the problem of unknown height is avoided by assuming the height of the human body to be 165cm in statistical mean value, so that physical coordinate positioning is possible by utilizing the top point of the pedestrian head, the probability of shielding the head of the pedestrian is far smaller than that of feet, the stability of positioning precision is greatly improved, and practical measurement shows that the horizontal distance between a camera and the pedestrian in an off-line store is mostly within 10m, and the height of the specific pedestrian is approximately 165cm at the distance, and the generated positioning error only exceeds 50cm when the height of the pedestrian is lower than 140cm or higher than 190cm, so that the positioning precision requirement within 1m designed by the invention is still met.

Further, by measuring in advance coordinates (Cx, cy) of the camera on the top plan view of the down-line store and projected unit vectors of the camera Z-axis on the top plan view

Projection of camera X-axis on top plan viewThe unit vector of the shadow is

Real-time coordinates of travelers on-line store top plane view can be calculated as

Thereby realizing the real-time positioning of indoor pedestrians under monocular vision, when the distance between the pedestrians and the camera is within 10 meters, the downward inclination angle of the camera is within 15-45 degrees, and the positioning error is caused<50cm。

To sum up: according to the indoor pedestrian positioning method based on monocular vision, by designing a special cross iterative algorithm, under the condition that depth information is not assisted, automatic estimation of the downward inclination angle of a camera is achieved by utilizing automatically collected historical human frame data, so that indoor pedestrian positioning based on monocular vision is achieved, and the accuracy and stability of positioning are further improved by adopting a mode of calculating the head top point of a pedestrian by assuming the height as a statistical mean value.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An indoor pedestrian positioning method based on monocular vision is characterized by comprising the following steps of: the indoor pedestrian positioning method comprises the following steps:

s101: the method comprises the steps that a high-definition monitoring camera collects videos in an off-line store scene, image frames are sent to a human-shaped detector after being decoded, the image frame interval is 40ms, namely 25 frames of images are sent to the human-shaped detector every second, and when calculation resources are insufficient, frame skipping processing is carried out on the image frames, wherein the frame skipping processing is not lower than 1 frame every second;

s103: the received historical period humanoid frame coordinate set may be described as o= { O ₁ ,O ₂ …O _n I.e. the collection is made up of n humanoid frames, each humanoid frame O _i = (ti, pi), wherein ti is a timestamp corresponding to the human frame, pi is an image coordinate of the human frame, and the camera pose estimation module automatically estimates a downward inclination angle theta of the vertical direction of the camera according to human frame coordinate data of the set O without any manual calibration assistance;

the optical imaging formulas of the head vertex P1 and the sole point P2 are combined as follows,

Wherein Y is ₂ -Y ₁ I.e. the height of the pedestrian;

Will Z1 ⁽⁰⁾ Substituting the following formula (5) to calculate a first iteration value Z1 ⁽¹⁾ Wherein the hyper-parameter alpha takes the decimal fraction between 0 and 1 and takes 0.5;

Substitution of +.>

Obtaining Z ₁ Second estimate of +.>

Replace it by +.>

S105: final estimation using downtilt angle θ

The following formula is substituted in to be substituted,

2. An indoor pedestrian positioning method based on monocular vision as claimed in claim 1, further comprising a pedestrian positioning structure (1), characterized in that: the pedestrian positioning structure (1) consists of a high-definition monitoring camera (2), a humanoid detector (3) and a coordinate calculator (4).

3. The monocular vision-based indoor pedestrian positioning method of claim 2, wherein: the high-definition monitoring camera (2) is responsible for collecting real-time videos in off-line store scenes, and the real-time videos are decoded into real-time image frame sequences and then transmitted to the human-shaped detector (3).

4. The monocular vision-based indoor pedestrian positioning method of claim 2, wherein: the human-shaped detector (3) comprises a human-shaped detection module which is responsible for extracting human-shaped frames in image frames and maintaining a historical human-shaped frame set which comprises coordinate information of human-shaped frames appearing in a period of time.

5. The monocular vision-based indoor pedestrian positioning method of claim 2, wherein: the coordinate calculator (4) comprises a camera pose estimation module and a coordinate positioning calculation module.