WO2020258820A1

WO2020258820A1 - Mobile side vision fusion positioning method and system, and electronic device

Info

Publication number: WO2020258820A1
Application number: PCT/CN2019/130553
Authority: WO
Inventors: 赵希敏; 胡金星
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2019-06-26
Filing date: 2019-12-31
Publication date: 2020-12-30
Also published as: CN110375739A; CN110375739B

Abstract

A mobile side vision fusion positioning method and system, and an electronic device. The method comprises: step a: system initialization (100), obtaining an initial location of a mobile terminal as the current location of a positioning target on the basis of a calibrated starting location and sensor information; step b, obtaining a video frame by using the mobile terminal (200); and step c, detecting a static object in the video frame, obtaining geographic coordinate information of the static object by means of a BIM space database, and bringing the coordinate information of the static object into a multi-target object positioning model, iteratively solving the positioning model by means of a Gauss-Newton method, obtaining the current location of the mobile terminal, and combining the current location of the mobile terminal and the coordinate information of the static object to obtain a positioning result of the positioning target. The more convenient, precise and cheaper positioning method can be implemented.

Description

Mobile terminal vision fusion positioning method, system and electronic equipment

Technical field

This application belongs to the cross-technology field of artificial intelligence and geographic information technology, and in particular relates to a mobile terminal vision fusion positioning method, system and electronic equipment.

Background technique

The Global Navigation Satellite System (GNSS) can realize navigation and positioning outdoors. Currently, radio positioning technologies represented by GNSS, cellular networks, and WIFI can achieve sub-meter precise positioning in open outdoor locations. The principle is to rely on detecting the characteristic parameters of the propagated signal to achieve positioning. Common methods include proximity detection, observation time difference of arrival (OTDOA) and so on.

Indoor positioning technology mainly realizes the positioning and tracking of people and objects in various indoor spaces. The demand for safety and monitoring of people and objects based on indoor positioning is also increasing. People's demand for location services in indoor environments has become increasingly significant. , Domestic and foreign scholars have conducted a lot of exploration and research. At present, most indoor positioning systems are implemented based on proximity detection, triangular, multilateral positioning, and fingerprint positioning methods, or combined positioning methods are adopted to improve accuracy. However, due to the influence of multi-path effects, the indoor environment is changeable and complex, and there is no universal solution. How to improve accuracy, real-time, security, scalability, low cost, convenience, and specialization are current research hotspots. .

At present, indoor radio positioning technologies (such as WIFI, Bluetooth, etc.) mostly use Received Signal Strength Indication (RSSI) as the basis of positioning algorithms, and realize positioning by using signal attenuation and distance change relationships. WIFI indoor positioning technology generally includes: the positioning method based on the received signal strength indicating the RSSI distance rendezvous and the RSSI location fingerprint method. The matching of the signal is the main part of its research, and the positioning accuracy lies in the calibration point density. This technology has the advantages of easy expansion, automatic data update, and low cost, so it is the first to realize large-scale applications.

Bluetooth positioning is based on a short-distance low-power communication protocol. The implementation methods can be centroid positioning, fingerprint positioning, and proximity detection; Bluetooth positioning has the advantages of low power consumption, close range, and extensive use, but at the same time it has poor stability and is affected by environmental interference. Big. The precise micro-positioning technology iBeacon developed by Apple based on Bluetooth low energy works in a similar way to the previous Bluetooth technology. The signal is transmitted by the Beacon, and the Bluetooth device receives and feedbacks the signal. When the user enters, exits or wanders in the area, the Beacon The broadcast has the ability to spread, and can calculate the distance between the user and the Beacon (calculated by RSSI). It can be seen that as long as there are three iBeacon devices, it can be located.

The above-mentioned indoor positioning technologies are all based on radio frequency signals. Radio signals are easily affected by indoor environments such as obstacles. Environmental changes cause positioning accuracy to decrease. At the same time, the early stage requires construction operations, a large amount of equipment, and high maintenance costs. .

The vision sensor positioning technology uses triangulation technology to obtain the current relative position to achieve positioning. Compared with other sensor positioning methods, the vision sensor has a higher positioning accuracy and low cost. The EasyLiving system is a positioning system based on computer vision, using high-performance mobile terminals with high accuracy; but when the indoor environment is complex, it is difficult to maintain high accuracy all the time. Through the principle of simultaneous positioning and mapping (Simultaneous Location And Mapping, SLAM) of mobile robots, visual sensors can be introduced. The EV-Loc indoor positioning system proposed in 2012 is a positioning system that uses visual signals as auxiliary positioning to improve accuracy. Google's Visual Positioning Service (VPS) technology based on the principle of visual positioning has theoretical accuracy up to centimeter level.

However, the existing visual positioning systems, such as EasyLiving, Google VPS, etc., are mostly based on the SLAM principle by extracting feature points captured by visual sensors, using triangulation ranging method, combined with acceleration, gyroscope and other sensors to calculate the current position movement offset. Just relative positioning. In order to achieve accurate indoor geographic positioning, a large number of manual signs must be deployed at fixed points in advance, and the preliminary preparations are tedious. At the same time, these existing visual positioning technologies only consider the data sensed by the sensors, and do not make use of the semantic information carried by these data.

Summary of the invention

This application provides a mobile terminal vision fusion positioning method, system and electronic device, which aim to solve one of the above technical problems in the prior art at least to a certain extent.

In order to solve the above-mentioned problems, this application provides the following technical solutions:

A mobile terminal vision fusion positioning method includes the following steps:

Step a: Acquire the initial position of the mobile terminal based on the calibrated starting position and sensor information, and set the initial position as the current position of the positioning target;

Step b: Use a mobile terminal to obtain video frames;

Step c: Detect the static object in the video frame, obtain the coordinate information of the static object through the BIM spatial database, bring the coordinate information of the static object into the multi-target object positioning model, and solve the iteratively by Gauss-Newton method The positioning model obtains the current position of the mobile terminal, and combines the current position of the mobile terminal with the coordinate information of the static object to obtain the positioning result of the positioning target.

The technical solution adopted in the embodiment of the application further includes: in the step b, after the use of the mobile terminal to obtain the video frame, the method further includes: the visual odometer calculates the current pose information of the mobile terminal according to the obtained video frame.

The technical solution adopted in the embodiment of the present application further includes: the visual odometer calculates the current pose information of the mobile terminal according to the acquired video frames specifically including:

Step b1: The visual odometer scales the acquired video frame to the set size, saves it in the image sliding window, and judges whether the current video frame is the first frame, if the current video frame is the first frame, only the key point extraction operation is performed ; Otherwise, extract the key points and calculate the residual difference between the current key point and the key point of the previous video frame; the single key point residual e is the error of the key point pixel brightness, and the calculation formula is:

e=I ₁ (x ₁ )-I ₂ (x ₂ )=I ₁ (Kp ₁ )-I ₂ (K(Rp ₁ +t))

In the above formula, I ₂ is obtained by I ₁ after a certain movement, R and t are the motion trajectories of the mobile terminal, x ₁ is the pixel position of the key point in the image I ₁ , and x ₂ is the pixel of the key point in the image I ₂ Position, p ₁ is the coordinate of the key point in real space, K is the internal parameter matrix of the mobile terminal;

Step b2: Use Gauss Newton's method to solve the residual Jacobian, obtain the motion pose of the current video frame and the previous video frame, and record them in the pose storage sliding window;

Step b3: Obtain the mobile terminal pose of the current video frame, extract the spatial offset of the pose, and convert the spatial offset into a relative coordinate offset value, which is the motion offset of the mobile terminal.

The technical solution adopted in the embodiment of the application further includes: the step b also includes: judging the monitoring positioning state, and if it is not in the positioning state, adding the current pose information of the mobile terminal to the current position acquired last time , Update the current position of the positioning target; if it is in the positioning state, perform step c.

The technical solution adopted in the embodiment of the application further includes: in the step c, the detecting a static object in the video frame, obtaining the coordinate information of the static object through the BIM spatial database, and bringing the coordinate information of the static object Entering a multi-target object positioning model, and iteratively solving the positioning model through the Gauss Newton method, and obtaining the current position of the mobile terminal specifically includes:

Step c1: Take out the video frame and input it into the target detection neural network to obtain the type of static object contained in the video frame, and set the center pixel position of the static object as a calibration point, and then take out the next video frame And the mobile terminal pose information of the next video frame, using triangulation to calculate the calibration point and the depth information of the mobile terminal; wherein the triangulation formula is as follows:

In the above formula, s ₁ and s ₂ are the depth information of key points;

Step c2: Load the BIM spatial information database with the coordinate information of the current position through the identified static object category, and obtain the coordinate information of the static object according to the BIM spatial information database;

Step c3: Bring the coordinate information of the static object into the positioning model, and use Gauss-Newton method to iteratively solve to obtain the current position of the mobile terminal; the Gauss-Newton method to solve the equations is:

In the above formula, (x, y, z) is the current position of the mobile _{_{terminal, (x n, y n,}} z n) are BIM coordinate information, ρ _n is the static object depth to the mobile terminal, σ _n is a measure of the depth of the noise;

Step c4: Combine the current position of the mobile terminal with the static object coordinate information to obtain a positioning result of the current position.

Another technical solution adopted by the embodiment of this application is: a mobile terminal vision fusion positioning system, including:

Initial positioning unit: Based on the calibrated starting position and sensor information, obtain the initial position of the mobile terminal and set the initial position as the current position of the positioning target;

Video frame acquisition module: used to acquire video frames using mobile terminals;

Target positioning module: used to detect static objects in the video frame, obtain the coordinate information of the static objects through the BIM spatial database, bring the coordinate information of the static objects into the multi-target object positioning model, and iterate through the Gauss-Newton method Solve the positioning model, obtain the new current position of the mobile terminal, and combine the new current position of the mobile terminal with the coordinate information of the static object to obtain the positioning result of the positioning target.

The technical solution adopted in the embodiment of the present application further includes a pose calculation module, which is used to calculate the current pose information of the mobile terminal according to the acquired video frame through the visual odometer.

The technical solution adopted in the embodiment of the present application further includes: the pose calculation module includes:

Key point extraction unit: used to zoom the acquired video frame to a set size through the visual odometer, save it in the image sliding window, and determine whether the current video frame is the first frame, if the current video frame is the first frame, only Carry out the operation of extracting key points; otherwise, extract the key points and calculate the residual difference between the current key point and the key point of the previous video frame; the single key point residual e is the error of the key point pixel brightness, and the calculation formula is:

e=I ₁ (x ₁ )-I ₂ (x ₂ )=I ₁ (Kp ₁ )-I ₂ (K(Rp ₁ +t))

Motion pose solving unit: used to use Gauss-Newton method to solve the residual Jacobian, get the motion pose of the current video frame and the previous video frame, and record it to the pose storage sliding window;

Motion offset calculation unit: used to obtain the mobile terminal pose of the current video frame, extract the spatial offset of the pose, and convert the spatial offset into a relative coordinate offset value, which is the motion offset of the mobile terminal shift.

The technical solution adopted in the embodiment of this application also includes a positioning judgment module and a position update module. The positioning judgment module is used to monitor the positioning status through judgment. If it is not in the positioning state, the current position of the mobile terminal is updated by the position update module. After the pose information is added to the current position obtained last time, the current position of the positioning target is updated; if it is in the positioning state, the positioning result of the positioning target is obtained through the target positioning module.

The technical solution adopted in the embodiment of the present application further includes: the target positioning module specifically includes:

Object recognition and depth calculation unit: used to take out the video frame, input it into the target detection neural network, obtain the type of static object contained in the video frame, and set the center pixel position of the static object as a calibration point, Then take out the next video frame and the mobile terminal pose information of the next video frame, and use the triangulation method to calculate the calibration point and the depth information of the mobile terminal; wherein the triangulation method formula is as follows:

In the above formula, s ₁ and s ₂ are the depth information of key points;

Coarse positioning unit: used to load the BIM spatial information database with the coordinate information of the current position through the identified static object category, and obtain the coordinate information of the static object according to the BIM spatial information database;

Fine positioning unit: used to bring the coordinate information of the static object into the positioning model, and iteratively solve it by using the Gauss Newton method to obtain the current position of the mobile terminal; the Gauss Newton method to solve the equations is:

In the above formula, (x, y, z) is the current position of the mobile terminal, (x _n , y _n , z _n ) is the static object coordinate information stored in the BIM spatial database, and ρ _n is the depth of the static object to the mobile terminal , Σ _n is the depth measurement noise;

Positioning result generating unit: used to combine the current position of the mobile terminal with the static object coordinate information to obtain the positioning result.

Another technical solution adopted by the embodiments of the present application is: an electronic device, including:

At least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the one processor, and the instructions are executed by the at least one processor to enable The at least one processor can perform the following operations of the aforementioned mobile terminal visual fusion positioning method:

Step b: Use a mobile terminal to obtain video frames;

Compared with the prior art, the beneficial effects produced by the embodiments of the present application are: the mobile terminal visual fusion positioning method, system and electronic equipment of the embodiments of the present application use visual sensors to detect and identify static objects in the real world to obtain the object spatial relationship, and The object space relationship provided by the BIM model is matched with geographical topological space, and then a nonlinear equation set is established according to the distance measurement of the object, and the equation set is solved iteratively, and the precise position is obtained by convergence, thereby realizing a more accurate, convenient and cheaper positioning method.

Description of the drawings

FIG. 1 is a flowchart of a method for visual fusion positioning on a mobile terminal according to an embodiment of the present application;

Figure 2 is a schematic diagram of the target detection neural network structure;

Figure 3 is a schematic diagram of key points selection;

4 is a schematic structural diagram of a mobile terminal visual fusion positioning system according to an embodiment of the present application;

FIG. 5 is a schematic diagram of the hardware device structure of the mobile terminal vision fusion positioning method provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, and not used to limit the application.

Please refer to FIG. 1, which is a flowchart of a mobile terminal vision fusion positioning method according to an embodiment of the present application. The mobile terminal vision fusion positioning method of the embodiment of the present application includes the following steps:

Step 100: System initialization;

In step 100, the system initialization includes the following steps:

Step 110: Initialize the visual odometer;

In step 110, the initialization of the visual odometer includes operations such as memory allocation of the pose manager and initial value of variables; the pose manager includes a main data structure such as a sliding window for pose storage, an image sliding window, and key points. , The sliding window of the pose storage is used to store the calculated pose information; the image sliding window is used to cache the shooting information of the mobile terminal, waiting for the extraction of key points and the estimation of the depth of the calibration point; the key points are used to identify a frame of image The pixel gradient change in a certain area in the image is used for the similarity comparison of subsequent images; in addition to this main data structure, the pose manager also includes key point extraction function, key point similarity estimation function, key point update function, key Point drop function, key point index function, key point depth estimation function, frame pose estimation function, and pose addition, deletion, and query functions.

Step 120: initialization of semantic positioning calibration;

In step 120: the initialization of the semantic positioning calibration includes the training of the target detection neural network, the loading of the model, and the generation and loading of the BIM spatial database. Among them, the target detection neural network structure is shown in Figure 2. The target detection neural network structure described in this application refers to the existing target detection method, the network structure has not been changed, and only the proprietary static object data set is used when training the network After training and optimization, it is deployed to mobile terminals. The BIM spatial database is constructed and indexed using the R-tree method. The BIM spatial data structure includes the electronic map of the area where the current location is located and the category of static objects contained in the area, the coordinate information of the static object, other static objects near the static object, and the building where the current location is located The spatial layout information of each floor of the building.

Step 130: Acquire the initial position of the mobile terminal based on the calibrated starting position and sensor information, and set the initial position as the current position of the positioning target;

In step 130, the mobile terminal of the embodiment of the present application is an Android-equipped terminal device with a 9-axis IMU sensor. The initial positioning can be set through the indoor flat map to set the starting point location, or set through the automatic identification of unique signs, and then accumulate based on acceleration, gyroscope and other sensors.

Step 200: Obtain video frames using a mobile terminal;

Step 300: The visual odometer calculates the current pose information of the mobile terminal according to the acquired video frames;

In step 300, the pose is a motion matrix, which contains information such as rotation and translation of the mobile terminal. The visual odometer uses the pixel residuals of key points to estimate the pose changes of the mobile terminal in two adjacent frames, and then obtains the movement offset of the mobile terminal. Specifically, the visual odometer pose calculation includes:

Step 310: The visual odometer scales the acquired video frame to 300px*300px (specifically can be set according to the actual application), saves it in the image sliding window, and judges whether the current video frame is the first frame, if the current video frame is Only the key points are extracted in the first frame, otherwise, the key points are extracted and the residuals between the key points and the key points of the previous video frame are calculated;

In step 310, the key point calculation steps are as follows: first, select the pixel p on the video frame, and take 30% of the gray value G of the pixel P as the threshold T. Then, taking pixel p as the center, select 16 pixels on a circle with a radius of 3. If there are N consecutive points on the selected circle whose brightness is greater than G+T or less than GT, then the pixel p can be considered as a video The key point of the frame. Repeat the above steps until all the pixels on the video frame are traversed to complete the key point calculation; finally, non-maximum suppression is used, and within 50px*50px, only the key points that respond to the maximum value are retained. As shown in Figure 3, a schematic diagram of key points selection. For example: the sixteen pixel values of point p with a radius of 3 are in the gray part of the table below. The gray part is the selected N points, calculate the size of these N points and P, and determine whether they are key points according to the above rules. P is a value from 0 to 255. The residual e of a single key point is the error of the pixel brightness of the key point, and the calculation formula is as follows:

e=I ₁ (x ₁ )-I ₂ (x ₂ )=I ₁ (Kp ₁ )-I ₂ (K(Rp ₁ +t)) (1)

In formula (1), I ₂ is obtained by I ₁ after a certain movement, R and t are the trajectories of the mobile terminal, x ₁ is the pixel position of the key point in the image I ₁ , and x ₂ is the key point in the image I ₂ P ₁ is the coordinate of the key point in real space, and K is the internal parameter matrix of the mobile terminal. The Lie algebraic form of the residual is:

e I ₁ (Kp ₁ )-I ₂ (K exp(ξ ^∧ )p ₁ )) (2)

In formula (2),

Step 320: Use the Gauss-Newton method to solve the residual Jacobian, obtain the motion pose of the current video frame and the previous video frame, and record them in the pose storage sliding window;

In step 320, using the Gauss-Newton method to solve the residual Jacobian specifically includes:

Mobile terminal pose optimization objective function:

In formula (3), ξ is the pose of the mobile terminal, and J is the gradient of the residual with respect to the Lie algebra; that is,

The Gauss-Newton iteration increment solution equation is:

In formula (4), Δξ ^* is the iteration increment.

The Gauss-Newton method is used to solve the optimization problem, which is solved iteratively along the gradient by a step of the objective function.

The objective function is: (x _i -x) ² +(y _i -y) ² +(z _i -z) ² =f(X) where X=[x,y,z]

The iteration increment ΔX ^k is: J(X) ^T J(X)ΔX=-J(X) ^T f(X)

Where J(X) is the Jacobian matrix of the equation;

In formula (5),

The solution process of Gauss-Newton method is as follows:

Step 321: Given the initial point p ₀ and the number of iterations k, the allowable error ε>0. When the number of iterations and the error do not meet the conditions, perform step 330;

Step 322: If the objective function f((X) ^k+1 )-f((X) ^k ) is less than the threshold ε, exit, otherwise go to step 323;

Step 323: Calculate the iteration increment, bring in the objective function, and return to step 321.

Step 330: Obtain the pose of the mobile terminal of the current video frame, which is a row vector with six degrees of freedom, extract the spatial offset of the pose, and convert the spatial offset into a relative coordinate offset value, which is The movement of the mobile terminal is offset.

Step 400: By judging the positioning monitoring state, if it is not in the positioning state, perform step 500; otherwise, perform step 600;

In step 400, the location monitoring is a callback type, and may be executed at any time after the initialization is completed. This application starts the visual odometry pose calculation after acquiring the video frame. If it is not in the positioning state, the calculated pose information is added to the current position obtained last time to obtain the current position update result. This step has been executed repeatedly and will not stop due to changes in positioning status. When the positioning monitor is installed in the positioning state, the semantic positioning calibration is called, the current position is calculated according to the recognized object, the movement offset calculated by the visual odometer is replaced, and the user's current position is updated in combination with the static object coordinates in the BIM spatial database. Then use that position as the final position.

Step 500: After adding the current pose information to the current position acquired last time, update the current position of the positioning target;

Step 600: Invoke semantic positioning calibration to detect static objects in the video frame, and use the calibration point to use the triangulation method (triangulation refers to observing the angle of the same point in two places to determine the distance of the point) to estimate the static object The depth information of the static object and the semantic information of the spatial data are used to propose a multi-target object positioning model, and the positioning model is solved iteratively through the Gauss-Newton method to obtain the static object coordinate information and the current position of the mobile terminal, and the static object coordinate information is moved The current position of the terminal is combined to obtain the positioning result of the positioning target;

In step 600, the positioning method of semantic positioning calibration specifically includes the following steps:

Step 610: Take out the video frame in the image sliding window and input it into the target detection neural network to obtain the type of static object contained in the video frame, and set the center pixel position of the identified static object as the calibration point, and then take it out The pose information of the next video frame and the mobile terminal of the next video frame are calculated using the triangulation method to calculate the calibration point and the depth information of the mobile terminal;

In step 610, the triangulation formula is as follows:

In formula (6), s ₁ and s ₂ are the depth information of key points.

Step 620: Obtain the coordinate information of the static object in the video frame through the coarse positioning; the position data of the coarse positioning comes from the coordinate information of the static object stored in the BIM spatial database, and retrieve the coordinate information near the current position through the recognized object category Load the BIM spatial information database to find out the coordinate information carried by the recognized object category.

Step 630: fine positioning; in order to further optimize the positioning accuracy, the coordinate information obtained from the coarse positioning is brought into the positioning model, and the Gauss-Newton method is used to iteratively solve the problem to obtain the current position of the mobile terminal.

In step 630, the positions of the static object and the mobile terminal should satisfy the following relationship:

(x _i -x) ² +(y _i -y) ² +(z _i -z) ² =δρ _i +σ _i (7)

Therefore, the following nonlinear equations can be established and solved:

In formula (8), (x, y, z) is the current position of the mobile terminal, (x _n , y _n , z _n ) is the static object coordinate information stored in the BIM spatial database, representing the distance of the static object from the fixed coordinate (such as The relative position of the coordinate information of the building center), ρ _n is the depth of the current static object to the mobile terminal, and σ _n is the depth measurement noise.

Step 640: Combine the current position of the mobile terminal iteratively calculated by fine positioning (the position is the offset relative to the specific target coordinates of the building) with the static object coordinate information obtained by coarse positioning to obtain the positioning result of the current position, and generate Indoor electronic map and BIM spatial database of positioning results;

In step 640, an electronic map of the area is obtained according to the positioning coordinates, and BIM information is superimposed to produce a nearby indoor electronic map of the current positioning result.

Please refer to FIG. 4, which is a schematic structural diagram of a mobile terminal visual fusion positioning system according to an embodiment of the present application. The mobile terminal vision fusion positioning system in the embodiment of the application includes an initialization module, a video frame acquisition module, a pose calculation module, a positioning judgment module, a position update module, and a target positioning module.

Initialization module: used for system initialization; specifically, the initialization module includes:

Visual odometer initialization unit: used for the initialization of visual odometer; including: the memory allocation of the pose manager, the initial value of variables and other operations; the pose manager includes a sliding window for pose storage, image sliding window, key The main data structure such as points, the sliding window of the pose storage is used to store the calculated pose information; the image sliding window is used to cache the shooting information of the mobile terminal, waiting for the extraction of key points and the estimation of the depth of the calibration point; the key points are used It is used to identify the pixel gradient change of a certain area in a frame of image for the similarity comparison of subsequent images; besides this main data structure, the pose manager also includes key point extraction function, key point similarity estimation function, key Point update function, key point discard function, key point index function, key point depth estimation function, frame pose estimation function, and pose addition, deletion, and query functions.

Semantic positioning calibration initialization unit: used for the initialization of semantic positioning calibration; including target detection neural network training, model loading, and BIM spatial database generation and loading. The target detection neural network structure described in this application refers to existing target detection methods, and uses a proprietary static object data set to train and optimize the network when training the network, and then deploy it to the mobile terminal. The BIM spatial database is constructed and indexed using the R-tree method. The BIM spatial data structure includes the electronic map of the area where the current location is located and the category of static objects contained in the area, the coordinate information of the static object, other static objects near the static object, and the building where the current location is located The spatial layout information of each floor of the building.

Initial positioning unit: used to obtain the initial position of the mobile terminal, and set the initial position as the current position; wherein, the mobile terminal in the embodiment of the present application is a terminal device equipped with Android and has a 9-axis IMU sensor. The initial positioning can be set through the indoor flat map to set the starting point position, or set through automatic identification of unique signs, and then accumulate based on acceleration, gyroscope and other sensors.

Pose calculation module: used to calculate the current pose information of the mobile terminal according to the acquired video frames through the visual odometer; among them, the pose is the motion matrix, which contains information such as the rotation and translation of the mobile terminal. The visual odometer uses the pixel residuals of key points to estimate the pose changes of the mobile terminal in two adjacent frames, and then obtains the movement offset of the mobile terminal. Specifically, the pose calculation module includes:

Key point extraction unit: used to scale the acquired video frame to 300px*300px, save it in the image sliding window, and determine whether the current video frame is the first frame, if the current video frame is the first frame, only the key points are extracted. Otherwise, extract the key point and calculate the residual difference between the key point and the key point of the previous video frame; among them, the key point calculation steps are as follows: first select the pixel p on the video frame, and take 30% of the gray value G of the pixel P as Threshold T. Then, taking the pixel p as the center, select 16 pixels on a circle with a radius of 3. If there are N consecutive points on the selected circle whose brightness is greater than G+T or less than GT, the pixel p can be considered as a video The key point of the frame. Repeat the above steps until all the pixels on the video frame are traversed to complete the key point calculation; finally, non-maximum suppression is used, and within 50px*50px, only the key points that respond to the maximum value are retained. As shown in Figure 3, a schematic diagram of key points selection. For example: the sixteen pixel values of point p with a radius of 3 are in the gray part of the table below. The gray part is the selected N points, calculate the size of these N points and P, and determine whether they are key points according to the above rules. P is a value from 0 to 255. The residual e of a single key point is the error of the pixel brightness of the key point, and the calculation formula is as follows:

e=I ₁ (x ₁ )-I ₂ (x ₂ )=I ₁ (Kp ₁ )-I ₂ (K(Rp ₁ +t)) (1)

e=I ₁ (Kp ₁ )-I ₂ (K(exp(ξ ^∧ )p ₁ )) (2)

In formula (2),

Motion pose solving unit: used to use Gauss-Newton method to solve the residual Jacobian, obtain the motion pose of the current video frame and the previous video frame, and record it in the pose storage sliding window; among them, use Gauss-Newton method to solve The residual Jacobian specifically includes:

Mobile terminal pose optimization objective function:

The Gauss-Newton iteration increment solution equation is:

In formula (4), Δξ ^* is the iteration increment.

The Gauss-Newton method is used to solve the optimization problem, through a step of the objective function, and iteratively descending along the gradient.

The iteration increment ΔX ^k is: J(X) ^T J(X)ΔX=-J(X) ^T f(X)

Where J(X) is the Jacobian matrix of the equation;

In formula (5),

The solution process of Gauss-Newton method is as follows:

1: Given the initial point p ₀ and the number of iterations k, the allowable error ε>0.

2: If the objective function f((X) ^k+1 )-f((X) ^k ) is less than the threshold ε, exit, otherwise proceed to the next step;

3: Calculate the iteration increment, bring in the objective function, and return to step 1.

Motion offset calculation unit: used to obtain the mobile terminal pose of the current video frame, which is a row vector with six degrees of freedom, extracts the spatial offset of the pose, and converts the spatial offset into a relative coordinate offset The shift value is the movement offset of the mobile terminal.

Positioning judgment module: judge the positioning monitoring terminal, if it is not in the positioning state, update the current position of the positioning target through the position update module; otherwise, obtain the positioning result of the positioning target through the target positioning module; among them, this application starts after obtaining the video frame Visual odometer pose calculation, if it is not in the positioning state, add the calculated pose information to the current position obtained last time to get the current position update result. This step has been executed repeatedly and will not stop due to changes in positioning status. When it is in the positioning state, the semantic positioning calibration is called, the current position is calculated according to the recognized object, the movement offset calculated by the visual odometer is replaced, and the current position of the user is updated with the static object coordinates in the spatial database, and then the position It is drawn on the map platform as a new location.

Position update module: used to add the current pose information to the current position obtained last time, update the current position of the positioning target, and draw an indoor electronic map according to the updated current position;

Target positioning module: used to call semantic positioning calibration to detect static objects in the video frame, and use the calibration point to adopt the triangulation method (triangulation refers to observing the angle of the same point in two places to determine the distance of the point) Estimate the depth information of the static object, propose a multi-target object positioning model through the depth information of the static object and the semantic information of the spatial data, use the Gauss-Newton method to iteratively solve the positioning model, obtain the static object coordinate information and the current position of the mobile terminal, and convert the static object coordinate The information is combined with the current position of the mobile terminal to obtain the positioning result of the current position.

Specifically, the target positioning module includes:

Object recognition and depth calculation unit: used to take out the video frame in the image sliding window, input it into the target detection neural network, get the type of static object contained in the video frame, and set the center pixel position of the identified static object To calibrate the point, take out the next video frame and the pose information of the mobile terminal in the next video frame, and use the triangulation method to calculate the depth information between the calibration point and the mobile terminal; the formula of the triangulation method is as follows:

In formula (6), s ₁ and s ₂ are the depth information of key points.

Coarse positioning unit: used to obtain the coordinate information of the static object in the video frame; the position data of the coarse positioning comes from the coordinate information of the static object stored in the BIM spatial database, and retrieved near the coordinate information range of the current position through the recognized object category Load the BIM spatial information database to find out the coordinate information carried by the recognized object category.

Fine positioning unit: It is used to bring the coordinate information obtained by coarse positioning into the positioning model, and iteratively solve by Gauss Newton method to obtain the current position of the mobile terminal. Among them, the relationship between the static object and the terminal position should satisfy the following relationship:

(x _i -x) ² +(y _i -y) ² +(z _i -z) ² =δρ _i +σ _i (7)

Therefore, the following nonlinear equations can be established and solved:

Positioning result generating unit: used to combine the current position of the mobile terminal calculated by the fine positioning iteration (the position is the offset relative to the specific target coordinates of the building) with the static object coordinate information obtained from the coarse positioning to obtain the current position positioning result. Obtain the electronic map of the area according to the positioning result, superimpose the BIM information, and generate the nearby indoor electronic map of the current positioning result.

FIG. 5 is a schematic diagram of the hardware device structure of the mobile terminal vision fusion positioning method provided by an embodiment of the present application. As shown in Figure 5, the device includes one or more processors and memory. Taking a processor as an example, the device may also include: an input system and an output system.

The processor, the memory, the input system, and the output system may be connected by a bus or other methods. In FIG. 5, the connection by a bus is taken as an example.

As a non-transitory computer-readable storage medium, the memory can be used to store non-transitory software programs, non-transitory computer executable programs and modules. The processor executes various functional applications and data processing of the electronic device by running non-transitory software programs, instructions, and modules stored in the memory, that is, realizing the processing methods of the foregoing method embodiments.

The memory may include a program storage area and a data storage area, where the program storage area can store an operating system and an application program required by at least one function; the data storage area can store data and the like. In addition, the memory may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory may optionally include a memory remotely arranged with respect to the processor, and these remote memories may be connected to the processing system through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

The input system can receive input digital or character information, and generate signal input. The output system may include display devices such as a display screen.

The one or more modules are stored in the memory, and when executed by the one or more processors, the following operations of any of the foregoing method embodiments are performed:

According to the image sample data of the transmission line components to construct the mobile terminal visual fusion positioning model, it specifically includes the following steps:

Step a: Obtain the initial position of the mobile terminal based on the calibrated starting position and sensor information, and set the initial position as the current position of the positioning target;

Step b: Use a mobile terminal to obtain video frames;

The above-mentioned products can execute the methods provided in the embodiments of the present application, and have corresponding functional modules and beneficial effects for the execution methods. For technical details not described in detail in this embodiment, please refer to the method provided in the embodiment of this application.

The embodiments of the present application provide a non-transitory (non-volatile) computer storage medium, the computer storage medium stores computer executable instructions, and the computer executable instructions can perform the following operations:

Step b: Use a mobile terminal to obtain video frames;

The embodiment of the present application provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer To make the computer do the following:

Step b: Use a mobile terminal to obtain video frames;

The mobile terminal vision fusion positioning method, system and electronic device of the embodiments of the application use visual sensors to detect and recognize static objects in the real world to obtain object spatial relationships, and perform geographic topological spatial matching with the object spatial relationships provided by the BIM model, and then according to the objects The distance measurement establishes a nonlinear equation set, iteratively solves the equation set, and converges to obtain a precise position, thereby realizing a more accurate, convenient and cheaper positioning method.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use this application. Various modifications to these embodiments will be obvious to those skilled in the art, and the general principles defined in this application can be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application will not be limited to the embodiments shown in this application, but should conform to the widest scope consistent with the principles and novel features disclosed in this application.

Claims

A mobile terminal vision fusion positioning method is characterized in that it comprises the following steps:

Step a: Obtain the initial position of the mobile terminal based on the calibrated starting position and sensor information, and set the initial position as the current position of the positioning target;

Step b: Use a mobile terminal to obtain video frames;

Step c: Detect the static object in the video frame, obtain the geographic coordinate information of the static object through the BIM spatial database, bring the coordinate information of the static object into the multi-target object positioning model, and use the Gauss Newton method to iteratively solve the problem. The positioning model obtains the current position of the mobile terminal, and combines the current position of the mobile terminal with the coordinate information of the static object to obtain the positioning result of the positioning target.
The mobile terminal visual fusion positioning method according to claim 1, characterized in that, in said step b, after said using the mobile terminal to obtain the video frame, it further comprises: the visual odometer calculates the current mobile terminal according to the obtained video frame Posture information.
The mobile terminal visual fusion positioning method according to claim 2, wherein the visual odometer calculating the current pose information of the mobile terminal according to the acquired video frames specifically comprises:

Step b1: The visual odometer scales the acquired video frame to the set size, saves it in the image sliding window, and judges whether the current video frame is the first frame, if the current video frame is the first frame, only the key point extraction operation is performed ; Otherwise, extract the key points and calculate the residual difference between the current key point and the key point of the previous video frame; the single key point residual e is the error of the key point pixel brightness, and the calculation formula is:

e=I 1 (x 1 )-I 2 (x 2 )=I 1 (Kp 1 )-I 2 (K(Rp 1 +t))

In the above formula, I 2 is obtained by I 1 after a certain movement, R and t are the motion trajectories of the mobile terminal, x 1 is the pixel position of the key point in the image I 1 , and x 2 is the pixel of the key point in the image I 2 Position, p 1 is the coordinate of the key point in real space, K is the internal parameter matrix of the mobile terminal;

Step b2: Use Gauss Newton's method to solve the residual Jacobian, obtain the motion pose of the current video frame and the previous video frame, and record them in the pose storage sliding window;

Step b3: Obtain the mobile terminal pose of the current video frame, extract the spatial offset of the pose, and convert the spatial offset into a relative coordinate offset value, which is the motion offset of the mobile terminal.
The mobile terminal visual fusion positioning method according to claim 3, wherein the step b further comprises: judging and monitoring the positioning state, and if it is not the positioning state, then comparing the current pose information of the mobile terminal with the last acquired After the current position of is added, the current position of the positioning target is updated; if it is in the positioning state, perform step c.
The mobile terminal visual fusion positioning method according to claim 4, characterized in that, in the step c, the static object in the video frame is detected, the coordinate information of the static object is obtained through the BIM spatial database, and the The coordinate information of the static object is brought into the multi-target object positioning model, and the positioning model is solved iteratively by the Gauss-Newton method, and obtaining the current position of the mobile terminal specifically includes:

Step c1: Take out the video frame and input it into the target detection neural network to obtain the type of static object contained in the video frame, and set the center pixel position of the static object as a calibration point, and then take out the next video frame And the mobile terminal pose information of the next video frame, using triangulation to calculate the calibration point and the depth information of the mobile terminal; wherein the triangulation formula is as follows:

In the above formula, s 1 and s 2 are the depth information of key points;

Step c2: Load the BIM spatial information database with the coordinate information of the current position through the identified static object category, and obtain the coordinate information of the static object according to the BIM spatial information database;

Step c3: Bring the coordinate information of the static object into the positioning model, and use Gauss-Newton method to iteratively solve to obtain the current position of the mobile terminal; the Gauss-Newton method to solve the equations is:

In the above formula, (x, y, z) is the current position of the mobile terminal, (x n , y n , z n ) is the static object coordinate information stored in the BIM spatial database, and ρ n is the depth of the static object to the mobile terminal , Σ n is the depth measurement noise;

Step c4: Combine the current position of the mobile terminal with the static object coordinate information to obtain a positioning result of the current position, and generate an indoor electronic map and a BIM spatial database of the positioning result.
A mobile terminal vision fusion positioning system, characterized in that it comprises:

Initial positioning unit: Based on the calibrated starting position and sensor information, obtain the initial position of the mobile terminal and set the initial position as the current position of the positioning target;

Video frame acquisition module: used to acquire video frames using mobile terminals;

Target positioning module: used to detect static objects in the video frame, obtain the coordinate information of the static objects through the BIM spatial database, bring the coordinate information of the static objects into the multi-target object positioning model, and iterate through the Gauss-Newton method Solve the positioning model, obtain the new current position of the mobile terminal, and combine the new current position of the mobile terminal with the coordinate information of the static object to obtain the positioning result of the positioning target.
The mobile terminal visual fusion positioning system according to claim 6, further comprising a pose calculation module, the pose calculation module is used to calculate the current pose information of the mobile terminal according to the acquired video frame through the visual odometer .
The mobile terminal visual fusion positioning system according to claim 7, wherein the pose calculation module comprises:

Key point extraction unit: used to zoom the acquired video frame to a set size through the visual odometer, save it in the image sliding window, and determine whether the current video frame is the first frame, if the current video frame is the first frame, only Carry out the operation of extracting key points; otherwise, extract the key points and calculate the residual difference between the current key point and the key point of the previous video frame; the single key point residual e is the error of the key point pixel brightness, and the calculation formula is:

e=I 1 (x 1 )-I 2 (x 2 )=I 1 (Kp 1 )-I 2 (K(Rp 1 +t))

In the above formula, I 2 is obtained by I 1 after a certain movement, R and t are the motion trajectories of the mobile terminal, x 1 is the pixel position of the key point in the image I 1 , and x 2 is the pixel of the key point in the image I 2 Position, p 1 is the coordinate of the key point in real space, K is the internal parameter matrix of the mobile terminal;

Motion pose solving unit: used to use Gauss-Newton method to solve the residual Jacobian, get the motion pose of the current video frame and the previous video frame, and record it to the pose storage sliding window;

Motion offset calculation unit: used to obtain the mobile terminal pose of the current video frame, extract the spatial offset of the pose, and convert the spatial offset into a relative coordinate offset value, which is the motion offset of the mobile terminal shift.
The mobile terminal vision fusion positioning system according to claim 8, further comprising a positioning judgment module and a position update module, the positioning judgment module monitors the positioning status by judging, and if it is not the positioning status, the location update module will After the current pose information of the mobile terminal is added to the current position acquired last time, the current position of the positioning target is updated; if it is in the positioning state, the positioning result of the positioning target is obtained through the target positioning module.
The mobile terminal visual fusion positioning system according to claim 9, wherein the target positioning module specifically comprises:

Object recognition and depth calculation unit: used to take out the video frame, input it into the target detection neural network, obtain the type of static object contained in the video frame, and set the center pixel position of the static object as a calibration point, Then take out the next video frame and the mobile terminal pose information of the next video frame, and use the triangulation method to calculate the calibration point and the depth information of the mobile terminal; wherein the triangulation method formula is as follows:

In the above formula, s 1 and s 2 are the depth information of key points;

Coarse positioning unit: used to load the BIM spatial information database with the coordinate information of the current position through the identified static object category, and obtain the coordinate information of the static object according to the BIM spatial information database;

Fine positioning unit: used to bring the coordinate information of the static object into the positioning model, and iteratively solve it by using the Gauss Newton method to obtain the current position of the mobile terminal; the Gauss Newton method to solve the equations is:

In the above formula, (x, y, z) is the current position of the mobile terminal, (x n , y n , z n ) is the static object coordinate information stored in the BIM spatial database, and ρ n is the depth of the static object to the mobile terminal , Σ n is the depth measurement noise;

Positioning result generating unit: used to combine the current position of the mobile terminal with the static object coordinate information to obtain the positioning result of the current position, and generate a nearby indoor electronic map.
An electronic device including:

At least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the one processor, and the instructions are executed by the at least one processor to enable The at least one processor can perform the following operations of the mobile terminal visual fusion positioning method described in any one of 1 to 5 above:

Step a: Obtain the initial position of the mobile terminal based on the calibrated starting position and sensor information, and set the initial position as the current position of the positioning target;

Step b: Use a mobile terminal to obtain video frames;

Step c: Detect the static object in the video frame, obtain the coordinate information of the static object through the BIM spatial database, bring the coordinate information of the static object into the multi-target object positioning model, and solve the iteratively by Gauss-Newton method The positioning model obtains the current position of the mobile terminal, and combines the current position of the mobile terminal with the coordinate information of the static object to obtain the positioning result of the positioning target.