CN114387351A

CN114387351A - Monocular vision calibration method and computer readable storage medium

Info

Publication number: CN114387351A
Application number: CN202111573714.8A
Authority: CN
Inventors: 张博; 陈飞; 石晓栊; 李俊; 黄韶丹; 范高; 杜霈; 宗子轩; 方炯; 甘宇; 黄抢林
Original assignee: National Pipeline Network Group Sichuan to East Natural Gas Pipeline Co Ltd
Current assignee: National Pipeline Network Group Sichuan to East Natural Gas Pipeline Co Ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-04-22

Abstract

The invention relates to the technical field of navigation, and provides a monocular vision calibration method and a computer readable storage medium. According to the monocular vision calibration method, the traditional monocular vision SLAM technology and the text recognition technology are combined together, so that the accuracy of pose estimation of a monocular camera can be improved, and a semantic map convenient for a user to use can be generated.

Description

Monocular vision calibration method and computer readable storage medium

Technical Field

The invention relates to the technical field of navigation, in particular to a monocular vision calibration method and a computer readable storage medium.

Background

Visual slam (simultaneous Localization and mapping) is a technology for estimating self-motion and simultaneously modeling a scene, and has been widely applied to fields of auto-driving, augmented reality, virtual reality, robot navigation, and the like. Visual SLAM attempts to solve such problems: when an intelligent agent moves in an unknown environment, how to determine the motion track of the intelligent agent through pictures taken by a camera and construct a map of the surrounding environment. Conventional visual SLAM uses little semantic information in localization and mapping and is therefore limited in many application scenarios. The traditional SLAM is combined with semantic information, so that the practicability and robustness of the system can be improved, and the system is more consistent with the cognition of human beings on exploring unknown environments.

The obtained more accurate feature matching relationship is a crucial component in the monocular SLAM, the traditional monocular SLAM only depends on the matching relationship among a limited number of feature points and cannot necessarily obtain accurate camera pose estimation, the generated map is not high in accuracy, the generated map is a sparse point cloud map, and the map is low in practicability from the perspective of user interaction. Therefore, it is desirable to provide a monocular vision calibration method and a computer readable storage medium to solve at least the above problems.

Disclosure of Invention

It is an object of the present invention to provide a monocular vision calibration method and computer readable storage medium that at least partially overcome the deficiencies in the prior art.

According to an aspect of the present invention, there is provided a monocular vision calibration method, comprising the steps of:

acquiring an original image through a monocular camera, and extracting semantic information in the original image;

obtaining initial pose information of the monocular camera and initial coordinate information of the original image based on the semantic information and the original image;

continuously acquiring a plurality of images to be fitted according to a time sequence through the monocular camera, and obtaining pose information to be fitted of the monocular camera and coordinate information to be fitted of the images to be fitted based on a uniform velocity model, the initial pose information and the initial coordinate information;

judging the number of the images to be fitted containing the semantic information, fitting the initial pose information and the pose information to be fitted to obtain output pose information when the number of the images to be fitted containing the semantic information is not less than six, and fitting the initial coordinate information and the coordinate information to be fitted to obtain output three-dimensional scene information;

and calibrating the pose and the coordinates of the monocular camera based on the relative relationship between the output pose information and the output three-dimensional scene information.

Preferably, before the acquiring the original image by the monocular camera, the method further comprises: and calibrating the internal reference matrix and the distortion parameter of the monocular camera.

Preferably, the semantic information is a set of pixel points with text traits in the original image.

Preferably, the fitting the initial coordinate information and the coordinate information to be fitted to obtain the output three-dimensional scene information includes: fitting based on the initial coordinate information and the coordinate information to be fitted to obtain a plurality of semantic planes; and fitting based on the semantic plane to obtain the three-dimensional scene information.

Preferably, the fitting based on the initial coordinate information and the coordinate information to be fitted to obtain a plurality of semantic planes includes: and regarding the initial coordinate information containing the semantic information and the coordinate information to be fitted as being on the same plane, and obtaining the semantic plane.

Preferably, the calibrating the pose and the coordinates of the monocular camera based on the relative relationship between the output pose information and the output three-dimensional scene information includes: obtaining a reprojection error factor based on the output three-dimensional scene information; and calibrating the coordinates of the monocular camera through the reprojection error factor.

Preferably, a distance factor is obtained based on the output three-dimensional scene information, and the pose of the monocular camera is calibrated based on the distance factor and the reprojection error factor.

Preferably, based on the distance factor and the reprojection error factor, the pose of the monocular camera is calibrated through a factor graph optimization algorithm.

Preferably, the factor graph optimization algorithm is constructed based on the G2O library.

According to another aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a monocular vision calibration method as described in any one of the above.

The invention provides a monocular vision calibration method, which comprises the steps of collecting an original image through a monocular camera, extracting semantic information in the original image, continuously collecting a plurality of images to be fitted according to a time sequence based on the semantic information and the original image, fitting to obtain output pose information and output three-dimensional scene information, and calibrating the pose and the coordinate of the monocular camera based on the relative relationship between the output pose information and the output three-dimensional scene information. According to the monocular vision calibration method, the traditional monocular vision SLAM technology and the text recognition technology are combined together, so that the accuracy of pose estimation of a monocular camera can be improved, and a semantic map convenient for a user to use can be generated.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of a monocular vision calibration method according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. For convenience of description, only portions related to the invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The monocular camera in the embodiment of the application mainly refers to a monocular digital camera, and can be a pinhole camera or a wide-angle camera and other various setting forms. This application so to monocular camera, because current equipment, like the camera that carries on-vehicle camera or unmanned aerial vehicle, some is the monocular camera, if reform transform it into binocular camera in order to improve positioning accuracy, the transformation cost is higher, one of the purpose of this application lies in relying on old monocular camera, do the transformation in the aspect of the software to the positioning performance of whole unmanned aerial vehicle or vehicle, can realize great positioning accuracy and promote, and provide a semantic map that can directly use for unmanned aerial vehicle or vehicle. In addition, the method can also be applied to occasions such as intelligent glasses with monocular cameras.

As shown in fig. 1, the present invention provides a monocular vision calibration method, comprising the following steps:

s101: acquiring an original image through a monocular camera, and extracting semantic information in the original image;

s102: obtaining initial pose information of the monocular camera and initial coordinate information of the original image based on the semantic information and the original image;

s103: continuously acquiring a plurality of images to be fitted according to a time sequence through a monocular camera, and obtaining pose information to be fitted of the monocular camera and coordinate information to be fitted of the images to be fitted based on a uniform velocity model, initial pose information and initial coordinate information;

s104: judging the number of images to be fitted containing semantic information, fitting the initial pose information and the pose information to be fitted to obtain output pose information when the number of the images to be fitted containing the semantic information is not less than six, and fitting the initial coordinate information and the coordinate information to be fitted to obtain output three-dimensional scene information;

s105: and calibrating the pose and the coordinates of the monocular camera based on the relative relationship between the output pose information and the output three-dimensional scene information.

In the processing S101, the original image may be a color image or a black and white image, but the definition of the original image should be sufficient to distinguish semantic information, especially which pixels belong to the semantic information and which pixels belong to which part of the semantic information, so that the original image in the present application has a certain requirement on the definition. Extracting semantic information

In the processing S101, a task of detecting and extracting two-dimensional text information in a picture taken by a camera at the current time may be completed by using software such as a Mask textpointer. Many software capable of performing semantic recognition are provided, and the details are not repeated here and should be well known to those skilled in the art. After whether each pixel point in the current picture belongs to the text feature and which text feature each pixel point corresponds to is obtained, as an optimal implementation manner, a two-dimensional text information set (L) in a single picture can be defined as:

wherein l_tT-th represented two-dimensional text information consisting of the sheets contained in the text informationWord w_tAnd the vertex pixel coordinate p of the polygon bounding box of the text information in the picture_tAnd T represents the total number of the two-dimensional text information detected in the current picture, and when T is-1, the current picture represents that no text information is detected.

In the process S102, the initial coordinate information of the original image may be obtained by retrieving from an existing map based on the semantic information identified in the process S101, which may be a general navigation map, or may be a special map for a smaller scene, such as a three-dimensional navigation map inside an office building, so that by retrieving the semantic information, a rough initial coordinate information may be obtained quickly, such as which point on the map the original image corresponds to, but there is a certain difference with respect to the accurate position of the monocular camera due to the inaccuracy of positioning of the monocular camera. The initial pose information may also be determined, for example, the pixel points belonging to the same semantic information are regarded as being in the same plane, and a semantic information image formed after fitting is compared with a pre-stored semantic information image, so as to obtain a rough initial pose of the current monocular camera, which is well known to those skilled in the art and will not be described in detail.

In the process S103, the uniform velocity model is a model that continues to move at a uniform velocity toward the moving direction of the starting monocular camera according to an empirically set parameter or an average velocity of a previous time and the like based on the initial coordinate information and the initial pose, where the acquisition and processing frequency of the monocular camera is actually very high, in the process S104, when 6 images to be fitted containing the semantic information are acquired, an output pose information may start to be output, the process time may be very short, and it is not necessary to accurately confirm a velocity change in the middle, and the uniform velocity model is used, which may effectively reduce the amount of computation.

In the processing S104, in the previous processing, initial pose and initial coordinate information of the camera are obtained by initializing a matching relationship of feature points with text semantic tags, pose information to be fitted of the monocular camera at each moment is obtained through a uniform velocity model, new coordinate information to be fitted is generated, it is continuously detected whether an image to be fitted belonging to the semantic information exceeds six, when it is detected that the number exceeds six, in a preferred implementation, a random consistency sampling algorithm may be used to perform plane fitting on the spatial point set, and here, as a preferred implementation, a semantic plane may be obtained by fitting first, and then the semantic plane is fitted to obtain output three-dimensional scene information.

The specific implementation manner may be that the geometric expression of the semantic plane obtained by fitting is as follows:

π＝(n_x,n_y,n_z,)^T (2.2)

wherein, pi represents the semantic plane obtained by fitting, (n)_x,n_y,n_z)^TThe unit normal vector corresponding to the plane is shown, and d is the distance from the origin to the plane. After the semantic plane is obtained through fitting, the space plane is constructed into output three-dimensional scene information by utilizing the semantic information again, and the output three-dimensional scene information is defined as:

Π_k＝{W_k,π_k,Q_k,Y_k} (2.3)

therein, II_kRepresenting the kth semantic plane, W_kRepresenting words, pi, contained in the semantic information corresponding to the semantic plane_kThe representation is a geometric expression of the spatial plane corresponding to the kth semantic plane, Q_kThe position coordinate of the center point corresponding to the kth semantic plane is shown, Y_kThe map point set corresponding to the kth semantic plane is represented.

Therefore, output three-dimensional scene information tightly coupled with the semantic information is obtained, and various composite functions such as semantic information guidance and the like can be provided while auxiliary positioning is realized.

In the processing S104, the fitting of the initial pose information and the pose information to be fitted to obtain the output pose information may be based on a factor graph optimization algorithm, and the semantic information in the scene used in the present invention may provide a supplementary clue for pose estimation of the camera. On one hand, the semantic information contains a large amount of feature point information, which can help estimate the pose of the camera. On the other hand, the semantic information can be regarded as a plane with a specific label attribute, and better camera pose estimation information can be further optimized by regarding the spatial map points belonging to the same text region as being on the same plane. Aiming at the different influences, the plane semantic information is tightly coupled into the factor graph optimization model, and the accurate camera pose estimation is calculated by jointly optimizing the reprojection error factor and the point-to-plane distance error factor, and the processes are well known by the technical personnel in the field and are not described in detail herein.

As a preferred implementation, before processing S101, the reference matrix and distortion parameters of the monocular camera are calibrated. The specific implementation manner of the calibration process is as follows:

1) carrying out internal reference calibration on the camera to obtain distortion parameters and an internal reference matrix of the camera

Wherein, [ x, y [ ]]Is the coordinate of the normalized plane point, [ x ]_distorted,y_distorted]Is the distorted coordinate, k₁,k₂,k₃,p₁,p₂Is a distortion term, and r is the distance from any point on the plane to the origin of the coordinate system;

p is the camera reference matrix, where f is the camera focal length, [ O ]_x,O_y]Is the principal optical axis point.

Therefore, after the calibration of the internal parameter matrix and the distortion parameter is completed, the subsequent calibration process can be facilitated.

As a preferred implementation manner, in the processing S101, the semantic information is a set of pixel points having text traits in the original image.

As a preferred implementation manner, in the processing S104, the fitting the initial coordinate information and the coordinate information to be fitted to obtain the output three-dimensional scene information includes: fitting based on the initial coordinate information and the coordinate information to be fitted to obtain a plurality of semantic planes; and fitting based on the semantic plane to obtain three-dimensional scene information. The fitting based on the initial coordinate information and the coordinate information to be fitted to obtain a plurality of semantic planes comprises the following steps: and regarding the initial coordinate information containing the semantic information and the coordinate information to be fitted as being on the same plane to obtain a semantic plane.

As a preferred implementation manner, in the process S105, calibrating the pose and the coordinates of the monocular camera based on the relative relationship between the output pose information and the output three-dimensional scene information includes: obtaining a reprojection error factor based on the output three-dimensional scene information; and calibrating the coordinates of the monocular camera through the reprojection error factor.

The specific implementation process can be as follows: the reprojection error factor is defined as:

e_i,j＝p_i,j-Proj(T_cP_i,j) (3.1)

wherein e_i,jRepresenting a reprojection error factor, P_i,jRepresenting the j-th spatial point coordinate information in time i, p_i,jIs represented by P_i,jCorresponding to the pixel coordinates of the characteristic points in the image, the Proj function represents the projection function of the camera, T_cIndicating the current timeAnd estimating the pose of the camera.

As a preferred implementation mode, a distance factor is obtained based on the output three-dimensional scene information, and the pose of the monocular camera is calibrated based on the distance factor and the reprojection error factor.

The specific implementation mode can be as follows: point to plane distance factor (e)_k,m) Is defined as:

e_k,m＝Π_kP_k,m (3.2)

wherein P is_k,mThe representation is the homogeneous representation of the m-th map point space coordinate belonging to the k-th semantic plane_kAnd the represented k-th three-dimensional semantic plane corresponds to a geometric expression form.

As a preferred implementation mode, the pose of the monocular camera is calibrated through a factor graph optimization algorithm based on the distance factor and the reprojection error factor. Wherein a factor graph optimization algorithm can be constructed based on the G2O library.

The specific implementation mode can be as follows:

and obtaining final camera pose information by jointly optimizing a reprojection error factor and a point-to-plane distance factor, wherein a loss function (C) of a least square model in a factor graph is defined as:

wherein λ is₁Denoted is the weight value of the reprojection error factor, λ₂Expressed is a weight value, p, of a point-to-plane distance factor_hExpressed is a robust Huber function, e_i,jDenoted is a reprojection error factor, e_k,mExpressed is the factor of the spatial point-to-plane distance, Ω_i,jExpressed is the covariance matrix of the reprojection error factor, omega_k,mThe covariance matrix of the point-to-plane distance factors is shown.

The optimization problem for solving the camera pose parameters is defined as:

wherein

Representing the pose estimation information, T, of the camera that ultimately needs to be solved_cIs the initial value of camera pose estimation in factor graph optimization. The invention utilizes G2O library to construct the factor graph optimization problem, and adopts Levenberg-Marquardt (LM) algorithm to solve, and finally, the camera pose estimation with low error can be obtained.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The present application also provides a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a monocular vision calibration method as described above. The computer readable media may include both permanent and non-permanent, removable and non-removable media implemented in any method or technology for storage of information. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A monocular vision calibration method is characterized by comprising the following steps:

2. A monocular vision calibration method according to claim 1, further comprising, before said capturing of the original image by the monocular camera: and calibrating the internal reference matrix and the distortion parameter of the monocular camera.

3. The monocular vision calibration method of claim 1, wherein the semantic information is a set of pixel points in the original image having text traits.

4. The monocular vision calibration method of claim 1, wherein the fitting the initial coordinate information and the coordinate information to be fitted to obtain output three-dimensional scene information comprises: fitting based on the initial coordinate information and the coordinate information to be fitted to obtain a plurality of semantic planes; and fitting based on the semantic plane to obtain the three-dimensional scene information.

5. The monocular vision calibration method of claim 4, wherein the fitting based on the initial coordinate information and the coordinate information to be fitted to obtain a plurality of semantic planes comprises: and regarding the initial coordinate information containing the semantic information and the coordinate information to be fitted as being on the same plane, and obtaining the semantic plane.

6. The monocular vision calibration method of claim 1, wherein said calibrating the pose and coordinates of the monocular camera based on the relative relationship between the output pose information and the output three-dimensional scene information comprises: obtaining a reprojection error factor based on the output three-dimensional scene information; and calibrating the coordinates of the monocular camera through the reprojection error factor.

7. The monocular vision calibration method of claim 6, wherein a distance factor is obtained based on the output three-dimensional scene information, and the pose of the monocular camera is calibrated based on the distance factor and the reprojection error factor.

8. The monocular vision calibration method of claim 7, wherein the pose of the monocular camera is calibrated by a factor graph optimization algorithm based on the distance factor and the reprojection error factor.

9. The monocular vision calibration method of claim 8, wherein the factor graph optimization algorithm is constructed based on a G2O library.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the monocular vision calibration method according to any one of claims 1 to 9.