CN110889873A

CN110889873A - Target positioning method and device, electronic equipment and storage medium

Info

Publication number: CN110889873A
Application number: CN201911175503.1A
Authority: CN
Inventors: 李子申; 潘军道; 吴海涛; 李瑞东; 刘振耀
Original assignee: Academy of Opto Electronics of CAS
Current assignee: Academy of Opto Electronics of CAS
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2020-03-17

Abstract

The invention provides a target positioning method, a target positioning device, electronic equipment and a storage medium, wherein the method comprises the following steps: calculating a disparity map based on a left view and a right view which are captured by a binocular camera and contain a target to be positioned; inputting the left view into the trained deep learning network, and outputting a target mask in the left view; and calculating the three-dimensional space coordinates of the target to be positioned by using a three-dimensional reconstruction projection method based on the disparity map and the target mask in the left view. According to the invention, binocular stereo vision and depth learning are combined, a binocular camera is utilized to calculate the position deviation between corresponding points of left and right views according to a triangulation principle, a depth learning method is utilized to perform specific target identification processing on an image, a scene target is positioned in real time by combining three-dimensional reconstruction information on the basis of target identification, the target positioning process is simplified, primary and secondary targets are not distinguished, and simultaneously all target positions in a field of view are calculated; deep learning can be targeted to specific targets and general targets to expand the scope of target localization applications.

Description

Target positioning method and device, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of target positioning, and particularly relates to a target positioning method and device, electronic equipment and a storage medium.

Background

The airborne photoelectric imaging platform is all-weather photoelectric reconnaissance equipment which integrates high-precision measurement equipment such as a visible light camera, a thermal infrared imager, a television tracker, a laser range finder, an angle sensor and the like and is used for realizing functions such as aerial reconnaissance, target aiming, tracking, positioning and the like.

The method comprises the following steps that generally, a single-point positioning method is adopted by an airborne photoelectric platform, and a cross-hair pointing target at the center of an image is positioned through an attitude measurement/laser ranging positioning model; the implementation of positioning of a plurality of targets requires frequent changes of the spatial orientation of the airborne photoelectric platform for multiple positioning, which consumes long time and is difficult to implement real-time or quasi-real-time positioning of a plurality of targets at the same time.

In the prior art, a multi-target autonomous positioning model based on a pixel sight vector is provided for simultaneously positioning a plurality of targets in real time or quasi-real time and establishing a multi-target autonomous positioning system of an airborne photoelectric imaging platform. The method comprises the steps of obtaining pixel coordinates of each target in a view field through a target detection algorithm, constructing a sight line vector of each target according to an imaging principle of a single-sided array Charge Coupled Device (CCD) sensor, calculating a pixel sight line angle between each target and a main target in the center of an image, calculating an angle and distance relation between each target and an airborne photoelectric platform by combining a measured azimuth angle, a height angle and a distance of the main target relative to the photoelectric platform, obtaining position and attitude information of an aircraft carrier by applying a Global Positioning System (GPS) and an attitude measurement technology, and calculating geodetic coordinates of a plurality of targets in a single image through a homogeneous coordinate transformation method.

After the photoelectric platform searches for a ground target, the main target is locked at the center of a view field, information such as an azimuth angle and a height angle of a visual axis relative to the navigation attitude measurement system, a distance between the main target and the photoelectric platform and the like is output, and meanwhile positioning data output by a GPS positioning system and photoelectric platform attitude data output by the navigation attitude measurement system are collected for coordinate conversion, and geodetic coordinates of the main target are calculated. For other targets (called secondary targets herein) in the field of view, the target detection module can be used for outputting pixel coordinates of the other targets, constructing a sight line vector of each target and calculating a pixel sight line angle between the sight line vector and the primary target, calculating a distance and angle relation between each target and the photoelectric platform by combining an azimuth angle, a height angle and a distance of the primary target relative to the photoelectric platform, and outputting geodetic coordinates of the secondary targets through homogeneous coordinate transformation.

The target detection module simultaneously detects the pixel coordinates of a plurality of static or moving targets by adopting an image segmentation method, a frame difference method or an optical flow method.

Disclosure of Invention

To overcome the existing problems or at least partially solve the problems, embodiments of the present invention provide an object positioning method, an apparatus, an electronic device, and a storage medium.

According to a first aspect of the embodiments of the present invention, there is provided a target positioning method, including:

calculating a disparity map based on a left view and a right view which are captured by a binocular camera and contain a target to be positioned;

inputting the left view into the trained deep learning network, and outputting a target mask in the left view;

and calculating the three-dimensional space coordinates of the target to be positioned by using a three-dimensional reconstruction projection method based on the disparity map and the target mask in the left view.

On the basis of the technical scheme, the invention can be improved as follows.

Preferably, the calculating the disparity map based on the left view and the right view captured by the binocular camera and containing the target to be positioned includes:

calibrating the binocular camera to obtain internal and external parameters of the binocular camera;

performing stereo rectification on the left view and the right view based on the internal and external parameters of the binocular camera, so that the left view and the right view keep line alignment;

and matching the left view and the right view by adopting a stereo matching method based on the corrected left view and the right view to obtain a disparity map.

Preferably, the stereo matching method is an efficient large-scale stereo matching method.

Preferably, the deep learning network is trained by:

training the deep learning network based on a left view training set, wherein the left view training set comprises a plurality of left views and pixel point positions of targets in each left view, the pixel point positions of the targets form a target mask, and the left view is captured by the binocular camera.

Preferably, the calculating three-dimensional space coordinates of the object to be positioned by using a three-dimensional reconstruction projection method based on the disparity map and the object mask in the left view includes:

calculating a reprojection matrix according to a stereoscopic vision principle and internal and external parameters of the binocular camera;

and calculating to obtain the three-dimensional space coordinate of the target to be positioned based on the reprojection matrix, the parallax map and the target mask in the left view.

Preferably, the calculating the three-dimensional space coordinate of the object to be positioned based on the reprojection matrix, the disparity map, and the object mask in the left view includes:

calculating the three-dimensional space coordinates of the object by the following formula:

q is a reprojection matrix, (X, Y) represents the coordinates of pixel points of the target to be positioned in the left view, d is the parallax of the pixel point coordinates of the target to be positioned in the left view at the position of (X, Y), and (X/W, Y/W, Z/W) is the three-dimensional space coordinates corresponding to the target to be positioned in the scene.

Preferably, the deep learning network is a Mask RCNN deep neural network.

Preferably, the target to be positioned in the left view and the right view includes one or more.

According to a second aspect of the embodiments of the present invention, there is provided an object locating apparatus, including:

the first calculation module is used for calculating a disparity map based on a left view and a right view which are captured by a binocular camera and contain a target to be positioned;

the output module is used for inputting the left view into the trained deep learning network and outputting a target mask code in the left view;

and the second calculation module is used for calculating the three-dimensional space coordinates of the target to be positioned by using a three-dimensional reconstruction projection method based on the disparity map and the target mask in the left view.

According to a third aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor calls the program instruction to be able to execute the target location method provided in any one of the various possible implementations of the first aspect.

According to a fourth aspect of embodiments of the present invention, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the target location method provided in any one of the various possible implementations of the first aspect.

The embodiment of the invention provides a target positioning method, a target positioning device, electronic equipment and a storage medium, wherein binocular stereoscopic vision and deep learning are combined, a binocular camera is used for calculating position deviation between corresponding points of a left view and a right view according to a triangulation principle, a deep learning method is used for carrying out specific target identification processing on an image, a scene target is positioned in real time by combining three-dimensional reconstruction information on the basis of target identification, the target positioning process is simplified, primary and secondary targets are not distinguished, and simultaneously all target positions in a field of view are calculated; deep learning can be targeted to specific targets and general targets to expand target localization applications.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic overall flow chart of a target positioning method according to an embodiment of the present invention;

fig. 2 is a flowchart of acquiring a disparity map of left and right views according to an embodiment of the present invention;

FIG. 3-1 is a schematic view of a stereotactic orthotic model;

FIG. 3-2 is a schematic view of a binocular optical axis parallel model;

FIG. 4 is a flow chart of a three-dimensional reconstruction projection provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of an overall structure of a target positioning apparatus according to an embodiment of the present invention;

fig. 6 is a schematic view of an overall structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In an embodiment of the present invention, a target positioning method is provided, and fig. 1 is a schematic overall flow chart of the target positioning method provided in the embodiment of the present invention, where the method includes:

It can be understood that in the embodiment of the present invention, a binocular camera is adopted to capture an image including a target to be positioned, that is, a left view and a right view including the target to be positioned are captured by the binocular camera, a disparity map is obtained by calculation according to the left view and the right view, and a target mask corresponding to the target to be positioned in the left view is extracted by using a trained deep learning network. And finally, based on the disparity map and the target mask in the left view, calculating the three-dimensional space coordinate of the target to be positioned by using a three-dimensional reconstruction projection method.

The embodiment of the invention combines binocular stereo vision with deep learning, utilizes a binocular camera to calculate the position deviation between corresponding points of left and right views according to the triangulation principle, utilizes a deep learning method to identify and process specific targets of images, combines three-dimensional reconstruction information to position scene targets in real time on the basis of target identification, simplifies the target positioning process, does not distinguish primary and secondary targets, and simultaneously calculates all target positions in the field of view; deep learning can be targeted to specific targets and general targets to expand target localization applications.

Referring to fig. 2, on the basis of the above embodiment, in the embodiment of the present invention, the calculating a disparity map based on the left view and the right view captured by the binocular camera and including the target to be positioned includes:

It can be understood that the method for calculating the disparity map according to the left view and the right view including the target in the embodiment of the present invention includes the first step of calibrating the binocular camera (the binocular optical axis parallel model) to obtain the internal and external parameters of the binocular camera. In the embodiment of the invention, the binocular calibration adopts an MATLAB calibration tool box to directly calibrate the binocular camera, so as to obtain the internal parameters of the left camera and the right camera and the posture of the right camera relative to the left camera.

And secondly, performing image stereo correction on the left view and the right view based on the internal and external parameters of the binocular camera, so that the left view and the right view keep line alignment. Specifically, in practical application, binocular stereo vision needs to be subjected to image distortion correction, left and right views are subjected to stereo correction to be made into a standard optical axis parallel model, two imaging planes are coplanar and aligned in rows, and therefore search of matching points only needs to be conducted according to the rows, and a basis is made for stereo matching.

And thirdly, matching the left view and the right view by adopting a stereo matching method based on the corrected left view and right view to obtain a disparity map. In order to ensure real-time and reliability, the embodiment of the invention adopts an Efficient Large-Scale Stereo Matching (ELAS) method, which is a Bayesian process and can calculate the accurate disparity map of the high-resolution image at a frame rate close to real time.

On the basis of the above embodiments, in the embodiments of the present invention, the deep learning network is trained in the following manner:

It is understood that the binocular optical axis parallel model is a simple stereoscopic model. In order to obtain the three-dimensional coordinates of a certain point in the space, a binocular optical axis parallel model is modeled. In practical situations, the left and right camera imaging planes are difficult to realize by strictly placing the cameras, so that stereo correction is necessary, the binocular stereo imaging schematic diagrams with two parallel camera optical axes are shown in fig. 3-1 and 3-2, fig. 3-1 is a stereo correction model schematic diagram of a left and right view, and fig. 3-2 is a binocular optical axis parallel model schematic diagram.

The imaging of the camera accords with a pinhole imaging model, the baseline distance T of the left camera and the right camera is constant, the left camera and the right camera are assumed to be identical, and the focal distance f is₁＝f₂F. And principal point c_lc_r(intersection of the optical axis with the image plane) has been calibrated to have the same pixel coordinates on the left and right images. The optical centers of the left camera and the right camera are respectively used as the original points O of the coordinate systems of the left eye camera and the right eye camera_l,O_rThe connecting lines between them are taken as their common x-axis and their optical axes are taken as their respective z-axes, and their y-axes are perpendicular to the xz-plane (schematic not shown). S in FIG. 1_l,S_rThe projection of left and right eye imaging plane coordinate system RCS (continuous correlation system) on the X-axis, the imaging plane coordinate system uses the top left vertex of the image as the origin of the coordinate system, and one point P (X) of the physical world_w,Y_w,Z_w) The intersection points in the left and right eye image plane coordinate systems are respectively (x)_l,y_l) And (x)_r,y_r). As can be seen from fig. 1:

d_xl＝x_l-c_l；

d_xr＝c_r-x_r；

let d be d_xl-d_xrAnd d is parallax.

From a similar triangle can be derived:

deducing:

when the left eye camera coordinate system is used as the world coordinate system wcs (world coordinate system),

in the same way

Wherein x is_lAnd x_rThe units are millimeters. In practical application, pixel points are adopted to represent:

wherein x is_plAnd x_prCoordinate positions respectively characterised by the pixel, in units of c_plAnd c_prRespectively expressed as the center coordinates of the left and right view pixels, with the unit of one, S_xIs the pixel size in millimeters.

From the above analysis, it can be seen that the three-dimensional space coordinates of any space point can be obtained by only obtaining the coordinates of the pixel points of the space point in the left view and the right view.

Therefore, to locate the target in the scene, that is, to obtain the three-dimensional space coordinates of the target in the scene, it is necessary to first obtain the position coordinates of the pixel points of each point of the target in the left and right views.

In the embodiment of the invention, the target in the view and the pixel point coordinates of the target are extracted by utilizing the deep learning network. The deep learning network is trained on the basis of the left view, wherein the target in each left view is extracted and the pixel point coordinates of the target are marked, the pixel point coordinates of the target form a target Mask, a left view training set is formed by the left views and the target Mask in each left view, and the deep learning network is trained by the left view training set to obtain the trained deep learning network.

And for the target to be positioned, capturing a left view and a right view of the target to be positioned by using binocular cameras, inputting the left view into the trained deep learning network, and outputting a target mask of the target in the left view.

Referring to fig. 4, on the basis of the foregoing embodiments, in the embodiment of the present invention, the calculating the three-dimensional space coordinate of the target to be positioned by using a three-dimensional reconstruction projection method based on the disparity map and the target mask in the left view includes:

It can be understood that, in the above embodiment, the target mask of the target in the left view of the target to be positioned is obtained, and the three-dimensional space coordinate of the target to be positioned is obtained by using the three-dimensional reconstruction projection method. The three-dimensional reconstruction of the environment is usually completed in a non-contact manner, and the non-contact three-dimensional reconstruction method is divided into two types according to different methods for acquiring the depth information of the target object: active and passive. Active three-dimensional reconstruction refers to directly acquiring depth information of a target object in an environment by emitting light sources or energy sources such as laser and infrared rays to the object in the environment, and mainly includes a moire fringe method, a time of flight (TOF) method and a structured light method. Compared with the active three-dimensional reconstruction technology, the passive three-dimensional reconstruction technology does not use any specific light source, utilizes the reflection of the surrounding environment such as sunlight, uses a camera to acquire the image information of the object, and then realizes the three-dimensional modeling of the object through a specific algorithm. The embodiment of the invention adopts a passive three-dimensional modeling method, and the specific process of three-dimensional modeling is to calculate a reprojection matrix according to a stereoscopic vision principle and internal and external parameters of a binocular camera; and calculating to obtain the three-dimensional space coordinates of the target based on the reprojection matrix, the disparity map and the target mask in the left view.

On the basis of the foregoing embodiments, in the embodiments of the present invention, calculating the three-dimensional space coordinate of the target to be positioned based on the reprojection matrix, the disparity map, and the target mask in the left view includes:

Extracting a target mask of the target to be positioned in the left view through a deep learning network (the target mask can obtain the coordinates of each pixel point of the target to be positioned), and calculating according to the formula and the coordinates of each pixel point of the target to be positioned to obtain the three-dimensional space coordinates corresponding to the coordinates of each pixel point of the target to be positioned.

On the basis of the above embodiments, in the embodiment of the present invention, one or more targets to be positioned in the left view and the right view are included. When a plurality of targets to be positioned are in a scene, calculating a disparity map based on a left view and a right view which are captured by a binocular camera and contain the targets to be positioned; inputting the left view into the trained deep learning network, and outputting a target mask of each target in the left view; based on the target mask of each target in the disparity map and the left view, the three-dimensional space coordinate of each target is calculated by using a three-dimensional reconstruction projection method, namely, each target in the scene is positioned, and the positioning of a plurality of targets in the scene is realized.

In another embodiment of the invention, an object localization apparatus is provided for implementing the methods of the preceding embodiments. Therefore, the descriptions and definitions in the embodiments of the target positioning method described above can be used for understanding the execution modules in the embodiments of the present invention. Fig. 5 is a schematic diagram of an overall structure of an object locating apparatus according to an embodiment of the present invention, which includes a first calculating module 51, an output module 52, and a second calculating module 53.

The first calculation module 51 is used for calculating a disparity map based on a left view and a right view which are captured by the binocular camera and contain a target to be positioned;

an output module 52, configured to input the left view into the trained deep learning network, and output a target mask in the left view;

and the second calculating module 53 is configured to calculate three-dimensional space coordinates of the object to be positioned by using a three-dimensional reconstruction projection method based on the disparity map and the object mask in the left view.

The target positioning device provided in the embodiments of the present invention corresponds to the target positioning methods provided in the embodiments described above, and the relevant technical features of the provided target positioning device may refer to the relevant technical features of the target positioning method, which is not described herein again.

Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: processor (processor)01, communication Interface (Communications Interface)02, memory (memory)03 and communication bus 04, wherein, the processor 01, communication Interface 02 and memory 03 can call logic instructions in the memory 03 to execute the following method by communication bus 04 to complete the communication between the processor 01 and the memory 03: calculating a disparity map based on a left view and a right view which are captured by a binocular camera and contain a target to be positioned; inputting the left view into the trained deep learning network, and outputting a target mask in the left view; and calculating the three-dimensional space coordinates of the target to be positioned by using a three-dimensional reconstruction projection method based on the target mask in the disparity map and the left view.

In addition, the logic instructions in the memory 03 can be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above method embodiments, for example, including: calculating a disparity map based on a left view and a right view which are captured by a binocular camera and contain a target to be positioned; inputting the left view into the trained deep learning network, and outputting a target mask in the left view; and calculating the three-dimensional space coordinates of the target to be positioned by using a three-dimensional reconstruction projection method based on the target mask in the disparity map and the left view.

According to the target positioning method, the target positioning device, the electronic equipment and the storage medium, a binocular optical axis parallel model (a binocular camera) is used for acquiring a left view and a right view (a left view and a right view) in real time and performing three-dimensional correction; the method comprises the following steps of calculating the position deviation between corresponding points of two views according to a triangulation principle, carrying out specific target identification processing on an image after scene information is acquired, and positioning a scene target in real time by combining three-dimensional reconstruction information on the basis of target identification, wherein the method has the following advantages:

the binocular vision simulates the process of human eyes for perceiving the target object information in the space, and three-dimensional information of a space point is obtained through coordinates of one point in the space on left and right imaging planes on the basis of parallax and a triangular geometrical relationship by utilizing two cameras; compared with other devices, the binocular vision three-dimensional reconstruction does not need to add complex light source equipment, and has the advantages of reliability, convenience, appropriate precision, low cost, accordance with popular requirements and the like.

The pixel position of the target is detected only by acquiring image information of the target in advance and marking the target (marking pixel position coordinates of the target in the image), after the acquired large number of target images are marked, the target images are trained and learned by a deep learning network to obtain a parameter model meeting requirements, the trained parameter model is applied to a newly input image to obtain a target detection result, and the output of the target detection is the pixel position coordinates of the target in a left view.

The target detection based on the deep learning algorithm is a mainstream target detection algorithm in the current computer vision field, depends on hierarchical characteristic representation of a multilayer neural network learning image, and can realize higher accuracy compared with the traditional detection method; the binocular vision is combined with a deep learning network, the deep learning network identifies a target in an image and outputs the pixel position of the target in the image, and finally, the binocular vision three-dimensional reconstruction information is combined to complete real-time positioning of the target; the Mask RCNN extracts the target and the pixel position of the target, the target position is solved by three-dimensional reprojection in combination with binocular stereo vision, and the calculated amount is small.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of locating an object, comprising:

2. The method for locating the target according to claim 1, wherein the calculating the disparity map based on the left view and the right view captured by the binocular camera and containing the target to be located comprises:

3. The method of claim 2, wherein the stereo matching method is an efficient large-scale stereo matching method.

4. The method of claim 1, wherein the deep learning network is trained by:

5. The method for locating an object according to claim 1, wherein the calculating three-dimensional space coordinates of the object to be located by using a three-dimensional reconstruction projection method based on the disparity map and the object mask in the left view comprises:

6. The method of claim 5, wherein the calculating three-dimensional space coordinates of the object to be positioned based on the reprojection matrix, the disparity map, and the object mask in the left view comprises:

7. The method of claim 1 or 4, wherein the deep learning network is a MaskRCNN deep neural network.

8. The method as claimed in claim 1, wherein the target to be located in the left and right views comprises one or more.

9. An object positioning device, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the object localization method as claimed in any one of claims 1 to 8 are implemented by the processor when executing the program.