WO2022206255A1 - 视觉定位方法、视觉定位装置、存储介质与电子设备 - Google Patents

视觉定位方法、视觉定位装置、存储介质与电子设备 Download PDF

Info

Publication number
WO2022206255A1
WO2022206255A1 PCT/CN2022/078435 CN2022078435W WO2022206255A1 WO 2022206255 A1 WO2022206255 A1 WO 2022206255A1 CN 2022078435 W CN2022078435 W CN 2022078435W WO 2022206255 A1 WO2022206255 A1 WO 2022206255A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame image
coordinate system
current frame
transformation parameter
matching
Prior art date
Application number
PCT/CN2022/078435
Other languages
English (en)
French (fr)
Inventor
周宇豪
李姬俊男
郭彦东
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2022206255A1 publication Critical patent/WO2022206255A1/zh
Priority to US18/372,477 priority Critical patent/US20240029297A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • G06T7/33Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • G06T7/66Analysis of geometric attributes of image moments or centre of gravity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Definitions

  • the present disclosure relates to the technical field of computer vision, and in particular, to a visual positioning method, a visual positioning device, a computer-readable storage medium, and an electronic device.
  • Visual positioning is a new type of positioning technology. It collects environmental images through image acquisition devices (such as mobile phones, RGB cameras, etc.), and cooperates with image algorithms and mathematical reasoning to calculate and update the current pose in real time. It has high speed, high precision, and ease of use. Advantages, it has been widely used in AR (Augmented Reality, augmented reality), indoor navigation and other scenarios.
  • image acquisition devices such as mobile phones, RGB cameras, etc.
  • image algorithms and mathematical reasoning to calculate and update the current pose in real time. It has high speed, high precision, and ease of use. Advantages, it has been widely used in AR (Augmented Reality, augmented reality), indoor navigation and other scenarios.
  • the present disclosure provides a visual positioning method, a visual positioning device, a computer-readable storage medium, and an electronic device.
  • a visual positioning method comprising: acquiring a surface normal vector of a current frame image; determining the distance between the current frame image and a reference frame image by projecting the surface normal vector to a Manhattan coordinate system The first transformation parameter of ; match the feature point of the current frame image and the feature point of the reference frame image, and determine the second transformation parameter between the current frame image and the reference frame image according to the matching result; based on the The first transformation parameter and the second transformation parameter determine the target transformation parameter; the visual positioning result corresponding to the current frame image is output according to the target transformation parameter.
  • a visual positioning device comprising: a surface normal vector acquisition module configured to acquire a surface normal vector of a current frame image; a first transformation parameter determination module configured to The normal vector is projected to the Manhattan coordinate system, and the first transformation parameter between the current frame image and the reference frame image is determined; the second transformation parameter determination module is configured to match the feature points of the current frame image and the reference frame image.
  • the feature points of determine the second transformation parameter between the current frame image and the reference frame image according to the matching result; the target transformation parameter determination module is configured to be based on the first transformation parameter and the second transformation parameter, A target transformation parameter is determined; the visual positioning result output module is configured to output a visual positioning result corresponding to the current frame image according to the target transformation parameter.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the visual positioning method of the first aspect and possible implementations thereof.
  • an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the executable instructions to The visual positioning method of the above-mentioned first aspect and possible implementations thereof are performed.
  • FIG. 1 shows a schematic structural diagram of an electronic device in this exemplary embodiment
  • FIG. 2 shows a flowchart of a visual positioning method in this exemplary embodiment
  • FIG. 3 shows a schematic diagram of the current frame image and the surface normal vector in this exemplary embodiment
  • FIG. 4 shows a schematic diagram of processing by a surface normal vector estimation network in this exemplary embodiment
  • FIG. 5 shows a flow chart of obtaining a surface normal vector in this exemplary embodiment
  • FIG. 6 shows a flow chart of determining the first transformation parameter in this exemplary embodiment
  • FIG. 7 shows a schematic diagram of determining the camera coordinate system-Manhattan coordinate system transformation parameters corresponding to the current frame image in this exemplary embodiment
  • FIG. 8 shows a flowchart of feature point matching in this exemplary embodiment
  • FIG. 9 shows a schematic diagram of feature point matching in this exemplary embodiment
  • Fig. 10 shows a flow chart of outputting the visual positioning result in this exemplary embodiment
  • FIG. 11 shows a schematic diagram of projecting a three-dimensional point cloud in this exemplary embodiment
  • FIG. 12 shows a flowchart of matching feature points and projection points in this exemplary embodiment
  • FIG. 13 shows a schematic diagram of a three-dimensional point cloud of the target scene in this exemplary embodiment
  • FIG. 14 shows a flowchart of a visual positioning initialization in this exemplary embodiment
  • Fig. 15 shows the flow chart of determining the target transformation parameter of the ith frame in this exemplary embodiment
  • FIG. 16 shows a schematic diagram of determining the target transformation parameter of the i-th frame in this exemplary embodiment
  • FIG. 17 shows a schematic structural diagram of a visual positioning device in this exemplary embodiment
  • FIG. 18 shows a schematic structural diagram of another visual positioning device in this exemplary embodiment.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
  • the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
  • numerous specific details are provided in order to give a thorough understanding of the embodiments of the present disclosure.
  • those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. may be employed.
  • well-known solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
  • auxiliary sensors In order to improve the accuracy of visual positioning in a weak texture environment, a scheme using auxiliary sensors has appeared in the related art. For example, laser sensors (such as lidar), depth sensors (such as RGB-D cameras), etc. are used to directly obtain the depth information at the corresponding positions of the image pixels, so as to recover the three-dimensional point cloud information for visual positioning. But this increases the hardware cost of the solution implementation.
  • laser sensors such as lidar
  • depth sensors such as RGB-D cameras
  • exemplary embodiments of the present disclosure first provide a visual positioning method, the application scenarios of which include but are not limited to: when a user is inside a shopping mall and needs indoor navigation, the environment can be photographed through a terminal with a photographing function, and the terminal Feature points are extracted from the environment image and uploaded to the cloud, where the cloud executes the visual positioning method of this exemplary embodiment, determines the positioning result of the terminal, and provides indoor navigation services.
  • Exemplary embodiments of the present disclosure also provide an electronic device for performing the above-mentioned visual positioning method.
  • the electronic device may be the above-mentioned terminal or a server in the cloud, including but not limited to a computer, a smart phone, a wearable device (such as augmented reality glasses), a robot, a drone, and the like.
  • an electronic device includes a processor and a memory.
  • the memory is used to store executable instructions of the processor, and may also store application data, such as image data, video data, etc.; the processor is configured to execute the visual positioning method in this exemplary embodiment by executing the executable instructions.
  • the following takes the mobile terminal 100 in FIG. 1 as an example to illustrate the structure of the above electronic device. It will be understood by those skilled in the art that the configuration in Figure 1 can also be applied to stationary type devices, in addition to components specifically for mobile purposes.
  • the mobile terminal 100 may specifically include: a processor 110, an internal memory 121, an external memory interface 122, a USB (Universal Serial Bus, Universal Serial Bus) interface 130, a charging management module 140, a power management module 141, battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 171, receiver 172, microphone 173, headphone jack 174, sensor module 180, display screen 190, camera module 191, indication 192, a motor 193, a button 194, a SIM (Subscriber Identification Module, Subscriber Identification Module) card interface 195 and the like.
  • a processor 110 an internal memory 121, an external memory interface 122, a USB (Universal Serial Bus, Universal Serial Bus) interface 130, a charging management module 140, a power management module 141, battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 171, receiver 172, microphone 173, headphone jack 174, sensor module 180, display screen 190, camera module 191, indication 192,
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an AP (Application Processor, application processor), a modem processor, a GPU (Graphics Processing Unit, graphics processor), an ISP (Image Signal Processor, image signal processor), controller, encoder, decoder, DSP (Digital Signal Processor, digital signal processor), baseband processor and/or NPU (Neural-Network Processing Unit, neural network processor), etc.
  • AP Application Processor
  • modem processor e.g., graphics processing circuitry
  • GPU Graphics Processing Unit, graphics processor
  • ISP Image Signal Processor, image signal processor
  • controller e.g., encoder, decoder
  • DSP Digital Signal Processor, digital signal processor
  • baseband processor and/or NPU Neural-Network Processing Unit, neural network processor
  • the encoder can encode (ie compress) image or video data, for example, encode the captured scene image to form the corresponding code stream data to reduce the bandwidth occupied by data transmission; the decoder can encode the image or video code stream
  • the data is decoded (ie, decompressed) to restore image or video data, for example, the code stream data of the scene image is decoded to obtain complete image data, which facilitates the execution of the positioning method of this exemplary embodiment.
  • the mobile terminal 100 may support one or more encoders and decoders.
  • the mobile terminal 100 can process images or videos in various encoding formats, such as: JPEG (Joint Photographic Experts Group, Joint Photographic Experts Group), PNG (Portable Network Graphics, Portable Network Graphics), BMP (Bitmap, Bitmap), etc. Image format, MPEG (Moving Picture Experts Group, Moving Picture Experts Group) 1, MPEG2, H.263, H.264, HEVC (High Efficiency Video Coding, High Efficiency Video Coding) and other video formats.
  • JPEG Joint Photographic Experts Group
  • PNG Portable Network Graphics
  • BMP Bitmap
  • Image format MPEG (Moving Picture Experts Group, Moving Picture Experts Group) 1, MPEG2, H.263, H.264, HEVC (High Efficiency Video Coding, High Efficiency Video Coding) and other video formats.
  • the processor 110 may include one or more interfaces through which connections are formed with other components of the mobile terminal 100.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the internal memory 121 may include volatile memory and nonvolatile memory.
  • the processor 110 executes various functional applications and data processing of the mobile terminal 100 by executing the instructions stored in the internal memory 121 .
  • the external memory interface 122 can be used to connect an external memory, such as a Micro SD card, so as to expand the storage capacity of the mobile terminal 100.
  • the external memory communicates with the processor 110 through the external memory interface 122 to implement data storage functions, such as storing images, videos and other files.
  • the USB interface 130 is an interface conforming to the USB standard specification, and can be used to connect a charger to charge the mobile terminal 100, and can also be connected to an earphone or other electronic devices.
  • the charging management module 140 is used to receive charging input from the charger. While charging the battery 142, the charging management module 140 can also supply power to the device through the power management module 141; the power management module 141 can also monitor the state of the battery.
  • the wireless communication function of the mobile terminal 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • the mobile communication module 150 may provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the mobile terminal 100 .
  • the wireless communication module 160 can provide applications on the mobile terminal 100 including WLAN (Wireless Local Area Networks, wireless local area network) (such as Wi-Fi (Wireless Fidelity, wireless fidelity) network), BT (Bluetooth, Bluetooth), GNSS (Global Navigation Satellite System, global navigation satellite system), FM (Frequency Modulation, frequency modulation), NFC (Near Field Communication, short-range wireless communication technology), IR (Infrared, infrared technology) and other wireless communication solutions.
  • WLAN Wireless Local Area Networks, wireless local area network
  • Wi-Fi Wireless Fidelity, wireless fidelity
  • BT Bluetooth
  • GNSS Global Navigation Satellite System, global navigation satellite system
  • FM Frequency Modulation, frequency modulation
  • NFC Near Field Communication, short-range wireless communication technology
  • IR Infrared, infrared technology
  • the mobile terminal 100 may implement a display function through the GPU, the display screen 190 and the AP, and display a user interface. For example, when the user turns on the shooting function, the mobile terminal 100 may display a shooting interface, a preview image, and the like on the display screen 190 .
  • the mobile terminal 100 can realize the shooting function through the ISP, the camera module 191, the encoder, the decoder, the GPU, the display screen 190, the AP, and the like.
  • a user can start a service related to visual positioning, and trigger a shooting function.
  • the camera module 191 can collect an image of the current scene and perform positioning.
  • the mobile terminal 100 may implement audio functions through an audio module 170, a speaker 171, a receiver 172, a microphone 173, an earphone interface 174, an AP, and the like.
  • the sensor module 180 may include a depth sensor 1801 , a pressure sensor 1802 , a gyro sensor 1803 , an air pressure sensor 1804 , etc., to implement corresponding sensing detection functions.
  • the indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
  • the motor 193 can generate vibration prompts, and can also be used for touch vibration feedback and the like.
  • the keys 194 include a power-on key, a volume key, and the like.
  • the mobile terminal 100 may support one or more SIM card interfaces 195 for connecting the SIM cards to realize functions such as calling and mobile communication.
  • FIG. 2 shows an exemplary flow of the visual positioning method, which may include:
  • Step S210 obtaining the surface normal vector of the current frame image
  • Step S220 by projecting the surface normal vector to the Manhattan coordinate system, determining the first transformation parameter between the current frame image and the reference frame image;
  • Step S230 matching the feature points of the current frame image and the feature points of the reference frame image, and determining the second transformation parameter between the current frame image and the reference frame image according to the matching result;
  • Step S240 using the first transformation parameter to optimize the second transformation parameter to obtain the target transformation parameter
  • Step S250 output the visual positioning result corresponding to the current frame image according to the target transformation parameter.
  • the projection of the surface normal vector in the current frame image in the Manhattan coordinate system is used to determine the first transformation parameter between the current frame image and the reference frame image, and thus the second transformation parameter obtained through feature point matching is determined. Perform optimization, and output the visual positioning result according to the target transformation parameters obtained after optimization.
  • the present exemplary embodiment determines the transformation parameters through the projection of the surface normal vector and the matching of the feature points, and combines the results of the two aspects for visual positioning, which reduces the dependence on one aspect and improves the solution's performance. robustness. In particular, the dependence on the feature quality in the image is reduced.
  • the second transformation parameter can be optimized to improve the accuracy of the target transformation parameter and the final visual positioning result, so as to solve the visual positioning in the weak texture environment.
  • the present exemplary embodiment adopts the method of feature point matching to determine the second transformation parameter, and the amount of computation required for feature point extraction and matching processing is low, which is beneficial to improve the response speed and the real-time performance of visual positioning.
  • the present exemplary embodiment can be implemented based on a common monocular RGB camera, without adding auxiliary sensors, and has a low implementation cost.
  • step S210 the surface normal vector of the current frame image is acquired.
  • the current frame image is an image currently shot for a target scene, and the target scene is an environmental scene where the user is currently located, such as a room, a shopping mall, and the like.
  • the terminal usually needs to continuously shoot multiple frames of scene images, and the current frame image is the latest frame image.
  • the surface normal vector of the current frame image includes the surface normal vector of at least a part of the pixels in the current frame image.
  • the height and width of the current frame image are H and W respectively, then the number of pixels of the current frame image is H*W, and the surface normal vector of each pixel is obtained.
  • the surface normal vector of each pixel includes 3
  • the axis coordinate of the dimension, the surface normal vector of the current frame image includes H*W*3 values.
  • each pixel in I 0 includes a pixel value (such as an RGB value).
  • pixel value such as an RGB value
  • a surface normal vector estimation network may be pre-trained, which may be a deep learning network such as a CNN (Convolutional Neural Network, convolutional neural network), and the surface normal vector estimation network is used to process the current frame image to obtain the current frame.
  • the surface normal vector of the image may be pre-trained, which may be a deep learning network such as a CNN (Convolutional Neural Network, convolutional neural network), and the surface normal vector estimation network is used to process the current frame image to obtain the current frame.
  • the surface normal vector of the image may be pre-trained, which may be a deep learning network such as a CNN (Convolutional Neural Network, convolutional neural network), and the surface normal vector estimation network is used to process the current frame image to obtain the current frame.
  • the surface normal vector of the image may be a deep learning network such as a CNN (Convolutional Neural Network, convolutional neural network)
  • Figure 4 shows a schematic diagram of the structure of the surface normal vector estimation network, which can be changed based on the traditional U-Net (a U-shaped network) structure.
  • the network mainly includes three parts: encoding sub-network (Encoder), decoding sub-network (Decoder), convolution sub-network (Convolution).
  • Encoder encoding sub-network
  • Decoder decoding sub-network
  • Convolution convolution sub-network
  • the process of obtaining the surface normal vector may include:
  • Step S510 using the coding sub-network to downsample the current frame image to obtain the downsampled intermediate image and the downsampled target image;
  • Step S520 using the decoding sub-network to perform an upsampling operation on the downsampled target image and a splicing operation with the downsampled intermediate image to obtain the upsampled target image;
  • Step S530 using a convolution sub-network to perform a convolution operation on the up-sampled target image to obtain a surface normal vector.
  • the current frame image is input into the surface normal vector estimation network, it first enters the encoding sub-network, and in the encoding sub-network, multiple downsampling operations are required in sequence. For example, performing g downsampling operations (g ⁇ 2), the images obtained from the 1st to the g-1th downsampling operations are downsampling intermediate images, and the images obtained by the gth downsampling operations are downsampling target images . Both the down-sampled intermediate image and the down-sampled target image can be regarded as the images output by the encoding sub-network.
  • the downsampling operation can capture the semantic information in the image.
  • the down-sampled target image enters the decoding sub-network, and performs multiple up-sampling and splicing operations in sequence. After each up-sampling operation, the obtained image is spliced with the corresponding down-sampling intermediate image to obtain an up-sampling intermediate image, and the up-sampling and splicing operations are performed on the up-sampling intermediate image for the next time.
  • the up-sampling operation of the decoding sub-network corresponds to the down-sampling operation of the encoding sub-network, which can be located with the semantic information in the image. After completing the last upsampling and stitching operations, the decoding sub-network outputs the upsampled target image.
  • the splicing operation can also be regarded as a link in the upsampling operation, and the above-mentioned upsampling and splicing operations are collectively referred to as the upsampling operation, which is not limited in the present disclosure.
  • the convolutional sub-network can be composed of multiple scale-invariant convolutional layers.
  • the training process of the surface normal vector estimation network may include the following steps:
  • the loss function of the surface normal vector estimation network is established as follows:
  • p l(x) (x) is the AP (Average Precision, average accuracy rate) probability function, l: ⁇ 1,...,K ⁇ is the pixel point, is the weight value of the pixel point, usually the pixel point in the image closer to the boundary position has the higher weight value;
  • Input a dataset containing RGB images and surface normal vector images, such as Taskonomy and NYUv2, to start training;
  • the structure and parameters of the surface normal vector estimation network are solidified and saved as a corresponding file for recall.
  • each downsampling operation can include two 3*3 convolution operations With a 2*2 maximum pooling operation, and the number of convolution kernels used in 5 downsampling operations is doubled in turn, they are 1, 64, 128, 256, 512 and 1024 respectively; after each downsampling operation, the corresponding The intermediate images are down-sampled, and their sizes are successively decreased, and the down-sampled target image is obtained after the last down-sampling operation.
  • the Decoder performs multiple upsampling operations corresponding to the downsampling operation on the downsampled target image; each upsampling operation can include a 2*2 transposed convolution (or deconvolution) operation, which is the same as the corresponding The stitching operation of the down-sampled intermediate images of the size, and two 3*3 convolution operations; after each up-sampling operation, the corresponding up-sampling intermediate images are obtained, and their sizes are sequentially increased, and the up-sampling target is obtained after the last up-sampling operation. image. Convolution then performs a full convolution operation on the up-sampled target image, and finally outputs the surface normal vector.
  • each upsampling operation can include a 2*2 transposed convolution (or deconvolution) operation, which is the same as the corresponding The stitching operation of the down-sampled intermediate images of the size, and two 3*3 convolution operations; after each up-sampling operation, the corresponding up-sampling intermediate images are obtained, and
  • the first transformation parameter between the current frame image and the reference frame image is determined by projecting the surface normal vector to the Manhattan coordinate system.
  • the Manhattan World hypothesis refers to the assumption that there is vertical or orthogonal information in the environment.
  • the indoor floor is perpendicular to the wall
  • the wall is perpendicular to the ceiling
  • the front The wall is perpendicular to the walls on both sides, so that a coordinate system containing vertical information can be constructed, that is, the Manhattan coordinate system.
  • the above-mentioned vertical ground and wall, wall and ceiling, etc. are not vertical in the camera coordinate system or image coordinate system corresponding to the current frame image. It can be seen that there is a certain relationship between the Manhattan coordinate system and the camera coordinate system or image coordinate system. Transform relationship.
  • the transformation parameters between the camera coordinate systems corresponding to the different frame images can be determined by using the transformation parameters between the Manhattan coordinate system and the camera coordinate systems corresponding to the different frame images.
  • the present exemplary embodiment can determine the transformation parameters between the camera coordinate systems corresponding to the current frame image and the reference frame image.
  • the above-mentioned first transformation parameter between the current frame image and the reference frame image represents the transformation parameter between the camera coordinate systems corresponding to the two images determined by the Manhattan coordinate system.
  • the reference frame image may be any frame or multiple frames of images of known poses captured for the target scene. For example, in the case of continuously capturing scene images for visual positioning, the previous frame image may be used as the reference frame image.
  • the transformation parameters described in this exemplary embodiment may include a rotation parameter (eg, a rotation matrix) and a translation parameter (eg, a translation vector).
  • a rotation parameter eg, a rotation matrix
  • a translation parameter eg, a translation vector
  • step S210 the surface normal vector of each pixel in the current frame image may be acquired.
  • step S220 the surface normal vector of each pixel can be projected to the Manhattan coordinate system, and the first transformation parameter can be obtained through subsequent processing.
  • step S220 may be implemented by the following steps S610 to S630:
  • step S610 the above-mentioned surface normal vector is mapped to the Manhattan coordinate system by using the camera coordinate system-Manhattan coordinate system transformation parameter corresponding to the reference frame image.
  • the camera coordinate system generally refers to the Cartesian coordinate system with the optical center of the camera as the origin, which can be expressed as SO(3);
  • the Manhattan coordinate system generally refers to the vector coordinate system formed by the unit sphere, which can be expressed as so(3), As shown in the normal vector sphere in Figure 7, each point on the normal vector sphere represents the position of the end point of the surface normal vector after the starting point of the surface normal vector under the camera coordinate system is moved to the center of the sphere.
  • the surface normal vector obtained in step S210 is the information in the camera coordinate system, which can be mapped to the Manhattan coordinate system by using the camera coordinate system-Manhattan coordinate system transformation parameter.
  • the directions of the x, y, and z axes in the camera coordinate system are related to the pose of the camera.
  • the directions of the x, y, and z axes in the Manhattan coordinate system are determined when the Manhattan coordinate system is established for the target scene, and are related to the real world of the target scene. Information related, no matter how the camera moves, the directions of the x, y, and z axes in the Manhattan coordinate system are fixed. Therefore, the transformation relationship between the camera coordinate system and the Manhattan coordinate system includes the rotation transformation relationship between the two coordinate systems.
  • the above camera coordinate system-Manhattan coordinate system transformation parameter may be a camera coordinate system-Manhattan coordinate system relative rotation matrix.
  • step S610 may include:
  • the three-dimensional axis coordinates of the above-mentioned surface normal vector in the camera coordinate system corresponding to the reference frame image are mapped to the three-dimensional axis coordinates in the Manhattan coordinate system;
  • R cM [r 1 r 2 r 3 ] ⁇ SO(3)
  • r 1 , r 2 , and r 3 represent Manhattan coordinates, respectively the three axes of the system.
  • n k the surface normal vector obtained in step S210
  • n k maps it to the Manhattan coordinate system to obtain:
  • n k ' represents the three-dimensional axis coordinate of the surface normal vector n k in the Manhattan coordinate system.
  • m k ' represents the two-dimensional coordinate of the surface normal vector n k on the plane tangential to the axis of the Manhattan coordinate system.
  • the surface normal vector is represented by the two-dimensional coordinates on the tangential plane, which is easier to calculate its offset.
  • step S620 based on the offset of the surface normal vector in the Manhattan coordinate system, the camera coordinate system-Manhattan coordinate system transformation parameter corresponding to the current frame image is determined.
  • the surface normal vector should be in the same axis direction as the Manhattan coordinate system.
  • the camera coordinate system-Manhattan coordinate system transformation parameter R cM maps the surface normal vector of the current frame image. After mapping, the surface normal vector is offset from the axial direction of the Manhattan coordinate system, which means that the mapping point of the surface normal vector on the unit sphere is not in the axial position.
  • the offset is determined by the camera coordinate system-Manhattan coordinate system transformation parameter R cM corresponding to the reference frame image and the camera coordinate system-Manhattan coordinate system transformation parameter corresponding to the current frame image (which can be defined as means) caused by inconsistencies.
  • the camera coordinate system-Manhattan coordinate system transformation parameter corresponding to the current frame image can be calculated by using the offset of the surface normal vector.
  • step S620 may include:
  • Cluster the two-dimensional coordinates of the surface normal vector on the tangential plane, and determine the offset of the surface normal vector on the tangential plane according to the cluster center;
  • the camera coordinate system-Manhattan coordinate system transformation parameters corresponding to the current frame image are determined according to the camera coordinate system-Manhattan coordinate system transformation parameters corresponding to the reference frame image and the above-mentioned offset three-dimensional axis coordinates in the Manhattan coordinate system.
  • the projection points of the surface normal vector on the tangential plane are actually clustered, and the coordinates of each dimension can be clustered separately.
  • the specific method is not limited.
  • c is the width of the Gaussian kernel
  • s j ' is the offset of m k ', which represents the two-dimensional coordinate of the offset on the tangential plane.
  • s j includes the coordinates of the three axes of x, y, and z in the Manhattan coordinate system, and can respectively represent the vectors that need to be updated for each axis in the Manhattan coordinate system. Therefore, on the basis of the camera coordinate system-Manhattan coordinate system transformation parameters corresponding to the reference frame image, the three-dimensional axis coordinates included in the above offset s j are used to update the transformation parameters, and the camera coordinate system-Manhattan coordinates corresponding to the current frame image are obtained.
  • the transformation parameters are as follows:
  • the SVD algorithm (Singular Value Decomposition, singular value decomposition) can be used to make it satisfy the orthogonality constraint applied by the rotation matrix, so as to improve the accuracy of the relative rotation matrix.
  • step S630 the first transformation parameter between the current frame image and the reference frame image is determined according to the camera coordinate system-Manhattan coordinate system transformation parameter corresponding to the reference frame image and the camera coordinate system-Manhattan coordinate system transformation parameter corresponding to the current frame image.
  • the transformation parameters between the current frame image and the reference frame image can be calculated, that is, the relative pose relationship between c 1 and c 2 , as shown below:
  • c 1 and c 2 represent the reference frame image and the current frame image, respectively, is the relative transformation parameter calculated by the mapping of the surface normal vector, which is called the first transformation parameter.
  • step S230 the feature points of the current frame image and the feature points of the reference frame image are matched, and the second transformation parameter between the current frame image and the reference frame image is determined according to the matching result.
  • Feature points refer to the locally representative points in the image, which can reflect the local features of the image.
  • the feature points are extracted from the border area with rich texture, and the feature points are described in a certain way to obtain the feature point descriptor.
  • This exemplary embodiment can adopt FAST (Features From Accelerated Segment Test, features based on accelerated segmentation detection), BRIEF (Binary Robust Independent Elementary Features, binary robust independent basic features), ORB (Oriented FAST and Rotated BRIEF, oriented to FAST and Rotation Brief), SIFT (Scale-Invariant Feature Transform), SURF (Speeded Up Robust Features, accelerated robust features), SuperPoint (feature point detection and descriptor extraction based on self-supervised learning), R2D2 ( Reliable and Repeatable Detector and Descriptor, reliable and repeatable feature points and descriptors) and other algorithms extract feature points and describe them.
  • FAST Features From Accelerated Segment Test, features based on accelerated segmentation detection
  • the SIFT feature is taken as an example for description.
  • the SIFT feature describes the feature points detected in the image with a 128-dimensional feature vector, which is invariant to image scaling, translation, and rotation, and also has certain invariance to illumination, affine and projection transformations.
  • Feature points are extracted from the reference frame image and the current frame image and represented by SIFT features, and the SIFT features in the two images are matched.
  • the similarity of the SIFT feature vectors of two feature points can be calculated, for example, the similarity is measured by Euclidean distance, cosine similarity, etc. If the similarity is high, it means that the two feature points are matched and a matching point pair is formed.
  • a set of matching point pairs in the reference frame image and the current frame image is formed to obtain the above matching result.
  • step S230 may include:
  • Step S810 matching the feature points of the reference frame image to the feature points of the current frame image to obtain first matching information
  • Step S820 matching the feature points of the current frame image to the feature points of the reference frame image to obtain second matching information
  • Step S830 obtaining a matching result according to the first matching information and the second matching information.
  • each feature point is described by a d-dimensional (such as 128-dimensional) descriptor, then the local descriptor of the reference frame image is D M*d ;
  • each feature point is described by a d - dimensional descriptor, then the local descriptor of the reference frame image is D N*d ; match D M*d to D N*d , get the first matching information:
  • step S810 and step S820 the matching directions are different, and the obtained matching results are also different, which are respectively expressed as the first matching information with the second matching information and They can be M*N and N*M matrices respectively, indicating the matching probability between the feature points obtained through different matching directions.
  • first matching information and the second matching information may be integrated to obtain a final matching result.
  • the first matching information and the second matching information may be intersected to obtain a matching result.
  • the first matching probability of different feature point pairs is determined in the first matching information; the second matching probability of different feature point pairs is determined in the second matching information; for the same feature point pair, between the first matching probability and the second matching probability
  • the smaller value of the matching probability is taken as the comprehensive matching probability; then the feature point pairs whose comprehensive matching probability is higher than the preset matching threshold are screened out to obtain the matching result.
  • filter out feature point pairs whose matching probability is higher than the matching threshold in the first matching information to obtain a first matching point pair set
  • filter out feature point pairs whose matching probability is higher than the matching threshold in the second matching information to obtain the first matching point pair set.
  • the second matching point pair set; then the intersection of the first matching point pair set and the second matching point pair set is obtained to obtain a matching result.
  • the matching of Cross-Check penalty is realized, which ensures the quality of matching point pairs.
  • a union of the first matching information and the second matching information may also be obtained to obtain a matching result.
  • the difference from the above method of taking the intersection is that the larger value of the first matching probability and the second matching probability is taken as the comprehensive matching probability, or the union of the first matching point pair set and the second matching point pair set is taken.
  • the geometric constraint relationship between the images such as the epipolar constraint
  • algorithms such as RANSAC (Random Sample Consensus, random sampling consistency) can be used.
  • RANSAC Random Sample Consensus, random sampling consistency
  • FIG. 9 shows the matching relationship of feature points between the reference frame image and the current frame image. Based on the matching relationship of the feature points, the SVD algorithm can be used to calculate the second transformation parameter between the two images, which can include a rotation matrix with translation vector
  • step S240 the target transformation parameter is determined based on the first transformation parameter and the second transformation parameter.
  • the first transformation parameter is determined by the projection of the surface normal vector
  • the second transformation parameter is determined through the feature point matching. Both algorithms may have certain limitations.
  • This exemplary embodiment obtains a more accurate target transformation parameter by synthesizing the first transformation parameter and the second transformation parameter. For example, one of the first transformation parameter and the second transformation parameter is optimized for the other, and the target transformation parameter is obtained after optimization.
  • the first transformation parameter includes a first rotation matrix
  • the second transformation parameter includes a second rotation matrix. Either of them can be optimized using the BA algorithm (Bundle Adjustment).
  • step S240 may include:
  • the minimum value of the loss function is optimized, and the adjusted second rotation matrix is used as the rotation matrix in the target transformation parameter.
  • An exemplary loss function for optimizing the rotation matrix can be as follows:
  • the second rotation matrix For the quantity to be optimized, iteratively adjust Decrease the value of the loss function ⁇ until convergence. will be adjusted marked as is the rotation matrix in the target transformation parameters. Additionally, the translation vector in the target transform parameter The translation vector in the second transform parameter can be used From this, the target transformation parameters are obtained, including
  • step S250 the visual positioning result corresponding to the current frame image is output according to the target transformation parameter.
  • the target transformation parameter is used to represent the relative pose relationship between the current frame image and the reference frame image.
  • the pose of the reference frame image has been determined.
  • affine transformation is performed through the target transformation parameters to obtain the visual positioning result corresponding to the current frame image.
  • it can be 6DoF (Degree of Freedom). , degrees of freedom) pose.
  • step S250 may include:
  • Step S1010 determining the first pose corresponding to the current frame image according to the pose corresponding to the target transformation parameter and the reference frame image;
  • Step S1020 using the first pose to project the three-dimensional point cloud of the target scene to the plane of the current frame image to obtain the corresponding projection point;
  • Step S1030 matching the feature point and the projection point of the current frame image, and determining the second pose of the current frame image according to the matching point pair of the feature point and the projection point of the current frame image;
  • Step S1040 outputting the second pose as the visual positioning result corresponding to the current frame image.
  • the first pose and the second pose respectively refer to the poses of the current frame image determined in different ways.
  • affine transformation is performed based on the pose corresponding to the reference frame image, and the obtained pose is the first pose.
  • This exemplary embodiment further optimizes the first pose to obtain a more accurate second pose, which is output as the final visual positioning result.
  • the target scene is the scene captured by the current frame image and the reference frame image, and is also the scene where the device to be positioned is currently located.
  • Figure 11 shows a schematic diagram of projecting a 3D point cloud onto the image plane of the current frame. After projecting the 3D point cloud from the world coordinate system to the camera coordinate system or image coordinate system corresponding to the current frame image, corresponding projection points are obtained. Match the feature points and projection points of the current frame image to obtain matching point pairs.
  • This exemplary embodiment can describe the projection points in the same way as the feature points, for example, SIFT feature description is performed on the feature points, and SIFT feature description is also performed on the projection points. The similarity of the feature vector can be used to match the feature point and the projection point.
  • the above-mentioned matching of feature points and projection points of the current frame image may include:
  • Step S1210 matching the projection point to the feature point of the current frame image to obtain third matching information
  • Step S1220 matching the feature points of the current frame image to the projection points to obtain fourth matching information
  • Step S1230 Obtain a matching point pair between the feature point of the current frame image and the projection point according to the third matching information and the fourth matching information.
  • the second pose is obtained by solving algorithms such as PnP (Perspective-n-Point, an algorithm for solving pose based on a 2D-3D matching relationship).
  • PnP Perspective-n-Point, an algorithm for solving pose based on a 2D-3D matching relationship
  • the visual positioning method shown in Figure 2 can be applied to the scene of SLAM (Simultaneous Localization And Mapping, simultaneous positioning and mapping), and can also be applied to the scene of visual positioning when a map has been constructed.
  • SLAM Simultaneous Localization And Mapping, simultaneous positioning and mapping
  • the following takes the scene of visual positioning when a map has been constructed as an example to further illustrate the process of visual positioning.
  • the staff uses mobile phones, panoramic cameras and other image acquisition equipment to collect the image of the target scene, and constructs the map formed by the three-dimensional point cloud through the SFM (Structure From Motion, motion recovery structure) process, you can refer to Figure 13 shown.
  • SFM Structure From Motion, motion recovery structure
  • Figure 14 shows a flowchart for visual positioning initialization.
  • Input the first frame image on the one hand, obtain the surface normal vector, and construct the Manhattan coordinate system.
  • the origin and axis direction of the Manhattan coordinate system and the camera coordinate system corresponding to the first frame image can be set to be the same, thereby obtaining the initial camera coordinate system-Manhattan coordinate system transformation parameters, wherein the rotation matrix R c1M can be a unit matrix , the translation vector T c1M can be a zero vector.
  • feature points are extracted from the first frame image and described by SIFT features, and the positions and descriptors of the feature points are saved for the processing of subsequent image frames. Complete visual positioning initialization.
  • FIG. 15 shows a flowchart of processing the image of the i-th frame
  • FIG. 16 is a schematic diagram of processing the image of the i-th frame.
  • input the i-th frame of image, when i is greater than or equal to 2 obtain the surface normal vector, take the i-1-th frame image as the reference frame image, and obtain the i-1-th frame of the camera coordinate system -Manhattan coordinate system transformation parameter (C-M transformation parameter), utilize this transformation parameter to map the surface normal vector of the ith frame to the Manhattan coordinate system M, and calculate the offset by clustering, to obtain the C-M transformation parameter of the ith frame; And then The first transformation parameter between the i-th frame and the i-1-th frame is calculated.
  • C-M transformation parameter Manhattan coordinate system transformation parameter
  • Feature points are extracted from the i-th frame image, described by SIFT features, and matched with the feature points of the i-1-th frame image to obtain the second transformation parameter between the i-th frame and the i-1-th frame.
  • the second transformation parameter is optimized by using the first transformation parameter to obtain the target transformation parameter.
  • the pose of the i-1th frame output the first pose of the i-th frame, and re-project the 3D point cloud of the target scene through the first pose to obtain the corresponding projection point; the SIFT feature based on the projection point and the first pose The SIFT feature of the i-frame image is obtained to obtain the matching point pair of the projection point-feature point; finally, the PnP algorithm is solved according to the matching point pair, and the second pose of the i-th frame is output, that is, the final visual positioning result. Based on the visual positioning result of each frame, the motion trajectory of the device to be positioned can be obtained, thereby realizing real-time indoor navigation or other functions.
  • the visual positioning device 1700 may include:
  • the surface normal vector acquisition module 1710 is configured to acquire the surface normal vector of the current frame image
  • the first transformation parameter determination module 1720 is configured to determine the first transformation parameter between the current frame image and the reference frame image by projecting the surface normal vector to the Manhattan coordinate system;
  • the second transformation parameter determination module 1730 is configured to match the feature point of the current frame image and the feature point of the reference frame image, and determines the second transformation parameter between the current frame image and the reference frame image according to the matching result;
  • the target transformation parameter determination module 1740 is configured to determine the target transformation parameter based on the first transformation parameter and the second transformation parameter;
  • the visual positioning result output module 1750 is configured to output the visual positioning result corresponding to the current frame image according to the target transformation parameter.
  • the surface normal vector acquisition module 1710 is configured to:
  • the surface normal vector estimation network includes an encoding sub-network, a decoding sub-network and a convolutional sub-network.
  • the surface normal vector acquisition module 1710 is configured to:
  • the encoding sub-network is used to downsample the current frame image to obtain the downsampled intermediate image and the downsampled target image;
  • the convolution sub-network is used to perform the convolution operation on the up-sampled target image to obtain the surface normal vector.
  • the first transformation parameter determination module 1720 is configured to:
  • the surface normal vector is mapped to the Manhattan coordinate system
  • the first transformation parameter between the current frame image and the reference frame image is determined according to the camera coordinate system-Manhattan coordinate system transformation parameter corresponding to the reference frame image and the camera coordinate system-Manhattan coordinate system transformation parameter corresponding to the current frame image.
  • the first transformation parameter determination module 1720 is configured to:
  • the three-dimensional axis coordinates of the surface normal vector in the camera coordinate system corresponding to the reference frame image are mapped to the three-dimensional axis coordinates in the Manhattan coordinate system;
  • the first transformation parameter determination module 1720 is configured to:
  • Cluster the two-dimensional coordinates of the surface normal vector on the tangential plane, and determine the offset of the surface normal vector on the tangential plane according to the cluster center;
  • the camera coordinate system-Manhattan coordinate system transformation parameters corresponding to the current frame image are determined according to the camera coordinate system-Manhattan coordinate system transformation parameters corresponding to the reference frame image and the three-dimensional axis coordinates offset under the Manhattan coordinate system.
  • the above-mentioned mapping of the two-dimensional coordinates offset on the tangential plane to the three-dimensional axis coordinates in the Manhattan coordinate system includes:
  • the two-dimensional coordinates offset on the tangential plane are mapped to the unit sphere of the Manhattan coordinate system through exponential mapping, and the three-dimensional axis coordinates offset in the Manhattan coordinate system are obtained.
  • the camera coordinate system-Manhattan coordinate system transformation parameter corresponding to the reference frame image includes: a relative rotation matrix between the camera coordinate system corresponding to the reference frame image and the Manhattan coordinate system.
  • the second transformation parameter determination module 1730 is configured to:
  • a matching result is obtained according to the first matching information and the second matching information.
  • the second transformation parameter determination module 1730 is configured to:
  • An intersection or union of the first matching information and the second matching information is obtained to obtain a matching result.
  • the above-mentioned matching of the feature points of the current frame image and the feature points of the reference frame image further includes:
  • the first transformation parameter includes a first rotation matrix
  • the second transformation parameter includes a second rotation matrix.
  • Target transformation parameter determination module 1740 configured to:
  • the minimum value of the loss function is optimized, and the adjusted second rotation matrix is used as the rotation matrix in the target transformation parameter.
  • the visual positioning result output module 1750 is configured to:
  • the target scene is the scene captured by the current frame image and the reference frame image;
  • the second pose is output as the visual positioning result corresponding to the current frame image.
  • the visual positioning result output module 1750 is configured to:
  • a matching point pair between the feature point and the projection point of the current frame image is obtained.
  • the above-mentioned determining the second pose of the current frame image according to the matching point pair of the feature point of the current frame image and the projection point includes:
  • the projected points in the matching point pair are replaced with the three-dimensional points in the three-dimensional point cloud, the matching relationship between the feature points of the current frame image and the three-dimensional points is obtained, and the second pose is obtained based on the matching relationship.
  • the above-mentioned acquiring the surface normal vector of the current frame image includes:
  • the above-mentioned determining the first transformation parameter between the current frame image and the reference frame image by projecting the surface normal vector to the Manhattan coordinate system includes:
  • the first transformation parameter between the current frame image and the reference frame image is determined.
  • the visual positioning device 1800 may include a processor 1810 and a memory 1820 .
  • the memory 1820 stores the following program modules:
  • the surface normal vector acquisition module 1821 is configured to acquire the surface normal vector of the current frame image
  • the first transformation parameter determination module 1822 is configured to determine the first transformation parameter between the current frame image and the reference frame image by projecting the surface normal vector to the Manhattan coordinate system;
  • the second transformation parameter determination module 1823 is configured to match the feature points of the current frame image and the feature points of the reference frame image, and determine the second transformation parameter between the current frame image and the reference frame image according to the matching result;
  • the target transformation parameter determination module 1824 is configured to determine the target transformation parameter based on the first transformation parameter and the second transformation parameter;
  • the visual positioning result output module 1825 is configured to output the visual positioning result corresponding to the current frame image according to the target transformation parameter.
  • the processor 1810 is used to execute the above-mentioned program modules.
  • the surface normal vector acquisition module 1821 is configured to:
  • the surface normal vector estimation network includes an encoding sub-network, a decoding sub-network and a convolutional sub-network.
  • the surface normal vector acquisition module 1821 is configured to:
  • the encoding sub-network is used to downsample the current frame image to obtain the downsampled intermediate image and the downsampled target image;
  • the convolution sub-network is used to perform the convolution operation on the up-sampled target image to obtain the surface normal vector.
  • the first transformation parameter determination module 1822 is configured to:
  • the surface normal vector is mapped to the Manhattan coordinate system
  • the first transformation parameter between the current frame image and the reference frame image is determined according to the camera coordinate system-Manhattan coordinate system transformation parameter corresponding to the reference frame image and the camera coordinate system-Manhattan coordinate system transformation parameter corresponding to the current frame image.
  • the first transformation parameter determination module 1822 is configured to:
  • the three-dimensional axis coordinates of the surface normal vector in the camera coordinate system corresponding to the reference frame image are mapped to the three-dimensional axis coordinates in the Manhattan coordinate system;
  • the first transformation parameter determination module 1822 is configured to:
  • Cluster the two-dimensional coordinates of the surface normal vector on the tangential plane, and determine the offset of the surface normal vector on the tangential plane according to the cluster center;
  • the camera coordinate system-Manhattan coordinate system transformation parameters corresponding to the current frame image are determined according to the camera coordinate system-Manhattan coordinate system transformation parameters corresponding to the reference frame image and the three-dimensional axis coordinates offset under the Manhattan coordinate system.
  • the above-mentioned mapping of the two-dimensional coordinates offset on the tangential plane to the three-dimensional axis coordinates in the Manhattan coordinate system includes:
  • the two-dimensional coordinates offset on the tangential plane are mapped to the unit sphere of the Manhattan coordinate system through exponential mapping, and the three-dimensional axis coordinates offset in the Manhattan coordinate system are obtained.
  • the camera coordinate system-Manhattan coordinate system transformation parameter corresponding to the reference frame image includes: a relative rotation matrix between the camera coordinate system corresponding to the reference frame image and the Manhattan coordinate system.
  • the second transformation parameter determination module 1823 is configured to:
  • a matching result is obtained according to the first matching information and the second matching information.
  • the second transformation parameter determination module 1823 is configured to:
  • An intersection or union of the first matching information and the second matching information is obtained to obtain a matching result.
  • the above-mentioned matching of the feature points of the current frame image and the feature points of the reference frame image further includes:
  • the first transformation parameter includes a first rotation matrix
  • the second transformation parameter includes a second rotation matrix.
  • Target transformation parameter determination module 1824 configured to:
  • the minimum value of the loss function is optimized, and the adjusted second rotation matrix is used as the rotation matrix in the target transformation parameter.
  • the visual positioning result output module 1825 is configured to:
  • the target scene is the scene captured by the current frame image and the reference frame image;
  • the second pose is output as the visual positioning result corresponding to the current frame image.
  • the visual positioning result output module 1825 is configured to:
  • a matching point pair between the feature point and the projection point of the current frame image is obtained.
  • the above-mentioned determining the second pose of the current frame image according to the matching point pair of the feature point of the current frame image and the projection point includes:
  • the projected points in the matching point pair are replaced with the three-dimensional points in the three-dimensional point cloud, the matching relationship between the feature points of the current frame image and the three-dimensional points is obtained, and the second pose is obtained based on the matching relationship.
  • the above-mentioned acquiring the surface normal vector of the current frame image includes:
  • the above-mentioned determining the first transformation parameter between the current frame image and the reference frame image by projecting the surface normal vector to the Manhattan coordinate system includes:
  • the first transformation parameter between the current frame image and the reference frame image is determined.
  • Exemplary embodiments of the present disclosure also provide a computer-readable storage medium that can be implemented in the form of a program product including program code for causing the electronic device to run the program product when the program product is run on the electronic device.
  • program product may be implemented as a portable compact disk read only memory (CD-ROM) and include program code, and may be executed on an electronic device, such as a personal computer.
  • CD-ROM portable compact disk read only memory
  • the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • the program product may employ any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer readable signal medium may include a propagated data signal in baseband or as part of a carrier wave with readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a readable signal medium can also be any readable medium, other than a readable storage medium, that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • Program code embodied on a readable medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming Language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
  • the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).
  • LAN local area network
  • WAN wide area network
  • modules or units of the apparatus for action performance are mentioned in the above detailed description, this division is not mandatory. Indeed, according to exemplary embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above may be further divided into multiple modules or units to be embodied.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

一种视觉定位方法、装置、计算机可读存储介质与电子设备,该方法包括:获取当前帧图像的表面法向量(S210);通过将表面法向量投影至曼哈顿坐标系,确定当前帧图像与参考帧图像间的第一变换参数(S220);匹配当前帧图像的特征点与参考帧图像的特征点,根据匹配结果确定当前帧图像与参考帧图像间的第二变换参数(S230);基于第一变换参数与第二变换参数,确定目标变换参数(S240);根据目标变换参数输出当前帧图像对应的视觉定位结果(S250)。降低了视觉定位对图像特征质量的依赖,解决了弱纹理环境下的视觉定位问题。 (图2)

Description

视觉定位方法、视觉定位装置、存储介质与电子设备
本申请要求申请日为2021年03月29日,申请号为202110336267.8,名称为“视觉定位方法、视觉定位装置、存储介质与电子设备”的中国专利申请的优先权,该中国专利申请的全部内容通过引用结合在本文中。
技术领域
本公开涉及计算机视觉技术领域,尤其涉及一种视觉定位方法、视觉定位装置、计算机可读存储介质与电子设备。
背景技术
视觉定位是一种新型的定位技术,通过图像采集设备(如手机、RGB相机等)采集环境图像,配合图像算法和数学推理来实时计算与更新当前位姿,具有高速、高精度、易使用等优点,已广泛应用于AR(Augmented Reality,增强现实)、室内导航等场景下。
相关技术中,在进行视觉定位时,通常需要对图像进行特征提取与特征匹配,以计算位姿。可见,其定位结果极大地依赖于图像中的特征质量。然而,当视觉环境出现弱纹理的情况时,将影响所提取的图像特征质量,导致定位漂移甚至定位失败。
发明内容
本公开提供一种视觉定位方法、视觉定位装置、计算机可读存储介质与电子设备。
根据本公开的第一方面,提供一种视觉定位方法,包括:获取当前帧图像的表面法向量;通过将所述表面法向量投影至曼哈顿坐标系,确定所述当前帧图像与参考帧图像间的第一变换参数;匹配所述当前帧图像的特征点与所述参考帧图像的特征点,根据匹配结果确定所述当前帧图像与所述参考帧图像间的第二变换参数;基于所述第一变换参数与所述第二变换参数,确定目标变换参数;根据所述目标变换参数输出所述当前帧图像对应的视觉定位结果。
根据本公开的第二方面,提供一种视觉定位装置,包括:表面法向量获取模块,被配置为获取当前帧图像的表面法向量;第一变换参数确定模块,被配置为通过将所述表面法向量投影至曼哈顿坐标系,确定所述当前帧图像与参考帧图像间的第一变换参数;第二变换参数确定模块,被配置为匹配所述当前帧图像的特征点与所述参考帧图像的特征点,根据匹配结果确定所述当前帧图像与所述参考帧图像间的第二变换参数;目标变换参数确定模块,被配置为基于所述第一变换参数与所述第二变换参数,确定目标变换参数;视觉定位结果输出模块,被配置为根据所述目标变换参数输出所述当前帧图像对应的视觉定位结果。
根据本公开的第三方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述第一方面的视觉定位方法及其可能的实现方式。
根据本公开的第四方面,提供一种电子设备,包括:处理器;以及存储器,用于存储所述处理器的可执行指令;其中,所述处理器配置为经由执行所述可执行指令来执行上述第一方面的视觉定位方法及其可能的实现方式。
附图说明
图1示出本示例性实施方式中一种电子设备的结构示意图;
图2示出本示例性实施方式中一种视觉定位方法的流程图;
图3示出本示例性实施方式中当前帧图像与表面法向量的示意图;
图4示出本示例性实施方式中通过表面法向量估计网络进行处理的示意图;
图5示出本示例性实施方式中一种获取表面法向量的流程图;
图6示出本示例性实施方式中一种确定第一变换参数的流程图;
图7示出本示例性实施方式中一种确定当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数的示意图;
图8示出本示例性实施方式中一种特征点匹配的流程图;
图9示出本示例性实施方式中一种特征点匹配的示意图;
图10示出本示例性实施方式中一种输出视觉定位结果的流程图;
图11示出本示例性实施方式中对三维点云进行投影的示意图;
图12示出本示例性实施方式中一种特征点与投影点匹配的流程图;
图13示出本示例性实施方式中目标场景的三维点云示意图;
图14示出本示例性实施方式中一种视觉定位初始化的流程图;
图15示出本示例性实施方式中确定第i帧目标变换参数的流程图;
图16示出本示例性实施方式中确定第i帧目标变换参数的示意图;
图17示出本示例性实施方式中一种视觉定位装置的结构示意图;
图18示出本示例性实施方式中另一种视觉定位装置的结构示意图。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多具体细节从而给出对本公开的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而省略所述特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知技术方案以避免喧宾夺主而使得本公开的各方面变得模糊。
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的步骤。例如,有的步骤还可以分解,而有的步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
为了提高在弱纹理环境下视觉定位的准确性,相关技术中出现了采用辅助传感器的方案。例如,采用激光传感器(如激光雷达)、深度传感器(如RGB-D相机)等直接获取图像像素对应位置处的深度信息,从而恢复出三维点云信息来进行视觉定位。但是这样增加了方案实现的硬件成本。
鉴于上述问题,本公开的示例性实施方式首先提供一种视觉定位方法,其应用场景包括但不限于:用户处于商场内部,需要室内导航时,可以通过具有拍摄功能的终端对环境进行拍摄,终端对环境图像进行特征点提取,并上传至云端,由云端执行本示例性实施方式的视觉定位方法,确定终端的定位结果,提供室内导航服务。
本公开的示例性实施方式还提供一种电子设备,用于执行上述视觉定位方法。该电子设备可以是上述终端或云端的服务器,包括但不限于计算机、智能手机、可穿戴设备(如增强现实眼镜)、机器人、无人机等。一般的,电子设备包括处理器和存储器。存储器用于存储处理器的可执行指令,也可以存储应用数据,如图像数据、视频数据等;处理器配置为经由执行可执行指令来执行本示例性实施方式中的视觉定位方法。
下面以图1中的移动终端100为例,对上述电子设备的构造进行示例性说明。本领域技术人员应当理解,除了特别用于移动目的的部件之外,图1中的构造也能够应用于固定类型的设备。
如图1所示,移动终端100具体可以包括:处理器110、内部存储器121、外部存储器接口122、USB(Universal Serial Bus,通用串行总线)接口130、充电管理模块140、电源管理模块141、电池142、天线1、天线2、移动通信模块150、无线通信模块160、音频模块170、扬声器171、受话器172、麦克风173、耳机接口174、传感器模块180、显示屏190、摄像模组191、指示器192、马达193、按键194以及SIM(Subscriber Identification Module,用户标识模块)卡接口195等。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括AP(Application Processor,应用处理器)、调制解调处理器、GPU(Graphics Processing Unit,图形处理器)、ISP(Image Signal Processor,图像信号处理器)、控制器、编码器、解码器、DSP(Digital Signal Processor,数字信号处理器)、基带处理器和/或NPU(Neural-Network Processing Unit,神经网络处理器)等。
编码器可以对图像或视频数据进行编码(即压缩),例如对拍摄的场景图像进行编码,形成对应的码流数据,以减少数据传输所占的带宽;解码器可以对图像或视频的码流数据进行解码(即解压缩),以还原出图像或视频数据,例如对场景图像的码流数据进行解码,得到完整的图像数据,便于执行本示例性实施方式的定位方法。移动终端100可以支持一种或多种编码器和解码器。这样,移动终端100可以处理多种编码格式的图像或视频,例如:JPEG(Joint Photographic Experts Group,联合图像专家组)、PNG(Portable Network Graphics,便携式网络图形)、BMP(Bitmap,位图)等图像格式,MPEG(Moving Picture Experts Group,动态图像专家组)1、MPEG2、H.263、H.264、HEVC(High Efficiency Video Coding,高效率视频编码)等视频格式。
在一种实施方式中,处理器110可以包括一个或多个接口,通过不同的接口和移动终端100的其他 部件形成连接。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括易失性存储器与非易失性存储器。处理器110通过运行存储在内部存储器121的指令,执行移动终端100的各种功能应用以及数据处理。
外部存储器接口122可以用于连接外部存储器,例如Micro SD卡,实现扩展移动终端100的存储能力。外部存储器通过外部存储器接口122与处理器110通信,实现数据存储功能,例如存储图像,视频等文件。
USB接口130是符合USB标准规范的接口,可以用于连接充电器为移动终端100充电,也可以连接耳机或其他电子设备。
充电管理模块140用于从充电器接收充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为设备供电;电源管理模块141还可以监测电池的状态。
移动终端100的无线通信功能可以通过天线1、天线2、移动通信模块150、无线通信模块160、调制解调处理器以及基带处理器等实现。天线1和天线2用于发射和接收电磁波信号。移动通信模块150可以提供应用在移动终端100上的包括2G/3G/4G/5G等无线通信的解决方案。无线通信模块160可以提供应用在移动终端100上的包括WLAN(Wireless Local Area Networks,无线局域网)(如Wi-Fi(Wireless Fidelity,无线保真)网络)、BT(Bluetooth,蓝牙)、GNSS(Global Navigation Satellite System,全球导航卫星系统)、FM(Frequency Modulation,调频)、NFC(Near Field Communication,近距离无线通信技术)、IR(Infrared,红外技术)等无线通信解决方案。
移动终端100可以通过GPU、显示屏190及AP等实现显示功能,显示用户界面。例如,当用户开启拍摄功能时,移动终端100可以在显示屏190中显示拍摄界面和预览图像等。
移动终端100可以通过ISP、摄像模组191、编码器、解码器、GPU、显示屏190及AP等实现拍摄功能。例如,用户可以启动视觉定位的相关服务,触发开启拍摄功能,此时可以通过摄像模组191采集当前场景的图像,并进行定位。
移动终端100可以通过音频模块170、扬声器171、受话器172、麦克风173、耳机接口174及AP等实现音频功能。
传感器模块180可以包括深度传感器1801、压力传感器1802、陀螺仪传感器1803、气压传感器1804等,以实现相应的感应检测功能。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。马达193可以产生振动提示,也可以用于触摸振动反馈等。按键194包括开机键,音量键等。
移动终端100可以支持一个或多个SIM卡接口195,用于连接SIM卡,以实现通话与移动通信等功能。
下面结合图2对本示例性实施方式的视觉定位方法进行说明,图2示出了视觉定位方法的示例性流程,可以包括:
步骤S210,获取当前帧图像的表面法向量;
步骤S220,通过将表面法向量投影至曼哈顿坐标系,确定当前帧图像与参考帧图像间的第一变换参数;
步骤S230,匹配当前帧图像的特征点与参考帧图像的特征点,根据匹配结果确定当前帧图像与参考帧图像间的第二变换参数;
步骤S240,利用第一变换参数优化第二变换参数,得到目标变换参数;
步骤S250,根据目标变换参数输出当前帧图像对应的视觉定位结果。
基于上述方法,利用当前帧图像中的表面法向量在曼哈顿坐标系中的投影,确定当前帧图像与参考帧图像间的第一变换参数,由此对通过特征点匹配所得到的第二变换参数进行优化,根据优化后得到的目标变换参数输出视觉定位结果。一方面,本示例性实施方式通过表面法向量的投影与特征点的匹配两方面来确定变换参数,并结合两方面的结果以进行视觉定位,降低了对单方面的依赖性,提高了方案的鲁棒性。特别是降低了对图像中特征质量的依赖,即使在弱纹理环境下,也能够通过优化第二变换参数,提高目标变换参数与最终视觉定位结果的准确性,从而解决弱纹理环境下的视觉定位问题。另一方面,本示例性实施方式采用特征点匹配的方式确定第二变换参数,特征点的提取与匹配处理所需的计算量较低,有利于提高响应速度与视觉定位的实时性。再一方面,本示例性实施方式基于普通的单目RGB相机即可实现,无需增加辅助传感器,具有较低的实现成本。
下面对图2中的每个步骤进行具体说明。
参考图2,步骤S210中,获取当前帧图像的表面法向量。
当前帧图像是当前针对目标场景所拍摄的图像,目标场景即用户当前所处的环境场景,例如房间、 商场等。在视觉定位的场景中,通常需要终端连续拍摄多帧场景图像,当前帧图像是其中最新一帧图像。当前帧图像的表面法向量包括当前帧图像中至少一部分像素点的表面法向量。举例来说,当前帧图像的高度与宽度分别为H、W,则当前帧图像的像素数为H*W,获取每个像素点的表面法向量,每个像素点的表面法向量包括3个维度的轴坐标,则当前帧图像的表面法向量包括H*W*3个数值。图3示出了表面法向量的可视化示意图,其中图像I 0为当前帧图像,I 0中的每个像素点包括像素值(如RGB值),获取每个像素点的表面法向量后,将表面法向量的坐标值表示为颜色,则得到可视化的表面法向量图像I normal。I 0中处于同一平面上的像素点的表面法向量相同,因此在I normal中的颜色也相同。
在一种实施方式中,可以预先训练表面法向量估计网络,其可以是CNN(Convolutional Neural Network,卷积神经网络)等深度学习网络,利用该表面法向量估计网络处理当前帧图像,得到当前帧图像的表面法向量。
图4示出了表面法向量估计网络的结构示意图,可以基于传统的U-Net(一种U型网络)结构进行变化。该网络主要包括三部分:编码子网络(Encoder),解码子网络(Decoder),卷积子网络(Convolution)。参考图5所示,获取表面法向量的过程可以包括:
步骤S510,利用编码子网络对当前帧图像进行下采样操作,得到下采样中间图像与下采样目标图像;
步骤S520,利用解码子网络对下采样目标图像进行上采样操作以及与下采样中间图像的拼接操作,得到上采样目标图像;
步骤S530,利用卷积子网络对上采样目标图像进行卷积操作,得到表面法向量。
一般的,将当前帧图像输入表面法向量估计网络后,首先进入编码子网络,在编码子网络中需要依次进行多次下采样操作。例如进行g次下采样操作(g≥2),第1次至第g-1次下采样操作所得到的图像为下采样中间图像,第g次下采样操作所得到的图像为下采样目标图像。下采样中间图像与下采样目标图像均可视为编码子网络输出的图像。下采样操作可以捕捉图像中的语义信息。
接下来,下采样目标图像进入解码子网络,依次进行多次上采样与拼接操作。每次进行上采样操作后,将得到的图像与对应的下采样中间图像进行拼接,得到上采样中间图像,再对该上采样中间图像进行下一次上采样与拼接操作。解码子网络的上采样操作与编码子网络的下采样操作相对应,可以与图像中的语义信息进行定位。在完成最后一次上采样与拼接操作后,解码子网络输出上采样目标图像。需要说明的是,在解码子网络的处理中,也可以将拼接操作视为上采样操作中的一个环节,将上述上采样与拼接操作统称为上采样操作,本公开对此不做限定。
最后,上采样目标图像进入卷积子网络。卷积子网络可以由多个尺度不变的卷积层组成,通过对上采样目标图像进行卷积操作,进一步学习图像特征,提高图像解码能力,并最终输出表面法向量。
在一种实施方式中,表面法向量估计网络的训练过程可以包括以下步骤:
构建初始的表面法向量估计网络,其结构可以如图4所示;
建立表面法向量估计网络的损失函数,可以如下所示:
L=∑ x∈Ωω(x)log(p l(x)(x))      (1)
其中,p l(x)(x)是AP(Average Precision,平均正确率)概率函数,l:Ω→{1,...,K}是像素点,
Figure PCTCN2022078435-appb-000001
是像素点的权重值,通常图像中越靠近边界位置的像素点的权重值越高;
设置Mini-batch的训练方式与Adam算法,并设置训练学习率和训练步数,如分别为0.001和100万步数;
输入包含RGB图像与表面法向量图像的数据集,如Taskonomy和NYUv2,开始训练;
完成训练并测试通过后,对表面法向量估计网络的结构与参数进行固化,并保存为相应的文件,以供调用。
下面结合图4,对表面法向量估计网络的处理过程进一步举例说明:将当前帧图像输入网络,由Encoder进行5次下采样操作;每次下采样操作可以包括两次3*3的卷积操作与一次2*2的最大池化操作,且5次下采样操作所用的卷积核数目依次翻倍,分别为1、64、128、256、512和1024;每次下采样操作后得到对应的下采样中间图像,其尺寸依次递减,最后一次下采样操作后得到下采样目标图像。由Decoder对下采样目标图像进行与下采样操作相对应的多次上采样操作;每次上采样操作可以包括一次2*2的转置卷积(或称反卷积)操作,一次与对应相同尺寸的下采样中间图像的拼接操作,以及两次3*3的卷积操作;每次上采样操作后得到对应的上采样中间图像,其尺寸依次递增,最后一次上采样操作后得到上采样目标图像。再由Convolution对上采样目标图像进行全卷积操作,最终输出表面法向量。
继续参考图2,步骤S220中,通过将表面法向量投影至曼哈顿坐标系,确定当前帧图像与参考帧图像间的第一变换参数。
曼哈顿世界(Manhattan World)假设是指假设环境中存在垂直或正交的信息,例如在图3所示的当 前帧图像I 0中,室内的地面与墙面垂直,墙面与天花板垂直,正前方的墙面与两侧的墙面垂直,由此可以构建包含垂直信息的坐标系,即曼哈顿坐标系。上述垂直的地面与墙面、墙面与天花板等,在当前帧图像对应的相机坐标系或图像坐标系中并非垂直关系,可见,曼哈顿坐标系与相机坐标系或图像坐标系之间存在一定的变换关系。因此,可以利用曼哈顿坐标系与不同帧图像对应的相机坐标系间的变换参数,确定不同帧图像对应的相机坐标系间的变换参数。本示例性实施方式可以通过这种方式,确定当前帧图像与参考帧图像对应的相机坐标系间的变换参数。上述当前帧图像与参考帧图像间的第一变换参数,即表示通过曼哈顿坐标系所确定的两图像对应的相机坐标系间的变换参数。参考帧图像可以是针对目标场景所拍摄的已知位姿的任意一帧或多帧图像,例如在连续拍摄场景图像以进行视觉定位的情况下,可以以上一帧图像作为参考帧图像。
需要说明的是,本示例性实施方式所述的变换参数,可以包括旋转参数(如旋转矩阵)与平移参数(如平移向量)。
在步骤S210中,可以获取当前帧图像中每个像素点的表面法向量。在步骤S220中,可以将每个像素点的表面法向量均投影至曼哈顿坐标系,经过后续处理得到第一变换参数。或者,也可以将一部分像素点的表面法向量投影至曼哈顿坐标系,例如选取当前帧图像中低平坦度区域(通常是纹理变化较大的区域)的像素点,将其表面法向量投影至曼哈顿坐标系。
在一种实施方式中,参考图6所示,步骤S220可以通过以下步骤S610至S630实现:
步骤S610中,利用参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数,将上述表面法向量映射至曼哈顿坐标系。
其中,相机坐标系一般指以相机的光心为原点的笛卡尔坐标系,可以表示为SO(3);曼哈顿坐标系一般指以单位球构成的向量坐标系,可以表示为so(3),如图7所示的法向量球,法向量球上的每一个点表示相机坐标系下表面法向量的起点移至球心后,表面法向量的终点所在的位置。步骤S210中所获取的表面法向量是在相机坐标系下的信息,利用相机坐标系-曼哈顿坐标系变换参数,可以将其映射至曼哈顿坐标系中。
相机坐标系中x、y、z轴的方向与相机的位姿相关,曼哈顿坐标系中x、y、z轴的方向是对目标场景建立曼哈顿坐标系时所确定的,与目标场景的真实世界信息相关,无论相机如何运动,曼哈顿坐标系中x、y、z轴的方向是固定的。因此,相机坐标系与曼哈顿坐标系间的变换关系包括两坐标系间的旋转变换关系。由于表面法向量仅表示方向,与像素点的位置无关,例如图3所示的当前帧图像I 0中,对于同一墙面上的不同像素点,其表面法向量相同,因此可以忽略相机坐标系与曼哈顿坐标系间的平移关系。上述相机坐标系-曼哈顿坐标系变换参数可以是相机坐标系-曼哈顿坐标系相对旋转矩阵。
在一种实施方式中,步骤S610可以包括:
利用参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数,将上述表面法向量在参考帧图像对应的相机坐标系下的三维轴坐标映射为在曼哈顿坐标系下的三维轴坐标;
将表面法向量在曼哈顿坐标系下的三维轴坐标映射为在曼哈顿坐标系的轴的切向平面上的二维坐标。
举例来说,将参考帧图像对应的相机坐标系-曼哈顿坐标系旋转矩阵表示为R cM=[r 1 r 2 r 3]∈SO(3),r 1、r 2、r 3分别表示曼哈顿坐标系的三个轴。用n k表示步骤S210所获取的表面法向量,将其映射到曼哈顿坐标系中,得到:
Figure PCTCN2022078435-appb-000002
其中,n k'表示表面法向量n k在曼哈顿坐标系下的三维轴坐标。
然后通过对数映射(logarithm map)将其映射至曼哈顿坐标系的轴的切向平面上,例如映射到z轴的切向平面上,得到:
Figure PCTCN2022078435-appb-000003
其中,
Figure PCTCN2022078435-appb-000004
m k'表示表面法向量n k在曼哈顿坐标系的轴的切向平面上的二维坐标。相比于在单位球面上的三维轴坐标,通过切向平面上的二维坐标表示表面法向量,更易于计算其偏移。
步骤S620中,基于表面法向量在曼哈顿坐标系下的偏移,确定当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数。
基于曼哈顿世界假设,如果相机坐标系-曼哈顿坐标系变换参数准确,则将表面法向量映射到曼哈顿坐标系后,表面法向量应当与曼哈顿坐标系的轴方向相同。参考图7所示,由于在步骤S610进行采用参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数R cM(更准确的说,是
Figure PCTCN2022078435-appb-000005
)对当前帧图像的表面法向量进行映射,映射后表面法向量与曼哈顿坐标系的轴方向存在偏移,表现为表面法向量在单位球 上的映射点并不在轴向位置。可见,该偏移是由参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数R cM与当前帧图像的对应的相机坐标系-曼哈顿坐标系变换参数(可以以
Figure PCTCN2022078435-appb-000006
表示)不一致所导致的。由此,可以利用表面法向量的偏移计算当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数。
在一种实施方式中,步骤S620可以包括:
对表面法向量在切向平面上的二维坐标进行聚类,根据聚类中心确定表面法向量在切向平面上的偏移;
将上述偏移在切向平面上的二维坐标映射为在曼哈顿坐标系下的三维轴坐标;
根据参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数与上述偏移在曼哈顿坐标系下的三维轴坐标,确定当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数。
其中,对表面法向量的二维坐标聚类时,实际上是对表面法向量在切向平面上的投影点进行聚类,可以分别对每个维度的坐标进行聚类,本公开对于聚类的具体方式不做限定。
举例来说,使用传统Mean-Shift算法(均值偏移,是一种聚类算法)计算m k'的聚类中心,得到聚类中心相对于切向平面原点的偏移(实际上就是聚类中心本身的二维坐标),具体可表示为:
Figure PCTCN2022078435-appb-000007
其中,c为高斯核的宽度,s j'为m k'的偏移,表示该偏移在切向平面上的二维坐标。
然后通过指数映射将s j'从切向平面映射回曼哈顿坐标系的单位球,得到上述偏移的三维轴坐标,即:
Figure PCTCN2022078435-appb-000008
此时,s j包括在曼哈顿坐标系的x、y、z三个轴的坐标,可以分别表示在曼哈顿坐标系下各轴需要更新的向量。由此,在参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数基础上,利用上述偏移s j包括的三维轴坐标进行变换参数的更新,得到当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数,如下所示:
Figure PCTCN2022078435-appb-000009
其中,
Figure PCTCN2022078435-appb-000010
表示当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数,主要是当前帧图像对应的相机坐标系与曼哈顿坐标系间的相对旋转矩阵。
在一种实施方式中,对于上述相对旋转矩阵
Figure PCTCN2022078435-appb-000011
可以采用SVD算法(Singular Value Decomposition,奇异值分解)使其满足旋转矩阵应用的正交性约束,以提高相对旋转矩阵的准确性。
步骤S630中,根据参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数与当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数,确定当前帧图像与参考帧图像间的第一变换参数。
参考图7所示,参考帧图像对应的相机坐标系c 1与曼哈顿坐标系之间存在变换关系,当前帧图像对应的相机坐标系c 2与曼哈顿坐标系之间存在变换关系,因此以曼哈顿坐标系作为基准,可以计算当前帧图像与参考帧图像间的变换参数,即c 1与c 2间的相对位姿关系,如下所示:
Figure PCTCN2022078435-appb-000012
其中,c 1与c 2分别表示参考帧图像与当前帧图像,
Figure PCTCN2022078435-appb-000013
是通过表面法向量的映射所计算得到的相对变换参数,称为第一变换参数。
继续参考图2,步骤S230中,匹配当前帧图像的特征点与参考帧图像的特征点,根据匹配结果确定当前帧图像与参考帧图像间的第二变换参数。
特征点是指图像中具有局部代表性的点,能够反映图像的局部特征,通常从纹理较为丰富的边界区域提取特征点,并通过一定的方式对特征点加以描述,得到特征点描述子。本示例性实施方式可以采用FAST(Features From Accelerated Segment Test,基于加速分割检测的特征)、BRIEF(Binary Robust Independent Elementary Features,二进制鲁棒独立基本特征)、ORB(Oriented FAST and Rotated BRIEF,面向FAST和旋转BRIEF)、SIFT(Scale-Invariant Feature Transform,尺度不变特征变换)、SURF(Speeded Up Robust Features,加速鲁棒特征)、SuperPoint(基于自监督学习的特征点检测和描述符提取)、R2D2(Reliable and Repeatable Detector and Descriptor,可靠可重复的特征点与描述符)等算法提取特征点并对特征点进行描述。
以SIFT特征为例进行说明。SIFT特征是对图像中检测到的特征点用128维的特征向量进行描述,具有对图像缩放、平移、旋转不变的特性,对于光照、仿射和投影变换也有一定的不变性。从参考帧图像与当前帧图像提取特征点并通过SIFT特征进行表示,对两图像中的SIFT特征进行匹配。一般可以计算两特征点的SIFT特征向量的相似度,如通过欧式距离、余弦相似度等对相似度进行度量,如果相 似度较高,则说明两特征点匹配,形成匹配点对。将参考帧图像与当前帧图像中的匹配点对形成集合,得到上述匹配结果。
在一种实施方式中,参考图8所示,步骤S230可以包括:
步骤S810,将参考帧图像的特征点向当前帧图像的特征点进行匹配,得到第一匹配信息;
步骤S820,将当前帧图像的特征点向参考帧图像的特征点进行匹配,得到第二匹配信息;
步骤S830,根据第一匹配信息与第二匹配信息得到匹配结果。
举例来说,从参考帧图像c 1中提取M个特征点,每个特征点通过d维(如128维)的描述子进行描述,则参考帧图像的局部描述子为D M*d;从当前帧图像c 2中提取N个特征点,每个特征点通过d维的描述子进行描述,则参考帧图像的局部描述子为D N*d;将D M*d向D N*d匹配,得到第一匹配信息:
Figure PCTCN2022078435-appb-000014
将D N*d向D M*d匹配,得到第二匹配信息:
Figure PCTCN2022078435-appb-000015
在步骤S810与步骤S820中,匹配方向不同,所得到的匹配结果也不同,分别表示为第一匹配信息
Figure PCTCN2022078435-appb-000016
与第二匹配信息
Figure PCTCN2022078435-appb-000017
Figure PCTCN2022078435-appb-000018
可以分别是M*N与N*M的矩阵,表示通过不同的匹配方向所得到的特征点之间的匹配概率。
进一步的,可以综合第一匹配信息与第二匹配信息,得到最终的匹配结果。
在一种实施方式中,可以对第一匹配信息与第二匹配信息取交集,得到匹配结果。例如,在第一匹配信息中确定不同特征点对的第一匹配概率;在第二匹配信息中确定不同特征点对的第二匹配概率;对同一特征点对,在第一匹配概率与第二匹配概率中取较小值作为综合匹配概率;然后筛选出综合匹配概率高于预设的匹配阈值的特征点对,得到匹配结果。或者,在第一匹配信息中筛选出匹配概率高于匹配阈值的特征点对,得到第一匹配点对集合;在第二匹配信息中筛选出匹配概率高于匹配阈值的特征点对,得到第二匹配点对集合;然后对第一匹配点对集合与第二匹配点对集合取交集,得到匹配结果。通过取交集的方式,实现了Cross-Check惩罚的匹配,保证了匹配点对的质量。
在另一种实施方式中,也可以对第一匹配信息与第二匹配信息取并集,得到匹配结果。与上述取交集的方式区别在于,在上述第一匹配概率与第二匹配概率中取较大值作为综合匹配概率,或者对上述第一匹配点对集合与第二匹配点对集合取并集。
在一种实施方式中,确定参考帧图像与当前帧图像的匹配点对后,可以利用图像间的几何约束关系,如对极约束等,通过RANSAC(Random Sample Consensus,随机采样一致性)等算法剔除参考帧图像与当前帧图像中的误匹配点对,以提高特征点匹配与后续处理结果的准确性。
图9示出了参考帧图像与当前帧图像之间的特征点的匹配关系。基于特征点的匹配关系,可以采用SVD算法计算两图像间的第二变换参数,可以包括旋转矩阵
Figure PCTCN2022078435-appb-000019
与平移向量
Figure PCTCN2022078435-appb-000020
继续参考图2,步骤S240中,基于第一变换参数与第二变换参数,确定目标变换参数。
由上可知,第一变换参数是通过表面法向量的投影而确定的,第二变换参数是通过特征点匹配而确定的,这两方面的算法均可能存在一定局限性。本示例性实施方式通过综合第一变换参数与第二变换参数,得到更加准确的目标变换参数。例如,基于第一变换参数与第二变换参数中的一个去优化另一个,优化后得到目标变换参数。
在一种实施方式中,第一变换参数包括第一旋转矩阵,第二变换参数包括第二旋转矩阵。可以采用BA算法(Bundle Adjustment,光束平差法)优化其中任一个。具体地,步骤S240可以包括:
基于第一旋转矩阵与第二旋转矩阵间的误差,建立损失函数;
通过迭代调整第二旋转矩阵,优化损失函数的最小值,将调整后的第二旋转矩阵作为目标变换参数中的旋转矩阵。
示例性的,用于优化旋转矩阵的损失函数可以如下所示:
Figure PCTCN2022078435-appb-000021
其中,第二旋转矩阵
Figure PCTCN2022078435-appb-000022
为待优化量,通过迭代调整
Figure PCTCN2022078435-appb-000023
使损失函数α的值减小,直到收敛。将调整后的
Figure PCTCN2022078435-appb-000024
记为
Figure PCTCN2022078435-appb-000025
为目标变换参数中的旋转矩阵。此外,目标变换参数中的平移向量
Figure PCTCN2022078435-appb-000026
可以采用第二变换参数中的平移向量
Figure PCTCN2022078435-appb-000027
由此,得到目标变换参数,包括
Figure PCTCN2022078435-appb-000028
继续参考图2,步骤S250中,根据目标变换参数输出当前帧图像对应的视觉定位结果。
目标变换参数用于表示当前帧图像与参考帧图像间的相对位姿关系。一般的,参考帧图像的位姿已被确定,在参考帧图像的位姿基础上,通过目标变换参数进行仿射变换,得到当前帧图像对应的视觉定位结果,如可以是6DoF(Degree of Freedom,自由度)位姿。
在一种实施方式中,参考图10所示,步骤S250可以包括:
步骤S1010,根据目标变换参数与参考帧图像对应的位姿,确定当前帧图像对应的第一位姿;
步骤S1020,利用第一位姿将目标场景的三维点云投影至当前帧图像的平面,得到对应的投影点;
步骤S1030,匹配当前帧图像的特征点与投影点,根据当前帧图像的特征点与投影点的匹配点对确定当前帧图像的第二位姿;
步骤S1040,将第二位姿作为当前帧图像对应的视觉定位结果进行输出。
其中,第一位姿与第二位姿分别指通过不同方式所确定的当前帧图像的位姿。在确定目标变换参数后,基于参考帧图像对应的位姿进行仿射变换,所得到的位姿为第一位姿。本示例性实施方式在第一位姿的基础上做进一步优化,得到更加准确的第二位姿,以作为最终的视觉定位结果加以输出。
目标场景为当前帧图像与参考帧图像所拍摄的场景,也是待定位设备当前所处的场景。图11示出了三维点云投影至当前帧图像平面的示意图,将三维点云从世界坐标系投影至当前帧图像对应的相机坐标系或图像坐标系后,得到对应的投影点。匹配当前帧图像的特征点与投影点,得到匹配点对。本示例性实施方式可以对投影点进行与特征点相同方式的描述,例如对特征点进行SIFT特征描述,对投影点也进行SIFT特征描述,这样通过计算特征点的SIFT特征向量与投影点的SIFT特征向量的相似度,可以进行特征点与投影点的匹配。
在一种实施方式中,参考图12所示,上述匹配当前帧图像的特征点与投影点,可以包括:
步骤S1210,将投影点向当前帧图像的特征点进行匹配,得到第三匹配信息;
步骤S1220,将当前帧图像的特征点向投影点进行匹配,得到第四匹配信息;
步骤S1230,根据第三匹配信息与第四匹配信息得到当前帧图像的特征点与投影点的匹配点对。
在得到投影点与当前帧图像的特征点的匹配点对后,将其中的投影点替换为三维点云中的点,得到当前帧图像的特征点(二维点)与三维点的匹配关系,进而通过PnP(Perspective-n-Point,一种基于2D-3D匹配关系求解位姿的算法)等算法求解得到第二位姿。
需要说明的是,图2所示的视觉定位方法可以应用于SLAM(Simultaneous Localization And Mapping,同时定位与建图)的场景,也可以应用于在已构建地图的情况下进行视觉定位的场景。下面以已构建地图的情况下进行视觉定位的场景为例,对视觉定位的流程做进一步说明。
在构建地图的环节,由工作人员使用手机、全景相机等图像采集设备采集目标场景的图像,并通过SFM(Structure From Motion,运动恢复结构)流程构建由三维点云形成的地图,可以参考图13所示。
假设某一用户位于目标场景中的某个位置,拿出手机对周围环境连续拍摄多帧图像,与此同时进行视觉定位。流程如下:
(一)视觉定位初始化
图14示出了视觉定位初始化的流程图。输入第1帧图像,一方面获取表面法向量,构建曼哈顿坐标系。一般的,可以设置曼哈顿坐标系与第1帧图像对应的相机坐标系的原点与轴方向均相同,由此得到初始的相机坐标系-曼哈顿坐标系变换参数,其中旋转矩阵R c1M可以是单位矩阵,平移向量T c1M可以是零向量。另一方面从第1帧图像中提取特征点并通过SIFT特征进行描述,保存特征点的位置与描述子,用于后续图像帧的处理。完成视觉定位初始化。
(二)确定目标变换参数
图15示出了对第i帧图像进行处理的流程图,图16示出了对第i帧图像进行处理的示意图。参考图15与图16所示,输入第i帧图像,当i大于或等于2时,获取表面法向量,以第i-1帧图像为参考帧图像,获取第i-1帧的相机坐标系-曼哈顿坐标系变换参数(C-M变换参数),利用该变换参数将第i帧的表面法向量映射至曼哈顿坐标系M,并通过聚类计算偏移,以得到第i帧的C-M变换参数;进而在计算第i帧与第i-1帧间的第一变换参数。从第i帧图像中提取特征点并通过SIFT特征进行描述,与第i-1帧图像的特征点进行匹配,以得到第i帧与第i-1帧间的第二变换参数。利用第一变换参数优化第二变换参数,得到目标变换参数。
(三)输出视觉定位结果
根据第i-1帧的位姿,输出第i帧的第一位姿,通过第一位姿对目标场景的三维点云进行重投影,得到对应的投影点;基于投影点的SIFT特征与第i帧图像的SIFT特征,得到投影点-特征点的匹配点对;最后根据匹配点对求解PnP算法,输出第i帧的第二位姿,即最终的视觉定位结果。基于每一帧的视觉定位结果,可以得到待定位设备的运动轨迹,由此实现实时的室内导航或其他功能。
本公开的示例性实施方式还提供一种视觉定位装置。参考图17所示,该视觉定位装置1700可以包括:
表面法向量获取模块1710,被配置为获取当前帧图像的表面法向量;
第一变换参数确定模块1720,被配置为通过将表面法向量投影至曼哈顿坐标系,确定当前帧图像与参考帧图像间的第一变换参数;
第二变换参数确定模块1730,被配置为匹配当前帧图像的特征点与参考帧图像的特征点,根据匹 配结果确定当前帧图像与参考帧图像间的第二变换参数;
目标变换参数确定模块1740,被配置为基于第一变换参数与第二变换参数,确定目标变换参数;
视觉定位结果输出模块1750,被配置为根据目标变换参数输出当前帧图像对应的视觉定位结果。
在一种实施方式中,表面法向量获取模块1710,被配置为:
利用预先训练的表面法向量估计网络对当前帧图像进行处理,得到当前帧图像的表面法向量。
在一种实施方式中,表面法向量估计网络包括编码子网络、解码子网络与卷积子网络。表面法向量获取模块1710,被配置为:
利用编码子网络对当前帧图像进行下采样操作,得到下采样中间图像与下采样目标图像;
利用解码子网络对下采样目标图像进行上采样操作以及与下采样中间图像的拼接操作,得到上采样目标图像;
利用卷积子网络对上采样目标图像进行卷积操作,得到表面法向量。
在一种实施方式中,第一变换参数确定模块1720,被配置为:
利用参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数,将表面法向量映射至曼哈顿坐标系;
基于表面法向量在曼哈顿坐标系下的偏移,确定当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数;
根据参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数与当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数,确定当前帧图像与参考帧图像间的第一变换参数。
在一种实施方式中,第一变换参数确定模块1720,被配置为:
利用参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数,将表面法向量在参考帧图像对应的相机坐标系下的三维轴坐标映射为在曼哈顿坐标系下的三维轴坐标;
将表面法向量在曼哈顿坐标系下的三维轴坐标映射为在曼哈顿坐标系的轴的切向平面上的二维坐标。
在一种实施方式中,第一变换参数确定模块1720,被配置为:
对表面法向量在切向平面上的二维坐标进行聚类,根据聚类中心确定表面法向量在切向平面上的偏移;
将偏移在切向平面上的二维坐标映射为在曼哈顿坐标系下的三维轴坐标;
根据参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数与偏移在曼哈顿坐标系下的三维轴坐标,确定当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数。
在一种实施方式中,上述将偏移在切向平面上的二维坐标映射为在曼哈顿坐标系下的三维轴坐标,包括:
通过指数映射将偏移在切向平面上的二维坐标映射至曼哈顿坐标系的单位球,得到偏移在曼哈顿坐标系下的三维轴坐标。
在一种实施方式中,上述参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数包括:参考帧图像对应的相机坐标系与曼哈顿坐标系间的相对旋转矩阵。
在一种实施方式中,第二变换参数确定模块1730,被配置为:
将参考帧图像的特征点向当前帧图像的特征点进行匹配,得到第一匹配信息;
将当前帧图像的特征点向参考帧图像的特征点进行匹配,得到第二匹配信息;
根据第一匹配信息与第二匹配信息得到匹配结果。
在一种实施方式中,第二变换参数确定模块1730,被配置为:
对第一匹配信息与第二匹配信息取交集或取并集,得到匹配结果。
在一种实施方式中,上述匹配当前帧图像的特征点与参考帧图像的特征点,还包括:
利用当前帧图像与参考帧图像的几何约束关系,从匹配结果中剔除误匹配点对。
在一种实施方式中,第一变换参数包括第一旋转矩阵,第二变换参数包括第二旋转矩阵。目标变换参数确定模块1740,被配置为:
基于第一旋转矩阵与第二旋转矩阵间的误差,建立损失函数;
通过迭代调整第二旋转矩阵,优化损失函数的最小值,将调整后的第二旋转矩阵作为目标变换参数中的旋转矩阵。
在一种实施方式中,视觉定位结果输出模块1750,被配置为:
根据目标变换参数与参考帧图像对应的位姿,确定当前帧图像对应的第一位姿;
利用第一位姿将目标场景的三维点云投影至当前帧图像的平面,得到对应的投影点,目标场景为当前帧图像与参考帧图像所拍摄的场景;
匹配当前帧图像的特征点与投影点,根据当前帧图像的特征点与投影点的匹配点对确定当前帧图像 的第二位姿;
将第二位姿作为当前帧图像对应的视觉定位结果进行输出。
在一种实施方式中,视觉定位结果输出模块1750,被配置为:
将投影点向当前帧图像的特征点进行匹配,得到第三匹配信息;
将当前帧图像的特征点向投影点进行匹配,得到第四匹配信息;
根据第三匹配信息与第四匹配信息得到当前帧图像的特征点与投影点的匹配点对。
在一种实施方式中,上述根据当前帧图像的特征点与投影点的匹配点对确定当前帧图像的第二位姿,包括:
将匹配点对中的投影点替换为三维点云中的三维点,得到当前帧图像的特征点与三维点的匹配关系,并基于匹配关系求解得到第二位姿。
在一种实施方式中,上述获取当前帧图像的表面法向量,包括:
获取当前帧图像中每个像素点的表面法向量。
在一种实施方式中,上述通过将表面法向量投影至曼哈顿坐标系,确定当前帧图像与参考帧图像间的第一变换参数,包括:
通过将每个像素点的表面法向量投影至曼哈顿坐标系,确定当前帧图像与参考帧图像间的第一变换参数。
本公开的示例性实施方式还提供另一种视觉定位装置。参考图18所示,该视觉定位装置1800可以包括处理器1810和存储器1820。其中,存储器1820存储有以下程序模块:
表面法向量获取模块1821,被配置为获取当前帧图像的表面法向量;
第一变换参数确定模块1822,被配置为通过将表面法向量投影至曼哈顿坐标系,确定当前帧图像与参考帧图像间的第一变换参数;
第二变换参数确定模块1823,被配置为匹配当前帧图像的特征点与参考帧图像的特征点,根据匹配结果确定当前帧图像与参考帧图像间的第二变换参数;
目标变换参数确定模块1824,被配置为基于第一变换参数与第二变换参数,确定目标变换参数;
视觉定位结果输出模块1825,被配置为根据目标变换参数输出当前帧图像对应的视觉定位结果。
处理器1810用于执行上述程序模块。
在一种实施方式中,表面法向量获取模块1821,被配置为:
利用预先训练的表面法向量估计网络对当前帧图像进行处理,得到当前帧图像的表面法向量。
在一种实施方式中,表面法向量估计网络包括编码子网络、解码子网络与卷积子网络。表面法向量获取模块1821,被配置为:
利用编码子网络对当前帧图像进行下采样操作,得到下采样中间图像与下采样目标图像;
利用解码子网络对下采样目标图像进行上采样操作以及与下采样中间图像的拼接操作,得到上采样目标图像;
利用卷积子网络对上采样目标图像进行卷积操作,得到表面法向量。
在一种实施方式中,第一变换参数确定模块1822,被配置为:
利用参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数,将表面法向量映射至曼哈顿坐标系;
基于表面法向量在曼哈顿坐标系下的偏移,确定当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数;
根据参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数与当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数,确定当前帧图像与参考帧图像间的第一变换参数。
在一种实施方式中,第一变换参数确定模块1822,被配置为:
利用参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数,将表面法向量在参考帧图像对应的相机坐标系下的三维轴坐标映射为在曼哈顿坐标系下的三维轴坐标;
将表面法向量在曼哈顿坐标系下的三维轴坐标映射为在曼哈顿坐标系的轴的切向平面上的二维坐标。
在一种实施方式中,第一变换参数确定模块1822,被配置为:
对表面法向量在切向平面上的二维坐标进行聚类,根据聚类中心确定表面法向量在切向平面上的偏移;
将偏移在切向平面上的二维坐标映射为在曼哈顿坐标系下的三维轴坐标;
根据参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数与偏移在曼哈顿坐标系下的三维轴坐标,确定当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数。
在一种实施方式中,上述将偏移在切向平面上的二维坐标映射为在曼哈顿坐标系下的三维轴坐标, 包括:
通过指数映射将偏移在切向平面上的二维坐标映射至曼哈顿坐标系的单位球,得到偏移在曼哈顿坐标系下的三维轴坐标。
在一种实施方式中,上述参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数包括:参考帧图像对应的相机坐标系与曼哈顿坐标系间的相对旋转矩阵。
在一种实施方式中,第二变换参数确定模块1823,被配置为:
将参考帧图像的特征点向当前帧图像的特征点进行匹配,得到第一匹配信息;
将当前帧图像的特征点向参考帧图像的特征点进行匹配,得到第二匹配信息;
根据第一匹配信息与第二匹配信息得到匹配结果。
在一种实施方式中,第二变换参数确定模块1823,被配置为:
对第一匹配信息与第二匹配信息取交集或取并集,得到匹配结果。
在一种实施方式中,上述匹配当前帧图像的特征点与参考帧图像的特征点,还包括:
利用当前帧图像与参考帧图像的几何约束关系,从匹配结果中剔除误匹配点对。
在一种实施方式中,第一变换参数包括第一旋转矩阵,第二变换参数包括第二旋转矩阵。目标变换参数确定模块1824,被配置为:
基于第一旋转矩阵与第二旋转矩阵间的误差,建立损失函数;
通过迭代调整第二旋转矩阵,优化损失函数的最小值,将调整后的第二旋转矩阵作为目标变换参数中的旋转矩阵。
在一种实施方式中,视觉定位结果输出模块1825,被配置为:
根据目标变换参数与参考帧图像对应的位姿,确定当前帧图像对应的第一位姿;
利用第一位姿将目标场景的三维点云投影至当前帧图像的平面,得到对应的投影点,目标场景为当前帧图像与参考帧图像所拍摄的场景;
匹配当前帧图像的特征点与投影点,根据当前帧图像的特征点与投影点的匹配点对确定当前帧图像的第二位姿;
将第二位姿作为当前帧图像对应的视觉定位结果进行输出。
在一种实施方式中,视觉定位结果输出模块1825,被配置为:
将投影点向当前帧图像的特征点进行匹配,得到第三匹配信息;
将当前帧图像的特征点向投影点进行匹配,得到第四匹配信息;
根据第三匹配信息与第四匹配信息得到当前帧图像的特征点与投影点的匹配点对。
在一种实施方式中,上述根据当前帧图像的特征点与投影点的匹配点对确定当前帧图像的第二位姿,包括:
将匹配点对中的投影点替换为三维点云中的三维点,得到当前帧图像的特征点与三维点的匹配关系,并基于匹配关系求解得到第二位姿。
在一种实施方式中,上述获取当前帧图像的表面法向量,包括:
获取当前帧图像中每个像素点的表面法向量。
在一种实施方式中,上述通过将表面法向量投影至曼哈顿坐标系,确定当前帧图像与参考帧图像间的第一变换参数,包括:
通过将每个像素点的表面法向量投影至曼哈顿坐标系,确定当前帧图像与参考帧图像间的第一变换参数。
上述装置中各部分的细节在方法部分实施方式中已经详细说明,因而不再赘述。
本公开的示例性实施方式还提供了一种计算机可读存储介质,可以实现为一种程序产品的形式,其包括程序代码,当程序产品在电子设备上运行时,程序代码用于使电子设备执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。在一种实施方式中,该程序产品可以实现为便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在电子设备,例如个人电脑上运行。然而,本公开的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作的程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的示例性实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
所属技术领域的技术人员能够理解,本公开的各个方面可以实现为系统、方法或程序产品。因此,本公开的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其他实施方式。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施方式仅被视为示例性的,本公开的真正范围和精神由权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限定。

Claims (20)

  1. 一种视觉定位方法,其特征在于,包括:
    获取当前帧图像的表面法向量;
    通过将所述表面法向量投影至曼哈顿坐标系,确定所述当前帧图像与参考帧图像间的第一变换参数;
    匹配所述当前帧图像的特征点与所述参考帧图像的特征点,根据匹配结果确定所述当前帧图像与所述参考帧图像间的第二变换参数;
    基于所述第一变换参数与所述第二变换参数,确定目标变换参数;
    根据所述目标变换参数输出所述当前帧图像对应的视觉定位结果。
  2. 根据权利要求1所述的方法,其特征在于,所述获取当前帧图像的表面法向量,包括:
    利用预先训练的表面法向量估计网络对所述当前帧图像进行处理,得到所述当前帧图像的表面法向量。
  3. 根据权利要求2所述的方法,其特征在于,所述表面法向量估计网络包括编码子网络、解码子网络与卷积子网络;所述利用预先训练的表面法向量估计网络对所述当前帧图像进行处理,得到所述当前帧图像的表面法向量,包括:
    利用所述编码子网络对所述当前帧图像进行下采样操作,得到下采样中间图像与下采样目标图像;
    利用所述解码子网络对所述下采样目标图像进行上采样操作以及与所述下采样中间图像的拼接操作,得到上采样目标图像;
    利用所述卷积子网络对所述上采样目标图像进行卷积操作,得到所述表面法向量。
  4. 根据权利要求1所述的方法,其特征在于,所述通过将所述表面法向量投影至曼哈顿坐标系,确定所述当前帧图像与参考帧图像间的第一变换参数,包括:
    利用所述参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数,将所述表面法向量映射至曼哈顿坐标系;
    基于所述表面法向量在所述曼哈顿坐标系下的偏移,确定所述当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数;
    根据所述参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数与所述当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数,确定所述当前帧图像与所述参考帧图像间的第一变换参数。
  5. 根据权利要求4所述的方法,其特征在于,所述利用所述参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数,将所述表面法向量映射至曼哈顿坐标系,包括:
    利用所述参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数,将所述表面法向量在所述参考帧图像对应的相机坐标系下的三维轴坐标映射为在所述曼哈顿坐标系下的三维轴坐标;
    将所述表面法向量在所述曼哈顿坐标系下的三维轴坐标映射为在所述曼哈顿坐标系的轴的切向平面上的二维坐标。
  6. 根据权利要求5所述的方法,其特征在于,所述基于所述表面法向量在所述曼哈顿坐标系下的偏移,确定所述当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数,包括:
    对所述表面法向量在所述切向平面上的二维坐标进行聚类,根据聚类中心确定所述表面法向量在所述切向平面上的偏移;
    将所述偏移在所述切向平面上的二维坐标映射为在所述曼哈顿坐标系下的三维轴坐标;
    根据所述参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数与所述偏移在所述曼哈顿坐标系下的三维轴坐标,确定所述当前帧图像对应的相机坐标系-曼哈顿坐标系变换参数。
  7. 根据权利要求6所述的方法,其特征在于,所述将所述偏移在所述切向平面上的二维坐标映射为在所述曼哈顿坐标系下的三维轴坐标,包括:
    通过指数映射将所述偏移在所述切向平面上的二维坐标映射至所述曼哈顿坐标系的单位球,得到所述偏移在所述曼哈顿坐标系下的三维轴坐标。
  8. 根据权利要求4所述的方法,其特征在于,所述参考帧图像对应的相机坐标系-曼哈顿坐标系变换参数包括:所述参考帧图像对应的相机坐标系与所述曼哈顿坐标系间的相对旋转矩阵。
  9. 根据权利要求1所述的方法,其特征在于,所述匹配所述当前帧图像的特征点与所述参考帧图像的特征点,包括:
    将所述参考帧图像的特征点向所述当前帧图像的特征点进行匹配,得到第一匹配信息;
    将所述当前帧图像的特征点向所述参考帧图像的特征点进行匹配,得到第二匹配信息;
    根据所述第一匹配信息与所述第二匹配信息得到所述匹配结果。
  10. 根据权利要求9所述的方法,其特征在于,所述根据所述第一匹配信息与所述第二匹配信息得到所述匹配结果,包括:
    对所述第一匹配信息与所述第二匹配信息取交集或取并集,得到所述匹配结果。
  11. 根据权利要求9所述的方法,其特征在于,所述匹配所述当前帧图像的特征点与所述参考帧图像的特征点,还包括:
    利用所述当前帧图像与所述参考帧图像的几何约束关系,从所述匹配结果中剔除误匹配点对。
  12. 根据权利要求1所述的方法,其特征在于,所述第一变换参数包括第一旋转矩阵,所述第二变换参数包括第二旋转矩阵;所述基于所述第一变换参数与所述第二变换参数,确定目标变换参数,包括:
    基于所述第一旋转矩阵与所述第二旋转矩阵间的误差,建立损失函数;
    通过迭代调整所述第二旋转矩阵,优化所述损失函数的最小值,将调整后的所述第二旋转矩阵作为所述目标变换参数中的旋转矩阵。
  13. 根据权利要求1所述的方法,其特征在于,所述根据所述目标变换参数输出所述当前帧图像对应的视觉定位结果,包括:
    根据所述目标变换参数与所述参考帧图像对应的位姿,确定所述当前帧图像对应的第一位姿;
    利用所述第一位姿将目标场景的三维点云投影至所述当前帧图像的平面,得到对应的投影点,所述目标场景为所述当前帧图像与所述参考帧图像所拍摄的场景;
    匹配所述当前帧图像的特征点与所述投影点,根据所述当前帧图像的特征点与所述投影点的匹配点对确定所述当前帧图像的第二位姿;
    将所述第二位姿作为所述当前帧图像对应的视觉定位结果进行输出。
  14. 根据权利要求13所述的方法,其特征在于,所述匹配所述当前帧图像的特征点与所述投影点,包括:
    将所述投影点向所述当前帧图像的特征点进行匹配,得到第三匹配信息;
    将所述当前帧图像的特征点向所述投影点进行匹配,得到第四匹配信息;
    根据所述第三匹配信息与所述第四匹配信息得到所述当前帧图像的特征点与所述投影点的匹配点对。
  15. 根据权利要求14所述的方法,其特征在于,所述根据所述当前帧图像的特征点与所述投影点的匹配点对确定所述当前帧图像的第二位姿,包括:
    将所述匹配点对中的所述投影点替换为所述三维点云中的三维点,得到所述当前帧图像的特征点与所述三维点的匹配关系,并基于所述匹配关系求解得到所述第二位姿。
  16. 根据权利要求1所述的方法,其特征在于,所述获取当前帧图像的表面法向量,包括:
    获取所述当前帧图像中每个像素点的表面法向量。
  17. 根据权利要求16所述的方法,其特征在于,所述通过将所述表面法向量投影至曼哈顿坐标系,确定所述当前帧图像与参考帧图像间的第一变换参数,包括:
    通过将所述每个像素点的表面法向量投影至所述曼哈顿坐标系,确定所述当前帧图像与所述参考帧图像间的第一变换参数。
  18. 一种视觉定位装置,其特征在于,包括处理器与存储器,所述处理器用于执行所述存储器中存储的以下程序模块:
    表面法向量获取模块,被配置为获取当前帧图像的表面法向量;
    第一变换参数确定模块,被配置为通过将所述表面法向量投影至曼哈顿坐标系,确定所述当前帧图像与参考帧图像间的第一变换参数;
    第二变换参数确定模块,被配置为匹配所述当前帧图像的特征点与所述参考帧图像的特征点,根据匹配结果确定所述当前帧图像与所述参考帧图像间的第二变换参数;
    目标变换参数确定模块,被配置为基于所述第一变换参数与所述第二变换参数,确定目标变换参数;
    视觉定位结果输出模块,被配置为根据所述目标变换参数输出所述当前帧图像对应的视觉定位结果。
  19. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至17任一项所述的方法。
  20. 一种电子设备,其特征在于,包括:
    处理器;以及
    存储器,用于存储所述处理器的可执行指令;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1至17任一项所述的方法。
PCT/CN2022/078435 2021-03-29 2022-02-28 视觉定位方法、视觉定位装置、存储介质与电子设备 WO2022206255A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/372,477 US20240029297A1 (en) 2021-03-29 2023-09-25 Visual positioning method, storage medium and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110336267.8 2021-03-29
CN202110336267.8A CN113096185B (zh) 2021-03-29 2021-03-29 视觉定位方法、视觉定位装置、存储介质与电子设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/372,477 Continuation US20240029297A1 (en) 2021-03-29 2023-09-25 Visual positioning method, storage medium and electronic device

Publications (1)

Publication Number Publication Date
WO2022206255A1 true WO2022206255A1 (zh) 2022-10-06

Family

ID=76670676

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/078435 WO2022206255A1 (zh) 2021-03-29 2022-02-28 视觉定位方法、视觉定位装置、存储介质与电子设备

Country Status (3)

Country Link
US (1) US20240029297A1 (zh)
CN (1) CN113096185B (zh)
WO (1) WO2022206255A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116630598A (zh) * 2023-07-19 2023-08-22 齐鲁空天信息研究院 大场景下的视觉定位方法、装置、电子设备及存储介质
CN116958271A (zh) * 2023-06-06 2023-10-27 阿里巴巴(中国)有限公司 标定参数确定方法以及装置
WO2024083010A1 (zh) * 2022-10-20 2024-04-25 腾讯科技(深圳)有限公司 一种视觉定位方法及相关装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113096185B (zh) * 2021-03-29 2023-06-06 Oppo广东移动通信有限公司 视觉定位方法、视觉定位装置、存储介质与电子设备
CN115578539B (zh) * 2022-12-07 2023-09-19 深圳大学 室内空间高精度视觉位置定位方法、终端及存储介质
CN117893610B (zh) * 2024-03-14 2024-05-28 四川大学 基于变焦单目视觉的航空装配机器人姿态测量系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292956A (zh) * 2017-07-12 2017-10-24 杭州电子科技大学 一种基于曼哈顿假设的场景重建方法
US20170343356A1 (en) * 2016-05-25 2017-11-30 Regents Of The University Of Minnesota Resource-aware large-scale cooperative 3d mapping using multiple mobile devices
CN111784776A (zh) * 2020-08-03 2020-10-16 Oppo广东移动通信有限公司 视觉定位方法及装置、计算机可读介质和电子设备
CN111967481A (zh) * 2020-09-18 2020-11-20 北京百度网讯科技有限公司 视觉定位方法、装置、电子设备及存储介质
CN113096185A (zh) * 2021-03-29 2021-07-09 Oppo广东移动通信有限公司 视觉定位方法、视觉定位装置、存储介质与电子设备

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106910210B (zh) * 2017-03-03 2018-09-11 百度在线网络技术(北京)有限公司 用于生成图像信息的方法和装置
CN107292965B (zh) * 2017-08-03 2020-10-13 北京航空航天大学青岛研究院 一种基于深度图像数据流的虚实遮挡处理方法
CN108717712B (zh) * 2018-05-29 2021-09-03 东北大学 一种基于地平面假设的视觉惯导slam方法
CN109544677B (zh) * 2018-10-30 2020-12-25 山东大学 基于深度图像关键帧的室内场景主结构重建方法及系统
CN109974693B (zh) * 2019-01-31 2020-12-11 中国科学院深圳先进技术研究院 无人机定位方法、装置、计算机设备及存储介质
CN110335316B (zh) * 2019-06-28 2023-04-18 Oppo广东移动通信有限公司 基于深度信息的位姿确定方法、装置、介质与电子设备
CN110322500B (zh) * 2019-06-28 2023-08-15 Oppo广东移动通信有限公司 即时定位与地图构建的优化方法及装置、介质和电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170343356A1 (en) * 2016-05-25 2017-11-30 Regents Of The University Of Minnesota Resource-aware large-scale cooperative 3d mapping using multiple mobile devices
CN107292956A (zh) * 2017-07-12 2017-10-24 杭州电子科技大学 一种基于曼哈顿假设的场景重建方法
CN111784776A (zh) * 2020-08-03 2020-10-16 Oppo广东移动通信有限公司 视觉定位方法及装置、计算机可读介质和电子设备
CN111967481A (zh) * 2020-09-18 2020-11-20 北京百度网讯科技有限公司 视觉定位方法、装置、电子设备及存储介质
CN113096185A (zh) * 2021-03-29 2021-07-09 Oppo广东移动通信有限公司 视觉定位方法、视觉定位装置、存储介质与电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NING RUIXIN: "Indoor Robot Visual Localization and Mapping", MASTER THESIS, TIANJIN POLYTECHNIC UNIVERSITY, CN, 15 February 2021 (2021-02-15), CN , XP055973147, ISSN: 1674-0246 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024083010A1 (zh) * 2022-10-20 2024-04-25 腾讯科技(深圳)有限公司 一种视觉定位方法及相关装置
CN116958271A (zh) * 2023-06-06 2023-10-27 阿里巴巴(中国)有限公司 标定参数确定方法以及装置
CN116958271B (zh) * 2023-06-06 2024-07-16 阿里巴巴(中国)有限公司 标定参数确定方法以及装置
CN116630598A (zh) * 2023-07-19 2023-08-22 齐鲁空天信息研究院 大场景下的视觉定位方法、装置、电子设备及存储介质
CN116630598B (zh) * 2023-07-19 2023-09-29 齐鲁空天信息研究院 大场景下的视觉定位方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN113096185B (zh) 2023-06-06
US20240029297A1 (en) 2024-01-25
CN113096185A (zh) 2021-07-09

Similar Documents

Publication Publication Date Title
WO2022206255A1 (zh) 视觉定位方法、视觉定位装置、存储介质与电子设备
CN112269851B (zh) 地图数据更新方法、装置、存储介质与电子设备
EP3786890B1 (en) Method and apparatus for determining pose of image capture device, and storage medium therefor
CN112927362B (zh) 地图重建方法及装置、计算机可读介质和电子设备
CN109788189B (zh) 将相机与陀螺仪融合在一起的五维视频稳定化装置及方法
CN105283905B (zh) 使用点和线特征的稳健跟踪
CN112270710B (zh) 位姿确定方法、位姿确定装置、存储介质与电子设备
CN112381828B (zh) 基于语义和深度信息的定位方法、装置、介质与设备
JP7150917B2 (ja) 地図作成のためのコンピュータ実施方法及び装置、電子機器、記憶媒体並びにコンピュータプログラム
CN112270755B (zh) 三维场景构建方法、装置、存储介质与电子设备
CN112598780B (zh) 实例对象模型构建方法及装置、可读介质和电子设备
CN112927271B (zh) 图像处理方法、图像处理装置、存储介质与电子设备
CN112288816B (zh) 位姿优化方法、位姿优化装置、存储介质与电子设备
WO2022033111A1 (zh) 图像信息提取方法、训练方法及装置、介质和电子设备
WO2022160857A1 (zh) 图像处理方法及装置、计算机可读存储介质和电子设备
CN113793370B (zh) 三维点云配准方法、装置、电子设备及可读介质
CN113409203A (zh) 图像模糊程度确定方法、数据集构建方法与去模糊方法
CN115222974A (zh) 特征点匹配方法及装置、存储介质及电子设备
CN112598732B (zh) 目标设备定位方法、地图构建方法及装置、介质、设备
CN114241039A (zh) 地图数据处理方法、装置、存储介质与电子设备
CN114419189A (zh) 地图构建方法及装置、电子设备、存储介质
CN116137025A (zh) 视频图像矫正方法及装置、计算机可读介质和电子设备
CN115423853A (zh) 一种图像配准方法和设备
CN114257748B (zh) 视频防抖方法及装置、计算机可读介质和电子设备
CN118053009A (zh) 视差处理方法、视差处理装置、介质与电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22778440

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22778440

Country of ref document: EP

Kind code of ref document: A1