CN116503474A

CN116503474A - Pose acquisition method, pose acquisition device, electronic equipment, storage medium and program product

Info

Publication number: CN116503474A
Application number: CN202310146798.XA
Authority: CN
Inventors: 王友辰; 李欣; 张鑫; 刘畅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-07-28

Abstract

The application provides a pose acquisition method, a pose acquisition device, electronic equipment, a storage medium and a program product; the method relates to the field of Internet of vehicles and the field of intelligent transportation, and comprises the following steps: acquiring a three-dimensional map image of a target space region and a plurality of two-dimensional live-action images of the target space region; acquiring the pose corresponding to each two-dimensional live-action image respectively, and selecting a target voxel corresponding to each two-dimensional live-action image from a plurality of voxels of the three-dimensional map image based on the pose; performing feature assignment on each target voxel based on the image features of each two-dimensional live-action image to obtain a voxel feature map of the target space region; acquiring the real scene image features of the real scene shooting image acquired by the image acquisition equipment, and matching the real scene image features with the voxel features of the voxel feature map to obtain a matching result; and determining the target pose of the image acquisition device in the target space region based on the matching result. Through the method and the device, the pose obtaining efficiency and accuracy can be effectively improved.

Description

Pose acquisition method, pose acquisition device, electronic equipment, storage medium and program product

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a pose acquisition method, a pose acquisition device, an electronic device, a storage medium, and a program product.

Background

Along with the rapid development of the augmented reality navigation technology, the pose of the equipment in the space area is efficiently and accurately acquired, and the navigation accuracy is influenced.

In the related art, the position and the posture of the user are usually positioned by repeatedly observing map features in the motion process from an unknown place of an unknown environment through an image acquisition device. The depth of the scene is lost in the shooting process of the image acquisition equipment, so that the scale of the real world cannot be solved, only the relative coordinates can be recovered, the image acquisition equipment cannot be used for real world positioning, the accuracy of the determined pose (position and pose) is low, and the pose acquisition efficiency is low due to long time consumption for repeatedly observing the map features.

Disclosure of Invention

The embodiment of the application provides a pose acquisition method, a pose acquisition device, electronic equipment, a computer readable storage medium and a computer program product, which can effectively improve the efficiency and accuracy of acquiring the pose.

The technical scheme of the embodiment of the application is realized as follows:

The embodiment of the application provides a pose acquisition method, which comprises the following steps:

acquiring a three-dimensional map image of a target space region and a plurality of two-dimensional live-action images of the target space region;

acquiring the pose corresponding to each two-dimensional live-action image respectively, and selecting a target voxel corresponding to each two-dimensional live-action image from a plurality of voxels of the three-dimensional map image based on the pose;

acquiring image features of each two-dimensional live-action image, and carrying out feature assignment on each target voxel based on the image features to obtain a voxel feature map of the target space region;

acquiring the real scene image characteristics of the real scene shooting image acquired by the image acquisition equipment, and matching the real scene image characteristics with the voxel characteristics of the voxel characteristic map to obtain a matching result;

and determining the target pose of the image acquisition equipment in the target space area based on the matching result.

The embodiment of the application provides a pose acquisition device, which comprises:

the acquisition module is used for acquiring a three-dimensional map image of a target space area and a plurality of two-dimensional live-action images of the target space area;

The selection module is used for acquiring the pose corresponding to each two-dimensional live-action image respectively, and selecting a target voxel corresponding to each two-dimensional live-action image from a plurality of voxels of the three-dimensional map image based on the pose;

the assignment module is used for acquiring the image characteristics of each two-dimensional live-action image, and carrying out characteristic assignment on each target voxel based on the image characteristics to obtain a voxel characteristic map of the target space region;

the matching module is used for acquiring the real-scene image characteristics of the real-scene shooting image acquired by the image acquisition equipment, and matching the real-scene image characteristics with the voxel characteristics of the voxel characteristic map to obtain a matching result;

and the determining module is used for determining the target pose of the image acquisition equipment in the target space area based on the matching result.

In some embodiments, the selecting module is further configured to generate, in the three-dimensional map image, at least one ray corresponding to each of the two-dimensional live-action images, based on a pose corresponding to each of the two-dimensional live-action images; respectively acquiring at least one intersection voxel of each ray and the three-dimensional map image; for each of the rays, a distance between each of the intersecting voxels of the ray and a start point of the ray is determined, and an intersecting voxel at which the distance is smallest is determined as the target voxel.

In some embodiments, the pose is used for indicating a target position and a pose angle of a live-action acquisition device for acquiring the two-dimensional live-action image in the target space region; the selection module is further configured to perform the following processing for each two-dimensional live-action image: determining target map coordinates corresponding to the target position in the three-dimensional map image based on the target position; determining a ray angle range of the ray in the three-dimensional map image based on the attitude angle and the view angle of the live-action acquisition equipment; and in the three-dimensional map image, the target map coordinate is taken as the starting point of the ray, and the at least one ray is generated within the ray angle range.

In some embodiments, the selecting module is further configured to obtain a position-coordinate mapping file, where the position-coordinate mapping file is used to record a mapping relationship between each position in the target space area and a corresponding map coordinate in the three-dimensional map image, and the positions in the target space area are in one-to-one correspondence with the map coordinates in the three-dimensional map image; and inquiring a target mapping relation comprising the target position in the position-coordinate mapping file, and determining the map coordinates in the target mapping relation as the target map coordinates.

In some embodiments, the selecting module is further configured to obtain a reference view angle, where twice the size of the reference view angle is equal to the size of the view angle of the live-action collection device; subtracting the attitude angle from the reference view angle to obtain a minimum angle value of the ray angle range, and adding the attitude angle to the reference view angle to obtain a maximum angle value of the ray angle range; and determining an angle range between the minimum angle value and the maximum angle value as the ray angle range.

In some embodiments, the image features include pixel features of pixels in the two-dimensional live-action image; the assignment module is further configured to obtain at least one associated pixel point associated with a corresponding target voxel from a plurality of pixel points included in each two-dimensional live-action image; combining the pixel characteristics of each associated pixel point to determine the target voxel characteristics of the target voxels corresponding to each two-dimensional live-action image; and in the three-dimensional map image, respectively carrying out feature assignment on each target voxel based on the target voxel feature of each target voxel to obtain a voxel feature map of the target space region.

In some embodiments, the assignment module is further configured to perform the following processing for each target voxel corresponding to each two-dimensional live-action image: when the number of the associated pixel points is one, determining the pixel characteristics of the associated pixel points as target voxel characteristics of the target voxels; and when the number of the associated pixel points is a plurality of, carrying out weighted summation on the pixel characteristics of each associated pixel point to obtain the target voxel characteristics of the target voxels.

In some embodiments, the voxel characteristic comprises a target voxel characteristic of each of the target voxels in the voxel characteristic map; the matching module is further configured to perform feature matching on the live-action image features and the target voxel features, so as to obtain feature matching results corresponding to the target voxel features; when the feature matching result indicates that the real-scene image feature and the target voxel feature are successfully matched, combining the real-scene shooting image and the target voxel corresponding to the target voxel feature into a shooting image-voxel pair; and determining each combined shooting picture-voxel pair as the matching result.

In some embodiments, the matching module is further configured to perform the following processing for each of the target voxel features: determining feature distances between the target voxel features and the live-action image features; when the feature distance is greater than or equal to a distance threshold, determining a feature matching result corresponding to the target voxel feature as a first matching result, wherein the first matching result is used for indicating that the real-scene image feature and the target voxel feature are successfully matched; and when the feature distance is smaller than the distance threshold, determining a feature matching result corresponding to the target voxel feature as a second matching result, wherein the second matching result is used for indicating that the matching of the live-action image feature and the target voxel feature fails.

In some embodiments, the matching result includes at least one shot-voxel pair for indicating a mapping relationship between the live-action shot and the target voxel; the determining module is further configured to select a target image-voxel pair from the at least one image-voxel pair; the target shooting image-voxel pair is the shooting image-voxel pair with the highest matching degree in the at least one shooting image-voxel pair, and the matching degree is used for indicating the matching degree between the live-action shooting image and the target voxel; and determining the target pose of the image acquisition device in the target space region based on the target shooting map-voxel pair.

In some embodiments, the determining module is further configured to obtain, for the target voxel in the target shot image-voxel pair, three-dimensional position coordinates of the target voxel in the three-dimensional map image, and obtain, from the live-action shot image, at least one associated pixel point associated with the target voxel; determining two-dimensional position coordinates of each associated pixel point in the live-action shooting picture respectively; and carrying out gesture prediction on the image acquisition equipment based on the three-dimensional position coordinates and the two-dimensional position coordinates to obtain the target pose of the image acquisition equipment in the target space region.

In some embodiments, the pose prediction is implemented by a pose prediction model comprising a parameter transformation layer and a pose estimation layer; the determining module is further configured to invoke the parameter transformation layer to perform parameter transformation on the three-dimensional position coordinate and the two-dimensional position coordinate to obtain a parameter transformation matrix; and calling the pose estimation layer, and carrying out pose estimation on the image acquisition equipment based on the parameter transformation matrix to obtain the target pose of the image acquisition equipment in the target space region.

In some embodiments, the pose acquisition device further includes: the map module is used for receiving a pose acquisition request sent by the image acquisition equipment in the navigation process; responding to the pose acquisition request, and sending the target pose to the image acquisition equipment; the target pose is used for rendering the real scene navigation map of the target space area by combining the target pose and the real scene shooting map through the image acquisition equipment.

An embodiment of the present application provides an electronic device, including:

a memory for storing computer executable instructions or computer programs;

and the processor is used for realizing the pose acquisition method provided by the embodiment of the application when executing the computer executable instructions or the computer programs stored in the memory.

The embodiment of the application provides a computer readable storage medium, which stores computer executable instructions for implementing the pose acquisition method provided by the embodiment of the application when the computer executable instructions cause a processor to execute.

Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer executable instructions from the computer readable storage medium, and the processor executes the computer executable instructions, so that the electronic device executes the pose acquisition method according to the embodiment of the application.

The embodiment of the application has the following beneficial effects:

by carrying out feature assignment on each target voxel in the three-dimensional map image to obtain a voxel feature map, matching the feature of the live-action image with the voxel feature of the voxel feature map, and determining the target pose based on the matching result, the dependence on the three-dimensional map image is effectively reduced (the dependence on the three-dimensional map image is converted into the dependence on the voxel feature map) in the determination process of the target pose, and the accuracy of the three-dimensional map image is often dependent on the acquisition equipment for acquiring the three-dimensional map image, so that the three-dimensional map image can not accurately reflect the feature of the target space region due to low accuracy of the acquisition equipment, and then the voxel feature map is determined to acquire the pose, so that the target pose can still be accurately acquired under the condition of low accuracy of the three-dimensional map image, and the accuracy of the determined target pose is effectively ensured. By carrying out feature assignment on partial voxels (target voxels) of the three-dimensional map image instead of full voxels based on image features, the volume of the obtained voxel feature map is smaller, and in the process of matching by using the voxel feature map, the matching operation amount can be effectively reduced, so that the pose obtaining efficiency is effectively improved.

Drawings

Fig. 1 is a schematic architecture diagram of a pose acquisition system provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device for acquiring a pose according to an embodiment of the present application;

fig. 3 to 7 are schematic flow diagrams of a pose acquisition method according to an embodiment of the present application;

FIG. 8 is a schematic view of the effect of a target spatial region provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a three-dimensional map acquisition device and a live-action acquisition device provided in an embodiment of the present application;

fig. 10 is a schematic diagram of a pose acquisition method according to an embodiment of the present application;

fig. 11 is a schematic view of a display interface of an image capturing device according to an embodiment of the present application;

FIG. 12 is an interface schematic of an image capture device provided in an embodiment of the present application;

fig. 13 is a flow chart of a pose acquisition method according to an embodiment of the present application;

fig. 14 is an effect schematic diagram of a point cloud map provided in an embodiment of the present application;

fig. 15 is a schematic diagram of a pose acquisition method according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used in the examples of this application have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the embodiments of the application is for the purpose of describing the embodiments of the application only and is not intended to be limiting of the application.

Before further describing embodiments of the present application in detail, the terms and expressions that are referred to in the embodiments of the present application are described, and are suitable for the following explanation.

1) Pose: is a description of the position and pose of an object (e.g., coordinates) in a given coordinate system. The pose is used for describing the position and the gesture of the object in a space coordinate system, and the position is used for indicating the position coordinate of the object in the space coordinate system; pose, which describes the orientation of an object in a spatial coordinate system.

2) Voxel: a voxel is an abbreviation of Volume element (voxel Pixel) and a Volume containing the voxel may be represented by a Volume rendering or extraction of a polygonal isosurface of a given threshold contour. As one of the names, the minimum unit of digital data on three-dimensional space segmentation, and voxels are used in the fields of three-dimensional imaging, scientific data, medical images and the like. Conceptually, like the smallest unit in two-dimensional space, a pixel is used on the image data of a two-dimensional computer image.

3) Target spatial region: refers to a specific spatial region in the real world, for example, a spatial region corresponding to a certain market in a certain city, a spatial region corresponding to a certain scenic spot, and the like.

4) Augmented reality (Augmented Reality, AR): the technology is a technology for skillfully fusing virtual information with the real world, and a plurality of technical means such as multimedia, three-dimensional modeling, real-time registration, intelligent interaction, sensing and the like are widely applied, and after virtual information such as characters, images, three-dimensional models, music, videos and the like generated by a computer is simulated and simulated, the virtual information is applied to the real world, and the two kinds of information are mutually complemented, so that the enhancement of the real world is realized. Augmented reality technology is also called augmented reality, and AR augmented reality technology is newer technology content that promotes integration between real world information and virtual world information content, and it carries out simulated simulation processing on the basis of scientific technology such as a computer on the entity information that is otherwise difficult to experience in the spatial range of the real world, and superposition effectively applies virtual information content in the real world, and in this process, can be perceived by human senses, thereby realizing sensory experience beyond reality. After overlapping between the real environment and the virtual object, the real environment and the virtual object can exist in the same picture and space simultaneously.

5) Live-action Map (Real Map): the real scene map is a map capable of seeing the real street view. On the electronic map base diagram architecture, secondary development is completed according to the needs of clients, client branch machines and website position display and navigation are realized, quick positioning of the point location geographic positions and bus route inquiry can be realized, massive multimedia information can be integrated on map punctuations, and clients can watch information integrated on punctuations while retrieving specific geographic positions of branch institutions and websites. Through watching 360 degrees of real scenes on the punctuation, let the customer leave home and personally understand the internal and external environment. The live-action map innovatively combines the three-dimensional live-action with the electronic map, and provides live-action map search service for the vast netizens.

6) PnP (permanent-n-Point) problem: the method is used for solving the 3D to 2D point pair motion, and aims to solve the pose of a camera coordinate system relative to a world coordinate system. It describes that when the coordinates of n 3D points (relative world coordinate system) and the pixel coordinates of these points are known, the corresponding perspective projection relationship is calculated, so as to obtain the camera pose (or camera pose) or the object pose (or object pose). The solvepnp function provided by opencv can be used to solve the PnP problem. The camera pose or the object pose can be calculated by using the solvepnp function, and the method can also be used for realizing space positioning.

7) Calibrating a camera: in image measurement processes and machine vision applications, in order to determine the correlation between the three-dimensional geometrical position of a point on the surface of a spatial object and its corresponding point in the image, a geometrical model of the camera imaging, i.e. a camera model, has to be established, and these camera model parameters are camera parameters. The process of solving the camera parameters is called camera calibration. The camera parameters include camera internal parameters and camera external parameters, wherein the camera internal parameters are determined by the camera and cannot be changed due to external environment.

8) Camera model: the process for mapping coordinate points in a three-dimensional world coordinate system to a two-dimensional image plane is a tie for realizing the connection between three-dimensional space points and two-dimensional plane points. The camera model comprises at least: pinhole camera model, fisheye camera model. Taking a pinhole camera model as an example for illustration, there are four coordinate systems in the pinhole camera model: three-dimensional world coordinate system, three-dimensional camera coordinate system, two-dimensional image physical coordinate system and two-dimensional image pixel coordinate system.

9) Camera coordinate system: the three-dimensional rectangular coordinate system is also called as a three-dimensional camera coordinate system, and is characterized in that the optical center of a camera is taken as an origin O of the coordinate system, an x axis and a y axis are respectively parallel to two perpendicular sides of an imaging plane of the camera, namely, the x axis and the y axis are taken as x axes and y axes which are parallel to the x and y directions of an image, the optical axis of the camera is taken as a z axis (or the z axis is parallel to the optical axis), the x axis, the y axis and the z axis are mutually perpendicular, and the unit is a length unit.

10 Three-dimensional camera coordinate system: the points in the three-dimensional camera coordinate system are obtained by perspective projection transformation to obtain corresponding points in the two-dimensional image coordinate system. Perspective projection (perspective projection) is a single-sided projection (similar to shadow play) that is achieved by projecting a feature onto a projection surface using a center projection method, thereby achieving a visual effect that is more similar to that of a shadow. The perspective projection accords with psychological habit of people, namely, an object close to the viewpoint is large, an object far from the viewpoint is small, and parallel lines which are not parallel to the imaging plane can intersect at a vanishing point (vanish point).

11 Image physical coordinate system): the physical coordinate system of the image (also called as two-dimensional image coordinate system) takes the intersection point of the optical axis of the camera and the physical imaging plane (also called as image plane) as the origin of coordinates O ', the x ' axis and the y ' axis are respectively parallel to two vertical sides of the image plane, and the unit is the length unit.

12 Image pixel coordinate system): the image pixel coordinate system (also called as pixel coordinate system) takes the vertex of the image as the coordinate origin Opixel, and the u and v directions are parallel to the x 'axis and the y' axis directions, and the units are in pixels. In practical applications, the image acquired by the camera is first formed in the form of a standard electrical signal, and then converted into a digital image by analog-to-digital conversion. The stored form of each image is an M x N array, and the value of each element in the image of M rows and N columns represents the gray scale of the image point. Each such element is called a pixel, and the pixel coordinate system is an image coordinate system in units of pixels.

In the implementation of the embodiments of the present application, the applicant found that the related art has the following problems:

in the related art, the position and the posture of the user are usually positioned by repeatedly observing map features in the motion process from an unknown place of an unknown environment through an image acquisition device. The depth of the scene is lost in the shooting process of the image acquisition equipment, so that the real-world scale cannot be solved, only the relative coordinates can be recovered, and the image acquisition equipment cannot be used for real-world positioning, and the determined pose (position and pose) is inaccurate.

The embodiment of the application provides a pose acquisition method, a pose acquisition device, electronic equipment, a computer readable storage medium and a computer program product, which can effectively improve the efficiency and accuracy of acquiring the pose, and the following describes an exemplary application of the pose acquisition system.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of a pose acquisition system 100 provided in an embodiment of the present application, where a terminal (a terminal 400 is shown in an exemplary manner) is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two.

The terminal 400 is used for a user to use the client 410 to display a live-action image on a graphical interface 410-1 (the graphical interface 410-1 is exemplarily shown). The terminal 400 and the server 200 are connected to each other through a wired or wireless network.

In some embodiments, the server 200 may be a stand-alone physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart television, a smart watch, a car terminal, etc. The electronic device provided in the embodiment of the application may be implemented as a terminal or as a server. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiments of the present application.

In some embodiments, the server 200 acquires a three-dimensional map image and a plurality of two-dimensional live-action images of a target space region, acquires a pose corresponding to each two-dimensional live-action image, selects a target voxel corresponding to each two-dimensional live-action image based on the pose, acquires an image feature of each two-dimensional live-action image, and performs feature assignment on the target voxel based on the image feature to obtain a voxel feature map; and acquiring the real-scene image characteristics of the real-scene too-shot image acquired by the image acquisition equipment, matching the real-scene image characteristics with the voxel characteristic map to obtain a matching result, determining the target pose of the image acquisition equipment in the target space area based on the matching result, and transmitting the target pose to a terminal 400 corresponding to the image acquisition equipment.

In other embodiments, the terminal 400 corresponding to the image acquisition device acquires a three-dimensional map image and a plurality of two-dimensional live-action images of the target space region, acquires a pose corresponding to each two-dimensional live-action image, selects a target voxel corresponding to each two-dimensional live-action image based on the pose, acquires an image feature of each two-dimensional live-action image, and performs feature assignment on the target voxel based on the image feature to obtain a voxel feature map; and acquiring the real-scene image characteristics of the real-scene too-shot map acquired by the image acquisition equipment, matching the real-scene image characteristics with the voxel characteristic map to obtain a matching result, and determining the target pose of the image acquisition equipment in the target space area based on the matching result.

In other embodiments, the embodiments of the present application may be implemented by means of Cloud Technology (Cloud Technology), which refers to a hosting Technology that unifies serial resources such as hardware, software, networks, etc. in a wide area network or a local area network, so as to implement calculation, storage, processing, and sharing of data.

The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical network systems require a large amount of computing and storage resources.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device 500 for acquiring a pose according to an embodiment of the present application, where the electronic device 500 shown in fig. 2 may be the server 200 or the terminal 400 in fig. 1, and the electronic device 500 shown in fig. 2 includes: at least one processor 410, a memory 450, at least one network interface 420. The various components in electronic device 500 are coupled together by bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 440.

The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.

Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 450 described in the embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for accessing other electronic devices via one or more (wired or wireless) network interfaces 420, the exemplary network interface 420 comprising: bluetooth, wireless compatibility authentication (WiFi, wireless Fidelity), and universal serial bus (USB, universal Serial Bus), etc.

In some embodiments, the pose acquisition device provided in the embodiments of the present application may be implemented in a software manner, and fig. 2 shows the pose acquisition device 455 stored in the memory 450, which may be software in the form of a program and a plug-in, and includes the following software modules: the acquisition module 4551, the selection module 4552, the assignment module 4553, the matching module 4554, the determination module 4555, which are logical, may be any combination or further split depending on the functions implemented. The functions of the respective modules will be described hereinafter.

In other embodiments, the pose acquisition device provided in the embodiments of the present application may be implemented in hardware, and as an example, the pose acquisition device provided in the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the pose acquisition method provided in the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSP, programmable logic device (PLD, programmable Logic Device), complex programmable logic device (CPLD, complex Programmable Logic Device), field programmable gate array (FPGA, field-Programmable Gate Array), or other electronic components.

In some embodiments, the terminal or the server may implement the pose acquisition method provided in the embodiments of the present application by running a computer program or computer executable instructions. For example, the computer program may be a native program (e.g., a dedicated pose acquisition program) or a software module in an operating system, e.g., a pose acquisition module that may be embedded in any program (e.g., an instant messaging client, an album program, an electronic map client, a navigation client); for example, a Native Application (APP) may be used, i.e. a program that needs to be installed in an operating system to be run. In general, the computer programs described above may be any form of application, module or plug-in.

The pose acquisition method provided by the embodiment of the application will be described with reference to exemplary applications and implementations of the server or the terminal provided by the embodiment of the application.

Referring to fig. 3, fig. 3 is a schematic flow chart of a pose acquisition method provided in the embodiment of the present application, which will be described with reference to steps 101 to 107 shown in fig. 3, where the pose acquisition method provided in the embodiment of the present application may be implemented by a server or a terminal alone or implemented by the server and the terminal cooperatively, and will be described with reference to a server alone embodiment.

In step 101, a three-dimensional map image of a target space region and a plurality of two-dimensional live-action images of the target space region are acquired.

In some embodiments, the target spatial region: refers to a specific spatial region in the real world, for example, a spatial region corresponding to a certain market in a certain city, a spatial region corresponding to a certain scenic spot, and the like.

As an example, referring to fig. 8, fig. 8 is an effect schematic diagram of a target space area provided in the embodiment of the present application, where the target space area may be a space area in the real world such as "8000 service shops", "221 training classrooms F2 layer", "F2 eastern elevator halls F2 layer", "F2 western elevator halls F2 layer", "F2 southern elevator halls F2 layer", "233 conference rooms F2 layer", and the like in a certain market.

In some embodiments, a three-dimensional map image, also known as a three-dimensional electronic map, is a three-dimensional, abstract description of one or more aspects of the real world or a portion thereof, to a scale, based on a three-dimensional electronic map database. The network three-dimensional electronic map not only provides map searching functions such as map inquiry, travel navigation and the like for users through an intuitive geographical live-action simulation expression mode, but also integrates a series of services such as living information, electronic government affairs, electronic commerce, virtual communities, travel navigation and the like.

In some embodiments, a two-dimensional live-action image is used to reflect the live-action of the target spatial region at an angle from a two-dimensional perspective.

In some embodiments, the step 101 may be implemented as follows: three-dimensional map acquisition is carried out on the target space region through three-dimensional map acquisition equipment, so that a three-dimensional map image of the target space region is obtained; and carrying out multiple two-dimensional image acquisition on the target space region through the live-action acquisition equipment to obtain multiple two-dimensional live-action images of the target space region.

As an example, referring to fig. 9, fig. 9 is a schematic structural diagram of a three-dimensional map acquisition device and a live-action acquisition device provided in an embodiment of the present application, where three-dimensional map acquisition is performed on a target space region by using the three-dimensional map acquisition device 1, so as to obtain a three-dimensional map image of the target space region; and acquiring a plurality of two-dimensional images of the target space region through the live-action acquisition equipment 2 to obtain a plurality of two-dimensional live-action images of the target space region.

Therefore, the three-dimensional map image of the target space region and the plurality of two-dimensional live-action images of the target space region are acquired, so that the subsequent construction of the voxel characteristic map of the target space region based on the three-dimensional map image and the two-dimensional live-action images is facilitated, and reliable data guarantee is provided for the subsequent determination of the target pose.

In step 102, pose corresponding to each two-dimensional live-action image is acquired.

In some embodiments, the pose corresponding to the two-dimensional live-action image is used for indicating the target position and the pose angle in the target space area when the live-action acquisition device for acquiring the two-dimensional live-action image acquires the two-dimensional live-action image.

In some embodiments, the attitude angle includes a pitch angle, a yaw angle and a roll angle, and when the live-action acquisition device acquires the two-dimensional live-action image, different attitudes are adopted, the image content of the acquired two-dimensional live-action image is different, and the attitudes corresponding to the two-dimensional live-action images with different image contents are different.

In step 103, a target voxel corresponding to each two-dimensional live-action image is selected from a plurality of voxels of the three-dimensional map image based on the pose.

In some embodiments, a voxel is an abbreviation for Volume element (voxel Pixel), and a Volume containing the voxel may be represented by a Volume rendering or extraction of a polygonal isosurface for a given threshold contour. As one of the names, the minimum unit of digital data on three-dimensional space segmentation, and voxels are used in the fields of three-dimensional imaging, scientific data, medical images and the like. Conceptually, like the smallest unit in two-dimensional space, a pixel is used on the image data of a two-dimensional computer image.

In some embodiments, the number of pixels in the two-dimensional live-action image depends on the device parameters of the live-action acquisition device that captured the two-dimensional live-action image. The number of voxels in the three-dimensional map image depends on the device parameters of the three-dimensional map acquisition device that acquired the three-dimensional map image. The number of voxels in the three-dimensional map image is greater than the number of pixel points in the two-dimensional live-action image, each pixel point in the two-dimensional live-action image corresponds to a target voxel, and one target voxel corresponds to one pixel point in at least one two-dimensional live-action image.

In some embodiments, the content indicated by the target voxel in the target spatial region includes: and the content indicated by each pixel point corresponding to the target voxel in the target space region. The content indicated by the target voxel in the target spatial region may be a real object or the like that is actually present in the target spatial region.

In some embodiments, referring to fig. 4, fig. 4 is a flowchart of a pose acquisition method provided in an embodiment of the present application, and step 103 shown in fig. 3 may be implemented by executing steps 1031 to 1033 shown in fig. 4.

In step 1031, at least one ray corresponding to each two-dimensional live-action image is generated in the three-dimensional map image based on the pose corresponding to each two-dimensional live-action image.

In some embodiments, the pose is used to indicate a target position and a pose angle of a live-action acquisition device acquiring a two-dimensional live-action image in a target space region.

In some embodiments, the origin of the ray may be a target map coordinate in the three-dimensional map image corresponding to the target location.

As an example, referring to fig. 10, fig. 10 is a schematic diagram of a pose acquisition method provided in an embodiment of the present application. Based on the pose T corresponding to the two-dimensional live-action image, generating rays R respectively corresponding to the two-dimensional live-action image in the three-dimensional map image ₁ Ray R ₂ Ray R ₃ 。

In some embodiments, step 1031 may be implemented as follows: the following processing is respectively executed for each two-dimensional live-action image: determining target map coordinates corresponding to the target position in the three-dimensional map image based on the target position; determining a ray angle range of rays in the three-dimensional map image based on the attitude angle and the view angle of the live-action acquisition equipment; in the three-dimensional map image, at least one ray is generated within a ray angle range by taking the coordinates of the target map as the starting point of the ray.

In some embodiments, the perspective of the live-action acquisition device refers to the maximum angular range that the live-action acquisition device can acquire.

In some embodiments, the determining, based on the target position, the target map coordinates corresponding to the target position in the three-dimensional map image may be implemented as follows: and determining target map coordinates corresponding to the target positions in the three-dimensional map image based on the target positions of the live-action acquisition equipment for acquiring the two-dimensional live-action images in the target space region.

In some embodiments, the target position of the live-action acquisition device uses the image physical coordinate system corresponding to the target space area as a reference coordinate system, the target map coordinate uses the three-dimensional camera coordinate system corresponding to the three-dimensional map image as a reference coordinate system, and the coordinate corresponding to the target position (i.e. the target map coordinate) under the three-dimensional camera coordinate system can be obtained by performing coordinate system conversion on the image physical coordinate system corresponding to the target position.

In some embodiments, the determining, based on the target position, the target map coordinates corresponding to the target position in the three-dimensional map image may be implemented as follows: acquiring a position-coordinate mapping file, wherein the position-coordinate mapping file is used for recording the mapping relation between each position in a target space area and corresponding map coordinates in a three-dimensional map image, and the positions in the target space area correspond to the map coordinates in the three-dimensional map image one by one; and querying a target mapping relation comprising the target position in the position-coordinate mapping file, and determining map coordinates in the target mapping relation as target map coordinates.

In some embodiments, the location-coordinate mapping file is configured to record a mapping relationship between each location in the target space area and a corresponding map coordinate in the three-dimensional map image, where it is understood that the location in the target space area and the reference coordinate system corresponding to the map coordinate in the three-dimensional map image are different, the reference coordinate system corresponding to the location in the target space area is an image physical coordinate system of the target space area, the reference coordinate system corresponding to the map coordinate in the three-dimensional map image is a three-dimensional camera coordinate system of the three-dimensional map image, and the mapping relationship between each location in the target space area and the corresponding map coordinate in the three-dimensional map image recorded by the location-coordinate mapping file may indicate a conversion relationship between the image physical coordinate system and the three-dimensional camera coordinate system from a coordinate system perspective.

In some embodiments, the determining the ray angle range of the ray in the three-dimensional map image based on the attitude angle and the view angle of the live-action acquisition device may be implemented as follows: acquiring a reference view angle, wherein the twice of the size of the reference view angle is equal to the size of the view angle of the live-action acquisition equipment; subtracting the attitude angle from the reference view angle to obtain a minimum angle value of the ray angle range, and adding the attitude angle and the reference view angle to obtain a maximum angle value of the ray angle range; the range of angles between the minimum angle value and the maximum angle value is determined as the ray angle range.

In step 1032, each ray is acquired separately from at least one intersecting voxel of the three-dimensional map image.

As an example, referring to fig. 10, an intersecting voxel L1 of a ray R1 and a three-dimensional map image, intersecting voxels L2, L4 of a ray R2 and a three-dimensional map image, and an intersecting voxel L3 of a ray R3 and a three-dimensional map image are acquired.

In step 1033, for each ray, a distance between each intersecting voxel of the ray and the origin of the ray is determined, and the intersecting voxel with the smallest distance is determined as the target voxel.

As an example, referring to fig. 10, when the number of intersecting voxels of the ray is one, the intersecting voxel L1 of the ray R1 is determined as a target voxel for the ray R1. When the number of intersecting voxels of the ray is plural, distances between the intersecting voxels L2, L4 of the ray R2 and the starting point T of the ray are determined, respectively, and the intersecting voxel having the smallest distance is determined as the target voxel.

In this way, at least one ray corresponding to each two-dimensional live-action image is generated in the three-dimensional map image based on the pose corresponding to each two-dimensional live-action image, and the intersecting voxel with the smallest distance between each intersecting voxel of the ray and the starting point of the ray is determined as the target voxel, so that subsequent assignment for the target voxel is facilitated, and the obtained voxel feature map is more accurate. By pointedly selecting part of voxels in the three-dimensional map image as target voxels, the subsequent assignment process does not need to assign all voxels in the three-dimensional map image, so that the calculation amount of an algorithm is effectively reduced, the operation efficiency is improved, the pose acquisition efficiency is effectively improved, and the pose is efficiently acquired.

In step 104, image features of each two-dimensional live-action image are acquired.

In some embodiments, the step 104 may be implemented as follows: and extracting image features of each two-dimensional live-action image to obtain the image features of each two-dimensional live-action image.

In some embodiments, the above image feature extraction may be implemented by an image coding model.

In step 105, feature assignment is performed on each target voxel based on the image features, and a voxel feature map of the target spatial region is obtained.

In some embodiments, the above feature assignment refers to a process of assigning target voxels to corresponding features.

In some embodiments, the image features include pixel features of pixels in a two-dimensional live-action image.

In some embodiments, referring to fig. 5, fig. 5 is a flowchart illustrating a pose acquisition method provided in an embodiment of the present application, and step 105 illustrated in fig. 3 may be implemented by performing steps 1051 to 1053 illustrated in fig. 5.

In step 1051, at least one associated pixel associated with a corresponding target voxel is obtained from a plurality of pixels included in each two-dimensional live-action image.

In some embodiments, the number of pixels in the two-dimensional live-action image depends on the device parameters of the live-action acquisition device that captured the two-dimensional live-action image. The number of voxels in the three-dimensional map image depends on the device parameters of the three-dimensional map acquisition device that acquired the three-dimensional map image. The number of voxels in the three-dimensional map image is greater than the number of pixel points in the two-dimensional live-action image, each pixel point in the two-dimensional live-action image corresponds to one target voxel respectively, one target voxel corresponds to at least one associated pixel point, and when one target voxel corresponds to a plurality of associated pixel points, the plurality of associated pixel points corresponding to the target voxel can be from the same two-dimensional live-action image or from different two-dimensional live-action images.

In some embodiments, the content indicated by the target voxel in the target spatial region includes: the content indicated by each associated pixel point corresponding to the target voxel in the target space region may be real scene content actually existing in the target space region, and the content indicated by the associated pixel point in the target space region may be real scene content actually existing in the target space region.

As an example, the real content actually existing in the target space region may be an object, a person, or the like in the target space region.

In step 1052, the pixel characteristics of each associated pixel point are combined to determine the target voxel characteristics of the target voxels corresponding to each two-dimensional live-action image.

In some embodiments, the two-dimensional real-scene image includes a plurality of pixels, the plurality of pixels includes associated pixels, and the image feature of the two-dimensional real-scene image includes a pixel feature of the associated pixels, where the pixel feature of the associated pixels may be extracted from the image feature of the two-dimensional real-scene image.

In some embodiments, the target voxel characteristic of the target voxel may be determined by the pixel characteristics of each associated pixel point associated with the target voxel.

In some embodiments, step 1052 described above may be implemented as follows: the following processing is respectively executed for the target voxels corresponding to the two-dimensional live-action images: when the number of the associated pixel points is one, determining the pixel characteristics of the associated pixel points as target voxel characteristics of target voxels; and when the number of the associated pixel points is a plurality of, carrying out weighted summation on the pixel characteristics of each associated pixel point to obtain the target voxel characteristics of the target voxels.

In some embodiments, the above-mentioned weighted summation of the pixel characteristics of each associated pixel point, to obtain the target voxel characteristic of the target voxel, may be determined by: and obtaining the weight of the pixel characteristic of each associated pixel point, and carrying out weighted summation on the pixel characteristic of each associated pixel point according to the corresponding weight to obtain the target voxel characteristic of the target voxel.

In some embodiments, the weights of the pixel features of the associated pixel points may be determined as follows: the following processing is performed for each associated pixel point: and acquiring a target distance between the ray corresponding to the associated pixel point and the central point of the corresponding target voxel, and determining the weight of the pixel characteristic of the associated pixel point based on the target distance, wherein the target distance is inversely proportional to the value of the weight, and the larger the target distance is, the smaller the weight is, and the larger the weight is.

In step 1053, in the three-dimensional map image, feature assignment is performed on each target voxel based on the target voxel feature of each target voxel, so as to obtain a voxel feature map of the target spatial region.

In some embodiments, step 1053 described above may be implemented as follows: in the three-dimensional map image, the following processing is performed for each target voxel: and acquiring initial voxel characteristics of the target voxels, and replacing the initial voxel characteristics of the target voxels with target voxel characteristics of the target voxels to obtain a voxel characteristic map of the target space region.

In some embodiments, the initial voxel characteristic of the target voxel may be determined during an image acquisition stage of the three-dimensional map image, and the initial voxel characteristic of the target voxel may be absent, i.e., the initial voxel characteristic of the target voxel may be null.

Therefore, the feature assignment is carried out on each target voxel based on the image features, and the voxel feature map of the target space region is obtained, so that in the process of determining the voxel feature map, assignment is carried out on partial voxels (namely target voxels) of the three-dimensional map image, assignment is not needed to be carried out on all voxels of the three-dimensional map image, the calculated amount of feature assignment is effectively reduced, assignment is not needed to be carried out on all voxels in the three-dimensional map image, the calculated amount of an algorithm is effectively reduced, the operation efficiency is improved, the pose acquisition efficiency is effectively improved, and the pose efficient acquisition is realized.

In step 106, live-action image features of the live-action photographic image acquired by the image acquisition device are acquired.

In some embodiments, the image capturing device may be a mobile terminal with image capturing capabilities (e.g., a mobile phone with a camera, etc.), as well as other dedicated image capturing devices with display communication capabilities, etc.

In some embodiments, the step 106 may collect the live-action image of the target space region through the image collecting device, and send the live-action image to the server, so that the server may obtain the live-action image collected by the image collecting device, and perform image feature extraction on the live-action image to obtain the live-action image feature of the live-action image.

As an example, referring to fig. 11, fig. 11 is a schematic view of a display interface of an image capturing apparatus provided in an embodiment of the present application. As shown in fig. 11, the image capturing device sends the live-action captured image to the server, so that the server obtains the live-action captured image captured by the image capturing device, and performs image feature extraction on the live-action captured image to obtain the live-action image feature of the live-action captured image, so as to determine the target pose of the image capturing device subsequently, and send the target pose to the image capturing device.

In step 107, the real-scene image feature and the voxel feature of the voxel feature map are matched to obtain a matching result.

In some embodiments, the matching refers to a process of matching the live-action image feature with the target voxel feature of each target voxel in the voxel feature map.

In some embodiments, the matching result is used to indicate a mapping relationship between the live-action image and the target voxel.

In some embodiments, the voxel features include target voxel features of each target voxel in the voxel feature map, referring to fig. 6, fig. 6 is a schematic flow chart of the pose acquisition method provided in the embodiments of the present application, and step 107 shown in fig. 3 may be implemented by executing steps 1071 to 1073 shown in fig. 6.

In step 1071, the features of the live-action image are respectively matched with the features of each target voxel, so as to obtain feature matching results corresponding to the features of each target voxel.

In some embodiments, the feature matching may be implemented by determining a feature distance between the live-action image feature and the target voxel feature, where the feature matching result is used to indicate whether the live-action image feature and the target voxel feature are successfully matched.

In some embodiments, step 1071 described above may be implemented by performing the following processing for each target voxel feature: determining feature distances between the target voxel features and the live-action image features; when the feature distance is greater than or equal to the distance threshold, determining a feature matching result corresponding to the target voxel feature as a first matching result, wherein the first matching result is used for indicating that the matching of the live-action image feature and the target voxel feature is successful; and when the feature distance is smaller than the distance threshold, determining a feature matching result corresponding to the target voxel feature as a second matching result, wherein the second matching result is used for indicating that the matching of the live-action image feature and the target voxel feature fails.

In some embodiments, the feature distance between the target voxel feature and the live-action image feature may refer to a manhattan distance, an euclidean distance or, a specific schiff distance, etc. between the target voxel feature and the live-action image feature, and the specific expression form of the feature distance is not limited to the embodiment of the present application.

In some embodiments, the feature distance is used to indicate a similarity between the target voxel feature and the live-action image feature, where the similarity between the target voxel feature and the live-action image feature is proportional to the feature distance between the target voxel feature and the live-action image feature, that is, the greater the similarity between the target voxel feature and the live-action image feature, the greater the feature distance between the target voxel feature and the live-action image feature, and the lesser the similarity between the target voxel feature and the live-action image feature.

In step 1072, when the feature matching result indicates that the matching between the live-action image feature and the target voxel feature is successful, the target voxels corresponding to the live-action image and the target voxel feature are combined into an image-voxel pair.

In some embodiments, a map-voxel pair is used to indicate the mapping between a live-action map and a target voxel. By determining the shooting map-voxel pair, the target pose of the image acquisition device in the target space area is conveniently determined based on the target voxels and the live-action shooting map in the shooting map-voxel pair.

In step 1073, each of the combined map-voxel pairs is determined as a matching result.

In some embodiments, the matching result includes at least one map-voxel pair, the matching result being indicative of a mapping relationship between the live-action map and the target voxel.

In this way, feature matching is carried out on the live-action image features and the target voxel features respectively to obtain feature matching results corresponding to the target voxel features, so that the matching of voxels of the whole three-dimensional map image is not needed in the feature matching process, the calculated amount in the feature matching process is effectively reduced, the operation efficiency is improved, the pose obtaining efficiency is effectively improved, and the pose is effectively obtained.

In step 108, a target pose of the image acquisition device in the target spatial region is determined based on the matching result.

In some embodiments, the matching result includes at least one shot-voxel pair, the shot-voxel pair being used to indicate a mapping relationship between the live-action shot and the target voxel.

In some embodiments, a target pose of the image acquisition device in the target spatial region is used to indicate a position and pose of the image acquisition device in the target spatial region.

In some embodiments, referring to fig. 7, fig. 7 is a flowchart of a pose acquisition method provided in an embodiment of the present application, and step 108 shown in fig. 3 may be implemented by executing steps 1081 to 1082 shown in fig. 7.

In step 1081, a target map-voxel pair is selected from the at least one map-voxel pair.

In some embodiments, step 1081 described above may be implemented by: when the number of shot map-voxel pairs is one, determining the shot map-voxel pair as a target shot map-voxel pair; when the number of shot map-voxel pairs is plural, a target shot map-voxel pair is selected from the plural shot map-voxel pairs.

In some embodiments, selecting the target map-voxel pair from the plurality of map-voxel pairs may be achieved by: selecting candidate shot-voxel pairs from the plurality of shot-voxel pairs by a random sample consensus algorithm (RANdom SAmple Consensus, RANSAC); a target shot-voxel pair is selected from the candidate shot-voxel pairs by a Voting matching algorithm (voing-Based Pose Estimation for Robotic Assembly Using a 3D Sensor).

In this way, the target shooting image-voxel pair is selected from at least one shooting image-voxel pair, when the shooting image-voxel pairs are multiple, the shooting image-voxel pairs are screened, and the target shooting image-voxel pairs are obtained, so that the number of the shooting image-voxel pairs for determining the target pose subsequently is effectively reduced, the time for determining the target pose is effectively reduced, the operation efficiency is improved, and the pose obtaining efficiency is effectively improved.

In this way, the target shooting image-voxel pair is selected from at least one shooting image-voxel pair, when the shooting image-voxel pairs are multiple, the shooting image-voxel pairs are screened, the target shooting image-voxel pair is obtained, and the accuracy of the target pose determined based on the target shooting image-voxel pair is effectively improved due to the fact that the obtained target shooting image-voxel pair is high in effectiveness.

In step 1082, a target pose of the image acquisition device in the target spatial region is determined based on the target cine-voxel pairs.

In some embodiments, the target shot-voxel pair is the most matched shot-voxel pair of at least one shot-voxel pair, the matching degree being used to indicate the matching degree between the live-action shot and the target voxel.

In some embodiments, the target pose of the image acquisition device in the target spatial region may be determined by a target voxel in a target map-voxel pair, a live-action map in the target map-voxel pair.

In some embodiments, step 1082 described above may be implemented by: aiming at a target voxel in a target shooting image-voxel pair, acquiring a three-dimensional position coordinate of the target voxel in a three-dimensional map image, and acquiring at least one associated pixel point associated with the target voxel from a live-action shooting image; determining two-dimensional position coordinates of each associated pixel point in the live-action shooting picture respectively; and carrying out gesture prediction on the image acquisition equipment based on the three-dimensional position coordinates and the two-dimensional position coordinates to obtain the target pose of the image acquisition equipment in the target space region.

In some embodiments, the target voxel is a three-dimensional position coordinate in the three-dimensional map image, which is a coordinate position in terms of a coordinate origin of a three-dimensional coordinate system to which the three-dimensional map image corresponds; the two-dimensional position coordinates of the associated pixel points in the live-action image are coordinate positions with respect to the coordinate origin of the two-dimensional coordinate system corresponding to the live-action image.

In some embodiments, the content indicated by the target voxel in the target spatial region includes: and the content indicated by each associated pixel point corresponding to the target voxel in the target space region.

In some embodiments, the at least one associated pixel point associated with the target voxel may be from a different live-action view. For example, the at least one associated pixel point associated with the target voxel comprises: associated pixel point 1 and associated pixel point 2, associated pixel point 1 being the pixel point in live-action image capturing fig. 1, and associated pixel point 2 being the pixel point in live-action image capturing fig. 2.

In some embodiments, the above-mentioned gesture prediction is performed on the image capturing device based on the three-dimensional position coordinates and the two-dimensional position coordinates, so as to obtain the target pose of the image capturing device in the target space region, which may be implemented in the following manner: and taking the three-dimensional position coordinates and the two-dimensional position coordinates as inputs, and inputting a corresponding algorithm for three-dimensional attitude estimation, such as a Solvepnp algorithm, so as to obtain the target attitude of the image acquisition equipment in the target space region.

In practical implementation, the server can predict the pose of the image acquisition device by combining the three-dimensional position coordinates and the two-dimensional position coordinates and combining the device parameters of the image acquisition device, so as to obtain the target pose of the image acquisition device in the target space region, namely the pose of the image acquisition device in the target space region. Here, the pose indicates the positions of the respective device points of the image capturing device in the target space region. It should be noted that the pose may be represented by a rotation matrix and a translation vector from the target space region to the three-dimensional map image at all the device points.

In some embodiments, the above-described pose prediction is implemented by a pose prediction model that includes a parameter transformation layer and a pose estimation layer.

In some embodiments, the above-mentioned gesture prediction is performed on the image capturing device based on the three-dimensional position coordinates and the two-dimensional position coordinates, so as to obtain the target pose of the image capturing device in the target space region, which may be implemented in the following manner: calling a parameter transformation layer, and performing parameter transformation on the three-dimensional position coordinates and the two-dimensional position coordinates to obtain a parameter transformation matrix; and calling a pose estimation layer, and carrying out pose estimation on the image acquisition equipment based on the parameter transformation matrix to obtain the target pose of the image acquisition equipment in the target space region.

In some embodiments, the parameter transformation layer is configured to perform parameter transformation (matrix transformation) on the three-dimensional position coordinate and the two-dimensional position coordinate to obtain a parameter transformation matrix, where the parameter transformation matrix is used for pose estimation on the image acquisition device.

In some embodiments, the pose estimation layer may perform pose estimation on the image acquisition device by using a least square method to find an approximately optimal solution, so as to obtain a target pose of the image acquisition device in a target space area.

In this way, aiming at the target voxels in the target shooting image-voxel pair, the gesture prediction is carried out on the image acquisition device through the three-dimensional position coordinates of the target voxels in the three-dimensional map image and the two-dimensional position coordinates of each associated pixel point in the live-action image, so as to obtain the target gesture of the image acquisition device in the target space region.

In some embodiments, following step 108 shown in fig. 3, live-action navigation may be performed as follows: receiving a pose acquisition request sent by image acquisition equipment in the navigation process; responding to a pose acquisition request, and transmitting the pose of the target to image acquisition equipment; the image acquisition device is used for combining the target pose with the live-action shooting picture, and rendering to obtain a live-action navigation map of the target space region.

In some embodiments, referring to fig. 12, fig. 12 is an interface schematic diagram of an image capturing device provided in an embodiment of the present application. In a display interface of the image acquisition equipment, displaying a live-action shooting picture 21, wherein the content displayed in the live-action shooting picture 21 is the live-action content of a target space area (current floor: F2 target floor: F2) acquired by the image acquisition equipment; in response to a navigation triggering operation for a live-action shooting picture in a display interface of the image acquisition device, the image acquisition device sends a pose acquisition request to a server, the server receives the pose acquisition request sent by the image acquisition device in a navigation process and then sends a calculated target pose to the image acquisition device, the image acquisition device combines the target pose and the live-action shooting picture after receiving the target pose, renders a live-action navigation map 22 of a target space area, wherein the live-action navigation map 22 is obtained by rendering a corresponding navigation icon 221 at a corresponding position of the live-action shooting picture, and the navigation icon 221 is used for guiding a user using the image acquisition device to navigate.

In some embodiments, following step 108 shown in fig. 3, the rendering of the augmented reality picture may be performed as follows: receiving a pose acquisition request sent by image acquisition equipment in the navigation process; responding to a pose acquisition request, and transmitting the pose of the target to image acquisition equipment; the target pose is used for the image acquisition equipment to render and obtain an augmented reality picture of the target space region by combining the target pose and the live-action shooting picture.

In this way, feature assignment is performed on each target voxel in the three-dimensional map image to obtain a voxel feature map, the voxel features of the live-action image feature and the voxel feature map are matched, and the target pose is determined based on the matching result, so that dependence on the three-dimensional map image is effectively reduced (dependence on the three-dimensional map image is converted into dependence on the voxel feature map) in the determination process of the target pose, the accuracy of the three-dimensional map image is often dependent on the acquisition equipment for acquiring the three-dimensional map image, the accuracy of the acquisition equipment is low, the three-dimensional map image can not accurately reflect the features of the target space region, and then the voxel feature map is determined to acquire the pose, so that the target pose can still be accurately acquired under the condition that the accuracy of the three-dimensional map image is low, and the accuracy of the determined target pose is effectively ensured. By carrying out feature assignment on partial voxels (target voxels) of the three-dimensional map image instead of full voxels based on image features, the volume of the obtained voxel feature map is smaller, and in the process of matching by using the voxel feature map, the matching operation amount can be effectively reduced, so that the pose obtaining efficiency is effectively improved.

In the following, an exemplary application of the embodiments of the present application in an application scenario of actual live-action navigation will be described.

The real scene map is a map capable of seeing real street scenes, secondary development is completed on an electronic map base map framework according to customer requirements, position display and navigation of a customer branch machine and a network point are achieved, quick positioning of the geographical position of the point location and bus route inquiry can be achieved, massive multimedia information can be integrated on map punctuation, and a user can watch information integrated on the punctuation while searching specific geographical positions of the branch mechanism and the network point.

The embodiment of the application provides a pose acquisition method for adding features to a laser point cloud map by utilizing an image to construct a laser voxel feature map and performing visual positioning by using the map, which is mainly applied to mobile terminal camera positioning and is divided into two parts of laser point cloud feature assignment and visual positioning, and specifically comprises the following steps: extracting image features: and extracting a description sub-feature map of each pixel of the image by using the deep neural network model. Laser point cloud feature assignment: and binding the feature weight corresponding to each pixel of the image to the corresponding point cloud voxel by using the pose of the image. Visual and map matching positioning: and matching each pixel of the query graph to a point cloud voxel, filtering out incorrect matching by utilizing a geometric constraint relation, and finally calculating to obtain the position and the posture of the camera.

The embodiment of the application provides a novel algorithm for completing visual positioning by utilizing a laser point cloud map, and the method for recovering the visual characteristic point map by directly utilizing the high-precision laser point cloud map avoids using three-dimensional reconstruction, SLAM and other methods with low precision and complex algorithm. Meanwhile, in order to solve the problem of matching the pixel points of the image to the laser point cloud, a deep neural network is utilized to assign characteristic values to the laser point cloud. Finally, a camera pose recovery algorithm based on a voting algorithm is provided.

In some embodiments, referring to fig. 13, fig. 13 is a schematic flow chart of a pose acquisition method provided in an embodiment of the present application. The pose obtaining method provided in the embodiment of the present application may be implemented through steps 301 to 308 shown in fig. 13, and steps 301 to 308 will be described below with reference to fig. 13.

In step 301, radar data is acquired.

In some embodiments, the radar data may be obtained by radar acquisition of the target space region by a radar acquisition device.

In step 302, a point cloud map is constructed based on the acquired radar data.

In some embodiments, the laser point cloud map is constructed, and the point cloud map (i.e., the three-dimensional map image of the target space region described above) is obtained by using a laser SLAM algorithm, a FAST-LIO algorithm, or the like, and after the point cloud map is constructed, the laser point cloud map is saved at the same time.

As an example, referring to fig. 14, fig. 14 is an effect schematic diagram of a point cloud map provided in the embodiment of the present application, and the point cloud map shown in fig. 14 is constructed based on the acquired radar data. A point cloud map, also known as a laser point cloud, is a collection of scanned points. The laser radar system scans the ground to obtain three-dimensional coordinates of ground reflection points, and each ground reflection point is distributed in a three-dimensional space in a point form according to the three-dimensional coordinates and is called a scanning point. The three-dimensional map image described above includes a laser point cloud.

In step 303, first visual data is acquired.

In some embodiments, the first visual data may be acquired by a live-action acquisition device, where the first visual data is a plurality of two-dimensional live-action images of the target space region described above.

In step 304, image feature extraction is performed based on the first visual data.

In some embodiments, the image feature extraction is performed based on the first visual data, which means that the image feature corresponding to the first visual data is obtained by performing image feature extraction (image coding) on the first visual data.

In some embodiments, each first visual sense may be based on a deep learning network S2Dnet of an image coding network (CNN) Data extracted into a D-dimensional feature map F using S2DNet _i ∈R ^W*H*D . The feature map is normalized by L2 on the channel to improve the universality, the long form of the feature map is consistent with the image, and after the image features corresponding to the first visual data are obtained, the image features are stored.

In step 305, a feature voxel map is determined based on the point cloud map and the image features.

In some embodiments, the step 305 may be implemented as follows: acquiring the pose of an image frame: the acquisition equipment is calibrated in advance, and the external parameters between the acquisition laser radar and the camera are utilized, so that the image frame pose (x, y, z, pitch angle, yaw angle and roll angle) is acquired through the alignment of the time stamp of the image frame and the laser point cloud acquired by the laser radar. Laser point cloud feature assignment: after the pose of the image frame is obtained, referring to fig. 15, fig. 15 is a schematic diagram of a pose obtaining method according to an embodiment of the present application, in which a ray is led from the camera optical center o to the image pixelsTo ensure sparsity, n pixels are projected each, n depending on the image resolution. The next step is to bind the feature value to the point cloud, obtain all voxels through which the ray passes by using the octree, find the voxel center point of the first intersected voxel, and bind the D-dimensional feature corresponding to the pixel to the voxel. When different pixels correspond to the same voxel, weighted average processing is carried out, the weight of each ray is inversely proportional to the distance between the center point of the voxel and the intersection point of the ray and the voxel, the closer the distance is, the larger the weight is, and the sum of the weights is 1.

In step 306, second visual data is acquired.

In some embodiments, the second visual data is a live view captured by the image capturing device as described above.

In step 307, image feature extraction is performed based on the second visual data.

In some embodiments, the process of image feature extraction in step 307 described above is similar to the process of image feature extraction in step 304 described above.

In step 308, a voting match optimization calculation positioning result is performed based on the image features and feature voxel map of the second visual data.

In some embodiments, the above step 308 may be implemented as follows: after a voxel feature map of a scene is established offline, when the scene is positioned, S2D-net is used for extracting features from a query map, the corresponding relation between the 2D features of the image and 3D points of the voxel center in the map is established through feature value matching, after enough 2D-3D Matches are obtained, RANSAC is utilized for screening the plurality of 2D-3D Matches to obtain candidate 2D-3D Matches and PnP algorithm based on target 2D-3D Matches, the pose of a camera (the pose of a camera shooting the query map relative to the origin of the map) is calculated, six degrees of freedom (x, y, z, pitch angle, yaw angle, position and pose of a rolling angle are obtained, 2D-3D Voting matching is carried out on the candidate 2D-3D Matches to obtain target 2D-3D Matches, after the query map is subjected to feature extraction through S2D net, firstly, obtaining one-to-many matching relations of each feature, filtering out most of wrong matching relations through a series of geometric filtering, then constructing a voing Shape, if position prior information (such as GPS position) exists, further restricting the voing Shape, constructing a voing Shape for each 2D-3D matching relation, carrying out Voting on the area similarly to Hough voice, finally, considering the gesture represented by the area with the largest number of votes as a real camera gesture area, calculating the image gesture, carrying out Voting matching and checking to obtain matching pair inner points, obtaining a final gesture (namely the target gesture described above) through RANSAC and PnP algorithm, and sending the target gesture to a client so as to enable the client to render a navigation picture based on the gesture.

Thus, by the pose acquisition method provided by the embodiment of the application, the target pose can be acquired more rapidly and accurately. Moreover, for the camera, one pixel corresponds to a 3D point which is not a real world point, and the use of voxels is more consistent with the situation of actually taking pictures. By using feature matching and positioning instead of triangulated visual feature point cloud set matching, the accuracy and real-time requirements on pose calculation of the terminal are reduced, and the method is more suitable for mobile terminal products with small calculation power, so that the pose acquisition efficiency is effectively improved.

In some embodiments, the characteristic value binding can be performed in a targeted manner, the pixel points with better quality of the characteristic can be projected instead of randomly selecting at intervals, and the limitation of the gravity direction can be added when voting is performed, so that the area where the query graph is located can be quickly determined.

It can be appreciated that in the embodiments of the present application, related data such as three-dimensional map images and two-dimensional live-action images are related, when the embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of related data needs to comply with related laws and regulations and standards of related countries and regions.

Continuing with the description below of an exemplary structure of the pose acquisition device 455 provided in embodiments of the present application implemented as a software module, in some embodiments, as shown in fig. 2, the software module stored in the pose acquisition device 455 of the memory 450 may include: the acquisition module 4551 is configured to acquire a three-dimensional map image of a target spatial region and a plurality of two-dimensional live-action images of the target spatial region; the selection module 4552 is configured to obtain poses corresponding to the two-dimensional live-action images respectively, and select target voxels corresponding to the two-dimensional live-action images respectively from a plurality of voxels of the three-dimensional map image based on the poses; the assignment module 4553 is configured to obtain image features of each two-dimensional live-action image, and perform feature assignment on each target voxel based on the image features to obtain a voxel feature map of the target spatial region; the matching module 4554 is configured to obtain a live-action image feature of the live-action shooting image acquired by the image acquisition device, and match the live-action image feature with a voxel feature of the voxel feature map to obtain a matching result; a determining module 4555 is configured to determine a target pose of the image capturing device in the target space region based on the matching result.

In some embodiments, the selecting module 4552 is further configured to generate, in the three-dimensional map image, at least one ray corresponding to each two-dimensional live-action image, based on the pose corresponding to each two-dimensional live-action image; respectively acquiring at least one intersecting voxel of each ray and the three-dimensional map image; for each ray, the distance between each intersecting voxel of the ray and the origin of the ray is determined, and the intersecting voxel with the smallest distance is determined as the target voxel.

In some embodiments, the pose is used for indicating a target position and a pose angle of a live-action acquisition device for acquiring a two-dimensional live-action image in a target space region; the selection module 4552 is further configured to perform the following processing for each two-dimensional live-action image: determining target map coordinates corresponding to the target position in the three-dimensional map image based on the target position; determining a ray angle range of rays in the three-dimensional map image based on the attitude angle and the view angle of the live-action acquisition equipment; in the three-dimensional map image, at least one ray is generated within a ray angle range by taking the coordinates of the target map as the starting point of the ray.

In some embodiments, the selecting module 4552 is further configured to obtain a position-coordinate mapping file, where the position-coordinate mapping file is used to record a mapping relationship between each position in the target space area and a corresponding map coordinate in the three-dimensional map image, and the positions in the target space area are in one-to-one correspondence with the map coordinates in the three-dimensional map image; and querying a target mapping relation comprising the target position in the position-coordinate mapping file, and determining map coordinates in the target mapping relation as target map coordinates.

In some embodiments, the selecting module 4552 is further configured to obtain a reference view angle, where twice the size of the reference view angle is equal to the size of the view angle of the live-action capturing device; subtracting the attitude angle from the reference view angle to obtain a minimum angle value of the ray angle range, and adding the attitude angle and the reference view angle to obtain a maximum angle value of the ray angle range; the range of angles between the minimum angle value and the maximum angle value is determined as the ray angle range.

In some embodiments, the image features include pixel features of pixels in the two-dimensional live-action image; the assignment module 4553 is further configured to obtain at least one associated pixel associated with the corresponding target voxel from a plurality of pixels included in each two-dimensional live-action image; combining the pixel characteristics of each associated pixel point to determine the target voxel characteristics of the target voxels corresponding to each two-dimensional live-action image; in the three-dimensional map image, based on the target voxel characteristics of each target voxel, respectively carrying out characteristic assignment on each target voxel to obtain a voxel characteristic map of the target space region.

In some embodiments, the assignment module 4553 is further configured to perform the following processing for each target voxel corresponding to each two-dimensional live-action image: when the number of the associated pixel points is one, determining the pixel characteristics of the associated pixel points as target voxel characteristics of target voxels; and when the number of the associated pixel points is a plurality of, carrying out weighted summation on the pixel characteristics of each associated pixel point to obtain the target voxel characteristics of the target voxels.

In some embodiments, the voxel features comprise target voxel features for each target voxel in a voxel feature map; the matching module 4554 is further configured to perform feature matching on the live-action image features and each target voxel feature, so as to obtain feature matching results corresponding to each target voxel feature; when the feature matching result indicates that the matching of the live-action image feature and the target voxel feature is successful, combining the live-action shooting image and the target voxel corresponding to the target voxel feature into a shooting image-voxel pair; and determining each combined shooting picture-voxel pair as a matching result.

In some embodiments, the matching module 4554 is further configured to perform the following processing for each target voxel feature: determining feature distances between the target voxel features and the live-action image features; when the feature distance is greater than or equal to the distance threshold, determining a feature matching result corresponding to the target voxel feature as a first matching result, wherein the first matching result is used for indicating that the matching of the live-action image feature and the target voxel feature is successful; and when the feature distance is smaller than the distance threshold, determining a feature matching result corresponding to the target voxel feature as a second matching result, wherein the second matching result is used for indicating that the matching of the live-action image feature and the target voxel feature fails.

In some embodiments, the matching result includes at least one shot-voxel pair, the shot-voxel pair being used to indicate a mapping relationship between the live-action shot and the target voxel; the determining module 4555 is further configured to select a target map-voxel pair from at least one map-voxel pair; the target shooting image-voxel pair is at least one shooting image-voxel pair with the highest matching degree, and the matching degree is used for indicating the matching degree between the live-action shooting image and the target voxel; based on the target capture map-voxel pairs, a target pose of the image acquisition device in a target spatial region is determined.

In some embodiments, the determining module 4555 is further configured to obtain, for a target voxel in a target capturing map-voxel pair, three-dimensional position coordinates of the target voxel in the three-dimensional map image, and obtain, from the live-action capturing map, at least one associated pixel point associated with the target voxel; determining two-dimensional position coordinates of each associated pixel point in the live-action shooting picture respectively; and carrying out gesture prediction on the image acquisition equipment based on the three-dimensional position coordinates and the two-dimensional position coordinates to obtain the target pose of the image acquisition equipment in the target space region.

In some embodiments, the pose prediction is implemented by a pose prediction model comprising a parameter transformation layer and a pose estimation layer; the determining module 4555 is further configured to invoke a parameter transformation layer to perform parameter transformation on the three-dimensional position coordinate and the two-dimensional position coordinate to obtain a parameter transformation matrix; and calling a pose estimation layer, and carrying out pose estimation on the image acquisition equipment based on the parameter transformation matrix to obtain the target pose of the image acquisition equipment in the target space region.

In some embodiments, the pose acquisition device 455 further includes: the map module is used for receiving a pose acquisition request sent by the image acquisition equipment in the navigation process; responding to a pose acquisition request, and transmitting the pose of the target to image acquisition equipment; the image acquisition device is used for combining the target pose with the live-action shooting picture, and rendering to obtain a live-action navigation map of the target space region.

The present embodiments provide a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, cause the processor to perform the pose acquisition method provided by the embodiments of the present application, for example, the pose acquisition method as shown in fig. 3.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of electronic devices including one or any combination of the above-described memories.

In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, computer-executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the application has the following beneficial effects:

(1) By carrying out feature assignment on each target voxel in the three-dimensional map image to obtain a voxel feature map, matching the feature of the live-action image with the voxel feature of the voxel feature map, and determining the target pose based on the matching result, the dependence on the three-dimensional map image is effectively reduced (the dependence on the three-dimensional map image is converted into the dependence on the voxel feature map) in the determination process of the target pose, and the accuracy of the three-dimensional map image is often dependent on the acquisition equipment for acquiring the three-dimensional map image, so that the three-dimensional map image can not accurately reflect the feature of the target space region due to low accuracy of the acquisition equipment, and then the voxel feature map is determined to acquire the pose, so that the target pose can still be accurately acquired under the condition of low accuracy of the three-dimensional map image, and the accuracy of the determined target pose is effectively ensured. By carrying out feature assignment on partial voxels (target voxels) of the three-dimensional map image instead of full voxels based on image features, the volume of the obtained voxel feature map is smaller, and in the process of matching by using the voxel feature map, the matching operation amount can be effectively reduced, so that the pose obtaining efficiency is effectively improved.

(2) By acquiring the three-dimensional map image of the target space region and a plurality of two-dimensional live-action images of the target space region, the method is convenient for constructing a voxel characteristic map of the target space region based on the three-dimensional map image and the two-dimensional live-action images, and provides reliable data guarantee for the determination of the subsequent target pose.

(3) At least one ray corresponding to each two-dimensional live-action image is generated in the three-dimensional map image based on the pose corresponding to each two-dimensional live-action image, and the intersecting voxel with the smallest distance between each intersecting voxel of the ray and the starting point of the ray is determined as the target voxel, so that subsequent assignment for the target voxel is facilitated, and the obtained voxel feature map is more accurate. By pointedly selecting part of voxels in the three-dimensional map image as target voxels, the subsequent assignment process does not need to assign all voxels in the three-dimensional map image, so that the calculation amount of an algorithm is effectively reduced, the operation efficiency is improved, the pose acquisition efficiency is effectively improved, and the pose is efficiently acquired.

(4) And carrying out feature assignment on each target voxel based on image features to obtain a voxel feature map of a target space region, so that in the process of determining the voxel feature map, assignment is carried out on partial voxels (namely target voxels) of the three-dimensional map image, assignment is not required to be carried out on all voxels of the three-dimensional map image, the calculated amount of feature assignment is effectively reduced, assignment is not required to be carried out on all voxels in the three-dimensional map image, the calculated amount of an algorithm is effectively reduced, the operation efficiency is improved, the pose acquisition efficiency is effectively improved, and the pose efficient acquisition is realized.

(5) The feature matching result corresponding to each target voxel feature is obtained by respectively carrying out feature matching on the live-action image feature and each target voxel feature, so that the whole voxels of the three-dimensional map image do not need to be matched in the feature matching process, the calculated amount in the feature matching process is effectively reduced, the operation efficiency is improved, and the pose obtaining efficiency is effectively improved.

(6) The target shooting image-voxel pair is selected from at least one shooting image-voxel pair, when the shooting image-voxel pairs are multiple, the shooting image-voxel pairs are screened, and the target shooting image-voxel pairs are obtained, so that the number of the shooting image-voxel pairs for determining the target pose subsequently is effectively reduced, the time for determining the target pose is effectively reduced, the operation efficiency is improved, and the pose obtaining efficiency is effectively improved.

(7) The target shooting image-voxel pair is selected from at least one shooting image-voxel pair, when the shooting image-voxel pairs are multiple, the shooting image-voxel pair is screened, the target shooting image-voxel pair is obtained, and the accuracy of the target pose determined based on the target shooting image-voxel pair is improved effectively due to the fact that the obtained target shooting image-voxel pair is high in effectiveness.

(8) Aiming at a target voxel in a target shooting image-voxel pair, carrying out gesture prediction on the image acquisition equipment through a three-dimensional position coordinate of the target voxel in a three-dimensional map image and a two-dimensional position coordinate of each associated pixel point in a live-action image respectively to obtain a target pose of the image acquisition equipment in a target space area, wherein the target voxel in the target shooting image-voxel pair can accurately reflect a mapping relation between the indication live-action image and the three-dimensional map image, so that the determined target pose is more accurate based on the target shooting image-voxel pair.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims

1. A pose acquisition method, the method comprising:

2. The method according to claim 1, wherein selecting, based on the pose, a target voxel corresponding to each of the two-dimensional live-action images from a plurality of voxels of the three-dimensional map image, includes:

generating at least one ray corresponding to each two-dimensional live-action image in the three-dimensional map image based on the pose corresponding to each two-dimensional live-action image;

respectively acquiring at least one intersection voxel of each ray and the three-dimensional map image;

for each of the rays, a distance between each of the intersecting voxels of the ray and a start point of the ray is determined, and an intersecting voxel at which the distance is smallest is determined as the target voxel.

3. The method of claim 2, wherein the pose is used for indicating a target position and a pose angle of a live-action acquisition device acquiring the two-dimensional live-action image in the target space region;

based on the pose corresponding to each two-dimensional live-action image, generating at least one ray corresponding to each two-dimensional live-action image in the three-dimensional map image, wherein the at least one ray comprises the following steps:

the following processing is respectively executed for each two-dimensional live-action image:

determining target map coordinates corresponding to the target position in the three-dimensional map image based on the target position;

determining a ray angle range of the ray in the three-dimensional map image based on the attitude angle and the view angle of the live-action acquisition equipment;

and in the three-dimensional map image, the target map coordinate is taken as the starting point of the ray, and the at least one ray is generated within the ray angle range.

4. The method of claim 3, wherein the determining, in the three-dimensional map image, target map coordinates corresponding to the target location based on the target location comprises:

Acquiring a position-coordinate mapping file, wherein the position-coordinate mapping file is used for recording mapping relations between each position in the target space area and corresponding map coordinates in the three-dimensional map image, and the positions in the target space area correspond to the map coordinates in the three-dimensional map image one by one;

and inquiring a target mapping relation comprising the target position in the position-coordinate mapping file, and determining the map coordinates in the target mapping relation as the target map coordinates.

5. A method according to claim 3, wherein said determining a range of ray angles for the ray in the three-dimensional map image based on the pose angle and the perspective of the live-action acquisition device comprises:

obtaining a reference view angle, wherein twice of the size of the reference view angle is equal to the size of the view angle of the live-action acquisition equipment;

subtracting the attitude angle from the reference view angle to obtain a minimum angle value of the ray angle range, and adding the attitude angle to the reference view angle to obtain a maximum angle value of the ray angle range;

and determining an angle range between the minimum angle value and the maximum angle value as the ray angle range.

6. The method of claim 1, wherein the image features comprise pixel features of pixels in the two-dimensional live-action image;

and performing feature assignment on each target voxel based on the image features to obtain a voxel feature map of the target space region, wherein the feature assignment comprises the following steps:

acquiring at least one associated pixel point associated with a corresponding target voxel from a plurality of pixel points included in each two-dimensional live-action image;

combining the pixel characteristics of each associated pixel point to determine the target voxel characteristics of the target voxels corresponding to each two-dimensional live-action image;

and in the three-dimensional map image, respectively carrying out feature assignment on each target voxel based on the target voxel feature of each target voxel to obtain a voxel feature map of the target space region.

7. The method of claim 6, wherein the determining, in combination with the pixel characteristics of each of the associated pixel points, the target voxel characteristic of the target voxel corresponding to each of the two-dimensional live-action images comprises:

and respectively executing the following processing for target voxels corresponding to the two-dimensional live-action images:

when the number of the associated pixel points is one, determining the pixel characteristics of the associated pixel points as target voxel characteristics of the target voxels;

And when the number of the associated pixel points is a plurality of, carrying out weighted summation on the pixel characteristics of each associated pixel point to obtain the target voxel characteristics of the target voxels.

8. The method of claim 1, wherein the voxel characteristics comprise target voxel characteristics for each of the target voxels in the voxel characteristic map; the step of matching the live-action image features with the voxel features of the voxel feature map to obtain a matching result comprises the following steps:

respectively carrying out feature matching on the live-action image features and the target voxel features to obtain feature matching results corresponding to the target voxel features;

when the feature matching result indicates that the real-scene image feature and the target voxel feature are successfully matched, combining the real-scene shooting image and the target voxel corresponding to the target voxel feature into a shooting image-voxel pair;

and determining each combined shooting picture-voxel pair as the matching result.

9. The method according to claim 8, wherein the performing feature matching on the live-action image features and the target voxel features to obtain feature matching results corresponding to the target voxel features includes:

The following processing is performed for each of the target voxel features:

determining feature distances between the target voxel features and the live-action image features;

when the feature distance is greater than or equal to a distance threshold, determining a feature matching result corresponding to the target voxel feature as a first matching result, wherein the first matching result is used for indicating that the real-scene image feature and the target voxel feature are successfully matched;

and when the feature distance is smaller than the distance threshold, determining a feature matching result corresponding to the target voxel feature as a second matching result, wherein the second matching result is used for indicating that the matching of the live-action image feature and the target voxel feature fails.

10. The method of claim 1, wherein the matching result comprises at least one shot-voxel pair for indicating a mapping relationship between the live-action shot and the target voxel;

the determining, based on the matching result, a target pose of the image acquisition device in the target space region includes:

selecting a target shot map-voxel pair from the at least one shot map-voxel pair;

The target shooting image-voxel pair is the shooting image-voxel pair with the highest matching degree in the at least one shooting image-voxel pair, and the matching degree is used for indicating the matching degree between the live-action shooting image and the target voxel;

and determining the target pose of the image acquisition device in the target space region based on the target shooting map-voxel pair.

11. The method of claim 10, wherein the determining a target pose of the image acquisition device in the target spatial region based on the target cine-voxel pairs comprises:

aiming at the target voxel in the target shooting image-voxel pair, acquiring a three-dimensional position coordinate of the target voxel in the three-dimensional map image, and acquiring at least one associated pixel point associated with the target voxel from the live-action shooting image;

determining two-dimensional position coordinates of each associated pixel point in the live-action shooting picture respectively;

and carrying out gesture prediction on the image acquisition equipment based on the three-dimensional position coordinates and the two-dimensional position coordinates to obtain the target pose of the image acquisition equipment in the target space region.

12. The method of claim 11, wherein the pose prediction is implemented by a pose prediction model comprising a parameter transformation layer and a pose estimation layer;

the step of predicting the pose of the image acquisition device based on the three-dimensional position coordinates and the two-dimensional position coordinates to obtain the target pose of the image acquisition device in the target space region comprises the following steps:

calling the parameter transformation layer, and performing parameter transformation on the three-dimensional position coordinates and the two-dimensional position coordinates to obtain a parameter transformation matrix;

and calling the pose estimation layer, and carrying out pose estimation on the image acquisition equipment based on the parameter transformation matrix to obtain the target pose of the image acquisition equipment in the target space region.

13. The method of claim 1, wherein the determining the target pose of the image acquisition device in the target spatial region based on the matching result further comprises:

receiving a pose acquisition request sent by the image acquisition equipment in the navigation process;

responding to the pose acquisition request, and sending the target pose to the image acquisition equipment;

The target pose is used for rendering the real scene navigation map of the target space area by combining the target pose and the real scene shooting map through the image acquisition equipment.

14. A pose acquisition device, the device comprising:

15. An electronic device, the electronic device comprising:

a memory for storing computer executable instructions or computer programs;

a processor for implementing the pose acquisition method according to any of claims 1 to 13 when executing computer executable instructions or computer programs stored in the memory.

16. A computer-readable storage medium storing computer-executable instructions, which when executed by a processor implement the pose acquisition method according to any one of claims 1 to 13.

17. A computer program product comprising a computer program or computer executable instructions which, when executed by a processor, implement the pose acquisition method according to any of claims 1 to 13.