CN115170914A

CN115170914A - Pose estimation method and device, electronic equipment and storage medium

Info

Publication number: CN115170914A
Application number: CN202210796618.8A
Authority: CN
Inventors: 潘维; 袁炜灯; 吴轲; 郭约法; 陈振良; 张熙; 汪万伟; 柏东辉; 卢旺; 吴栋; 钟志明; 李祺威; 陈浩良; 胡晓军; 徐云
Original assignee: Guangdong Power Grid Co Ltd; Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2022-07-06
Filing date: 2022-07-06
Publication date: 2022-10-11

Abstract

The embodiment of the invention discloses a pose estimation method, a pose estimation device, electronic equipment and a storage medium. The method comprises the following steps: acquiring image data and inertial data of an environment where the mobile equipment is located at a target moment; respectively extracting the characteristics of the image data and the inertial data to obtain the image characteristics of the image data and the inertial characteristics of the inertial data; determining fusion characteristics according to the image characteristics and the inertia characteristics; and determining the pose of the mobile equipment at the target moment according to the fusion characteristics. According to the scheme of the embodiment of the invention, the poses of the mobile equipment at different moments can be accurately determined.

Description

Pose estimation method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a pose estimation method and device, electronic equipment and a storage medium.

Background

The Simultaneous Localization And Mapping (SLAM) is mainly used for autonomous Localization And Mapping of mobile devices such as unmanned aerial vehicles And unmanned vehicles, specifically, mobile devices in an unknown environment move from an unknown position, and estimate their own position through data returned by sensors during moving, and establish an environment map. The SLAM system is divided into four parts, namely a front-end odometer, closed-loop detection, rear-end optimization and map construction. Odometry technology (i.e., pose estimation of mobile devices) as the front end of SLAM, it can estimate the pose of the drone. An excellent odometry technology can provide a high-quality initial value for the back-end and global map construction of the SLAM, so that the mobile equipment can realize accurate autonomy to execute various tasks in a complex unknown environment.

How to accurately determine the poses of the mobile device at different moments is a key problem concerned by the insiders.

Disclosure of Invention

The embodiment of the invention provides a pose estimation method and device, electronic equipment and a storage medium, which are used for accurately determining poses of mobile equipment at different moments.

According to an aspect of the embodiments of the present invention, there is provided a pose estimation method, including:

acquiring image data and inertial data of an environment where the mobile equipment is located at a target moment;

respectively extracting the features of the image data and the inertial data to obtain the image features of the image data and the inertial features of the inertial data;

determining a fusion characteristic according to the image characteristic and the inertia characteristic;

and determining the pose of the mobile equipment at the target moment according to the fusion features.

According to another aspect of the embodiments of the present invention, there is provided a pose estimation apparatus including:

the data acquisition module is used for acquiring image data and inertial data of the environment of the mobile equipment at the target moment;

the characteristic extraction module is used for respectively extracting the characteristics of the image data and the inertial data to obtain the image characteristics of the image data and the inertial characteristics of the inertial data;

a fusion feature determination module for determining a fusion feature according to the image feature and the inertial feature;

and the pose determining module is used for determining the pose of the mobile equipment at the target moment according to the fusion characteristics.

According to another aspect of the embodiments of the present invention, there is provided an electronic apparatus, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the pose estimation method according to any one of the embodiments of the present invention.

According to another aspect of the embodiments of the present invention, there is provided a computer-readable storage medium storing computer instructions for enabling a processor to implement the pose estimation method according to any one of the embodiments of the present invention when executed.

According to the technical scheme of the embodiment of the invention, the image data and the inertia data of the environment of the mobile equipment at the target moment are obtained; respectively extracting the features of the image data and the inertial data to obtain the image features of the image data and the inertial features of the inertial data; determining fusion characteristics according to the image characteristics and the inertia characteristics; and determining the pose of the mobile equipment at the target moment according to the fusion characteristics, so that the poses of the mobile equipment at different moments can be accurately determined.

It should be understood that the statements in this section do not necessarily identify key or critical features of any embodiments of the present invention, nor limit the scope of any embodiments of the present invention. Other features of embodiments of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a pose estimation method according to an embodiment of the present invention;

fig. 2 is a flowchart of a pose estimation method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a convolutional neural network for extracting image features according to a second embodiment of the present invention;

fig. 4 is a flowchart of a pose estimation method according to a second embodiment of the present invention;

fig. 5 is a schematic structural diagram of a pose estimation apparatus according to a third embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device that implements the pose estimation method according to the embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only some embodiments, not all embodiments, of the embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the embodiments of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the embodiments of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of a pose estimation method according to an embodiment of the present invention, where the present embodiment is applicable to estimating a pose of a mobile device such as an unmanned aerial vehicle, and the method may be executed by a pose estimation apparatus, which may be implemented in a form of hardware and/or software, and the pose estimation apparatus may be configured in an electronic device, in this embodiment, the electronic device may be a computer, a server, a vehicle-mounted device, or a tablet computer. Specifically, referring to fig. 1, the method specifically includes the following steps:

and step 110, acquiring image data and inertial data of the environment of the mobile device at the target moment.

The mobile device may be an unmanned aerial vehicle, an unmanned vehicle, a robot, or the like, which is not limited in this embodiment. The target time may be any time, for example, 28 minutes and 30 seconds at 10 am or 38 minutes and 50 seconds at 4 pm, etc., which is not limited in this embodiment. The environment of the mobile device at the target time is the position of the mobile device at the target time, for example, on a square, in the air, or in a field, and the like, which is not limited in this embodiment.

It is understood that a plurality of image sensors, inertial sensors, velocity sensors, temperature sensors, or the like may be simultaneously mounted on the mobile device.

In an optional implementation manner of this embodiment, image data of an environment where the mobile device is located at a target time may be obtained by an image sensor installed on the mobile device, and it is suspected that inertial data of the mobile device at the target time is obtained by an inertial sensor installed on the mobile device; in this embodiment, the inertial data may be angular velocity data, acceleration data, or both angular velocity data and acceleration data, which is not limited in this embodiment.

And step 120, respectively extracting the characteristics of the image data and the inertial data to obtain the image characteristics of the image data and the inertial characteristics of the inertial data.

In an optional implementation manner of this embodiment, after the image data of the environment where the mobile device is located at the target time and the inertial data at the target time are acquired, feature extraction may be performed on the image data and the inertial data, respectively, so as to obtain an image feature of the image data and an inertial feature of the inertial data.

In a specific example of this embodiment, after the image data of the environment where the mobile device is located at the target time is acquired, the image data may be preprocessed, and the size of the image data is adjusted, for example, to be 640 × 192 pixels or 512 × 512 pixels, which is not limited in this embodiment, and the preprocessed image data is input into the feature extraction network to perform feature extraction, so as to obtain the image features of the image data.

Furthermore, the acquired inertial data including the angular velocity data and the acceleration data can be input into a one-dimensional feature extraction network for feature extraction, so that the inertial features of the inertial data are obtained. It can be understood that, in the embodiment, since the image data is two-dimensional data, the extracted image features are also two-dimensional data; the inertia data is one-dimensional data, and the extracted inertia characteristics are also one-dimensional data; therefore, the dimensions of the image features and the inertial features involved in the present embodiment are different.

And step 130, determining fusion characteristics according to the image characteristics and the inertia characteristics.

In an optional implementation manner of this embodiment, after the image feature of the image data and the inertial feature of the inertial data are extracted, the fusion feature may be further determined according to the extracted image feature and the inertial feature.

In an optional implementation manner of this embodiment, after the image features of the image data and the inertial features of the inertial data are extracted, the image features and the inertial features of different dimensions may be fused, so as to obtain a fused feature.

And 140, determining the pose of the mobile equipment at the target moment according to the fusion characteristics.

In an optional implementation manner of this embodiment, after the fusion feature is determined and obtained according to the image feature and the inertial feature, the pose of the mobile device at the target time may be further determined and obtained through the fusion feature; wherein, the pose of the mobile device at the target moment may include: position information, roll angle, pitch angle, and the like, which are not limited in this embodiment.

In an optional implementation manner of this embodiment, after the fusion feature is obtained, the fusion feature may be input to the full-link layer for processing, so as to obtain a pose of the mobile device at the target time.

According to the scheme of the embodiment, image data and inertial data of the environment of the mobile equipment at the target moment are acquired; respectively extracting the features of the image data and the inertial data to obtain the image features of the image data and the inertial features of the inertial data; determining fusion characteristics according to the image characteristics and the inertia characteristics; and determining the pose of the mobile equipment at the target moment according to the fusion characteristics, so that the poses of the mobile equipment at different moments can be accurately determined.

Example two

Fig. 2 is a flowchart of a method for correcting table data in the second embodiment of the present invention, and this embodiment is a further refinement of the foregoing technical solutions, and the technical solutions in this embodiment may be combined with various alternatives in one or more of the foregoing embodiments. As shown in fig. 2, the method for correcting table data may include the steps of:

step 210, obtaining image data and inertial data of the environment of the mobile device at the target moment.

Step 220, scanning the image data in sequence through a plurality of convolution kernels with different sizes, and extracting image characteristic information in the image data; and pooling the image characteristic information along the horizontal direction and the vertical direction respectively to obtain the image characteristics corresponding to the image data.

In an optional implementation manner of this embodiment, after the image data and the inertial data of the environment where the mobile device is located at the target time are acquired, the image data may be further scanned sequentially by a plurality of convolution kernels of different sizes, and image feature information in the image data is extracted; and pooling the image characteristic information along the horizontal direction and the vertical direction respectively to obtain the image characteristics corresponding to the image data.

In this embodiment, the image feature extraction module is composed of a plurality of convolution layers, each convolution layer includes a plurality of convolution kernels, and the convolution kernels are used to scan the whole image from left to right and from top to bottom in sequence to obtain the image features. The convolutional layer in front of the network captures image local, detailed information. The subsequent convolutional layer receptive fields are enlarged layer by layer and are used for capturing more complex and abstract information of the image. And finally obtaining abstract representations of the images at different scales as the output of the image feature extraction module through the operation of the plurality of convolution layers. In order to improve the feature extraction capability, a coordinate attention module is added in the convolutional neural network. The coordinate attention mechanism decomposes the channel attention into two 1-dimensional feature encoding processes, aggregating features along 2 spatial directions, respectively. Remote dependencies are captured in one spatial direction while accurate location information can be retained in another spatial direction. The generated feature maps are then encoded into a pair of direction-aware and location-sensitive attention feature maps, respectively, which may be applied complementarily to the input feature maps to enhance the representation of the object of interest.

Fig. 3 is a schematic structural diagram of a convolutional neural network for extracting image features according to a second embodiment of the present invention, and it can be understood that fig. 3 only shows a part of the network, and it is not a complete structure of the network; as can be seen from fig. 3, in this embodiment, after the image data is processed by the convolutional layer, the convolved feature maps are pooled in the X and Y directions, respectively, and the pooled result is subjected to cascade convolution, so that an attention mechanism is introduced into the process of extracting the image features of the visual image, so that important information of a target processing object can be emphasized or selected, and some irrelevant detail information can be suppressed.

The parameter configuration of the image feature extraction network in the present embodiment may be as shown in table 1.

TABLE 1 image feature extraction network parameter configuration

Step 230, scanning the inertial data through a plurality of one-dimensional convolution cores, and extracting inertial characteristic information in the inertial data; and pooling the inertia characteristic information along the horizontal direction and the vertical direction respectively to obtain inertia characteristics corresponding to the inertia data.

In an optional implementation manner of this embodiment, after the image data and the inertial data of the environment where the mobile device is located at the target time are acquired, the inertial data may be further scanned by a plurality of one-dimensional convolution kernels, and inertial feature information in the inertial data is extracted; and pooling the inertia characteristic information along the horizontal direction and the vertical direction respectively to obtain inertia characteristics corresponding to the inertia data.

It should be noted that, because the dimensions of the inertial data are different from those of the image data, the extraction of the image features and the inertial features cannot be realized by the same convolutional neural network, so in this embodiment, one of the convolutional neural network is used to extract the inertial features, and meanwhile, in order to better capture the relevance between the internal features of the input data, a self-attention mechanism is introduced into the inertial feature extraction module, and the weights of different connections are dynamically generated.

The parameter configuration of the inertial feature extraction network in this embodiment may be as shown in table 2.

TABLE 2 inertial feature extraction network parameter configuration

At present, most of the learning-based visual inertial attitude estimation adopts a long-short memory network (LSTM) to process inertial data, but the long-short memory network has the defects of long time consumption, complex calculation and the like. The embodiment of the invention adopts a one-dimensional convolution neural network to process inertia data, extracts inertia characteristics and introduces a self-attention mechanism in the inertia characteristic extraction network; the input received by the neural network is a plurality of vectors with different sizes, and different vectors have a certain relation, but the relation between the inputs cannot be fully played in the actual training process, so that the model training effect is poor, the self-attention mechanism can enable the neural network to notice the correlation between different parts in the whole input, and the pose estimation accuracy is improved.

And 240, cascading the image characteristics and the inertia characteristics, and fusing the cascaded image characteristics and the inertia characteristics through an encoder to obtain fused characteristics.

In an optional implementation manner of this embodiment, after the image feature and the inertia feature are obtained, the image feature and the inertia feature may be cascaded, and the cascaded image feature and the inertia feature are fused by an encoder to obtain a fusion feature.

In this embodiment, the inertia feature and the image feature after the cascade connection may be input into a fusion network, and the two are fused by the fusion network, so as to obtain a fusion feature. In an optional implementation manner of this embodiment, an encoder in the Transformer network may be used to fuse features of two different modalities, namely, image and inertia. A position Encoding (Positional Encoding) module in the Transformer structure can perform data abstraction on different acquired sensor information on a time sequence so as to ensure the accuracy of the model, and due to the fact that the position Encoding (Positional Encoding) module refers to the idea of attention mechanism, the position Encoding (Positional Encoding) module does not generate excessive dependence on the data information at the previous moment, can establish a learning model for effective sensor information more seriously, and has higher calculation speed. And (3) cascading the image characteristics and the inertia characteristics extracted by the neural network, inputting the image characteristics and the inertia characteristics into a coding layer in the transform network for multi-sensor fusion, and outputting the image characteristics and the inertia characteristics to obtain fused characteristics.

And step 250, processing the fusion characteristics through the full connection layer to obtain the pose of the mobile equipment at the target moment.

In this embodiment, after the fusion feature is obtained, the fusion feature may be processed through the full-link layer, so as to predict and obtain the pose of the mobile device at the target time.

According to the scheme of the embodiment, the image features in the image data are extracted through the image feature extraction network, the inertia features in the inertia data are extracted through the inertia feature extraction network, the image features and the inertia features are fused through the encoder, the pose of the mobile equipment can be accurately obtained, and a basis is provided for improving the accuracy of the SLAM system.

For better understanding of the embodiment of the present invention, fig. 4 is a flowchart of a pose estimation method according to a second embodiment of the present invention, and referring to fig. 4, the method specifically includes the following steps:

step 410, acquiring image data.

Step 411, extracting image features by convolution nerves.

Step 420, obtaining inertial data.

And 421, extracting inertia characteristics through a one-dimensional convolution neural network, and adding a self-attention mechanism.

And 430, cascading the image characteristic and the inertia characteristic.

And step 440, fusing the image characteristic and the inertia characteristic.

And 450, predicting to obtain a pose with 6 degrees of freedom.

In a specific example of this embodiment, this can be saidAnd preprocessing the input image at the previous moment, and adjusting the size of the image to be 640 multiplied by 192 pixels so as to meet the input requirement of an image feature extraction network. Then, the processed image is used as the input of an image feature extraction network, the feature extraction network adopts a pre-training model FlowNet to carry out weight initialization, then, a KITTI data set is adopted to carry out training, and the image feature f is obtained through network output _V For subsequent processing.

Furthermore, the data of the inertial sensor at the current moment is used as the input of an inertial feature extraction network, the inertial feature extraction network adopts a one-dimensional convolution network (1D CNN) to process the inertial data, and the inertial feature f is output _IMU For subsequent processing.

Further, the image characteristics f obtained in the above steps _V And characteristic of inertia f _IMU And cascading, inputting the image inertia characteristics after cascading into a Transformer module to perform multi-modal characteristic fusion. The fusion module adopts an encoder in a Transformer network, and the position encoding module can perform data abstraction on the acquired different sensor information on a time sequence so as to ensure the accuracy of the model, and can build a learning model for effective sensor information more with reference to the idea of attention mechanism, so that the model has higher calculation speed. Outputting by a Transformer module to obtain a fused characteristic f _out 。

Further, the fused feature f _out And inputting the full-connection layer for processing, and predicting to obtain the pose estimation of 6 degrees of freedom of the camera.

According to the scheme of the embodiment of the invention, camera data and inertia data are respectively processed by a convolutional neural network and a one-dimensional convolutional neural network, an attention mechanism is introduced, and finally a Transformer encoder is used for fusing extracted visual features and inertia features to obtain pose estimation of 6 degrees of freedom. Compared with the VIO algorithm for carrying out inertial information coding by using the LSTM, the computing time is reduced, and the efficiency of the VIO system and the accuracy of pose estimation are improved.

In the technical scheme of the embodiment of the invention, the acquisition, storage, application and the like of the personal information (such as face information, voice information and the like) of the related user all accord with the regulations of related laws and regulations without violating the good customs of the public order.

EXAMPLE III

Fig. 5 is a schematic structural diagram of a pose estimation apparatus provided in the third embodiment of the present invention. As shown in the pose estimation, the apparatus includes: a data acquisition module 510, a feature extraction module 520, a fused feature determination module 530, and a pose determination module 540.

A data obtaining module 510, configured to obtain image data and inertial data of an environment where the mobile device is located at a target time;

a feature extraction module 520, configured to perform feature extraction on the image data and the inertial data, respectively, to obtain an image feature of the image data and an inertial feature of the inertial data;

a fusion feature determining module 530, configured to determine a fusion feature according to the image feature and the inertial feature;

a pose determination module 540, configured to determine a pose of the mobile device at the target time according to the fusion feature.

According to the scheme of the embodiment, the image data and the inertia data of the environment of the mobile equipment at the target moment are acquired through the data acquisition module; respectively extracting the features of the image data and the inertial data through a feature extraction module to obtain the image features of the image data and the inertial features of the inertial data; determining fusion characteristics according to the image characteristics and the inertia characteristics through a fusion characteristic determination module; the pose of the mobile equipment at the target moment is determined by the pose determination module according to the fusion characteristics, so that the poses of the mobile equipment at different moments can be accurately determined.

In an optional implementation manner of this embodiment, the data obtaining module 410 is specifically configured to obtain, through each image sensor installed in the mobile device, image data of an environment in which the mobile device is located at a target time; acquiring inertial data of the mobile equipment at a target moment through an inertial sensor installed in the mobile equipment; wherein the inertial data comprises angular velocity data and/or acceleration data.

In an optional implementation manner of this embodiment, the feature extraction module 420 is specifically configured to scan the image data sequentially through a plurality of convolution kernels with different sizes, and extract image feature information in the image data; pooling the image characteristic information along a horizontal direction and a vertical direction respectively to obtain image characteristics corresponding to the image data.

In an optional implementation manner of this embodiment, the feature extraction module 420 is further specifically configured to scan the inertial data through a plurality of one-dimensional convolution kernels, and extract inertial feature information in the inertial data; pooling the inertia characteristic information along a horizontal direction and a vertical direction respectively to obtain inertia characteristics corresponding to the inertia data.

In an optional implementation manner of this embodiment, the fusion feature determining module 430 is configured to cascade the image feature and the inertial feature, and fuse the image feature and the inertial feature after the cascade connection through an encoder to obtain a fusion feature.

In an optional implementation manner of this embodiment, the pose determination module 440 is specifically configured to process the fusion features through a full connection layer to obtain the pose of the mobile device at the target time.

In an optional implementation of this embodiment, the image feature and the inertial feature have different dimensions; the mobile device is an unmanned aerial vehicle, an unmanned vehicle or a robot.

The pose estimation device provided by the embodiment of the invention can execute the pose estimation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

FIG. 6 illustrates a block diagram of an electronic device 10 that may be used to implement embodiments of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of embodiments of the inventions described and/or claimed herein.

As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the pose estimation method.

In some embodiments, the pose estimation method can be implemented as a computer program that is tangibly embodied in a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the pose estimation method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the pose estimation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Computer programs for implementing methods of embodiments of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of embodiments of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the embodiments of the present invention may be executed in parallel, may be executed sequentially, or may be executed in different orders, as long as the desired result of the technical solution of the embodiments of the present invention can be achieved, which is not limited herein.

The above detailed description does not limit the scope of the embodiments of the present invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the embodiments of the present invention should be included in the scope of the embodiments of the present invention.

Claims

1. A pose estimation method, comprising:

respectively extracting the image data and the inertial data to obtain image characteristics of the image data and inertial characteristics of the inertial data;

determining fusion characteristics according to the image characteristics and the inertia characteristics;

2. The method of claim 1, wherein obtaining image data and inertial data of an environment in which the mobile device is located at the target time comprises:

acquiring image data of an environment of the mobile equipment at a target moment through each image sensor installed in the mobile equipment;

acquiring inertial data of the mobile equipment at a target moment through an inertial sensor installed in the mobile equipment; wherein the inertial data comprises angular velocity data and/or acceleration data.

3. The method according to claim 2, wherein the performing feature extraction on the image data and the inertial data to obtain image features of the image data and inertial features of the inertial data respectively comprises:

scanning the image data in sequence through a plurality of convolution kernels with different sizes, and extracting image characteristic information in the image data;

pooling the image characteristic information along a horizontal direction and a vertical direction respectively to obtain image characteristics corresponding to the image data.

4. The method according to claim 2, wherein the extracting features of the image data and the inertial data to obtain image features of the image data and inertial features of the inertial data, respectively, further comprises:

scanning the inertial data through a plurality of one-dimensional convolution cores, and extracting inertial characteristic information in the inertial data;

pooling the inertia characteristic information along a horizontal direction and a vertical direction respectively to obtain inertia characteristics corresponding to the inertia data.

5. The method of claim 1, wherein determining a fused feature from the image feature and the inertial feature comprises:

and cascading the image features and the inertia features, and fusing the cascaded image features and the inertia features through an encoder to obtain fused features.

6. The method of claim 1, wherein determining the pose of the mobile device at the goal time based on the fused features comprises:

and processing the fusion characteristics through a full connection layer to obtain the pose of the mobile equipment at the target moment.

7. The method of any of claims 1-6, wherein the image feature is a different dimension than the inertial feature;

the mobile device is an unmanned aerial vehicle, an unmanned vehicle or a robot.

8. A pose estimation apparatus, characterized by comprising:

the feature extraction module is used for respectively extracting features of the image data and the inertial data to obtain image features of the image data and inertial features of the inertial data;

the fusion characteristic determining module is used for determining fusion characteristics according to the image characteristics and the inertia characteristics;

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the pose estimation method of any one of claims 1-7.

10. A computer-readable storage medium characterized in that the computer-readable storage medium stores computer instructions for causing a processor, when executed, to implement the pose estimation method according to any one of claims 1 to 7.