CN114972412A

CN114972412A - Attitude estimation method, device and system and readable storage medium

Info

Publication number: CN114972412A
Application number: CN202210515831.7A
Authority: CN
Inventors: 沈益冉; 郭旭成; 杜博闻
Original assignee: Shenzhen Ruishi Zhixin Technology Co ltd
Current assignee: Shenzhen Ruishi Zhixin Technology Co ltd
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-08-30

Abstract

The application provides a method, a device, a system and a readable storage medium for attitude estimation, wherein the method comprises the following steps: acquiring an event image based on the event data and acquiring a standard image based on the video data; combining adjacent event images and adjacent standard images in channel dimensions respectively, and then inputting the combined images into a feature extraction branch network respectively to extract event image features and standard image features; inputting the event image characteristics and the standard image characteristics into a characteristic fusion branch network for characteristic fusion to obtain fusion characteristics; and inputting the fusion characteristics into a posture estimation branch network for posture regression to obtain system posture change data. By implementing the scheme, the attitude change of the system in the motion process is jointly estimated by combining the data collected by the event camera and the standard camera, and when the system is in a high-speed and high-dynamic-range scene, more motion information can be provided for estimating the attitude change of the system, so that the accuracy and the stability of attitude estimation are improved.

Description

Attitude estimation method, device and system and readable storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, a system, and a readable storage medium for estimating an attitude.

Background

For autonomous navigation of intelligent vehicles, the self-positioning capability of the vehicle during motion is very important, and estimating the position of the vehicle by fusing data of various sensors is defined as an odometer. Early vehicle positioning systems typically employed wheel speed encoders to calculate vehicle range, however, such methods had serious cumulative errors and tire slip, among other factors, had even more unpredictable results. With the development of computer vision technology, vision sensors are increasingly used for vehicle positioning and motion estimation. The visual sensor not only can provide abundant perception information, but also has the characteristics of low cost, small size and the like. Researchers have generally referred to the problem of visually obtaining camera pose as Visual Odometer (VO), which has been widely used in navigation positioning of various kinds of robots for more than two decades recently.

However, the currently mainstream VO method estimates the pose of the camera mainly based on the geometric characteristics of an object in an image, so that the image is required to contain a large number of stable texture features, and once an occlusion appears in a scene or a view is found in a rainy or foggy day, the solution accuracy of the geometric method is severely interfered without other sensors (IMU, laser, radar, etc.).

Disclosure of Invention

The embodiment of the application provides a method, a device and a system for estimating a posture and a readable storage medium, which can at least solve the problem of low posture estimation precision caused by estimating the posture of a camera based on the geometric characteristics of an object in an image in the related art.

A first aspect of an embodiment of the present application provides a pose estimation method, which is applied to a pose estimation system, where the pose estimation system is configured with an event camera and a standard camera, the event camera is used to collect event data, and the standard camera is used to collect video data, and the pose estimation method includes:

acquiring an event image based on the event data and acquiring a standard image based on the video data;

combining the event images adjacent to the time sequence and the standard images adjacent to the time sequence in channel dimensions respectively, and then inputting the combined images into a feature extraction branch network respectively to extract event image features and standard image features;

inputting the event image features and the standard image features into a feature fusion branch network for feature fusion to obtain fusion features;

and inputting the fusion characteristics into a posture estimation branch network for posture regression to obtain system posture change data.

A second aspect of the embodiments of the present application provides a pose estimation apparatus, which is applied to a pose estimation system, where the pose estimation system is configured with an event camera and a standard camera, the event camera is used for collecting event data, and the standard camera is used for collecting video data, and the pose estimation apparatus includes:

the acquisition module is used for acquiring an event image based on the event data and acquiring a standard image based on the video data;

the extraction module is used for merging the event images adjacent in time sequence and the standard images adjacent in time sequence in channel dimensions respectively, and then inputting the merged event images and the standard images into the feature extraction branch network respectively to extract event image features and standard image features;

the fusion module is used for inputting the event image characteristics and the standard image characteristics into a characteristic fusion branch network for characteristic fusion to obtain fusion characteristics;

and the estimation module is used for inputting the fusion characteristics to a posture estimation branch network for posture regression to obtain system posture change data.

A third aspect of the embodiments of the present application provides an attitude estimation system, including: the system comprises an event camera, a standard camera, a memory and a processor, wherein the event camera is used for collecting event data, and the standard camera is used for collecting video data; the processor is configured to execute the computer program stored on the memory, and when the processor executes the computer program, the processor implements the steps in the posture estimation method provided by the first aspect of the embodiment of the present application.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the posture estimation method provided in the first aspect of the embodiments of the present application.

In view of the above, according to the attitude estimation method, apparatus, system and readable storage medium provided in the present application, an event image is obtained based on event data, and a standard image is obtained based on video data; combining the event images adjacent to the time sequence and the standard images adjacent to the time sequence in channel dimensions respectively, and then inputting the combined images into a feature extraction branch network respectively to extract the features of the event images and the features of the standard images; inputting the event image characteristics and the standard image characteristics into a characteristic fusion branch network for characteristic fusion to obtain fusion characteristics; and inputting the fusion characteristics into a posture estimation branch network for posture regression to obtain system posture change data. Through the implementation of the scheme, the attitude change of the system in the motion process is jointly estimated by combining the data collected by the event camera and the standard camera, and when the system is in a high-speed and high-dynamic-range scene, more motion information can be provided to estimate the attitude change of the system, so that the accuracy and the stability of attitude estimation are effectively improved.

Drawings

Fig. 1 is a schematic structural diagram of an attitude estimation neural network according to a first embodiment of the present application;

fig. 2 is a basic flowchart of an attitude estimation method according to a first embodiment of the present application;

fig. 3 is a schematic view of an event image according to a first embodiment of the present application.

Fig. 4 is a schematic detailed flowchart of an attitude estimation method according to a second embodiment of the present application;

fig. 5 is a schematic diagram of program modules of an attitude estimation device according to a third embodiment of the present application;

fig. 6 is a schematic structural diagram of an attitude estimation system according to a fourth embodiment of the present application.

Detailed Description

In order to make the objects, features and advantages of the present invention more apparent and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description of the embodiments of the present application, it is to be understood that the terms "length", "width", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present application, "a plurality" means two or more unless specifically defined otherwise.

In the embodiments of the present application, unless otherwise specifically stated or limited, the terms "mounted," "connected," and "fixed" are to be construed broadly, e.g., as meaning fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the embodiments of the present application can be understood by those of ordinary skill in the art according to specific situations.

The above description is only exemplary of the present application and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

In the last two decades, VO has been widely used in navigation positioning of various robots, such as Mars detectors "courage number" and "opportunity number" developed by NASA in the United states, as well as unmanned aerial vehicles, underwater robots, land robots, etc. These mainstream VO methods estimate the position of the camera based mainly on the geometric characteristics of the object in the picture, so the picture is required to contain a large amount of stable texture features, and once an obstruction appears in the scene or a view is found in a rainy or foggy day, the solution accuracy of the geometric method is severely disturbed without other sensors (IMU, laser, radar, etc.). In many practical applications, many other sensors may not be used, so that the method of positioning only by vision still has a great research space.

In order to solve the problem of low accuracy of pose estimation caused by estimating the pose of a camera based on the geometric characteristics of an object in an image in the related art, a first embodiment of the present application provides a pose estimation method applied to a pose estimation system deployed on a moving object, the pose estimation system is configured with an event camera for acquiring event data and a standard camera for acquiring video data, the event camera is a novel bionic sensor capable of asynchronously capturing the change of light intensity on all pixels on the camera, and compared with the standard camera, the pose estimation method has better performance in capturing scenes with fast movement and high dynamic range, and the standard camera can be preferably implemented by using a depth camera. It should be understood that the event camera and the standard camera of the present embodiment may be two cameras independent of each other, or may be two different sensor units independently disposed on one integrated camera, in practical applications, the two cameras cooperate to simultaneously trigger data acquisition, the event camera records event data in the form of an event stream, the event data may be stored in a txt format, the standard camera records video data in the form of continuous image frames, and the video data may be stored in a series of pictures in a png format.

It should be noted that, in order to implement the posture estimation method of the present embodiment, the present embodiment uses a posture estimation neural network to perform posture estimation, as shown in fig. 1, which is a schematic structural diagram of the posture estimation neural network provided in the present embodiment, the overall network includes: the system comprises a feature extraction branch network A, a feature fusion branch network B and a posture estimation branch network C.

Fig. 2 is a schematic basic flow chart of the attitude estimation method provided in this embodiment, and the attitude estimation method includes the following steps:

step 201, acquiring an event image based on the event data, and acquiring a standard image based on the video data.

Specifically, in this embodiment, the video data is composed of a plurality of standard images connected in time sequence, and the recording time of each standard image in the video is defined as t e [ t ∈ [ [ t ] ₀ ,t ₁ ,…,t _n-1 ]At collection time t _m And t _m+1 And event data between the m-th standard image and the m-th standard image. It should be noted that the representation form of the event data is e ═ { x ═ x _i ,y _i ，t _i ,p _i I ∈ 0,1, …, n-1, in practical applications, event data can be divided into voxels according to time, and then the event of a voxel is converted into a two-dimensional event image, each event image being temporally aligned with a standard image.

In an implementation manner of this embodiment, the step of acquiring an event image based on event data includes: dividing event data into a plurality of voxels according to an event; wherein the event data is expressed as e ═ { x ═ x _i ,y _i ,t _i ,p _i }，i∈0,1,…,n-1，x _i ,y _i Representing the pixel coordinate position, t _i Represents a time stamp, p _i Indicating the polarity of the event; and converting each voxel into an event image based on a preset conversion formula. The conversion formula is expressed as:

where V (x, y, t) represents a voxel and B represents the number of voxels.

It should be understood that, in order to avoid the image blurring caused by the excessive accumulation of the local events, the embodiment may convert the event data collected at a specific time into a two-dimensional image with a height of 260 and a width of 346, specifically referring to the schematic diagram of the event image shown in fig. 3.

Step 202, merging the event images with adjacent time sequences and the standard images with adjacent time sequences in channel dimensions, and then inputting the merged images into a feature extraction branch network to extract event image features and standard image features.

Referring to fig. 1 again, in the present embodiment, the adjacent Standard images from the Standard Camera (Standard Camera) and the adjacent Event images from the Event Camera (Event Camera) are merged along the channel dimension, and then used as the input of the feature extraction branch network to perform feature extraction by the feature extraction branch network. It should be understood that the number of the chronologically adjacent event images and the number of the chronologically adjacent standard images selected in the present embodiment may be two, preferably.

In an implementation manner of this embodiment, before the step of combining the time-series adjacent event images and the time-series adjacent standard images in the channel dimension, the method further includes: respectively calculating the mean value and the variance of all the event images and the standard images, and carrying out normalization processing and standardization processing on all the event images and the standard images; and copying three parts of each event image and each standard image respectively, then placing the event images and the standard images in a color three-channel, and expressing the event images and the standard images by using the tensor of the preset dimension.

Specifically, in this embodiment, on one hand, the mean and variance of pixels of all event images are calculated, normalization and normalization processing is performed on all event images, three copies of each two-dimensional image representing an event are placed in three different channels, and a tensor with the size of [3,260,346] is formed, adjacent event images are combined in the channel dimension and become a tensor with the size of [6,260,346], the obtained tensor is input into the feature extraction branch network, and finally, a tensor with the size of [1024,5,6], that is, the feature of the finally extracted event image is obtained; on the other hand, the mean value and the variance of all standard image pixels in the whole video data are calculated, all images are normalized, the height of each image in the video is 260 and the width is 346, three channels of RGB are provided, the image is a tensor with the size of [3,260,346], adjacent images are combined in the channel dimension to be a tensor with the size of [6,260,346], the obtained tensor is input into the network structure of the feature extraction branch network, and finally, a tensor with the size of [1024,5,6], namely the standard image feature is obtained.

Step 203, inputting the event image characteristics and the standard image characteristics into the characteristic fusion branch network for characteristic fusion to obtain fusion characteristics.

Specifically, in an embodiment of the present invention, the feature fusion branch network includes a Self-attention network, and referring to fig. 1 again, the Self-attention network includes a first Self-attention network (i.e., Self-attention (event) in fig. 1) and a second Self-attention network (i.e., Self-attention (image) in fig. 1). Correspondingly, the step of inputting the event image features and the standard image features into the feature fusion branch network for feature fusion to obtain fusion features includes: inputting the event image characteristics to a first self-attention network of a characteristic fusion branch network, acquiring a first mask corresponding to the event image characteristics, and inputting the standard image characteristics to a second self-attention network, and acquiring a second mask corresponding to the standard image characteristics; multiplying the event image feature and the first mask code through a first self-attention network to obtain a first coarse filtering feature, and multiplying the standard image feature and the second mask code through a second self-attention network to obtain a second coarse filtering feature; and performing feature fusion based on the first coarse filtering feature and the second coarse filtering feature to obtain a fusion feature.

It should be appreciated that the self-attention network captures long-term dependencies and global channel correlations by comparing the similarity of a single channel to all channels, which focuses on the main features from the same sensor, allowing better selection of features collected by standard and event cameras.

Further, in an implementation manner of this embodiment, the step of obtaining the first mask corresponding to the event image feature includes: acquiring a first mask corresponding to the event image characteristic through a preset first mask calculation formula; the first mask calculation formula is expressed as:

wherein e is _i And e _j Feature vectors, W, of one channel each representing a feature of an event image _i And W _j Respectively represent _i And e _j A learnable weight matrix projected to the embedding space, g () represents a function that projects the feature vector to the new embedding space.

And, the step of obtaining a second mask corresponding to the standard image feature includes: acquiring a second mask corresponding to the standard image characteristic through a preset second mask calculation formula; the second mask calculation formula is expressed as:

wherein f is _i And f _j Feature vectors of one channel each representing a feature of the standard image.

Specifically, in this embodiment, the operations in the mask calculation formula are performed for the event image feature and the standard image feature respectively to obtain the corresponding mask, the event image feature and the standard image featureSigned mask

And

all sizes are [1024,1 ]]，

And

tensor e output after multiplying by e and f respectively _att And f _att All sizes are [1024,5,6]]，e _att And f _att Namely a first coarse filtering feature and a second coarse filtering feature, which are useful features obtained after self-filtering in the image features.

Further, in an implementation manner of this embodiment, the feature fusion branch network further includes a Cross-attention network (i.e., Cross-attention in fig. 1), and the Cross-attention network is intended to fuse data complementary to different sensors, and focuses more on features of different sensors in different scenarios.

Correspondingly, the step of performing feature fusion based on the first coarse filtering feature and the second coarse filtering feature to obtain a fusion feature includes: inputting the first coarse filtering characteristic into a cross attention network to obtain a third mask corresponding to the standard image characteristic, and inputting the second coarse filtering characteristic into the cross attention network to obtain a fourth mask corresponding to the event image characteristic; performing multiplication operation on the first coarse filtering feature and the fourth mask through a cross attention network to obtain a first fine filtering feature, and performing multiplication operation on the second coarse filtering feature and the third mask to obtain a second fine filtering feature; and performing feature fusion on the first fine filtering feature and the second fine filtering feature to obtain a fusion feature.

Preferably, in this embodiment, the step of obtaining the third mask corresponding to the standard image feature includes: acquiring a third mask corresponding to the standard image characteristic through a preset third mask calculation formula; third mask calculation formulaExpressed as: m is _e→f ＝Sigmoid((W _e→f e _i ) ^T W _e→f e _j )。

And, the step of obtaining a fourth mask corresponding to the event image feature includes: acquiring a fourth mask corresponding to the event image characteristic through a preset fourth mask calculation formula; the fourth mask calculation formula is expressed as: m is _f→e ＝Sigmoid((W _f→e e _i ) ^T W _f→e e _j )。

It should be noted that the mask m _f→e And m _e→f Are all [1024,1 ]]，e _att And f _att Multiplying by the corresponding masks respectively to obtain outputs e _out And f _out ，e _out And f _out All the sizes of (1) are [1024,5,6]]，e _out And f _out I.e. the first fine filter feature and the second fine filter feature, both being more useful features of the originally extracted image features, e _out And f _out Changing from three dimensions to one dimension results in a tensor 30720]，e _out And f _out Connecting at the channel to finally obtain an output result, wherein the characteristic fusion formula is expressed as:

and step 204, inputting the fusion characteristics into a posture estimation branch network for posture regression to obtain system posture change data.

Specifically, in this embodiment, the pose estimation branch network includes a long-short term memory unit (i.e., LSTM in fig. 1) and a fully connected layer, the LSTM portion is used to estimate the frame-to-frame motion experienced by the camera, and in practical applications, two LSTM stacks are preferably used to achieve better performance. The embodiment inputs the fusion characteristics into a long-term and short-term memory unit of the attitude estimation branch network to carry out modeling in time sequence, and estimation data is obtained; and inputting the estimated data into the full connection layer for processing, and outputting system attitude change data.

Specifically, the embodiment combines the above-mentioned fusion features

The input to LSTM is modeled temporally, so that the output out is of size [1000]]And finally, outputting x, y and z representing displacement and Euler angles representing rotation by the full connection layer, thereby obtaining the posture change of the system between adjacent moments.

Based on the technical scheme of the embodiment of the application, the event image is obtained based on the event data, and the standard image is obtained based on the video data; combining the event images adjacent to the time sequence and the standard images adjacent to the time sequence in channel dimensions respectively, and then inputting the combined images into a feature extraction branch network respectively to extract the features of the event images and the features of the standard images; inputting the event image characteristics and the standard image characteristics into a characteristic fusion branch network for characteristic fusion to obtain fusion characteristics; and inputting the fusion characteristics into a posture estimation branch network for posture regression to obtain system posture change data. Through the implementation of the scheme, the attitude change of the system in the motion process is jointly estimated by combining the data collected by the event camera and the standard camera, when the system is in a high-speed and high-dynamic-range scene, more motion information can be provided for estimating the attitude change of the system, and the accuracy and the stability of attitude estimation are effectively improved.

The method in fig. 4 is a refined attitude estimation method provided in the second embodiment of the present application, and the attitude estimation method includes:

step 401, obtaining an event image based on the event data, and obtaining a depth image based on the video data.

Specifically, in this embodiment, the event camera and the depth camera work in cooperation to trigger data acquisition simultaneously, the event camera records event data in the form of an event stream, and the depth camera records video data in the form of continuous image frames.

Step 402, merging two event images with adjacent time sequences and two depth images with adjacent time sequences in channel dimensions, inputting the merged images into a feature extraction branch network, and extracting event image features and depth image features.

Step 403, inputting the event image feature into the first self-attention network of the feature fusion branch network, obtaining a first mask corresponding to the event image feature, and inputting the depth image feature into the second self-attention network, obtaining a second mask corresponding to the depth image feature.

Step 404, performing a multiplication operation on the event image feature and the first mask through the first self-attention network to obtain a first coarse filtering feature, and performing a multiplication operation on the depth image feature and the second mask through the second self-attention network to obtain a second coarse filtering feature.

In particular, the self-attention network captures long-term dependencies and global channel correlations by comparing the similarity of a single channel to all channels, which focuses on the main features from the same sensor, allowing better selection of features collected by standard and event cameras.

Step 405, inputting the first coarse filtering feature to the cross attention network to obtain a third mask corresponding to the depth image feature, and inputting the second coarse filtering feature to the cross attention network to obtain a fourth mask corresponding to the event image feature.

And step 406, performing multiplication operation on the first coarse filtering feature and the fourth mask through the cross attention network to obtain a first fine filtering feature, and performing multiplication operation on the second coarse filtering feature and the third mask to obtain a second fine filtering feature.

In particular, the cross attention network aims to fuse complementary data of different sensors, focuses more on characteristics of different sensors in different scenes, and can further extract more useful characteristics from useful characteristics extracted from the attention network.

And 407, performing feature fusion on the first fine filtering feature and the second fine filtering feature to obtain a fusion feature.

Step 408, inputting the fusion features into a long-term and short-term memory unit of the attitude estimation branch network for modeling in time sequence to obtain estimation data; and inputting the estimated data into the full connection layer for processing, and outputting system attitude change data.

Specifically, in this embodiment, the obtained fusion feature is input to the LSTM to perform time-series modeling, so that the output out is [1000], and finally, x, y, z representing displacement and euler angles representing rotation are output by the fully-connected layer, so as to obtain the posture change of the system between adjacent time instants.

It should be understood that, the size of the serial number of each step in this embodiment does not mean the execution sequence of the step, and the execution sequence of each step should be determined by its function and inherent logic, and should not be limited uniquely to the implementation process of the embodiment of the present application.

Based on the scheme of the embodiment of the application, the posture change of the system can be accurately predicted under various scenes, the system is deployed on mobile equipment, and the posture change of the system can be obtained according to the contents shot by the event camera and the depth camera. Through a large number of experiments, results show that compared with a method of using a depth camera only in various scenes, the method and the device for estimating the trajectory can greatly improve the trajectory estimation precision by automatically adopting the advantage information from different sources.

Fig. 5 is a posture estimation device according to a third embodiment of the present application. The attitude estimation device is applied to an attitude estimation system, the attitude estimation system is provided with an event camera and a standard camera, the event camera is used for collecting event data, and the standard camera is used for collecting video data. As shown in fig. 5, the posture estimation device mainly includes:

an obtaining module 501, configured to obtain an event image based on event data and obtain a standard image based on video data;

an extraction module 502, configured to combine event images with adjacent time sequences and standard images with adjacent time sequences in channel dimensions, and then input the combined event images and standard images to a feature extraction branch network to extract event image features and standard image features;

the fusion module 503 is configured to input the event image features and the standard image features to the feature fusion branch network for feature fusion to obtain fusion features;

and the estimation module 504 is configured to input the fusion features to the posture estimation branch network to perform posture regression, so as to obtain system posture change data.

In some embodiments of this embodiment, when the acquiring module executes the function of acquiring the event image based on the event data, the acquiring module is specifically configured to: dividing event data into a plurality of voxels according to an event; wherein the event data is expressed as e ═ { x ═ x _i ,y _i ,t _i ,p _i }，i∈0,1,…,n-1，x _i ,y _i Representing the pixel coordinate position, t _i Represents a time stamp, p _i Indicating the polarity of the event; converting each voxel into an event image based on a preset conversion formula; the conversion formula is expressed as:

where V (x, y, t) represents a voxel and B represents the number of voxels.

In some implementations of this embodiment, the pose estimation device further includes: the processing module is used for respectively calculating the mean value and the variance of all the event images and the standard images and carrying out normalization processing and standardization processing on all the event images and the standard images; and copying three parts of each event image and each standard image respectively, then placing the event images and the standard images in a color three-channel, and expressing the event images and the standard images by using the tensor of the preset dimension.

In some embodiments of the present embodiment, the feature fusion branch network includes a self-attention network, and the self-attention network includes a first self-attention network and a second self-attention network. Correspondingly, the fusion module is specifically configured to: inputting the event image features into a first self-attention network of a feature fusion branch network to obtain a first mask corresponding to the event image features, and inputting the standard image features into a second self-attention network to obtain a second mask corresponding to the standard image features; multiplying the event image feature and the first mask code through a first self-attention network to obtain a first coarse filtering feature, and multiplying the standard image feature and the second mask code through a second self-attention network to obtain a second coarse filtering feature; and performing feature fusion based on the first coarse filtering feature and the second coarse filtering feature to obtain a fusion feature.

Further, in some embodiments of this embodiment, the fusion module, when executing the function of acquiring the first mask corresponding to the event image feature, is specifically configured to: acquiring a first mask corresponding to the event image characteristic through a preset first mask calculation formula; the first mask calculation formula is expressed as:

wherein e is _i And e _j Feature vectors, W, of one channel each representing a feature of an event image _i And W _j Respectively represent _i And e _j A learnable weight matrix projected to the embedding space, g () represents a function that projects the feature vector to the new embedding space. In addition, the fusion module is specifically configured to, when executing the function of acquiring the second mask corresponding to the standard image feature: acquiring a second mask corresponding to the standard image characteristic through a preset second mask calculation formula; the second mask calculation formula is expressed as:

Further, in other embodiments of this embodiment, the feature fusion branching network further includes a cross attention network. Correspondingly, when the fusion module performs the above-mentioned feature fusion based on the first coarse filtering feature and the second coarse filtering feature to obtain the function of the fusion feature, the fusion module is specifically configured to: inputting the first coarse filtering characteristic into a cross attention network to obtain a third mask corresponding to the standard image characteristic, and inputting the second coarse filtering characteristic into the cross attention network to obtain a fourth mask corresponding to the event image characteristic; performing multiplication operation on the first coarse filtering feature and the fourth mask through a cross attention network to obtain a first fine filtering feature, and performing multiplication operation on the second coarse filtering feature and the third mask to obtain a second fine filtering feature; and performing feature fusion on the first fine filtering feature and the second fine filtering feature to obtain a fusion feature.

Further, in some embodiments of this embodiment, when the fusion module executes the function of acquiring the third mask corresponding to the standard image feature, the fusion module is specifically configured to: acquiring a third mask corresponding to the standard image characteristic through a preset third mask calculation formula; the third mask calculation formula is expressed as: m is _e→f ＝Sigmoid((W _e→f e _i ) ^T W _e→f e _j ). In addition, when the fusion module executes the function of acquiring the fourth mask corresponding to the event image feature, the fusion module is specifically configured to: acquiring a fourth mask corresponding to the event image characteristic through a preset fourth mask calculation formula; the fourth mask calculation formula is expressed as: m is _f→e ＝Sigmoid((W _f→e e _i ) ^T W _f→e e _j )。

In some embodiments of the present embodiment, the posture estimation branching network includes a long-short term memory unit and a full connection layer. Correspondingly, the estimation module is specifically configured to: inputting the fusion characteristics into a long-term and short-term memory unit of the attitude estimation branch network to carry out modeling in time sequence to obtain estimation data; and inputting the estimated data into the full connection layer for processing, and outputting system attitude change data.

It should be noted that, the attitude estimation methods in the first and second embodiments can be implemented based on the attitude estimation device provided in this embodiment, and it can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the attitude estimation device described in this embodiment may refer to the corresponding process in the foregoing method embodiment, and details are not described here.

According to the posture estimation device provided by the present embodiment, an event image is acquired based on event data, and a standard image is acquired based on video data; combining the event images adjacent to the time sequence and the standard images adjacent to the time sequence in channel dimensions respectively, and then inputting the combined images into a feature extraction branch network respectively to extract the features of the event images and the features of the standard images; inputting the event image characteristics and the standard image characteristics into a characteristic fusion branch network for characteristic fusion to obtain fusion characteristics; and inputting the fusion characteristics into a posture estimation branch network for posture regression to obtain system posture change data. Through the implementation of the scheme, the attitude change of the system in the motion process is jointly estimated by combining the data collected by the event camera and the standard camera, when the system is in a high-speed and high-dynamic-range scene, more motion information can be provided for estimating the attitude change of the system, and the accuracy and the stability of attitude estimation are effectively improved.

Fig. 6 is an attitude estimation system according to a fourth embodiment of the present application. The attitude estimation system can be used for realizing the attitude estimation method in the embodiment, and mainly comprises the following steps:

a memory 601, a processor 602, an event camera 603, a standard camera 604 and a computer program 605 stored on the memory 601 and executable on the processor 602, the memory 601 and the processor 602 being communicatively connected. The processor 602, when executing the computer program 605, implements the method of one or both of the previous embodiments. Wherein the number of processors may be one or more.

The Memory 601 may be a high-speed Random Access Memory (RAM) Memory, or a non-volatile Memory (non-volatile Memory), such as a disk Memory. The memory 601 is used for storing executable program code, and the processor 602 is coupled with the memory 601.

Further, an embodiment of the present application also provides a computer-readable storage medium, where the computer-readable storage medium may be disposed in the above-mentioned attitude estimation system, and the computer-readable storage medium may be the memory in the foregoing embodiment shown in fig. 6.

The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the attitude estimation method in the foregoing embodiments. Further, the computer-readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a RAM, a magnetic disk, or an optical disk.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules is merely a division of logical functions, and an actual implementation may have another division, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a readable storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned readable storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In view of the above description of the method, device, system and readable storage medium for posture estimation provided by the present application, those skilled in the art will recognize that there may be variations in the embodiments and applications of the method, device and system according to the teachings of the present application.

Claims

1. A pose estimation method applied to a pose estimation system configured with an event camera for capturing event data and a standard camera for capturing video data, the pose estimation method comprising:

2. The pose estimation method of claim 1, wherein the step of obtaining an event image based on the event data comprises:

dividing the event data into a plurality of voxels according to events; wherein the event data is expressed as e ═ { x ═ x _i ,y _i ,t _i ,p _i }，i∈0,1,…,n-1，x _i ,y _i Representing the pixel coordinate position, t _i Represents a time stamp, p _i Indicating the polarity of the event;

converting each voxel into an event image based on a preset conversion formula; the conversion formula is expressed as:

where V (x, y, t) represents the voxel and B represents the number of voxels.

3. The pose estimation method according to claim 1, wherein the step of merging the time-series adjacent event images and the time-series adjacent standard images respectively in a channel dimension further comprises:

calculating a mean value and a variance value for all the event images and the standard images respectively, and performing normalization processing and normalization processing on all the event images and the standard images;

and copying three parts of each event image and each standard image respectively, and then placing the event images and the standard images in a color three-channel, and expressing the event images and the standard images by using tensors of preset dimensions.

4. The pose estimation method according to claim 1, wherein the feature fusion branch network comprises a self-attention network comprising a first self-attention network and a second self-attention network;

the step of inputting the event image features and the standard image features into a feature fusion branch network for feature fusion to obtain fusion features includes:

inputting the event image features into the first self-attention network of the feature fusion branch network, acquiring a first mask corresponding to the event image features, and inputting the standard image features into the second self-attention network, acquiring a second mask corresponding to the standard image features;

multiplying the event image feature by the first mask through the first self-attention network to obtain a first coarse filtering feature, and multiplying the standard image feature by the second mask through the second self-attention network to obtain a second coarse filtering feature;

and performing feature fusion based on the first coarse filtering feature and the second coarse filtering feature to obtain a fusion feature.

5. The pose estimation method according to claim 4, wherein the step of obtaining a first mask corresponding to the event image feature comprises:

acquiring a first mask corresponding to the event image characteristic through a preset first mask calculation formula; the first mask calculation formula is expressed as:

wherein e is _i And e _j Feature vectors, W, of one channel each representing a feature of the event image _i And W _j Respectively represent _i And e _j A learnable weight matrix projected to the embedding space, g () representing a function that projects the feature vector to the new embedding space;

the step of obtaining a second mask corresponding to the standard image feature includes:

acquiring a second mask corresponding to the standard image characteristic through a preset second mask calculation formula; the second mask calculation formula is expressed as:

6. The pose estimation method of claim 4, wherein the feature fusion branch network further comprises a cross attention network;

the step of performing feature fusion based on the first coarse filtering feature and the second coarse filtering feature to obtain a fused feature includes:

inputting the first coarse filtering feature into the cross attention network to obtain a third mask corresponding to the standard image feature, and inputting the second coarse filtering feature into the cross attention network to obtain a fourth mask corresponding to the event image feature;

multiplying the first coarse filter characteristic by the fourth mask through the cross attention network to obtain a first fine filter characteristic, and multiplying the second coarse filter characteristic by the third mask to obtain a second fine filter characteristic;

and performing feature fusion on the first fine filtering feature and the second fine filtering feature to obtain a fusion feature.

7. The pose estimation method according to claim 6, wherein the step of obtaining a third mask corresponding to the standard image feature comprises:

acquiring a third mask corresponding to the standard image characteristic through a preset third mask calculation formula; the third mask calculation formula is expressed as: m is _e→f ＝Sigmoid((W _e→f e _i ) ^T W _e→f e _j )；

The step of obtaining a fourth mask corresponding to the event image feature includes:

acquiring a fourth mask corresponding to the event image characteristic through a preset fourth mask calculation formula; the fourth mask calculation formula is expressed as: m is _f→e ＝Sigmoid((W _f→e e _i ) ^T W _f→e e _j )。

8. The attitude estimation method according to any one of claims 1 to 7, characterized in that the attitude estimation branching network includes a long-short term memory unit and a full connection layer;

the step of inputting the fusion features into a posture estimation branch network for posture regression to obtain system posture change data comprises the following steps:

inputting the fusion characteristics to the long-term and short-term memory unit of the attitude estimation branch network for modeling in time sequence to obtain estimation data;

and inputting the estimation data into the full-connection layer for processing, and outputting system attitude change data.

9. A pose estimation device applied to a pose estimation system configured with an event camera for capturing event data and a standard camera for capturing video data, the pose estimation device comprising:

the extraction module is used for merging the event images adjacent in time sequence and the standard images adjacent in time sequence in channel dimensions respectively, then inputting the merged event images and the standard images into the feature extraction branch network respectively, and extracting event image features and standard image features;

10. A pose estimation system, comprising an event camera, a standard camera, a memory, and a processor, wherein:

the event camera is used for collecting event data, and the standard camera is used for collecting video data;

the processor is configured to execute a computer program stored on the memory;

the processor, when executing the computer program, performs the steps of the method of any one of claims 1 to 8.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.