CN114972492A

CN114972492A - Position and pose determination method and device based on aerial view and computer storage medium

Info

Publication number: CN114972492A
Application number: CN202110213229.3A
Authority: CN
Inventors: 陈琛; 王云; 安利峰
Original assignee: Institute of Microelectronics of CAS
Current assignee: Institute of Microelectronics of CAS
Priority date: 2021-02-24
Filing date: 2021-02-24
Publication date: 2022-08-30

Abstract

The invention discloses a position and pose determination method, device and computer storage medium based on an aerial view, belongs to the technical field of visual positioning, and aims to solve the technical problems that the existing method needs a large amount of labeled information such as semantic segmentation and the like, and the cost is too high in a large-scale scene. The method comprises the following steps: and acquiring a bird's-eye view and an original image of the same scene. And performing masking processing on the moving object in the aerial view to obtain a masked aerial view. And processing the mask aerial view by using the deep learning network to obtain a first characteristic. And processing the original image by using a depth pose estimation network to obtain a second characteristic. And fusing the first characteristic and the second characteristic to obtain a scene pose.

Description

Position and pose determination method and device based on aerial view and computer storage medium

Technical Field

The invention relates to the technical field of visual positioning, in particular to a position and pose determination method and device based on an aerial view and a computer storage medium.

Background

As the visual positioning technology enters a stage of robust perception, people have been dedicated to research methods to perform high-level scene perception and understanding to increase the robustness of the visual positioning task.

Most of the existing methods mainly utilize semantic understanding to link semantic concepts (such as object classification, material composition and the like) with the geometric structure of the environment, but the method usually needs a large amount of labeling information such as semantic segmentation and the like, and the cost is too high in a large-scale scene.

Disclosure of Invention

Based on the above, the invention aims to provide a position and pose determination method, device and computer storage medium based on an aerial view, so as to solve the technical problems that the existing method needs a large amount of labeled information such as semantic segmentation and the like, and the cost is too high in a large-scale scene.

In a first aspect, the invention provides a pose determination method based on an aerial view, comprising the following steps: and acquiring a bird's-eye view and an original image of the same scene. And performing masking processing on the moving object in the aerial view to obtain a masked aerial view. And processing the mask aerial view by using a deep learning network to obtain a first characteristic. And processing the original image by using a depth pose estimation network to obtain a second characteristic. And fusing the first characteristic and the second characteristic to obtain a scene pose.

Compared with the prior art, the position and pose determining method based on the aerial view provided by the invention comprises the steps of firstly obtaining the aerial view and the original image of the same scene; and then, carrying out masking processing on the moving object in the aerial view to obtain a masked aerial view. And processing the mask aerial view by using the deep learning network to obtain a first characteristic. And processing the original image by using a depth pose estimation network to obtain a second characteristic. And finally, fusing the first characteristic and the second characteristic to obtain a scene pose. In practice, since the original image is an image captured by a camera, the size of different objects and the degree of movement of the objects are different according to the depth of the objects from the camera (which can also be understood as the difference of the angle of view of the objects), and the degree of movement of the objects needs to be paid attention to, which results in a certain error in the final result. Because the aerial view is a perspective view drawn by overlooking the ground from a certain point at a high altitude by a high viewpoint perspective method according to the perspective principle, the adverse effect of the change of the view angle on the extraction of the position relation of the actual target entity can be solved, and the comprehension capability of the real scene can be improved. Based on the method, the robust characteristics can be established for the scene pose determination, different states of the dynamic foreground and the static background can be distinguished to a certain extent, and the understanding can enhance the robustness and the fuzziness for resisting dynamic and static objects to a certain extent. Therefore, the scene pose obtained by fusing the first feature and the second feature has higher robustness, and the scene pose is obtained based on feature extraction. The method solves the technical problems that the existing method in the prior art needs a large amount of labeled information such as semantic segmentation and the like and has too high cost in a large-scale scene.

In a second aspect, the invention further provides a pose determination device based on an aerial view, comprising a processor and a communication interface coupled with the processor; the processor is used for running a computer program or instructions to implement any one of the bird's eye view-based pose determination methods described above.

In a third aspect, the present invention further provides a computer storage medium, where instructions are stored, and when the instructions are executed, the method for determining the pose based on the bird's eye view is implemented.

Compared with the prior art, the beneficial effects of the second aspect and the third aspect of the present invention are the same as the beneficial effects of the position and orientation determining method based on the bird's-eye view in the above technical solution, and are not described herein again.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 is a flowchart illustrating steps of a pose determination method based on an aerial view according to an embodiment of the present invention;

fig. 2 is a 3D radar point cloud diagram provided in the embodiment of the present invention;

FIG. 3 is an RGB raw image according to an embodiment of the present invention;

FIG. 4 is a bird's eye view with a target detection result according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating the positions of various entities in an aerial view according to an embodiment of the present invention;

fig. 6 is a block diagram of a pose determination apparatus of a camera according to an embodiment of the present invention;

fig. 7 is a hardware configuration diagram of a pose determination apparatus of a camera according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a chip according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

Various schematic diagrams of embodiments of the invention are shown in the drawings, which are not drawn to scale. Wherein certain details are exaggerated and possibly omitted for clarity of understanding. The shapes of various regions, layers, and relative sizes and positional relationships therebetween shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, as actually required.

In the following, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless otherwise specified.

In the present invention, unless expressly stated or limited otherwise, the term "coupled" is to be interpreted broadly, e.g., "coupled" may be fixedly coupled, detachably coupled, or integrally formed; may be directly connected or indirectly connected through an intermediate.

In the related art, as the visual positioning technology enters a stage of robust perception, people have been dedicated to research methods to perform high-level scene perception and understanding to increase the robustness of the visual positioning task.

Based on the situation, the embodiment of the invention discloses a position and pose determination method based on an aerial view, which is used for solving the technical problems that the existing method needs a large amount of marking information such as semantic segmentation and the like and is too costly in a large-scale scene.

The position and posture determining method based on the aerial view is applied to the position and posture determining device based on the aerial view. Referring to fig. 1, the attitude determination method based on the bird's eye view includes the steps of:

s101, acquiring a bird' S-eye view and an original image of the same scene.

In practice, there are various bird's-eye views obtained, and the radar point cloud data may be obtained by certain projection. Or from an electronic map. Fig. 2 shows a bird's eye view illustration derived from 3D radar point cloud data.

In practice, when the bird's eye view is a coordinate-converted point cloud image, the size parameters of the bird's eye view should be the same as those of the original image.

When the data type of the aerial view is point cloud data, acquiring the aerial view comprises the following steps:

first, original point cloud data is obtained. Wherein the raw point cloud data may be obtained from the radar point cloud data.

And then extracting data in a first quadrant of the original point cloud data as new point cloud data. And taking the X axis of the new point cloud data as the Y axis of the initial aerial view, and taking the Y axis of the new point cloud data as the X axis of the initial aerial view.

And finally, processing the initial aerial view according to the size information and the image content of the original image to obtain the aerial view. The reason for processing the initial bird's-eye view is to make the size information of the obtained bird's-eye view identical to the size information of the original image and to make the image content of the bird's-eye view represent the same scene as the image content of the original image, based on the size information and the image content of the original image.

There are also various ways of acquiring the original image, for example: acquired by a camera. Specifically, a camera may be used to capture an image in a three-dimensional scene to obtain an original image of the scene. Illustratively, fig. 3 shows an original image, which is an RGB image.

It should be noted that, in the embodiment of the present invention, the determination of the scene pose is performed by using the combination of the bird's-eye view and the original image, so that the bird's-eye view and the original image are the same scene image.

And S102, performing masking processing on the moving object in the aerial view to obtain a masked aerial view.

In practice, before masking the moving object in the bird's eye view, it is necessary to detect the moving object in the bird's eye view. In practice, the bird's-eye view can be detected by using an object detection network, and the bird's-eye view with the object detection result is obtained.

Specifically, the target detection is to detect a target in an image and mark the position of the target in the image. The embodiment of the invention can adopt any existing target detection network, such as: the main idea of the R-CNN network is to generate a series of sparse candidate frames by a heuristic method (selective search) or a CNN network (RPN), and then classify and regress the candidate frames. Another example is: the Yolo network and the SSD network have the main idea that intensive sampling is uniformly carried out at different positions of a picture, different scales and aspect ratios can be adopted during sampling, then classification and regression are directly carried out after characteristics are extracted by using CNN, and the whole process only needs one step, so that the Yolo network and the SSD network have the advantage of high speed.

And processing the aerial view by using the target detection network to obtain the aerial view with a target detection result. Fig. 4 shows a bird's eye view with target detection results, and it can be understood that any good-performance target detection framework can obtain the bird's eye view with the target as shown in fig. 4, wherein the higher the target detection accuracy, the better the performance of the finally obtained model.

Fig. 5(a) shows the positions of the respective entities in the n-th overhead view, and fig. 5(b) shows the positions of the respective entities in the n + i-th overhead view, where i is 1,2, and 3. The solid of the dashed frame in fig. 5(b) is the moving object, which is the moving object, having moved from the n-th frame bird's-eye view to the n + i-th frame bird's-eye view.

In order to emphasize the importance of each entity position in the bird's-eye view and reduce noise caused by factors such as entity shape dispersion, the embodiment of the invention masks the target in the bird's-eye view with the target detection result. The mask is a template for the image filter. For example: when extracting roads or rivers, or houses, the image is pixel filtered through a matrix of n × n, and then the desired feature or logo is highlighted. This matrix is a mask.

Image masks are mainly used for: firstly, extracting an interested region, multiplying a pre-made interested region mask and an image to be processed to obtain an interested region image, wherein the image value in the interested region is kept unchanged, and the image value outside the region is 0. Masking, masking certain areas of the image to be processed or not to be processed parameter calculation, or processing or counting only the masked areas. And thirdly, extracting structural features, namely detecting and extracting the structural features similar to the mask in the image by using a similarity variable or an image matching method. And fourthly, manufacturing the image with the special shape. It can be seen that the image mask in the embodiment of the present invention mainly functions as a shielding function for weakening the weight of a moving object (dynamic object) and increasing the weight of a static object.

As a specific embodiment, masking a moving object in a bird's eye view with an object detection result includes: the moving object region in the bird's eye view is reset with pixel values using a mask matrix (also called kernel). So as to realize the subsequent redistribution of the weights of all areas in the aerial view and reduce the weight of the motion area in the aerial view.

And S103, processing the mask aerial view by using the deep learning network to obtain a first characteristic.

The deep learning is one of machine learning, and the machine learning is a necessary path for realizing artificial intelligence. The concept of deep learning is derived from the research of artificial neural networks, and a multi-layer perceptron comprising a plurality of hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data. The motivation for studying deep learning is to build neural networks that simulate the human brain for analytical learning, which mimics the mechanism of the human brain to interpret data such as images, sounds, text, and the like.

The deep learning network is a network established based on deep learning. The deep learning network in the embodiment of the present invention may be any supervised deep learning network or any unsupervised deep learning network. For example: the supervised learning network may be a posenet network, a decapvo network, etc. For example: the unsupervised learning network may be an sfmlearner network or the like.

According to the embodiment of the invention, the mask aerial view is input into the deep learning network, and the deep learning network adjusts the weight of each target area of the mask aerial view, so that the weight of the dynamic target is reduced or even removed. Therefore, the influence of the dynamic object on pose determination can be reduced to a certain extent, the robustness of scene pose determination is improved, and noises such as edge sharpening and the like caused by direct removal of the dynamic object are avoided. Therefore, the features input from the deep learning network are features that are not subject to perspective change, i.e., static features.

In practice, the embodiment of the invention can extract the relationship characteristics among the targets in the bird's-eye view with the target detection results in different scenes through the deep learning network, and the relationship characteristics reflect the states of the targets in the different scenes, thereby enhancing the comprehension capability of the scenes.

According to the method and the device, the relation characteristics among different target entities in the picture are established, foreground objects such as cars and people extracted through the target detection frame are marked on the aerial view, the aerial view sequence with the target detection result is input into the learning model, the characteristics which are not influenced by visual angle change are extracted through a classical deep learning network, the characteristics reflect the position relation among the entities in different scenes, and the understanding of the whole state of the scene in a 3D scene is enhanced. Meanwhile, the change of the entity position relation in the scene change reflects the motion state of each entity, so that the scene understanding capability is improved, the characteristics of the dynamic entities with larger motion state change can be removed to a certain extent, and the robustness of the method in the dynamic scene is improved.

And S104, processing the original image by using a depth pose estimation network to obtain a second feature.

At present, pose estimation has very important application in the fields of robots, automation, machine vision and the like, and particularly in the field of robots, accurate and rapid acquisition of six-dimensional poses of objects is very important for operating objects by robots. In industrial production, the pose of the accessory is accurately measured, so that the industrial robot can grasp an object in a specified pose and align the object for installation, and the method has very important significance for improving the industrial production efficiency.

The pose estimation method comprises the following steps: the traditional pose estimation method mainly comprises a monocular-based method, a binocular-based method, a multi-view-based method, a scanning-type laser radar-based method, a non-scanning-type laser radar-based method, And a pose estimation method which develops rapidly in recent years comprises a SLAM (Simultaneous Localization And Mapping) based method And a multi-sensor fusion method. The depth pose estimation network adopted in the embodiment of the invention can be a pose estimation network constructed by utilizing the pose estimation method.

Specifically, processing the original image by using the depth pose estimation network to obtain the second feature includes: and inputting the original image into a depth pose estimation network, and processing the original image by using the depth pose estimation network to obtain a second characteristic. And the second characteristic is a pose characteristic in the original image. For example, the pose characteristics may be three position parameters and three attitude parameters of the target acquired under a specific coordinate system, and the specific coordinate system may be a world coordinate system, an object coordinate system, or a camera coordinate system.

And S105, fusing the first characteristic and the second characteristic to obtain a scene pose.

In the embodiment of the present invention, fusing the first feature and the second feature to obtain the scene pose may include:

s1051, convolving the first feature to obtain a first convolution feature.

Convolving the first feature: and performing mathematical operation on the first features and corresponding convolution kernels to extract certain specified features. It is understood that the convolution kernels are different, and the extracted specified features are different, and further, the extracted specified features have different effects. It will be appreciated that the effect of the convolution kernel is the extraction of features, with larger convolution kernel sizes implying larger receptive fields, and of course more parameters. In an image, since the image has local correlation in a spatial domain, the convolution process can also be considered as an extraction of the local correlation.

In order to obtain a better information combination, the convolution adopted by the invention is 1 × 1 convolution, and can also be understood as a full connection layer. Wherein, 1 × 1 convolution is used for realizing interaction and information integration across channels. And performing dimension reduction and dimension ascending of the number of the convolution kernel channels. For a single-channel feature graph, single-kernel convolution is used for multiplying one parameter, and generally, the multiple-kernel convolution is multi-channel, so that the linear combination of multiple features is realized.

And S1052, convolving the second features to obtain second convolution features.

Convolving the second feature: and performing mathematical operation on the second features and corresponding convolution kernels to extract certain specified features. It is understood that the convolution kernels are different, and the extracted specified features are different, and further, the extracted specified features have different effects.

Similarly, in order to reduce the convolution kernel parameter, the convolution operation performed on the second feature may be performed by 1 × 1 convolution. Wherein, 1 × 1 convolution is used for realizing interaction and information integration across channels. And performing dimension reduction and dimension ascending of the number of the convolution kernel channels.

And S1053, fusing the first convolution characteristic and the second convolution characteristic to obtain a scene pose.

In the embodiment of the present invention, the last dimensions of the first convolution feature and the second convolution feature obtained in step S1051 and step S1052 are the same.

And splicing the first convolution characteristic and the second convolution characteristic, and solving the average value of the spliced characteristics. Specifically, the average value of the spliced features in one dimension and two dimensions is obtained to obtain the scene pose.

The first convolution feature is extracted based on the aerial view, and the second convolution feature is extracted based on the original image, so that the final scene pose is obtained based on the aerial view and the original image from the same scene and based on the deep learning network. Based on the method, the comprehension capability of the 3D real scene can be improved, and robustness and accuracy are improved. Specifically, after the target extraction and the target masking are carried out on the aerial view, the features of the aerial view are extracted by using the deep learning network, and the features can model the relationship features between target entities in different scenes to a certain extent. Meanwhile, the aerial view solves the adverse effect of Visual angle change on the extraction of the actual target entity position relationship, improves the comprehension capability of a 3D real scene, and can establish a more robust feature for pose estimation and distinguish different states of a dynamic foreground and a static background to a certain extent, namely the comprehension can enhance vo (Visual odometry) robustness and the ambiguity of resisting dynamic static objects to a certain extent.

For development, firstly, the embodiment of the invention performs feature extraction (first feature) on the aerial view sequence with the target mask, and fuses the feature with the original image feature (second feature), so that the scene understanding capability is enhanced, the robustness feature is increased, and the purpose of reducing the influence of the dynamic point on the estimation of the system pose to a certain extent is achieved. Moreover, in the embodiment of the invention, robust bird's-eye view features which are not influenced by visual angle changes are added, and relationship features between different target entities in the bird's-eye view are extracted to deeply understand scenes, and the relationship features enable the system to understand the motion states of different targets in different scenes, so that under the drive of a large amount of data, the weight occupied by dynamic points is gradually reduced in training, namely, the influence of the dynamic points on the system is reduced, the robustness of the system is further improved, and thus, the noise of edge sharpening and the like caused by directly removing dynamic objects is avoided. Finally, the embodiment of the invention can enhance the scene comprehension capability, thereby providing more robust features, reducing the influence of dynamic points and finally realizing the improvement of the visual positioning precision.

In practice, when the dimension of the obtained scene pose is different from the preset dimension by fusing the first convolution feature and the second convolution feature, the dimension of the scene pose can be adjusted to enable the dimension of the scene pose to be the same as the preset dimension.

Wherein, the preset dimension may be set according to an output dimension of the deep learning network, for example: when the deep learning network is a supervised deep learning model posenet, the output dimension is a 1-by-6-dimensional matrix. When the deep learning network is an unsupervised model sfmlearner, scene poses of three continuous images need to be obtained simultaneously, and the aim is to obtain t frame reconstructed images according to t-1 frame images, t +1 frame images and scene pose re-projection, and minimize re-projection errors, so that the dimension of an output result is 3 x 6.

In the embodiment of the invention, the dimension of the scene pose can be adjusted in the full connection layer, so that the dimension of the scene pose is the same as the preset dimension.

Among them, the fully connected layers (FC) play the role of "classifier" in the whole convolutional neural network. In practical use, the fully-connected layer may be implemented by a convolution operation: fully connected layers that are fully connected to the previous layer can be converted into convolutions with convolution kernels of 1x 1; and the fully-connected layer of which the front layer is the convolution layer can be converted into the global convolution with the convolution kernel h x w, wherein h and w are the height and width of the convolution result of the front layer respectively.

Taking VGG-16 as an example, for the input of 224x224x3, the final layer of convolution can obtain an output of 7x7x512, and if the subsequent layer is a layer of FC containing 4096 neurons, the full-connection operation process can be realized by using the global convolution with a convolution kernel of 7x7x512x4096, and the output of 1x1x4096, that is, the output after passing through the full-connection layer, can be obtained after the convolution operation.

Therefore, after passing through the full connection layer, the dimension of the scene pose can be adjusted to be the preset dimension.

In the embodiment of the invention, the fused features (scene poses) not only enhance the comprehension capability of the scene, but also improve the robustness of visual positioning in a dynamic environment. In a dynamic scene, the degree of motion change of the static object and the dynamic object extracted from the relationship features is different, and the static object is more beneficial to estimating the pose, so the attention of the deep learning network to the static features is continuously strengthened in the training process. Therefore, under the action of powerful self-optimization of large-scale data driving and deep learning networks, along with the increase of training data and the increase of training times, under the action of a full connection layer, the dynamic characteristics and the static characteristics are well distinguished, meanwhile, the weight distributed to the dynamic characteristics is gradually reduced, and finally, the system can still keep good performance in a dynamic scene.

Fig. 6 is a block diagram showing a configuration of a position and orientation determination device based on a bird's eye view provided by an embodiment of the present invention, in a case where each functional module is divided into corresponding functions. As shown in fig. 6, the bird's eye view-based pose determination device 70 includes: a communication module 701 and a processing module 702.

A communication module 701, configured to obtain an image captured by a camera.

A processing module 702, configured to support the position and orientation determining apparatus based on the bird's eye view to perform steps 101 to 103 in the foregoing embodiments.

All relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.

In some possible implementations, the above-mentioned bird's eye view-based pose determination apparatus may further include a storage module 703 for storing program codes and data of the base station.

The Processing module may be a Processor or a controller, and may be, for example, a Central Processing Unit (CPU), a general-purpose Processor, a Digital Signal Processor (DSP), an Application-Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication module may be a transceiver, a transceiving circuit or a communication interface, etc. The storage module may be a memory.

When the processing module is a processor, the communication module is a communication interface, and the storage module is a memory, the bird's-eye view-based pose determination apparatus according to the embodiment of the present invention may be the bird's-eye view-based pose determination device shown in fig. 7.

Fig. 7 is a schematic hardware structure diagram of a pose determination device based on an aerial view according to an embodiment of the invention. As shown in fig. 7, the bird's eye view-based pose determination apparatus 80 includes a processor 801 and a communication interface 802.

As shown in fig. 7, the processor may be a general-purpose Central Processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more ics for controlling the execution of programs according to the present invention. The number of the communication interfaces may be one or more. The communication interface may use any transceiver or the like for communicating with other devices or communication networks.

As shown in fig. 7, the above-described attitude determination device based on the bird's eye view may further include a communication line 803. The communication link may include a path for transmitting information between the aforementioned components.

Optionally, as shown in fig. 7, the pose determination device based on the bird's eye view may further include a memory 804. The memory is used for storing computer-executable instructions for implementing the inventive arrangements and is controlled by the processor for execution. The processor is used for executing the computer execution instructions stored in the memory, thereby realizing the method provided by the embodiment of the invention.

As shown in fig. 7, the memory may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a Random Access Memory (RAM) or other types of dynamic storage devices that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory may be separate and coupled to the processor via a communication link. The memory may also be integral to the processor.

Optionally, the computer-executable instructions in the embodiment of the present invention may also be referred to as application program codes, which is not specifically limited in this embodiment of the present invention.

In particular implementations, as one embodiment, processor 801 may include one or more CPUs, such as CPU0 and CPU1 in fig. 7, as shown in fig. 7.

In a specific implementation, as an embodiment, as shown in fig. 7, the attitude determination device based on the bird's eye view may include a plurality of processors, such as the processor 801-1 and the processor 801-2 in fig. 7. Each of these processors may be a single core processor or a multi-core processor.

Fig. 8 is a schematic structural diagram of a chip according to an embodiment of the present invention. As shown in fig. 8, the chip 90 includes one or more than two (including two) processors 801 and a communication interface 802.

Optionally, as shown in FIG. 8, the chip also includes a memory 804, which may include read-only memory and random access memory, and provides operating instructions and data to the processor. The portion of memory may also include non-volatile random access memory (NVRAM).

In some embodiments, as shown in FIG. 8, the memory stores elements, execution modules or data structures, or a subset thereof, or an expanded set thereof.

In the embodiment of the present invention, as shown in fig. 8, by calling an operation instruction stored in the memory (the operation instruction may be stored in the operating system), a corresponding operation is performed.

As shown in fig. 8, a processor, which may also be referred to as a Central Processing Unit (CPU), controls the processing operation of any one of the attitude determination devices based on the bird's eye view.

As shown in fig. 8, the memories may include read-only memory and random access memory, and provide instructions and data to the processor. The portion of memory may also include NVRAM. For example, in applications where the memory, communication interface, and memory are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 805 in FIG. 8.

As shown in fig. 8, the method disclosed in the above embodiments of the present invention may be applied to a processor, or may be implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an FPGA (field-programmable gate array) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

In one possible implementation, as shown in fig. 8, the communication interface is used to obtain images captured by the camera. The processor is configured to execute steps 101 to 103 of the bird's eye view-based pose determination method in the embodiment shown in fig. 1.

In one aspect, a computer-readable storage medium is provided, in which instructions are stored, and when executed, implement the functions performed by the bird's eye-view-based pose determination device in the above embodiments.

In one aspect, a chip is provided, where the chip is applied to a bird's-eye-view-based pose determination device, and the chip includes at least one processor and a communication interface, where the communication interface is coupled with the at least one processor, and the processor is configured to execute instructions to implement the functions performed by the bird's-eye-view-based pose determination device in the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the procedures or functions described in the embodiments of the present invention are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a terminal, a user device, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website, computer, server or data center to another website, computer, server or data center by wire or wirelessly. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape; or optical media such as Digital Video Disks (DVDs); it may also be a semiconductor medium, such as a Solid State Drive (SSD).

While the invention has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A pose determination method based on an aerial view is characterized by comprising the following steps:

acquiring a bird-eye view and an original image of the same scene;

carrying out masking processing on the moving object in the aerial view to obtain a masked aerial view;

processing the mask aerial view by using a deep learning network to obtain a first characteristic;

processing the original image by using a depth pose estimation network to obtain a second characteristic;

and fusing the first characteristic and the second characteristic to obtain a scene pose.

2. The bird's eye view-based pose determination method according to claim 1, wherein the bird's eye view is a coordinate-converted point cloud image, and a size parameter of the bird's eye view is the same as a size parameter of the original image.

3. The bird's eye view-based pose determination method according to claim 1, wherein the data type of the bird's eye view is point cloud data; the acquiring of the aerial view of the same scene comprises:

acquiring original point cloud data;

extracting data in a first quadrant in a two-dimensional coordinate system where the original point cloud data are located to serve as new point cloud data;

taking the X axis of the new point cloud data as the Y axis of the initial aerial view, and taking the Y axis of the new point cloud data as the X axis of the initial aerial view;

and processing the initial aerial view according to the size information and the image content of the original image to obtain the aerial view.

4. The bird's eye view-based pose determination method of claim 1, wherein the first feature is a feature that is not subject to a change in viewing angle.

5. The bird's eye view-based pose determination method of claim 1, wherein the deep learning network is a supervised deep learning network or an unsupervised deep learning network.

6. The bird's-eye-view-based pose determination method according to any one of claims 1-5, wherein the masking the moving object in the bird's-eye view to obtain the masked bird's-eye view comprises:

processing the aerial view by using a target extraction network to obtain an aerial view with a target detection result;

and masking the detection target in the aerial view with the target detection result to obtain a masked aerial view.

7. The bird's eye view-based pose determination method of claim 1, wherein the fusing the first feature and the second feature to obtain the scene pose comprises:

performing convolution on the first feature to obtain a first convolution feature;

performing convolution on the second characteristic to obtain a second convolution characteristic;

and fusing the first convolution characteristic and the second convolution characteristic to obtain a scene pose.

8. The bird's eye view-based pose determination method of claim 7, wherein after the fusing the first convolution feature and the second convolution feature to obtain the scene pose, the bird's eye view-based pose determination method further comprises:

and when the dimension of the scene pose is different from the preset dimension, adjusting the dimension of the scene pose to enable the scene pose to have the preset dimension.

9. A bird's eye view-based pose determination device comprising a processor and a communication interface coupled to the processor; the processor is configured to execute a computer program or instructions to implement the bird's eye view-based pose determination method according to any one of claims 1 to 8.

10. A computer storage medium characterized in that the computer storage medium has stored therein instructions that, when executed, implement the bird's eye view-based pose determination method according to any one of claims 1 to 8.