WO2023231435A1 - 视觉感知方法、装置、存储介质和电子设备 - Google Patents

视觉感知方法、装置、存储介质和电子设备 Download PDF

Info

Publication number
WO2023231435A1
WO2023231435A1 PCT/CN2023/073954 CN2023073954W WO2023231435A1 WO 2023231435 A1 WO2023231435 A1 WO 2023231435A1 CN 2023073954 W CN2023073954 W CN 2023073954W WO 2023231435 A1 WO2023231435 A1 WO 2023231435A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
grid
grid point
index
features
Prior art date
Application number
PCT/CN2023/073954
Other languages
English (en)
French (fr)
Inventor
陈少宇
程天恒
孟文明
张骞
Original Assignee
北京地平线信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京地平线信息技术有限公司 filed Critical 北京地平线信息技术有限公司
Publication of WO2023231435A1 publication Critical patent/WO2023231435A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure relates to the field of visual perception technology, and in particular to a visual perception method, device, storage medium and electronic equipment.
  • Embodiments of the present disclosure provide a visual perception method, device, storage medium and electronic device.
  • a visual perception method including:
  • the second feature map is identified based on the network model corresponding to the preset perception task, and the perception result corresponding to the preset perception task is determined.
  • a visual perception device including:
  • a feature extraction module configured to perform feature extraction on multiple images from different perspectives collected by the multi-camera system of the vehicle at the same time against the surrounding environment of the vehicle to obtain multiple first feature maps
  • a feature corresponding module configured to determine, based on the index features corresponding to the multiple grid points included in the bird's-eye view corresponding to the same moment in at least one first feature map determined by the feature extraction module, among the multiple grid points.
  • a feature map determination module configured to determine the second feature map corresponding to the bird's-eye view based on the grid point features determined by the feature corresponding module corresponding to each grid point;
  • a perception identification module configured to identify the second feature map determined by the feature map determination module based on a network model corresponding to a preset perception task, and determine a perception result corresponding to the preset perception task.
  • a computer-readable storage medium stores a computer program, and the computer program is used to execute the visual perception method described in any of the above embodiments.
  • an electronic device includes:
  • memory for storing instructions executable by the processor
  • the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the visual perception method described in any of the above embodiments.
  • the grid point characteristics corresponding to each grid point are determined by determining the index characteristics corresponding to each grid point in the bird's-eye view, without the need to combine
  • the internal and external parameter matrix is used to determine the second feature corresponding to the bird's-eye view, which overcomes the problem of relying on the internal and external parameter matrix in the existing technology.
  • Figure 1 is a schematic flowchart of a visual perception method provided by an exemplary embodiment of the present disclosure.
  • Figure 2a is a schematic flowchart of step 102 in the embodiment shown in Figure 1 of the present disclosure.
  • Figure 2b is a schematic diagram of an exemplary grid point corresponding area in the visual perception method provided by an exemplary embodiment of the present disclosure.
  • FIG 3 is a schematic flowchart of step 1022 in the embodiment shown in Figure 2a of the present disclosure.
  • FIG. 4 is a schematic flowchart of step 301 in the embodiment shown in FIG. 3 of the present disclosure.
  • FIG 5 is a schematic flowchart of step 1023 in the embodiment shown in Figure 2a of the present disclosure.
  • Figure 6 is a schematic structural diagram of a visual sensing device provided by an exemplary embodiment of the present disclosure.
  • Figure 7 is a schematic structural diagram of a visual sensing device provided by another exemplary embodiment of the present disclosure.
  • FIG. 8 is a structural diagram of an electronic device provided by an exemplary embodiment of the present disclosure.
  • plural may refer to two or more than two, and “at least one” may refer to one, two, or more than two.
  • Embodiments of the present disclosure may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which may operate with numerous other general or special purpose computing system environments or configurations.
  • Examples of well-known terminal devices, computing systems, environments and/or configurations suitable for use with terminal devices, computer systems, servers and other electronic devices include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients Computers, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems and distributed cloud computing technology environments including any of the above systems, etc.
  • Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system executable instructions (such as program modules) being executed by the computer system.
  • program modules may include routines, programs, object programs, components, logic, data structures, etc., that perform specific tasks or implement specific abstract data types.
  • the computer system/server may be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices linked through a communications network.
  • program modules may be located on local or remote computing system storage media including storage devices.
  • a directed attention scheme is usually used to convert images obtained by vehicle-mounted cameras into Mapping to a bird's-eye view, but this method has at least the following problems: it is relatively dependent on the internal and external parameter matrices.
  • the internal and external parameter matrices of the cameras in the multi-camera system are prone to change, which will cause errors in the mapping results and lead to a bird's-eye view.
  • the perceptual results of the perceptual task determined by the figure are inaccurate.
  • Figure 1 is a schematic flowchart of a visual perception method provided by an exemplary embodiment of the present disclosure. This embodiment can be applied to electronic devices, as shown in Figure 1, including the following steps:
  • Step 101 Feature extraction is performed on multiple images from different perspectives collected by the multi-camera system of the vehicle at the same time for the surrounding environment of the vehicle to obtain multiple first feature maps.
  • the vehicle is any device that can carry a multi-camera system, for example, it is a vehicle or an intelligent mobile robot; because the position of the vehicle may change over time, for example, the vehicle changes over time while driving. Its position is constantly changing. Therefore, in this embodiment, multiple images are limited to be obtained through the multi-camera system at the same time, so that the multiple images collected by the vehicle based on the multi-camera system are images of the vehicle at the same position from different perspectives.
  • the multi-camera system can include multiple cameras. Each camera is preferably a vehicle-mounted camera that meets automotive standards.
  • Each camera corresponds to an image and can be used to collect images of the environment surrounding the vehicle; for example, when the vehicle is a vehicle , the multi-camera system can be multiple surround-view cameras installed on the vehicle; after collecting multiple images, the neural network model (for example, convolutional neural network, etc.) can be used to extract features of each image separately to obtain multiple first feature maps, where each first feature map corresponds to an image.
  • the neural network model for example, convolutional neural network, etc.
  • Step 102 Based on the index features corresponding to the multiple grid points included in the bird's-eye view corresponding to the same time in at least one first feature map, determine the grid point features corresponding to each of the multiple grid points.
  • the bird's-eye view in this embodiment is a part of a plane (for example, a rectangular area) in which the z-axis takes a preset value in the coordinate system centered on the vehicle.
  • Each grid point can correspond to the world coordinate system.
  • each grid point corresponds to a 0.5m*0.5m area in the world coordinate system, or each grid point corresponds to a 0.6m*1m area in the world coordinate system.
  • Each grid point corresponds to a 0.6m*1m area in the world coordinate system.
  • the specific size can be set according to the actual application scenario.
  • Features can be used to determine the grid point features corresponding to each grid point in the bird's-eye view.
  • Step 103 Determine the second feature map corresponding to the bird's-eye view based on the grid point features corresponding to each grid point.
  • the grid point feature of each grid point can be represented as a vector. According to the position of each grid point in the bird's-eye view, the grid point features corresponding to multiple grid points can be spliced to obtain the second feature corresponding to the bird's-eye view. picture.
  • Step 104 Identify the second feature map based on the network model corresponding to the preset perception task, and determine the perception result corresponding to the preset perception task.
  • the preset perception task in this embodiment can be any visual perception task, such as a segmentation task, a detection task, a classification task, etc.
  • the operation of the perception task in this step is implemented through a network model corresponding to the visual perception task.
  • the visual perception method provided by the above embodiments of the present disclosure determines the grid point characteristics corresponding to each grid point by determining the index characteristics corresponding to each grid point in the bird's-eye view. There is no need to combine the internal and external parameter matrices to determine the corresponding index characteristics of the bird's-eye view.
  • the second feature overcomes the problem of the existing technology's reliance on internal and external parameter matrices.
  • step 102 may include the following steps:
  • Step 1021 Determine multiple grid points included in the bird's-eye view.
  • the bird's-eye view determines the size of the bird's-eye view according to the actual application scenario.
  • the bird's-eye view has the same surrounding environment range as the multiple images collected by the multi-camera system.
  • the bird's-eye view is divided into multiple images according to the size of the bird's-eye view.
  • Each grid point has the same size. For example, each grid point corresponds to a 0.5m*0.5m area in the world coordinate system, etc.
  • Step 1022 Determine the index feature corresponding to each grid point in at least one first feature map among the plurality of grid points, and obtain at least one index feature corresponding to each grid point.
  • each grid point may or may not have a corresponding index area in a first feature map.
  • the grid point corresponds to the image on the right side of the vehicle in the bird's-eye view.
  • the first feature map corresponding to the image obtained through the left-view camera may not have the index feature corresponding to the grid point; for example, as shown in Figure 2b, the grid point in the bird's-eye view on the right corresponds to the front-view camera and the side door position of the vehicle captured in the right front camera; the side door of the vehicle was not captured in other cameras in the multi-camera system, so this grid point does not have index features in other images.
  • Step 1023 Based on at least one index feature corresponding to each grid point, determine the grid point feature corresponding to each grid point in the plurality of grid points.
  • At least one index feature corresponding to each grid point can be processed through a neural network model (for example, a self-attention model), and all the index features corresponding to a grid point are used as input, and the output is the corresponding grid point.
  • the grid point characteristics of the bird's-eye view are combined with the grid point's position in the bird's-eye view to determine the grid point characteristics of each grid point in the bird's-eye view, thereby obtaining the second feature map of the bird's-eye view; that is, this implementation
  • the second feature map corresponding to the bird's-eye view can be obtained without relying on the coordinate system transformation involving the camera's internal and external parameter matrices, which overcomes the problem of dependence on the internal and external parameter matrices in related technologies, so that the bird's-eye view with a determined index relationship can be used in any
  • the corresponding second feature map is obtained at all times.
  • step 1022 may include the following steps:
  • Step 301 For each first feature map in the at least one first feature map, based on the internal parameter matrix, the external parameter matrix and the three-dimensional coordinates of the camera in the multi-camera system corresponding to the first feature map, determine the location of the grid point in the first feature map. A mapped region in a feature map.
  • the position of the grid point in each first feature map can be determined based on the camera's internal parameter matrix, external parameter matrix and the three-dimensional coordinates of the grid point.
  • the mapping area for example, converts the features in the image coordinate system into a vehicle-centered coordinate system (for example, the self-vehicle coordinate system) through coordinate system transformation, so that each grid can be directly obtained when performing visual perception tasks.
  • the index feature corresponding to the point does not need to recombine the camera's intrinsic parameter matrix and extrinsic parameter matrix to perform coordinate system transformation.
  • Step 302 Based on the position of the grid point corresponding to at least one mapping area in at least one first feature map, determine the corresponding to One less index feature.
  • the grid points can be mapped to each first feature map in the plurality of first feature maps.
  • the mapping area is valid, so as to The feature corresponding to the mapping area is used as the index feature of the grid point in the first feature map, and when the position corresponding to the mapping area exceeds the range of the first feature map (for example, the position coordinate is a negative number, etc.), it means that the mapping The area is invalid.
  • the index feature corresponding to the grid point does not exist in the first feature map corresponding to the mapping area.
  • the method provided in this embodiment can still obtain relatively accurate sensing results.
  • the process of determining the index characteristics of the grid points through the internal and external parameter matrix provided in this embodiment is executed after determining the positional relationship between the vehicle and the multi-camera system. It can be executed before step 101 is performed. After determining the index corresponding to the grid point After the features are added, the corresponding relationship can be stored. In the process of actually applying the visual perception method provided by this embodiment, the corresponding relationship between the grid points and the index features can be directly called, without the need to perform it every time the perception task operation is performed.
  • step 301 may include the following steps:
  • Step 3011 Based on the internal parameter matrix and the external parameter matrix, map the three-dimensional coordinates of the grid points to the image coordinate system corresponding to the first feature map to obtain the image coordinates of the grid points in the image coordinate system.
  • the three-dimensional coordinates of the grid points can be mapped to the image coordinate system corresponding to the first feature map based on the following formula (1):
  • k represents the corresponding vehicle camera number.
  • Multiple cameras included in the multi-camera system can be numbered in advance, and the corresponding numbers can be used later to represent the corresponding cameras.
  • the 6 cameras included in the multi-camera system can be numbered from the front to the front. The directions are numbered clockwise as 1 (front), 2 (front right), 3 (rear right), 4 (rear right), 5 (rear left), 6 (front left), etc.
  • c k means that the grid point is mapped to For the image coordinates in the first feature map, K k represents the internal parameter matrix of the camera numbered k, Rt k represents the extrinsic parameter matrix of the camera numbered k, and c 3D represents the three-dimensional coordinates of the grid point.
  • Step 3012 Determine the mapping area by taking the image coordinates as the center and combining the preset length and preset width.
  • a mapping area is determined based on the preset length and the preset width. For example, when the preset length is Kh and the preset width is Kw, the resulting mapping area is Kh*Kw, where Kh and Kw are The values can be the same or different, and the specific values can be set according to the actual application scenario.
  • grid points are mapped to a mapping area based on a certain range, rather than just using image coordinates as mapping coordinates. At this time, even if the internal and external parameter matrices are inaccurate or accuracy issues cause the corresponding mapping area to shift, the target can still be is covered to make the perception result obtained based on the second feature map determined by the mapping area insensitive to the internal and external parameter matrix.
  • step 3011 may include:
  • Step a1 Based on the internal parameter matrix and the external parameter matrix, map the three-dimensional coordinates of the grid points to the image coordinate system to obtain the precise coordinates of the grid points in the image coordinate system.
  • coordinate mapping is implemented based on the above formula (1) to obtain the precise coordinates of the grid point in the image coordinate system.
  • the coordinate point may be an integer or a non-integer.
  • Step a2 Obtain an integer coordinate as an image coordinate based on the position of the precise coordinate around the precise coordinate.
  • the precise coordinates are integers
  • the precise coordinates are directly used as image coordinates
  • the mapping area is determined with the image coordinates as the center point; in most cases, the precise coordinates are non-integers.
  • the mapping area can be determined by An integer coordinate is determined at any position around the precise coordinate, and rounding is implemented.
  • the grid points are roughly mapped to the first feature map of each viewing angle using the internal and external parameter matrix of the camera (errors are allowed), and the results of each viewing angle are obtained.
  • the image coordinates (u, v) of; the rounding process in this embodiment is equivalent to adding noise to the internal and external parameter matrix, therefore, over-reliance on the accuracy of the internal and external parameter matrix is avoided.
  • step 1023 may include the following steps:
  • Step 501 Execute an expansion operation on each first feature map in at least one first feature map to obtain multiple strip features corresponding to each first feature map.
  • each first feature map can be expanded through the convolution acceleration algorithm (img2col) to obtain unfolded features (unfold features).
  • the unfolded features can be understood as including multiple strip features, where each strip feature (A row of expanded features) corresponds to the K*K area (patch) around each feature point on the first feature map.
  • K is an integer greater than 1. The specific value can be set according to the actual scenario.
  • Step 502 Based on the multiple strip features corresponding to each first feature map, determine the strip index feature corresponding to each index feature in the at least one index feature, and obtain at least one strip index feature corresponding to each grid point.
  • each strip feature corresponds to a K*K area around a pixel, and then combined with the image features corresponding to each grid point (corresponding to the first feature map (a feature point), you can determine at least one strip-shaped index feature corresponding to each grid point.
  • Step 503 Determine the grid point feature corresponding to each grid point based on at least one strip index feature corresponding to each grid point.
  • At least one strip index feature corresponding to each grid point can be input into a self-attention model (for example, Transformer, a deep learning model of the self-attention mechanism) to obtain the grid point feature corresponding to the grid point.
  • a self-attention model for example, Transformer, a deep learning model of the self-attention mechanism
  • by performing a self-attention operation only on at least one strip index feature to obtain grid features compared to the related technology that requires all features in all first feature maps
  • self-attention operations By performing self-attention operations (attention), embodiments of the present disclosure greatly reduce the amount of calculation and improve the processing efficiency of preset perception tasks.
  • step 1023 may also include: performing feature extraction on at least one index feature corresponding to each grid point based on the self-attention model for each grid point in the plurality of grid points, and obtaining the index feature corresponding to each grid point respectively.
  • Grid features may also include: performing feature extraction on at least one index feature corresponding to each grid point based on the self-attention model for each grid point in the plurality of grid points, and obtaining the index feature corresponding to each grid point respectively.
  • this embodiment omits the step of expanding the first feature map, and directly inputs the image features of at least one area corresponding to the grid point in at least one first feature map into the self-attention model.
  • the self-attention model for example, Transformer, a self-attention mechanism In the deep learning model
  • the grid point features corresponding to the grid point for example, the feature vector of 1*1*C'
  • this embodiment requires all feature points in all first feature maps.
  • self-attention operation greatly reduces the amount of calculation.
  • the calculation process is saved and the preset perception task is further improved. processing efficiency.
  • Any visual perception method provided by the embodiments of the present disclosure can be executed by any appropriate device with data processing capabilities, including but not limited to: terminal devices and servers.
  • any of the visual perception methods provided by the embodiments of the present disclosure can be executed by the processor.
  • the processor executes any of the visual perception methods mentioned in the embodiments of the present disclosure by calling corresponding instructions stored in the memory. No further details will be given below.
  • Figure 6 is a schematic structural diagram of a visual sensing device provided by an exemplary embodiment of the present disclosure. As shown in Figure 6, the device provided by this embodiment includes:
  • the feature extraction module 61 is used to extract features from multiple images from different perspectives collected by the multi-camera system of the vehicle at the same time for the surrounding environment of the vehicle to obtain multiple first feature maps.
  • the feature corresponding module 62 is configured to determine, based on the index features corresponding to the multiple grid points included in the bird's-eye view corresponding to the same moment in the at least one first feature map determined by the feature extraction module 61, the respective grid points of the multiple grid points. Corresponding grid features.
  • the feature map determination module 63 is configured to determine the second feature map corresponding to the bird's-eye view based on the grid point features determined by the feature corresponding module 62 corresponding to each grid point.
  • the perception identification module 64 is configured to identify the second feature map determined by the feature map determination module 63 based on the network model corresponding to the preset perception task, and determine the perception result corresponding to the preset perception task.
  • the visual perception device determines the grid point characteristics corresponding to each grid point by determining the index characteristics corresponding to each grid point in the bird's-eye view, and does not need to combine the internal and external parameter matrices to determine the corresponding index characteristics of the bird's-eye view.
  • the second feature overcomes the problem of the existing technology's reliance on internal and external parameter matrices.
  • Figure 7 is a schematic structural diagram of a visual sensing device provided by another exemplary embodiment of the present disclosure. As shown in Figure 7, the device provided by this embodiment includes:
  • Feature corresponding module 62 includes:
  • the grid point determination unit 621 is used to determine a plurality of grid points included in the bird's-eye view.
  • the index determination unit 622 is used to determine the index feature corresponding to each of the plurality of grid points in at least one of the first feature maps, and obtain at least one index corresponding to each of the grid points. feature.
  • the grid point feature determining unit 623 is configured to determine the grid point features corresponding to each of the plurality of grid points based on at least one of the index features corresponding to each of the grid points.
  • the index determining unit 622 is specifically configured to, for each of the at least one first feature map, based on the The internal parameter matrix, the external parameter matrix and the three-dimensional coordinates of the grid points of the cameras in the multi-camera system corresponding to the first feature map are determined to determine the mapping area of the grid points in the first feature map; based on the The position of a grid point corresponding to at least one of the mapping areas in at least one of the first feature maps determines at least one of the index features corresponding to the grid point.
  • the index determination unit 622 determines that the grid point is in the multi-camera system based on the internal parameter matrix, the external parameter matrix and the three-dimensional coordinates of the grid point corresponding to the first feature map.
  • the mapping area in the first feature map is used to map the three-dimensional coordinates of the grid points to the image coordinate system corresponding to the first feature map based on the internal parameter matrix and the external parameter matrix to obtain the grid
  • the image coordinates of the point under the image coordinate system; with the image coordinates as the center, the mapping area is determined in combination with the preset length and the preset width.
  • the index determination unit 622 maps the three-dimensional coordinates of the grid point to the image coordinate system corresponding to the first feature map based on the internal parameter matrix and the external parameter matrix to obtain the location of the grid point.
  • the image coordinates in the image coordinate system are described, it is used to map the three-dimensional coordinates of the grid point to the image coordinate system based on the internal parameter matrix and the external parameter matrix to obtain the coordinates of the grid point in the image coordinate system.
  • the grid feature determining unit 623 is specifically configured to perform an expansion operation on each of the at least one first feature map to obtain each of the first feature maps.
  • the grid point feature determination unit 623 is specifically configured to determine, for each of the plurality of grid points, at least one corresponding to each grid point based on a self-attention model.
  • the index features are used for feature extraction to obtain the grid point features corresponding to each of the grid points.
  • the electronic device may be either or both of the first device and the second device, or a stand-alone device independent of them.
  • the stand-alone device may communicate with the first device and the second device to receive the collected information from them. input signal.
  • Figure 8 illustrates a block diagram of an electronic device according to an embodiment of the present disclosure.
  • electronic device 80 includes one or more processors 81 and memory 82 .
  • the processor 81 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 80 to perform desired functions.
  • CPU central processing unit
  • the processor 81 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 80 to perform desired functions.
  • Memory 82 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • the volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache).
  • the non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 81 may execute the program instructions to implement the visual perception methods and/or other methods of various embodiments of the present disclosure described above. Desired functionality.
  • Various contents such as input signals, signal components, noise components, etc. may also be stored in the computer-readable storage medium.
  • the electronic device 80 may also include an input device 83 and an output device 84, and these components are interconnected through a bus system and/or other forms of connection mechanisms (not shown).
  • the input device 83 may be the above-mentioned microphone or microphone array, used to capture the input signal of the sound source.
  • the input device 83 may be a communication network connector for receiving the collected input signals from the first device and the second device.
  • the input device 83 may also include, for example, a keyboard, a mouse, and the like.
  • the output device 84 can output various information to the outside, including determined distance information, direction information, etc.
  • the output device 84 may include, for example, a display, a speaker, a printer, a communication network and remote output devices connected thereto, and the like.
  • the electronic device 80 may also include any other suitable components depending on the specific application.
  • embodiments of the present disclosure may also be a computer program product, which includes computer program instructions that, when executed by a processor, cause the processor to perform the “exemplary method” described above in this specification
  • the steps in the visual perception method according to various embodiments of the present disclosure are described in Sec.
  • the computer program product may be written with program code for performing operations of embodiments of the present disclosure in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc. , also includes conventional procedural programming languages, such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
  • embodiments of the present disclosure may also be a computer-readable storage medium having computer program instructions stored thereon.
  • the computer program instructions when executed by a processor, cause the processor to execute the above-mentioned “example method” part of this specification.
  • the steps in the visual perception method according to various embodiments of the present disclosure are described in .
  • the computer-readable storage medium may be any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may include, for example, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the methods and apparatus of the present disclosure may be implemented in many ways.
  • the methods and devices of the present disclosure may be implemented through software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the above order for the steps of the methods is for illustration only, and the steps of the methods of the present disclosure are not limited to the order specifically described above unless otherwise specifically stated.
  • the present disclosure may also be implemented as programs recorded in recording media, and these programs include machine-readable instructions for implementing methods according to the present disclosure.
  • the present disclosure also covers recording media storing programs for executing methods according to the present disclosure.
  • each component or each step can be decomposed and/or recombined. These decompositions and/or recombinations should be considered equivalent versions of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

本公开实施例公开了一种视觉感知方法、装置、存储介质和电子设备,其中,方法包括:对载具的多相机系统在同一时刻针对所述载具周围环境采集的不同视角的多个图像分别进行特征提取,得到多个第一特征图;基于所述同一时刻对应的鸟瞰图中包括的多个格点在至少一个所述第一特征图中对应的索引特征,确定所述多个格点中各所述格点分别对应的格点特征;基于各所述格点分别对应的所述格点特征,确定所述鸟瞰图对应的第二特征图;基于预设感知任务对应的网络模型对所述第二特征图进行识别,确定所述预设感知任务对应的感知结果。

Description

视觉感知方法、装置、存储介质和电子设备
本公开要求在2022年6月1日提交的、申请号为202210618710.5、发明名称为“视觉感知方法、装置、存储介质和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及视觉感知技术领域,尤其是一种视觉感知方法、装置、存储介质和电子设备。
背景技术
在自动驾驶领域,视觉感知系统中多相机系统在各自坐标系下的感知结果不能直接被用于后续预测、规控系统,需要通过一定方式将不同视角的特征进行融合,例如,映射到鸟瞰图(Bird Eye View,BEV)中,统一在自车坐标系下表达;现有技术中通常采用指向注意力(Point attention)方案将多相机系统获得的图像映射到鸟瞰图中,但该方法对内外参矩阵较为依赖。
发明内容
为了解决上述技术问题,提出了本公开。本公开的实施例提供了一种视觉感知方法、装置、存储介质和电子设备。
根据本公开实施例的一个方面,提供了一种视觉感知方法,包括:
对载具的多相机系统在同一时刻针对所述载具周围环境采集的不同视角的多个图像分别进行特征提取,得到多个第一特征图;
基于所述同一时刻对应的鸟瞰图中包括的多个格点在至少一个所述第一特征图中对应的索引特征,确定所述多个格点中各所述格点分别对应的格点特征;
基于各所述格点分别对应的所述格点特征,确定所述鸟瞰图对应的第二特征图;
基于预设感知任务对应的网络模型对所述第二特征图进行识别,确定所述预设感知任务对应的感知结果。
根据本公开实施例的另一方面,提供了一种视觉感知装置,包括:
特征提取模块,用于对载具的多相机系统在同一时刻针对所述载具周围环境采集的不同视角的多个图像分别进行特征提取,得到多个第一特征图;
特征对应模块,用于基于所述同一时刻对应的鸟瞰图中包括的多个格点在所述特征提取模块确定的至少一个第一特征图中对应的索引特征,确定所述多个格点中各所述格点分别对应的格点特征;
特征图确定模块,用于基于各所述格点分别对应的所述特征对应模块确定的所述格点特征,确定所述鸟瞰图对应的第二特征图;
感知识别模块,用于基于预设感知任务对应的网络模型对所述特征图确定模块确定的所述第二特征图进行识别,确定所述预设感知任务对应的感知结果。
根据本公开实施例的又一方面,提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述任一实施例所述的视觉感知方法。
根据本公开实施例的还一方面,提供了一种电子设备,所述电子设备包括:
处理器;
用于存储所述处理器可执行指令的存储器;
所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述可执行指令以实现上述任一实施例所述的视觉感知方法。
基于本公开上述实施例提供的一种视觉感知方法、装置、存储介质和电子设备,通过确定鸟瞰图中每个格点对应的索引特征来确定每个格点对应的格点特征,不需要结合内外参矩阵来确定鸟瞰图对应的第二特征,克服了现有技术中对内外参矩阵较为依赖的问题。
下面通过附图和实施例,对本公开的技术方案做进一步的详细描述。
附图说明
通过结合附图对本公开实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显。附图用来提供对本公开实施例的进一步理解,并且构成说明书的一部分,与本公开实施例一起用于解释本公开,并不构成对本公开的限制。在附图中,相同的参考标号通常代表相同部件或步骤。
图1是本公开一示例性实施例提供的视觉感知方法的流程示意图。
图2a是本公开图1所示的实施例中步骤102的一流程示意图。
图2b是本公开一示例性实施例提供的视觉感知方法中示例性的格点对应区域示意图。
图3是本公开图2a所示的实施例中步骤1022的一流程示意图。
图4是本公开图3所示的实施例中步骤301的一流程示意图。
图5是本公开图2a所示的实施例中步骤1023的一流程示意图。
图6是本公开一示例性实施例提供的视觉感知装置的结构示意图。
图7是本公开另一示例性实施例提供的视觉感知装置的结构示意图。
图8是本公开一示例性实施例提供的电子设备的结构图。
具体实施方式
下面,将参考附图详细地描述根据本公开的示例实施例。显然,所描述的实施例仅仅是本公开的一部分实施例,而不是本公开的全部实施例,应理解,本公开不受这里描述的示例实施例的限制。
应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数 值不限制本公开的范围。
本领域技术人员可以理解,本公开实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等,既不代表任何特定技术含义,也不表示它们之间的必然逻辑顺序。
还应理解,在本公开实施例中,“多个”可以指两个或两个以上,“至少一个”可以指一个、两个或两个以上。
还应理解,对于本公开实施例中提及的任一部件、数据或结构,在没有明确限定或者在前后文给出相反启示的情况下,一般可以理解为一个或多个。
另外,本公开中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本公开中字符“/”,一般表示前后关联对象是一种“或”的关系。
还应理解,本公开对各个实施例的描述着重强调各个实施例之间的不同之处,其相同或相似之处可以相互参考,为了简洁,不再一一赘述。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
本公开实施例可以应用于终端设备、计算机系统、服务器等电子设备,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统、大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。
终端设备、计算机系统、服务器等电子设备可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。
申请概述
在实现本公开的过程中,发明人发现,现有技术中通常采用指向注意力方案将车载摄像头获得的图像 映射到鸟瞰图中,但该方法至少存在以下问题:对内外参矩阵较为依赖,当车辆行驶过程中多相机系统中的相机的内外参矩阵容易发生变化,会使得映射结果出错,进而导致以鸟瞰图确定的感知任务的感知结果不准确。
示例性方法
图1是本公开一示例性实施例提供的视觉感知方法的流程示意图。本实施例可应用在电子设备上,如图1所示,包括如下步骤:
步骤101,对载具的多相机系统在同一时刻针对载具周围环境采集的不同视角的多个图像分别进行特征提取,得到多个第一特征图。
其中,载具是任意可以装载多相机系统的设备,例如,其为车辆和智能移动机器人等;由于载具可能随着时间的变化位置发生变化,例如,车辆在行驶过程中随着时间的变化其位置不断发生变化,因此,本实施例中将多个图像限定为同一时刻通过多相机系统获得,使得载具基于多相机系统采集到的多个图像是载具在同一位置的不同视角的图像。多相机系统中可包括多个相机,各个相机优选采用符合车规级标准的车载相机,每个相机分别对应一个图像,可用于采集包含载具周围环境的图像;例如,当载具为车辆时,多相机系统可以为设置在车辆上的多个环视相机等;在采集到多个图像后,可利用神经网络模型(例如,卷积神经网络等)分别对每个图像进行特征提取,得到多个第一特征图,其中,每个第一特征图对应一个图像。
步骤102,基于同一时刻对应的鸟瞰图中包括的多个格点在至少一个第一特征图中对应的索引特征,确定多个格点中各格点分别对应的格点特征。
本实施例中的鸟瞰图为以载具为中心的坐标系下的一个z轴取预设值的平面中的一部分(例如,一个矩形区域),其中,每个格点可以对应在世界坐标系中的一个矩形区域,例如,每个格点对应世界坐标系中的0.5m*0.5m大小的区域,或每个格点对应世界坐标系中的0.6m*1m大小的区域,每个格点的具体大小可根据实际应用场景进行设置,鸟瞰图对应的范围越大,格点对应的区域可以越大,每个格点的长宽可以相等或不等;本实施例中可预先通过坐标系转换、投影变换等操作,确定鸟瞰图中包括的多个格点中的每个格点在至少一个第一特征图中的索引特征,在基于当前多相机系统进行感知识别时,直接获取这些索引特征即可确定该鸟瞰图中每个格点对应的格点特征。
步骤103,基于各格点分别对应的格点特征,确定鸟瞰图对应的第二特征图。
可选地,每个格点的格点特征可以表示为一个向量,按照每个格点在鸟瞰图中的位置拼接多个格点对应的格点特征,即可得到鸟瞰图对应的第二特征图。
步骤104,基于预设感知任务对应的网络模型对所述第二特征图进行识别,确定预设感知任务对应的感知结果。
本实施例中的预设感知任务可以是任意视觉感知任务,例如,分割任务、检测任务、分类任务等,通过对应该视觉感知任务的网络模型实现该步骤的感知任务的操作。
本公开上述实施例提供的一种视觉感知方法,通过确定鸟瞰图中每个格点对应的索引特征来确定每个格点对应的格点特征,不需要结合内外参矩阵来确定鸟瞰图对应的第二特征,克服了现有技术中对内外参矩阵较为依赖的问题。
如图2a所示,在上述图1所示实施例的基础上,步骤102可包括如下步骤:
步骤1021,确定鸟瞰图中包括的多个格点。
可选地,根据实际应用场景确定鸟瞰图的大小,通常情况下鸟瞰图与多相机系统采集的多个图像采集的周围环境范围相同,可选地,根据鸟瞰图的大小将鸟瞰图分割为多个格点,每个格点大小相同,例如,每个格点对应世界坐标系中的0.5m*0.5m大小的区域等。
步骤1022,确定多个格点中各格点分别在至少一个第一特征图中对应的索引特征,得到各格点分别对应的至少一个索引特征。
本实施例中,每个格点在一个第一特征图中可能存在对应的索引区域,也可能不存在对应的索引区域,例如,格点对应的是鸟瞰图中载具右侧的图像,此时在通过左侧视角的相机获得的图像对应的第一特征图中,可能不具有该格点对应的索引特征;例如,如图2b所示,右侧鸟瞰图中的格点对应前视相机和右前侧相机中采集的车辆的侧门位置;而该车辆的侧门在多相机系统中其他的相机中没有采集到,因此,该格点在其他图像中不存在索引特征。
步骤1023,基于各格点分别对应的至少一个索引特征,确定多个格点中各格点分别对应的格点特征。
本实施例中,可通过神经网络模型(例如,自注意力模型)分别对各个格点对应的至少一个索引特征进行处理,将一个格点对应的全部索引特征作为输入,输出得到该格点对应的格点特征,以该格点特征结合该格点在鸟瞰图中的位置,即可确定鸟瞰图中每个格点的格点特征,实现获得鸟瞰图的第二特征图;即,本实施例不需要依赖相机内外参矩阵参与的坐标系转换,即可获得鸟瞰图对应的第二特征图,克服了相关技术中对内外参矩阵依赖的问题,使确定了索引关系的鸟瞰图可以在任意时刻获得对应的第二特征图。
如图3所示,在上述图2a所示实施例的基础上,步骤1022可包括如下步骤:
步骤301,针对至少一个第一特征图中每个第一特征图,基于第一特征图对应的多相机系统中的相机的内参矩阵、外参矩阵和格点的三维坐标,确定格点在第一特征图中的映射区域。
本实施例中,在确定载具和多相机系统之间的位置关系后,即可基于相机的内参矩阵、外参矩阵和格点的三维坐标来确定格点在每个第一特征图中的映射区域,例如,通过坐标系转换将图像坐标系下的特征转换到以载具为中心的坐标系(例如,自车坐标系)中,以便后续在执行视觉感知任务时,直接获得每个格点对应的索引特征,而无需重新结合相机的内参矩阵和外参矩阵来进行坐标系转换。
步骤302,基于格点在至少一个第一特征图中对应的至少一个映射区域对应的位置,确定格点对应的至 少一个索引特征。
本实施例中,可以将格点映射到多个第一特征图中的每个第一特征图中,确定映射区域对应的位置在该第一特征图的范围内时,该映射区域有效,以该映射区域对应的特征作为格点在该第一特征图中的索引特征,而当映射区域对应的位置超出了第一特征图的范围时(例如,位置坐标为负数等情况),说明该映射区域无效,该映射区域对应的第一特征图中不存在该格点对应的索引特征,通过所有有效的映射区域确定的索引特征,得到格点对应的至少一个索引特征,通过预先建立格点与索引特征之间的对应关系,减小了格点对应的格点特征对相机内外参矩阵的依赖,在载具位置发生变化的情况下(例如,车辆行驶的情况下),即使内外参矩阵发生一定变化,本实施例提供的方法仍然可以得到较为准确的感知结果。
本实施例提供的通过内外参矩阵确定格点的索引特征的过程是在确定载具和多相机系统之间的位置关系后执行的,可在执行步骤101之前执行,在确定格点对应的索引特征之后,可将该对应关系进行存储,实际应用本实施例提供的视觉感知方法的过程中,可直接调用格点与索引特征的对应关系,而不需要在每次进行感知任务操作时都进行格点与索引特征的对应计算;只要载具和设置在载具上的多相机系统不发生变化,就不需要结合内外参矩阵来确定格点对应的索引特征,提高了感知任务的处理效率,克服了现有技术中对内外参矩阵较为依赖的问题。
如图4所示,在上述图3所示实施例的基础上,步骤301可包括如下步骤:
步骤3011,基于内参矩阵和外参矩阵,将格点的三维坐标映射到第一特征图对应的图像坐标系,得到格点在图像坐标系下的图像坐标。
可选地,可基于以下公式(1)将格点的三维坐标映射到第一特征图对应的图像坐标系中:
ck=Kk·Rtk·c3D            公式(1)
其中,k表示对应的车载相机编号,可预先对多相机系统中包括的多个相机进行编号,在后续使用相应编号表示对应的相机,例如,将多相机系统中包括的6个相机从正前方向顺时针分别编号为1(正前方)、2(右前方)、3(右后方)、4(正后方)、5(左后方)、6(左前方)等,ck表示格点映射到第一特征图中的图像坐标,Kk表示编号为k的相机的内参矩阵,Rtk表示编号为k的相机的外参矩阵,c3D表示格点的三维坐标。
步骤3012,以图像坐标为中心,结合预设长度和预设宽度,确定映射区域。
本实施例中,通过基于预设长度和预设宽度确定一个映射区域,例如,预设长度为Kh,预设宽度为Kw时,得到的映射区域为Kh*Kw,其中,Kh和Kw的取值可以相同,也可以不同,其具体取值可根据实际应用场景进行设置。本实施例将格点映射到基于一定范围的映射区域,而非仅以图像坐标作为映射坐标,此时,即使内外参矩阵不准或精度问题导致对应的映射区域有所偏移,目标仍能被覆盖到,使基于该映射区域确定的第二特征图得到的感知结果对内外参矩阵不敏感。
可选地,步骤3011可以包括:
步骤a1,基于内参矩阵和外参矩阵,将格点的三维坐标映射到图像坐标系,得到格点在图像坐标系下的精确坐标。
例如,基于上述公式(1)实现坐标映射,获得格点在图像坐标系下的精确坐标,此时,该坐标点可能是整数或非整数。
步骤a2,基于精确坐标在精确坐标周围的位置获得一个整数坐标作为图像坐标。
本实施例中,当精确坐标为整数时,直接基于该精确坐标作为图像坐标,以该图像坐标为中心点确定映射区域;而大多数情况下,精确坐标为非整数,此时,可通过在该精确坐标的四周任一位置确定一个整数坐标,实现取整,实现了利用相机的内外参矩阵将格点粗略地映射到各视角的第一特征图上(允许有误差),得到各个视角下的图像坐标(u,v);本实施例中取整的过程相当于对内外参矩阵添加了噪声,因此,避免对内外参矩阵精度的过度依赖。
如图5所示,在上述图2a所示实施例的基础上,步骤1023可包括如下步骤:
步骤501,对至少一个第一特征图中每个第一特征图执行展开操作,得到每个第一特征图对应的多个条状特征。
可选地,可通过卷积加速算法(img2col)将每个第一特征图进行展开,得到展开特征(unfold features),展开特征可以理解为包括多个条状特征,其中,每一个条状特征(展开特征的一行)对应于第一特征图上每一个特征点的周围K*K的区域(patch),K为大于1的整数,具体取值可根据实际场景进行设置。
步骤502,基于每个第一特征图对应的多个条状特征,确定至少一个索引特征中每个索引特征对应的条状索引特征,得到各格点分别对应的至少一个条状索引特征。
由于上述展开每个第一特征图得到的多个条状特征中每个条状特征对应一个像素的周围K*K的区域,再结合每个格点对应的图像特征(对应第一特征图中的一个特征点),即可确定每个格点对应的至少一个条状的索引特征。
步骤503,基于各格点分别对应的至少一个条状索引特征,确定各格点对应的格点特征。
本实施例中,可将每个格点对应的至少一个条状索引特征输入自注意力模型中(例如,Transformer,自注意力机制的一个深度学习模型),得到该格点对应的格点特征(例如,1*1*C’的特征向量),通过仅对至少一个条状索引特征执行自注意力操作以获得格点特征,相对于相关技术中需要对所有第一特征图中的所有特征点执行自注意力操作(attention),本公开实施例极大的减少了计算量,提升了预设感知任务的处理效率。
在另一个实施例中,步骤1023还可包括:针对多个格点中各格点,基于自注意力模型对各格点分别对应的至少一个索引特征进行特征提取,得到各格点分别对应的格点特征。
本实施例相较于上述图5所示的实施例省略了将第一特征图展开的步骤,直接将格点在至少一个第一特征图中对应的至少一个区域的图像特征输入自注意力模型中(例如,Transformer,自注意力机制的一个 深度学习模型)中,得到该格点对应的格点特征(例如,1*1*C’的特征向量),本实施例在相对于相关技术中需要对所有第一特征图中的所有特征点执行自注意力操作(attention),极大的减少了计算量的基础上,由于相对于将第一特征图进行展开,以条状特征作为输入,节省了计算过程,进一步提升了预设感知任务的处理效率。
本公开实施例提供的任一种视觉感知方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端设备和服务器等。或者,本公开实施例提供的任一种视觉感知方法可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本公开实施例提及的任一种视觉感知方法。下文不再赘述。
示例性装置
图6是本公开一示例性实施例提供的视觉感知装置的结构示意图。如图6所示,本实施例提供的装置包括:
特征提取模块61,用于对载具的多相机系统在同一时刻针对载具周围环境采集的不同视角的多个图像分别进行特征提取,得到多个第一特征图。
特征对应模块62,用于基于同一时刻对应的鸟瞰图中包括的多个格点在特征提取模块61确定的至少一个第一特征图中对应的索引特征,确定多个格点中各格点分别对应的格点特征。
特征图确定模块63,用于基于各格点分别对应的特征对应模块62确定的格点特征,确定鸟瞰图对应的第二特征图。
感知识别模块64,用于基于预设感知任务对应的网络模型对特征图确定模块63确定的第二特征图进行识别,确定预设感知任务对应的感知结果。
本公开上述实施例提供的一种视觉感知装置,通过确定鸟瞰图中每个格点对应的索引特征来确定每个格点对应的格点特征,不需要结合内外参矩阵来确定鸟瞰图对应的第二特征,克服了现有技术中对内外参矩阵较为依赖的问题。
图7是本公开另一示例性实施例提供的视觉感知装置的结构示意图。如图7所示,本实施例提供的装置包括:
特征对应模块62,包括:
格点确定单元621,用于确定所述鸟瞰图中包括的多个所述格点。
索引确定单元622,用于确定多个所述格点中各所述格点分别在至少一个所述第一特征图中对应的索引特征,得到各所述格点分别对应的至少一个所述索引特征。
格点特征确定单元623,用于基于各所述格点分别对应的至少一个所述索引特征,确定多个所述格点中各所述格点分别对应的格点特征。
可选地,索引确定单元622,具体用于针对至少一个所述第一特征图中每个所述第一特征图,基于所述 第一特征图对应的所述多相机系统中的相机的内参矩阵、外参矩阵和所述格点的三维坐标,确定所述格点在所述第一特征图中的映射区域;基于所述格点在至少一个所述第一特征图中对应的至少一个所述映射区域对应的位置,确定所述格点对应的至少一个所述索引特征。
可选地,索引确定单元622在基于所述第一特征图对应的所述多相机系统中的相机的内参矩阵、外参矩阵和所述格点的三维坐标,确定所述格点在所述第一特征图中的映射区域时,用于基于所述内参矩阵和所述外参矩阵,将所述格点的三维坐标映射到所述第一特征图对应的图像坐标系,得到所述格点在所述图像坐标系下的图像坐标;以所述图像坐标为中心,结合预设长度和预设宽度,确定所述映射区域。
可选地,索引确定单元622在基于所述内参矩阵和所述外参矩阵,将所述格点的三维坐标映射到所述第一特征图对应的图像坐标系,得到所述格点在所述图像坐标系下的图像坐标时,用于基于所述内参矩阵和所述外参矩阵,将所述格点的三维坐标映射到所述图像坐标系,得到所述格点在所述图像坐标系下的精确坐标;基于所述精确坐标在所述精确坐标周围的位置获得一个整数坐标作为所述图像坐标。
在一些可选的实施例中,格点特征确定单元623,具体用于对至少一个所述第一特征图中每个所述第一特征图执行展开操作,得到每个所述第一特征图对应的多个条状特征;基于每个所述第一特征图对应的多个条状特征,确定至少一个所述索引特征中每个所述索引特征对应的条状索引特征,得到各所述格点分别对应的至少一个所述条状索引特征;基于各所述格点分别对应的至少一个所述条状索引特征,确定各所述格点对应的所述格点特征。
在另一些可选的实施例中,格点特征确定单元623,具体用于针对所述多个格点中各所述格点,基于自注意力模型对各所述格点分别对应的至少一个所述索引特征进行特征提取,得到各所述格点分别对应的所述格点特征。
示例性电子设备
下面,参考图8来描述根据本公开实施例的电子设备。该电子设备可以是第一设备和第二设备中的任一个或两者、或与它们独立的单机设备,该单机设备可以与第一设备和第二设备进行通信,以从它们接收所采集到的输入信号。
图8图示了根据本公开实施例的电子设备的框图。
如图8所示,电子设备80包括一个或多个处理器81和存储器82。
处理器81可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元,并且可以控制电子设备80中的其他组件以执行期望的功能。
存储器82可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。 在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器81可以运行所述程序指令,以实现上文所述的本公开的各个实施例的视觉感知方法以及/或者其他期望的功能。在所述计算机可读存储介质中还可以存储诸如输入信号、信号分量、噪声分量等各种内容。
在一个示例中,电子设备80还可以包括:输入装置83和输出装置84,这些组件通过总线系统和/或其他形式的连接机构(未示出)互连。
例如,在该电子设备80是第一设备或第二设备时,该输入装置83可以是上述的麦克风或麦克风阵列,用于捕捉声源的输入信号。在该电子设备80是单机设备时,该输入装置83可以是通信网络连接器,用于从第一设备和第二设备接收所采集的输入信号。
此外,该输入装置83还可以包括例如键盘、鼠标等等。
该输出装置84可以向外部输出各种信息,包括确定出的距离信息、方向信息等。该输出装置84可以包括例如显示器、扬声器、打印机、以及通信网络及其所连接的远程输出设备等等。
当然,为了简化,图8中仅示出了该电子设备80中与本公开有关的组件中的一些,省略了诸如总线、输入/输出接口等等的组件。除此之外,根据具体应用情况,电子设备80还可以包括任何其他适当的组件。
示例性计算机程序产品和计算机可读存储介质
除了上述方法和设备以外,本公开的实施例还可以是计算机程序产品,其包括计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的视觉感知方法中的步骤。
所述计算机程序产品可以以一种或多种程序设计语言的任意组合来编写用于执行本公开实施例操作的程序代码,所述程序设计语言包括面向对象的程序设计语言,诸如Java、C++等,还包括常规的过程式程序设计语言,诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。
此外,本公开的实施例还可以是计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的视觉感知方法中的步骤。
所述计算机可读存储介质可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以包括但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上 述的任意合适的组合。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。
本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇,指“包括但不限于”,且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”,且可与其互换使用,除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”,且可与其互换使用。
可能以许多方式来实现本公开的方法和装置。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和装置。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。
还需要指出的是,在本公开的装置、设备和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。
提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或者使用本公开。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的,并且在此定义的一般原理可以应用于其他方面而不脱离本公开的范围。因此,本公开不意图被限制到在此示出的方面,而是按照与在此公开的原理和新颖的特征一致的最宽范围。
为了例示和描述的目的已经给出了以上描述。此外,此描述不意图将本公开的实施例限制到在此公开的形式。尽管以上已经讨论了多个示例方面和实施例,但是本领域技术人员将认识到其某些变型、修改、改变、添加和子组合。

Claims (10)

  1. 一种视觉感知方法,包括:
    对载具的多相机系统在同一时刻针对所述载具周围环境采集的不同视角的多个图像分别进行特征提取,得到多个第一特征图;
    基于所述同一时刻对应的鸟瞰图中包括的多个格点在至少一个所述第一特征图中对应的索引特征,确定所述多个格点中各所述格点分别对应的格点特征;
    基于各所述格点分别对应的所述格点特征,确定所述鸟瞰图对应的第二特征图;
    基于预设感知任务对应的网络模型对所述第二特征图进行识别,确定所述预设感知任务对应的感知结果。
  2. 根据权利要求1所述的方法,其中,所述基于所述同一时刻对应的鸟瞰图中包括的多个格点在至少一个所述第一特征图中对应的索引特征,确定所述多个格点中各所述格点分别对应的格点特征,包括:
    确定所述鸟瞰图中包括的多个所述格点;
    确定多个所述格点中各所述格点分别在至少一个所述第一特征图中对应的索引特征,得到各所述格点分别对应的至少一个所述索引特征;
    基于各所述格点分别对应的至少一个所述索引特征,确定多个所述格点中各所述格点分别对应的格点特征。
  3. 根据权利要求2所述的方法,其中,所述确定多个所述格点中各所述格点分别在至少一个所述第一特征图中对应的索引特征,得到各所述格点分别对应的至少一个所述索引特征,包括:
    针对至少一个所述第一特征图中每个所述第一特征图,基于所述第一特征图对应的所述多相机系统中的相机的内参矩阵、外参矩阵和所述格点的三维坐标,确定所述格点在所述第一特征图中的映射区域;
    基于所述格点在至少一个所述第一特征图中对应的至少一个所述映射区域对应的位置,确定所述格点对应的至少一个所述索引特征。
  4. 根据权利要求3所述的方法,其中,所述基于所述第一特征图对应的所述多相机系统中的相机的内参矩阵、外参矩阵和所述格点的三维坐标,确定所述格点在所述第一特征图中的映射区域,包括:
    基于所述内参矩阵和所述外参矩阵,将所述格点的三维坐标映射到所述第一特征图对应的图像坐标系,得到所述格点在所述图像坐标系下的图像坐标;
    以所述图像坐标为中心,结合预设长度和预设宽度,确定所述映射区域。
  5. 根据权利要求4所述的方法,其中,所述基于所述内参矩阵和所述外参矩阵,将所述格点的三维坐标映射到所述第一特征图对应的图像坐标系,得到所述格点在所述图像坐标系下的图像坐标,包括:
    基于所述内参矩阵和所述外参矩阵,将所述格点的三维坐标映射到所述图像坐标系,得到所述格点在所述图像坐标系下的精确坐标;
    基于所述精确坐标在所述精确坐标周围的位置获得一个整数坐标作为所述图像坐标。
  6. 根据权利要求2-5任一所述的方法,其中,所述基于各所述格点分别对应的至少一个所述索引特征,确定多个所述格点中各所述格点分别对应的格点特征,包括:
    对至少一个所述第一特征图中每个所述第一特征图执行展开操作,得到每个所述第一特征图对应的多个条状特征;
    基于每个所述第一特征图对应的多个条状特征,确定至少一个所述索引特征中每个所述索引特征对应的条状索引特征,得到各所述格点分别对应的至少一个所述条状索引特征;
    基于各所述格点分别对应的至少一个所述条状索引特征,确定各所述格点对应的所述格点特征。
  7. 根据权利要求2-5任一所述的方法,其中,所述基于各所述格点分别对应的至少一个所述索引特征,确定多个所述格点中每个所述格点分别对应的格点特征,包括:
    针对所述多个格点中各所述格点,基于自注意力模型对各所述格点分别对应的至少一个所述索引特征进行特征提取,得到各所述格点分别对应的所述格点特征。
  8. 一种视觉感知装置,包括:
    特征提取模块,用于对载具的多相机系统在同一时刻针对所述载具周围环境采集的不同视角的多个图像分别进行特征提取,得到多个第一特征图;
    特征对应模块,用于基于所述同一时刻对应的鸟瞰图中包括的多个格点在所述特征提取模块确定的至少一个所述第一特征图中对应的索引特征,确定所述多个格点中各所述格点分别对应的格点特征;
    特征图确定模块,用于基于各所述格点分别对应的所述特征对应模块确定的所述格点特征,确定所述鸟瞰图对应的第二特征图;
    感知识别模块,用于基于预设感知任务对应的网络模型对所述特征图确定模块确定的所述第二特征图进行识别,确定所述预设感知任务对应的感知结果。
  9. 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1-7任一所述的视觉感知方法。
  10. 一种电子设备,所述电子设备包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述可执行指令以实现上述权利要求1-7任一所述的视觉感知方法。
PCT/CN2023/073954 2022-06-01 2023-01-31 视觉感知方法、装置、存储介质和电子设备 WO2023231435A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210618710.5 2022-06-01
CN202210618710.5A CN114882465A (zh) 2022-06-01 2022-06-01 视觉感知方法、装置、存储介质和电子设备

Publications (1)

Publication Number Publication Date
WO2023231435A1 true WO2023231435A1 (zh) 2023-12-07

Family

ID=82679391

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/073954 WO2023231435A1 (zh) 2022-06-01 2023-01-31 视觉感知方法、装置、存储介质和电子设备

Country Status (2)

Country Link
CN (1) CN114882465A (zh)
WO (1) WO2023231435A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882465A (zh) * 2022-06-01 2022-08-09 北京地平线信息技术有限公司 视觉感知方法、装置、存储介质和电子设备
CN115719476A (zh) * 2022-11-11 2023-02-28 北京地平线信息技术有限公司 图像的处理方法、装置、电子设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160120895A (ko) * 2015-04-09 2016-10-19 한국항공대학교산학협력단 영상과 위치정보를 연계한 데이터베이스를 구축하는 방법, 상기 데이터베이스를 활용하여 측위하는 방법, 및 상기 방법들을 수행하는 전자 장치
CN109948398A (zh) * 2017-12-20 2019-06-28 深圳开阳电子股份有限公司 全景泊车的图像处理方法及全景泊车装置
CN111797650A (zh) * 2019-04-09 2020-10-20 广州文远知行科技有限公司 障碍物的识别方法、装置、计算机设备和存储介质
CN114882465A (zh) * 2022-06-01 2022-08-09 北京地平线信息技术有限公司 视觉感知方法、装置、存储介质和电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160120895A (ko) * 2015-04-09 2016-10-19 한국항공대학교산학협력단 영상과 위치정보를 연계한 데이터베이스를 구축하는 방법, 상기 데이터베이스를 활용하여 측위하는 방법, 및 상기 방법들을 수행하는 전자 장치
CN109948398A (zh) * 2017-12-20 2019-06-28 深圳开阳电子股份有限公司 全景泊车的图像处理方法及全景泊车装置
CN111797650A (zh) * 2019-04-09 2020-10-20 广州文远知行科技有限公司 障碍物的识别方法、装置、计算机设备和存储介质
CN114882465A (zh) * 2022-06-01 2022-08-09 北京地平线信息技术有限公司 视觉感知方法、装置、存储介质和电子设备

Also Published As

Publication number Publication date
CN114882465A (zh) 2022-08-09

Similar Documents

Publication Publication Date Title
EP3627180B1 (en) Sensor calibration method and device, computer device, medium, and vehicle
WO2023231435A1 (zh) 视觉感知方法、装置、存储介质和电子设备
JP6745328B2 (ja) 点群データを復旧するための方法及び装置
CN109214980B (zh) 一种三维姿态估计方法、装置、设备和计算机存储介质
US11328521B2 (en) Map construction method, electronic device and readable storage medium
WO2019042426A1 (zh) 增强现实场景的处理方法、设备及计算机存储介质
US20220164987A1 (en) Extrinsic Camera Parameter Calibration Method, Extrinsic Camera Parameter Calibration Apparatus, and Extrinsic Camera Parameter Calibration System
CN112116655B (zh) 目标对象的位置确定方法和装置
WO2022262160A1 (zh) 传感器标定方法及装置、电子设备和存储介质
US20240029297A1 (en) Visual positioning method, storage medium and electronic device
CN112037279B (zh) 物品位置识别方法和装置、存储介质、电子设备
WO2022247414A1 (zh) 空间几何信息估计模型的生成方法和装置
CN111612842A (zh) 生成位姿估计模型的方法和装置
WO2022206414A1 (zh) 三维目标检测方法及装置
CN114399588B (zh) 三维车道线生成方法、装置、电子设备和计算机可读介质
WO2024077935A1 (zh) 一种基于视觉slam的车辆定位方法及装置
CN112489114A (zh) 图像转换方法、装置、计算机可读存储介质及电子设备
CN113129249B (zh) 基于深度视频的空间平面检测方法及其系统和电子设备
WO2022262273A1 (zh) 光心对齐检测方法和装置、存储介质、电子设备
CN116194951A (zh) 用于基于立体视觉的3d对象检测与分割的方法和装置
CN113689508A (zh) 点云标注方法、装置、存储介质及电子设备
CN113095228B (zh) 图像中的目标检测方法、装置及计算机可读存储介质
CN113793370A (zh) 三维点云配准方法、装置、电子设备及可读介质
WO2023184869A1 (zh) 室内停车场的语义地图构建及定位方法和装置
CN115620250A (zh) 路面要素重建方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23814634

Country of ref document: EP

Kind code of ref document: A1