CN114757301A

CN114757301A - Vehicle-mounted visual perception method and device, readable storage medium and electronic equipment

Info

Publication number: CN114757301A
Application number: CN202210514756.2A
Authority: CN
Inventors: 朱红梅; 孟文明; 王梦圆; 李翔宇; 任伟强; 张骞
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-07-15
Anticipated expiration: 2042-05-12
Also published as: CN114757301B

Abstract

The embodiment of the disclosure discloses a vehicle-mounted visual perception method and device, a readable storage medium and electronic equipment, wherein the method comprises the following steps: acquiring images at a plurality of continuous moments by a plurality of vehicle-mounted cameras arranged at preset positions on a vehicle to obtain a plurality of image frame sets; each image frame set comprises a plurality of image frames collected by the same vehicle-mounted camera, and each vehicle-mounted camera corresponds to one image frame at each moment; performing feature extraction on the image frames in the plurality of image frame sets through a coding branch network in a first network model to obtain a plurality of first feature maps; performing space fusion and time sequence fusion on the plurality of first feature maps through a fusion branch network in the first network model to obtain a second feature map under a bird's-eye view image coordinate system; and identifying the second feature map based on a network model corresponding to a preset perception task, and determining a perception result corresponding to the preset perception task.

Description

In-vehicle visual perception method and device, readable storage medium, electronic device

技术领域technical field

本公开涉及自动驾驶技术领域，尤其是一种车载视觉感知方法和装置、可读存储介质、电子设备。The present disclosure relates to the technical field of automatic driving, and in particular, to a vehicle visual perception method and device, a readable storage medium, and an electronic device.

背景技术Background technique

在自动驾驶领域，车外感知系统中多个车载摄像头在各自坐标系下的感知结果不能直接被用于后续预测、规控系统，需要通过一定方式对不同视角的动、静态感知结果进行融合，统一在自车坐标系下表达；现有技术中通常通过后处理单独进行不同视角的融合。In the field of autonomous driving, the perception results of multiple in-vehicle cameras in their respective coordinate systems in the off-vehicle perception system cannot be directly used for subsequent prediction and regulation systems. The unified expression is expressed in the self-vehicle coordinate system; in the prior art, the fusion of different perspectives is usually performed separately through post-processing.

发明内容SUMMARY OF THE INVENTION

为了解决上述技术问题，提出了本公开。本公开的实施例提供了一种车载视觉感知方法和装置、可读存储介质、电子设备。In order to solve the above-mentioned technical problems, the present disclosure is made. Embodiments of the present disclosure provide a vehicle-mounted visual perception method and apparatus, a readable storage medium, and an electronic device.

根据本公开实施例的一个方面，提供了一种车载视觉感知方法，包括：According to an aspect of the embodiments of the present disclosure, a vehicle-mounted visual perception method is provided, including:

通过设置在车辆上预设位置的多个车载摄像头在连续多个时刻进行图像采集，得到多个图像帧集合；其中，每个所述图像帧集合中包括基于同一所述车载摄像头采集的多帧图像帧，每一所述车载摄像头在每一时刻对应一帧图像帧；A plurality of vehicle-mounted cameras set at preset positions on the vehicle are used to collect images at multiple times in a row, so as to obtain multiple image frame sets; wherein, each of the image frame sets includes multiple frames collected based on the same vehicle-mounted camera. Image frames, each of the on-board cameras corresponds to one image frame at each moment;

通过第一网络模型中的编码分支网络对所述多个图像帧集合中包括的图像帧进行特征提取，得到多个第一特征图；Perform feature extraction on the image frames included in the plurality of image frame sets through the coding branch network in the first network model to obtain a plurality of first feature maps;

通过所述第一网络模型中的融合分支网络对所述多个第一特征图执行空间融合和时序融合，得到鸟瞰图像坐标系下的第二特征图；Perform spatial fusion and time series fusion on the plurality of first feature maps through the fusion branch network in the first network model to obtain a second feature map in the bird's-eye image coordinate system;

基于预设感知任务对应的网络模型对所述第二特征图进行识别，确定所述预设感知任务对应的感知结果。The second feature map is identified based on the network model corresponding to the preset sensing task, and the sensing result corresponding to the preset sensing task is determined.

根据本公开实施例的另一方面，提供了一种车载视觉感知装置，包括：According to another aspect of the embodiments of the present disclosure, a vehicle-mounted visual perception device is provided, including:

图像采集模块，用于通过设置在车辆上预设位置的多个车载摄像头在连续多个时刻进行图像采集，得到多个图像帧集合；其中，每个所述图像帧集合中包括基于同一所述车载摄像头采集的多帧图像帧，每帧所述图像帧分别对应一个时刻；The image acquisition module is used for acquiring multiple image frame sets through multiple vehicle-mounted cameras set at preset positions on the vehicle at multiple times in a row, and each set of image frames includes a set of images based on the same set of image frames. Multiple frames of image frames collected by the vehicle-mounted camera, each of the image frames corresponds to a moment;

编码模块，用于通过第一网络模型中的编码分支网络对所述图像采集模块得到的多个图像帧集合中包括的图像帧进行特征提取，得到多个第一特征图；an encoding module, configured to perform feature extraction on the image frames included in the multiple image frame sets obtained by the image acquisition module through the encoding branch network in the first network model to obtain multiple first feature maps;

融合模块，用于通过所述第一网络模型中的融合分支网络对所述编码模块确定的多个第一特征图执行空间融合和时序融合，得到鸟瞰图像坐标系下的第二特征图；a fusion module, configured to perform spatial fusion and time series fusion on a plurality of first feature maps determined by the coding module through a fusion branch network in the first network model, to obtain a second feature map in the bird's-eye image coordinate system;

感知模块，用于基于预设感知任务对应的网络模型对所述融合模块确定的第二特征图进行识别，得到所述预设感知任务对应的感知结果。A perception module, configured to identify the second feature map determined by the fusion module based on a network model corresponding to a preset perception task, to obtain a perception result corresponding to the preset perception task.

根据本公开实施例的又一方面，提供了一种计算机可读存储介质，所述存储介质存储有计算机程序，所述计算机程序用于执行上述任一实施例所述的车载视觉感知方法。According to yet another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, where the storage medium stores a computer program, and the computer program is used to execute the vehicle-mounted visual perception method described in any of the foregoing embodiments.

根据本公开实施例的还一方面，提供了一种电子设备，所述电子设备包括：According to yet another aspect of the embodiments of the present disclosure, there is provided an electronic device, the electronic device comprising:

处理器；processor;

用于存储所述处理器可执行指令的存储器；a memory for storing the processor-executable instructions;

所述处理器，用于从所述存储器中读取所述可执行指令，并执行所述指令以实现上述任一实施例所述的车载视觉感知方法。The processor is configured to read the executable instructions from the memory, and execute the instructions to implement the vehicle visual perception method according to any one of the foregoing embodiments.

基于本公开上述实施例提供的一种车载视觉感知方法和装置、可读存储介质、电子设备，通过在第一神经网络内部实现空间融合和时序融合，实现了神经网络端到端的学习，由于无需采用后处理融合，可以有效避免因在后处理时对图像进行空间融合和时序融合的复杂程度，以及可避免在后处理中将同一目标误识别为多个目标的情况发生。Based on the in-vehicle visual perception method and device, readable storage medium, and electronic device provided by the above-mentioned embodiments of the present disclosure, the end-to-end learning of the neural network is realized by realizing spatial fusion and time sequence fusion within the first neural network. The use of post-processing fusion can effectively avoid the complexity of spatial fusion and time-series fusion of images during post-processing, and avoid misidentifying the same target as multiple targets in post-processing.

下面通过附图和实施例，对本公开的技术方案做进一步的详细描述。The technical solutions of the present disclosure will be further described in detail below through the accompanying drawings and embodiments.

附图说明Description of drawings

通过结合附图对本公开实施例进行更详细的描述，本公开的上述以及其他目的、特征和优势将变得更加明显。附图用来提供对本公开实施例的进一步理解，并且构成说明书的一部分，与本公开实施例一起用于解释本公开，并不构成对本公开的限制。在附图中，相同的参考标号通常代表相同部件或步骤。The above and other objects, features and advantages of the present disclosure will become more apparent from the more detailed description of the embodiments of the present disclosure in conjunction with the accompanying drawings. The accompanying drawings are used to provide a further understanding of the embodiments of the present disclosure, and constitute a part of the specification, and are used to explain the present disclosure together with the embodiments of the present disclosure, and do not limit the present disclosure. In the drawings, the same reference numbers generally refer to the same components or steps.

图1是本公开一示例性实施例提供的第一网络模型的结构示意图。FIG. 1 is a schematic structural diagram of a first network model provided by an exemplary embodiment of the present disclosure.

图2是本公开一示例性实施例提供的车载视觉感知方法的流程示意图。FIG. 2 is a schematic flowchart of a vehicle-mounted visual perception method provided by an exemplary embodiment of the present disclosure.

图3是本公开图2所示的实施例中步骤203的一流程示意图。FIG. 3 is a schematic flowchart of step 203 in the embodiment shown in FIG. 2 of the present disclosure.

图4是本公开图3所示的实施例中步骤2031的一流程示意图。FIG. 4 is a schematic flowchart of step 2031 in the embodiment shown in FIG. 3 of the present disclosure.

图5a是本公开图4所示的实施例中步骤401的一流程示意图。FIG. 5a is a schematic flowchart of step 401 in the embodiment shown in FIG. 4 of the present disclosure.

图5b是本公开一示例性实施例中鸟瞰图成像平面与自车坐标系的预设平面之间的转换关系示意图。Fig. 5b is a schematic diagram of a conversion relationship between a bird's-eye view imaging plane and a preset plane of the ego vehicle coordinate system in an exemplary embodiment of the present disclosure.

图6是本公开图3所示的实施例中步骤2032的一流程示意图。FIG. 6 is a schematic flowchart of step 2032 in the embodiment shown in FIG. 3 of the present disclosure.

图7是本公开图2所示的实施例中步骤204的一流程示意图。FIG. 7 is a schematic flowchart of step 204 in the embodiment shown in FIG. 2 of the present disclosure.

图8是本公开一示例性实施例提供的车载视觉感知装置的结构示意图。FIG. 8 is a schematic structural diagram of a vehicle-mounted visual perception device provided by an exemplary embodiment of the present disclosure.

图9是本公开另一示例性实施例提供的车载视觉感知装置的结构示意图。FIG. 9 is a schematic structural diagram of a vehicle-mounted visual perception device provided by another exemplary embodiment of the present disclosure.

图10是本公开一示例性实施例提供的电子设备的结构图。FIG. 10 is a structural diagram of an electronic device provided by an exemplary embodiment of the present disclosure.

具体实施方式Detailed ways

下面，将参考附图详细地描述根据本公开的示例实施例。显然，所描述的实施例仅仅是本公开的一部分实施例，而不是本公开的全部实施例，应理解，本公开不受这里描述的示例实施例的限制。Hereinafter, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. Obviously, the described embodiments are only some of the embodiments of the present disclosure, not all of the embodiments of the present disclosure, and it should be understood that the present disclosure is not limited by the example embodiments described herein.

应注意到：除非另外具体说明，否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。It should be noted that the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

本领域技术人员可以理解，本公开实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等，既不代表任何特定技术含义，也不表示它们之间的必然逻辑顺序。Those skilled in the art can understand that terms such as "first" and "second" in the embodiments of the present disclosure are only used to distinguish different steps, devices, or modules, etc., and neither represent any specific technical meaning, nor represent any difference between them. the necessary logical order of .

还应理解，在本公开实施例中，“多个”可以指两个或两个以上，“至少一个”可以指一个、两个或两个以上。It should also be understood that, in the embodiments of the present disclosure, "a plurality" may refer to two or more, and "at least one" may refer to one, two or more.

还应理解，对于本公开实施例中提及的任一部件、数据或结构，在没有明确限定或者在前后文给出相反启示的情况下，一般可以理解为一个或多个。It should also be understood that any component, data or structure mentioned in the embodiments of the present disclosure can generally be understood as one or more in the case of no explicit definition or contrary indications given in the context.

另外，本公开中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本公开中字符“/”，一般表示前后关联对象是一种“或”的关系。In addition, the term "and/or" in the present disclosure is only an association relationship to describe associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, and A and B exist at the same time , there are three cases of B alone. In addition, the character "/" in the present disclosure generally indicates that the related objects are an "or" relationship.

还应理解，本公开对各个实施例的描述着重强调各个实施例之间的不同之处，其相同或相似之处可以相互参考，为了简洁，不再一一赘述。It should also be understood that the description of the various embodiments in the present disclosure emphasizes the differences between the various embodiments, and the same or similar points can be referred to each other, and for the sake of brevity, they will not be repeated.

同时，应当明白，为了便于描述，附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。Meanwhile, it should be understood that, for the convenience of description, the dimensions of various parts shown in the accompanying drawings are not drawn in an actual proportional relationship.

以下对至少一个示例性实施例的描述实际上仅仅是说明性的，决不作为对本公开及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses in any way.

对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论，但在适当情况下，所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods, and apparatus should be considered part of the specification.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步讨论。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further discussion in subsequent figures.

本公开实施例可以应用于终端设备、计算机系统、服务器等电子设备，其可与众多其它通用或专用计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于：个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统、大型计算机系统和包括上述任何系统的分布式云计算技术环境，等等。Embodiments of the present disclosure can be applied to electronic devices such as terminal devices, computer systems, servers, etc., which can operate with numerous other general-purpose or special-purpose computing system environments or configurations. Examples of well-known terminal equipment, computing systems, environments and/or configurations suitable for use with terminal equipment, computer systems, servers, etc. electronic equipment include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients computer, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the foregoing, among others.

终端设备、计算机系统、服务器等电子设备可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常，程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等，它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施，分布式云计算环境中，任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中，程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。Electronic devices such as terminal devices, computer systems, servers, etc., may be described in the general context of computer system-executable instructions, such as program modules, being executed by the computer system. Generally, program modules may include routines, programs, object programs, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer systems/servers may be implemented in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located on local or remote computing system storage media including storage devices.

申请概述Application overview

在实现本公开的过程中，发明人发现，现有技术中提供的车载视觉感知方法，通常采用后处理的方式，在神经网络模型输出感知结果后，通过一些规则将不同摄像头的感知结果融合起来；该方法至少存在以下问题：通过后处理的方式融合相邻视角重叠区域的感知结果时，由于多个视角之间存在重叠，在不同视角的图像中对同一目标都会得到检测结果，并且，同一目标在不同图像中出现在不同位置，使得后处理融合时需要复杂的识别和检测，且容易将同一目标识别为多个目标而引起歧义。In the process of realizing the present disclosure, the inventor found that the vehicle-mounted visual perception method provided in the prior art usually adopts a post-processing method. After the neural network model outputs the perception result, the perception results of different cameras are fused through some rules. ; This method has at least the following problems: when the perception results of the overlapping areas of adjacent viewing angles are fused by post-processing, due to the overlap between multiple viewing angles, the detection results will be obtained for the same target in images from different viewing angles, and the same object will be detected. Targets appear in different positions in different images, which requires complex recognition and detection in post-processing fusion, and it is easy to identify the same target as multiple targets, which leads to ambiguity.

示例性网络结构Exemplary network structure

图1是本公开一示例性实施例提供的第一网络模型的结构示意图。如图1所示，本实施例中的第一网络模型包括：编码分支网络101、融合分支网络102和解码分支网络103。FIG. 1 is a schematic structural diagram of a first network model provided by an exemplary embodiment of the present disclosure. As shown in FIG. 1 , the first network model in this embodiment includes: an encoding branch network 101 , a fusion branch network 102 and a decoding branch network 103 .

本实施例中，通过编码分支网络101对多个车载摄像头采集的图像帧进行特征提取，得到多个第一特征图；其中，多个车载摄像头为设置在车辆上不同预设位置的车载摄像头，每个车载摄像头对应一个视角，多个车载摄像头对应的拍摄视野范围可以覆盖车辆的周围环境，为了避免视角盲区，多个车载摄像头之间的拍摄视野范围可以存在重叠。车辆行驶过程中，各车载摄像头在设定位置基于一个视角在多个时刻连续采集多个图像，得到各车载摄像头分别对应的图像帧集合，可以理解为，每个车载摄像头在每一时刻采集一帧图像帧，每个图像帧集合中包括同一车载摄像头采集的、基于时序排列的多帧图像帧。可选地，编码分支网络101可以为任意可实现特征提取的网络，例如，卷积神经网络等，本实施例不限制其具体采用的网络结构。In this embodiment, feature extraction is performed on the image frames collected by multiple vehicle-mounted cameras through the encoding branch network 101 to obtain multiple first feature maps; wherein, the multiple vehicle-mounted cameras are vehicle-mounted cameras arranged at different preset positions on the vehicle, Each vehicle-mounted camera corresponds to a viewing angle, and the shooting field of view corresponding to multiple vehicle-mounted cameras can cover the surrounding environment of the vehicle. In order to avoid blind spots, the shooting field of view of multiple vehicle-mounted cameras may overlap. During the driving of the vehicle, each vehicle-mounted camera continuously collects multiple images at multiple times based on a viewing angle at the set position, and obtains the image frame set corresponding to each vehicle-mounted camera. It can be understood that each vehicle-mounted camera collects a frame image frames, each image frame set includes multiple frames of image frames collected by the same vehicle camera and arranged based on time series. Optionally, the encoding branch network 101 may be any network that can implement feature extraction, for example, a convolutional neural network, etc. This embodiment does not limit its specific network structure.

通过融合分支网络102对多个第一特征图执行空间融合和时序融合，得到鸟瞰图像坐标系下的第二特征图；其中，空间融合将多个车载摄像头中每个车载摄像头在同一时刻采集的多帧图像帧进行融合，得到鸟瞰图像坐标系下的第三特征图；时序融合是将多个时刻中每个时刻对应的第三特征图进行融合，得到时序融合后的第二特征图。Through the fusion branch network 102, spatial fusion and time series fusion are performed on the plurality of first feature maps to obtain the second feature map in the bird's-eye image coordinate system; wherein, the spatial fusion combines the data collected by each vehicle-mounted camera at the same time among the multiple vehicle-mounted cameras at the same time. Multi-frame image frames are fused to obtain the third feature map in the bird's-eye image coordinate system; time series fusion is to fuse the third feature map corresponding to each time in multiple moments to obtain the second feature map after time series fusion.

通过解码分支网络103对融合分支网络102得到的第二特征图进行解码处理，得到解码后的第二特征图；可选地，本实施例中解码后的第二特征图可以为符合后续感知任务的结构的特征图，可选地，解码分支网络103可以为任意可实现特征提取的网络，例如，卷积神经网络等，本实施例不限制具体的网络结构。The second feature map obtained by the fusion branch network 102 is decoded by the decoding branch network 103 to obtain a decoded second feature map; optionally, the decoded second feature map in this embodiment may be suitable for subsequent sensing tasks The feature map of the structure, optionally, the decoding branch network 103 may be any network that can implement feature extraction, for example, a convolutional neural network, etc. This embodiment does not limit the specific network structure.

本实施例提供的第一网络模型为了实现视觉感知任务，因此，在第一网络模型之后可以连接任一视觉感知任务，视觉感知任务可以包括但不限于：分割任务、检测任务、分类任务等。The first network model provided in this embodiment is to implement the visual perception task, therefore, any visual perception task can be connected after the first network model, and the visual perception task may include but not limited to: segmentation task, detection task, classification task, etc.

本实施例通过第一网络模型内部实现空间融合和时间融合，使车辆行驶过程中的摄像头姿态变化导致的相机内外参数的扰动不直接作用于感知结果，通过对第一网络模型训练学习克服该扰动影响，使输出的感知结果更平稳，不受车辆行驶过程中的摄像头姿态变化的影响；在第一网络模型内部空间融合和时间融合，无需在后处理过程中进行空间融合和时间融合，避免了在后处理中进行融合，降低了后处理的复杂性，使第一网络模型可以实现端到端的学习，具有与候选视觉感知任务联合学习的潜力，并且，本申请提供的通过第一网络模型的感知融合方法对芯片的计算复杂性要求更低。In this embodiment, the space fusion and time fusion are realized inside the first network model, so that the disturbance of the internal and external parameters of the camera caused by the change of the camera posture during the driving process of the vehicle does not directly affect the perception result, and the disturbance is overcome by training and learning the first network model. Influence, so that the output perception results are more stable, and are not affected by the camera attitude change during the driving process of the vehicle; in the first network model internal space fusion and time fusion, there is no need to perform space fusion and time fusion in the post-processing process, avoiding the need for Fusion in post-processing reduces the complexity of post-processing, enables the first network model to achieve end-to-end learning, and has the potential of joint learning with candidate visual perception tasks. The perceptual fusion method requires less computational complexity of the chip.

示例性方法Exemplary method

图2是本公开一示例性实施例提供的车载视觉感知方法的流程示意图。本实施例可应用在电子设备上，如图2所示，包括如下步骤：FIG. 2 is a schematic flowchart of a vehicle-mounted visual perception method provided by an exemplary embodiment of the present disclosure. This embodiment can be applied to an electronic device, as shown in FIG. 2 , including the following steps:

步骤201，通过设置在车辆上预设位置的多个车载摄像头在连续多个时刻进行图像采集，得到多个图像帧集合。In step 201, a plurality of vehicle-mounted cameras set at preset positions on the vehicle are used to collect images at a plurality of consecutive moments to obtain a plurality of image frame sets.

其中，每个图像帧集合中包括基于同一车载摄像头采集的多帧图像帧，每一车载摄像头在每一时刻对应一帧图像帧。Wherein, each image frame set includes multiple image frames collected based on the same vehicle-mounted camera, and each vehicle-mounted camera corresponds to one image frame at each moment.

可选地，由于每个车载摄像头对应一个图像帧集合，因此，多个车载摄像头得到多个图像帧集合；每个车载摄像头可对应不同方向，且多个车载摄像头对应的图像采集范围之间可存在重叠或不存在重叠，通常情况下，通过在多个预设位置设置多个车载摄像头实现将车辆对应的全方位角度进行拍摄，以便于为自动驾驶或辅助驾驶提供更多信息。Optionally, since each vehicle-mounted camera corresponds to an image frame set, multiple vehicle-mounted cameras can obtain multiple image frame sets; each vehicle-mounted camera can correspond to different directions, and the image acquisition ranges corresponding to the multiple vehicle-mounted cameras can be There is overlap or there is no overlap. Usually, multiple on-board cameras are set in multiple preset positions to capture the corresponding omnidirectional angles of the vehicle, so as to provide more information for automatic driving or assisted driving.

步骤202，通过第一网络模型中的编码分支网络对多个图像帧集合中包括的图像帧进行特征提取，得到多个第一特征图。Step 202 , perform feature extraction on the image frames included in the multiple image frame sets through the coding branch network in the first network model to obtain multiple first feature maps.

可选地，通过第一网络模型中的编码分支网络对多个图像帧集合中包括的图像帧依次进行特征提取，得到多个第一特征图；本实施例中的编码分支网络可参照图1提供的编码分支网络101进行理解，实现对图像帧的特征提取。Optionally, feature extraction is performed on the image frames included in the multiple image frame sets in turn by the encoding branch network in the first network model, to obtain multiple first feature maps; for the encoding branch network in this embodiment, refer to FIG. 1 . The provided encoding branch network 101 understands and implements feature extraction on the image frame.

步骤203，通过第一网络模型中的融合分支网络对多个第一特征图执行空间融合和时序融合，得到鸟瞰图像坐标系下的第二特征图。Step 203 , performing spatial fusion and time series fusion on the plurality of first feature maps through the fusion branch network in the first network model to obtain a second feature map in the bird's-eye image coordinate system.

可选地，本实施例中的编码分支网络可参照图1提供的融合分支网络102进行理解，实现对第一特征图的融合。Optionally, the encoding branch network in this embodiment can be understood with reference to the fusion branch network 102 provided in FIG. 1 to implement fusion of the first feature map.

步骤204，基于预设感知任务对应的网络模型对第二特征图进行识别，确定预设感知任务对应的感知结果。Step 204: Identify the second feature map based on the network model corresponding to the preset sensing task, and determine the sensing result corresponding to the preset sensing task.

本实施例中的预设感知任务可以是任意视觉感知任务，例如，分割任务、检测任务、分类任务等，通过对应该视觉感知任务的网络模型实现该步骤的识别操作。The preset perception task in this embodiment may be any visual perception task, for example, a segmentation task, a detection task, a classification task, etc. The recognition operation of this step is implemented through a network model corresponding to the visual perception task.

本公开上述实施例提供的一种车载视觉感知方法，通过在第一神经网络内部实现空间融合和时序融合，实现了神经网络端到端的学习，由于无需采用后处理融合，可以有效避免因在后处理时对图像进行空间融合和时序融合的复杂程度，以及可避免在后处理中将同一目标误识别为多个目标的情况发生。The vehicle visual perception method provided by the above embodiments of the present disclosure realizes the end-to-end learning of the neural network by implementing spatial fusion and time sequence fusion within the first neural network. The complexity of spatial fusion and temporal fusion of images during processing, as well as avoiding misidentification of the same target as multiple targets in post-processing.

如图3所示，在上述图2所示实施例的基础上，步骤203可包括如下步骤：As shown in FIG. 3 , on the basis of the above-mentioned embodiment shown in FIG. 2 , step 203 may include the following steps:

步骤2031，针对多个时刻中每个时刻，对该时刻对应的多个第一特征图执行空间融合，得到鸟瞰图像坐标系下的多个第三特征图。Step 2031 , for each moment in the plurality of moments, perform spatial fusion of the plurality of first feature maps corresponding to the moment to obtain a plurality of third feature maps in the bird's-eye image coordinate system.

其中，每个第三特征图对应一个时刻。Among them, each third feature map corresponds to a moment.

本实施例中，每个车载摄像头采集的图像帧为该车载摄像头对应的相机坐标系下的图像帧，为了在进行融合之前，先将多个第一特征图分别投影到鸟瞰图像坐标系下，再对鸟瞰图像坐标系下的多个特征图进行融合，对相同坐标系下的多个特征图的融合可以通过相关技术中的特征图融合方法实现，例如，逐元素相加、在特征通道维度拼接、逐元素取最大值、神经网络融合等。In this embodiment, the image frame collected by each vehicle-mounted camera is an image frame in the camera coordinate system corresponding to the vehicle-mounted camera. Then, the multiple feature maps in the bird's-eye image coordinate system are fused. The fusion of multiple feature maps in the same coordinate system can be achieved by the feature map fusion method in related technologies, for example, element-by-element addition, in the feature channel dimension. Splicing, element-wise maximum value, neural network fusion, etc.

步骤2032，对多个第三特征图执行时序融合，得到第二特征图。Step 2032: Perform time series fusion on the plurality of third feature maps to obtain a second feature map.

本实施例中，每个第三特征图对应一个时刻，通过将每个时刻对应的第三特征图分别重构到某一时刻(例如，采集最后一个图像帧对应的最近时刻)对应的第三特征图中，通过对重构后的多个第三特征图进行融合，即可得到第二特征图，该步骤中的融合同样可以通过相关技术中的特征图融合方法实现，例如，逐元素相加、在特征通道维度拼接、逐元素取最大值、神经网络融合等；本实施例中先执行空间融合，再执行时序融合，使时序融合更易于实现，加快了特征图融合的速度。In this embodiment, each third feature map corresponds to a moment, and by reconstructing the third feature map corresponding to each moment to a certain moment (for example, collecting the latest moment corresponding to the last image frame) In the feature map, the second feature map can be obtained by fusing the reconstructed third feature maps. The fusion in this step can also be realized by the feature map fusion method in the related art. addition, splicing in the feature channel dimension, taking the maximum value element by element, neural network fusion, etc. In this embodiment, spatial fusion is performed first, and then time series fusion is performed, which makes time series fusion easier to implement and accelerates the speed of feature map fusion.

如图4所示，在上述图3所示实施例的基础上，步骤2031可包括如下步骤：As shown in FIG. 4 , on the basis of the above-mentioned embodiment shown in FIG. 3 , step 2031 may include the following steps:

步骤401，分别对多个第一特征图中每个第一特征图执行单应变换，得到鸟瞰图像坐标系下的多个变换特征图。Step 401 , perform homography transformation on each of the plurality of first feature maps, respectively, to obtain a plurality of transformed feature maps in the bird's-eye image coordinate system.

其中，单应变换(或称为投影变换)，具有8个自由度，用于描述两个平面上的点的映射关系。通过单应变换实现将图像坐标系下的每个第一特征图转换到鸟瞰图像坐标系下，即可得到多个变换特征图。Among them, the homography transformation (or projective transformation) has 8 degrees of freedom and is used to describe the mapping relationship of points on two planes. Through the homography transformation, each first feature map in the image coordinate system is converted into the bird's-eye image coordinate system, and then multiple transformed feature maps can be obtained.

步骤402，对多个变换特征图进行逐点融合，得到鸟瞰图像坐标系下的第三特征图。Step 402 , perform point-by-point fusion of the plurality of transformed feature maps to obtain a third feature map in the bird's-eye image coordinate system.

本实施例中，通过单应变换获得了多个在相同坐标系下的变换特征图，并且，多个变换特征图的大小相同，因此，可通过逐点融合的方式实现多个变换特征图之间的融合，逐点融合的方法可以包括但不限于：逐元素相加、在特征通道维度拼接、逐元素取最大值、神经网络融合等；本实施例通过单应变换将分别对应不同坐标系的多个第一特征图映射到相同的坐标系下，将每个视角的特征图从相机视角转换到鸟瞰视角，并在相同坐标系下实现融合，使融合后的第三特征图能够具有车辆周围所有角度对应的图像特征，基于该第三特征图执行视觉感知任务时，实现了无需后处理，即可将不同车载摄像头的感知结果进行融合，提升了感知结果的融合效率，降低了融合难度。In this embodiment, multiple transformation feature maps in the same coordinate system are obtained through homography transformation, and the sizes of the multiple transformation feature maps are the same. Therefore, the fusion of multiple transformation feature maps can be realized by point-by-point fusion. The method of fusion, point-by-point fusion may include, but is not limited to: element-by-element addition, splicing in the feature channel dimension, element-by-element maximum value, neural network fusion, etc. The multiple first feature maps are mapped to the same coordinate system, the feature map of each view is converted from the camera view to the bird's-eye view, and the fusion is realized under the same coordinate system, so that the third feature map after fusion can have the vehicle Image features corresponding to all surrounding angles, when performing visual perception tasks based on the third feature map, the perception results of different vehicle cameras can be fused without post-processing, which improves the fusion efficiency of perception results and reduces the difficulty of fusion. .

如图5a所示，在上述图4所示实施例的基础上，步骤401可包括如下步骤：As shown in FIG. 5a, on the basis of the above-mentioned embodiment shown in FIG. 4, step 401 may include the following steps:

步骤4011，基于每个第一特征图对应的车载摄像头的内参矩阵和外参矩阵，确定第一特征图对应的自车坐标系到相机坐标系的第一变换矩阵。Step 4011: Determine a first transformation matrix from the vehicle coordinate system corresponding to the first feature map to the camera coordinate system based on the internal parameter matrix and the external parameter matrix of the vehicle camera corresponding to each first feature map.

本实施例中，每个车载摄像头对应的内参矩阵为已知的3*3矩阵K，通过车载摄像头设置在车辆上的预设位置可计算得到该车载摄像头对应的外参矩阵，基于内参矩阵和外参矩阵可确定自车坐标系到相机坐标系之间的第一变换矩阵T_vcs2cam，该第一变换矩阵中包括旋转参数和平移参数，例如，在一可选示例中第一变换矩阵可表示为：

其中，r₁₁、r₁₂、r₁₃、r₂₁、r₂₂、r₂₃、r₃₁、r₃₂、r₃₃表示旋转参数，t₁、t₂、t₃表示平移参数。In this embodiment, the internal parameter matrix corresponding to each on-board camera is a known 3*3 matrix K, and the external parameter matrix corresponding to the on-board camera can be calculated by setting the preset position of the on-board camera on the vehicle. Based on the internal parameter matrix and The extrinsic parameter matrix can determine the first transformation matrix T _vcs2cam between the vehicle coordinate system and the camera coordinate system, and the first transformation matrix includes rotation parameters and translation parameters. For example, in an optional example, the first transformation matrix can represent for:

Among them, r ₁₁ , r ₁₂ , r ₁₃ , r ₂₁ , r ₂₂ , r ₂₃ , r ₃₁ , r ₃₂ , and r ₃₃ represent rotation parameters, and t ₁ , t ₂ , and t ₃ represent translation parameters.

步骤4012，基于自车坐标系的预设平面上的感知范围和鸟瞰图像坐标系的缩放比例和平移距离，确定鸟瞰图像坐标系到自车坐标系的预设平面的第二变换矩阵。Step 4012: Determine a second transformation matrix from the bird's-eye image coordinate system to the preset plane of the self-vehicle coordinate system based on the sensing range on the preset plane of the vehicle's coordinate system and the zoom ratio and translation distance of the bird's-eye image coordinate system.

可选地，如图5b所示，以d₁，d₂，d₃，d₄表示自车坐标系在预设平面(xy平面，以图中O_vcs表示自车坐标系中xy平面的中心点)上的感知范围，基于该感知范围即可确定自车坐标系相对于鸟瞰图像坐标系的平移距离(如图5b中的d₁，d₃)，假设鸟瞰图成像平面(该成像平面的中心点可以为图5b中的O_BEV)与自车坐标系的xy平面重合(或平行)，以r表示自车坐标系的xy平面到鸟瞰图成像平面的缩放比例，则第二变换矩阵可以表示为：Optionally, as shown in Fig. 5b, d ₁ , d ₂ , d ₃ , and d ₄ indicate that the self-vehicle coordinate system is on a preset plane (xy plane, and O _vcs in the figure indicates the center of the xy plane in the self-vehicle coordinate system. point), based on which the translation distance of the ego vehicle coordinate system relative to the bird's-eye image coordinate system can be determined (d ₁ , d ₃ in Figure 5b ), assuming that the bird's-eye view imaging plane (the The center point can be O _BEV in Fig. 5b ) coincides with (or is parallel to) the xy plane of the self-vehicle coordinate system, and r represents the scaling ratio of the xy plane of the self-vehicle coordinate system to the bird's-eye view imaging plane, then the second transformation matrix can be Expressed as:

步骤4013，基于第一变换矩阵、第二变换矩阵和车载摄像头的内参矩阵，确定鸟瞰图像坐标系到车载摄像头对应的图像坐标系的第三变换矩阵。Step 4013 , based on the first transformation matrix, the second transformation matrix and the internal parameter matrix of the vehicle camera, determine a third transformation matrix from the bird's-eye image coordinate system to the image coordinate system corresponding to the vehicle camera.

可选地，本实施例中，通过第一变换矩阵和第二变换矩阵可推导得出鸟瞰图像坐标系到相机坐标系的转换矩阵，再结合内参矩阵，即可确定鸟瞰图像坐标系到车载摄像头对应的图像平面(对应图像坐标系)之间的第三变换矩阵T_bev2img。Optionally, in this embodiment, the transformation matrix from the bird's-eye image coordinate system to the camera coordinate system can be derived through the first transformation matrix and the second transformation matrix, and then combined with the internal parameter matrix, the bird's-eye image coordinate system can be determined to the vehicle camera. The third transformation matrix T _bev2img between the corresponding image planes (corresponding to the image coordinate system).

步骤4014，基于第三变换矩阵对第一特征图进行变换，得到变换特征图。Step 4014: Transform the first feature map based on the third transformation matrix to obtain a transformed feature map.

本实施例，通过上述确定得第三变换矩阵，即确定了图像坐标系到鸟瞰图像坐标系之间的换关系，通过该转换关系可以根据鸟瞰图像(u，v)位置反向索引图像位置为(u₁，v₁)的特征，例如，可通过以下公式(1)实现特征映射：In this embodiment, the third transformation matrix is determined through the above determination, that is, the transformation relationship between the image coordinate system and the bird's-eye image coordinate system is determined. Through the transformation relationship, the position of the image can be reversely indexed according to the position of the bird's-eye image (u, v) as (u ₁ , v ₁ ) features, for example, feature mapping can be achieved by the following formula (1):

从而得到视角转换后的变换特征图f_n；类似地，对每个视角对应的第一特征图执行上述变换操作，得到N个(N为车载摄像头的数量，可根据具体场景进行设置)视角转换后的变换特征图{f_n}，n＝1,2,…,N；最后对其进行特征融合，得到空间融合后t时刻(多个时刻中的其中之一)的第三特征图F_t。Thus obtain the transformation feature map f _n after the perspective conversion; Similarly, the first feature map corresponding to each perspective is carried out the above-mentioned transformation operation, obtain N (N is the number of the vehicle-mounted camera, can be set according to the specific scene) perspective transformation The transformed feature map {f _n }, n = 1, 2, ..., N; finally, feature fusion is performed on it to obtain the third feature map F _t at time t (one of multiple moments) after spatial fusion .

可选地，在上述实施例的基础上，步骤3013还可以包括：Optionally, on the basis of the foregoing embodiment, step 3013 may further include:

a1，基于第一变换矩阵，确定自车坐标系的预设平面到相机坐标系的第四变换矩阵。a1, based on the first transformation matrix, determine a fourth transformation matrix from the preset plane of the ego vehicle coordinate system to the camera coordinate system.

可选地，假设鸟瞰视角的成像平面与自车坐标系的xy平面重合(或平行)，xy平面即自车坐标下三维点坐标z值等于0(或等于任意相同值)的平面，重合时，可以简单地消除第一变换矩阵T_vcs2cam中对应z轴的列(第3列)，平行时，可将第3列置为预设值，得到鸟瞰图像坐标系的成像平面与相机坐标系之间的第四变换矩阵T_{vcs_xy2cam}，例如，在一可选示例中(鸟瞰图像坐标系的成像平面与自车坐标系的xy平面重合)，第四变换矩阵可表示为：Optionally, it is assumed that the imaging plane of the bird's-eye view is coincident with (or parallel to) the xy plane of the vehicle coordinate system, and the xy plane is the plane where the z value of the three-dimensional point coordinate under the vehicle coordinate is equal to 0 (or equal to any same value). , the column corresponding to the z-axis (the third column) in the first transformation matrix T _vcs2cam can be simply eliminated. When parallel, the third column can be set as the default value to obtain the difference between the imaging plane of the bird's-eye image coordinate system and the camera coordinate system. The fourth transformation matrix T _{vcs_xy2cam between} , for example, in an optional example (the imaging plane of the bird's-eye image coordinate system coincides with the xy plane of the ego vehicle coordinate system), the fourth transformation matrix can be expressed as:

a2，基于第四变换矩阵和第二变换矩阵，确定鸟瞰图像坐标系到相机坐标系的第五变换矩阵。a2, based on the fourth transformation matrix and the second transformation matrix, determine a fifth transformation matrix from the bird's-eye image coordinate system to the camera coordinate system.

可选地，在确定第四变换矩阵之后，结合鸟瞰图像坐标系到自车坐标系的第二变换矩阵，即可通过第四变换矩阵和第二变换矩阵执行矩阵乘法，确定鸟瞰图像坐标系到相机坐标系之间的第五变换矩阵，例如，可基于以下公式(2)确定第五变换矩阵T_bev2cam：Optionally, after the fourth transformation matrix is determined, in combination with the second transformation matrix from the bird's-eye image coordinate system to the ego vehicle coordinate system, matrix multiplication can be performed by the fourth transformation matrix and the second transformation matrix to determine the bird's-eye image coordinate system to The fifth transformation matrix between camera coordinate systems, for example, the fifth transformation matrix T _bev2cam can be determined based on the following formula (2):

T_bev2cam＝T_vcs-xy2cam×T_bev2vcs-xy 公式(2)T _bev2cam =T _vcs-xy2cam ×T _bev2vcs-xy formula (2)

a3，基于第五变换矩阵和车载摄像头的内参矩阵，确定第三变换矩阵。a3, based on the fifth transformation matrix and the internal parameter matrix of the vehicle camera, determine the third transformation matrix.

本实施例中，确定第五变换矩阵与内参矩阵之间的行列数不对应，为了可以执行矩阵乘法(内参矩阵为3*3的矩阵)，可对第五变换矩阵执行截取操作，截取该第五变换矩阵的前三行得到截取变换矩阵，例如，将第五变换矩阵T_bev2cam的前三行进行截取，得到截取变换矩阵T′_bev2cam，此时，通过截取变换矩阵与内参矩阵执行矩阵乘法，确定第三变换矩阵，例如，如以下公式(3)所示：In this embodiment, it is determined that the number of rows and columns between the fifth transformation matrix and the internal parameter matrix does not correspond. In order to perform matrix multiplication (the internal parameter matrix is a matrix of 3*3), an interception operation can be performed on the fifth transformation matrix to intercept the The intercepted transformation matrix is obtained from the first three rows of the five transformation matrix. For example, the first three rows of the fifth transformation matrix T _bev2cam are intercepted to obtain the intercepted transformation matrix T′ _bev2cam . At this time, matrix multiplication is performed by intercepting the transformation matrix and the internal parameter matrix, Determine the third transformation matrix, for example, as shown in the following formula (3):

T_bev2img＝K×T′_bev2cam 公式(3)T _bev2img = K×T′ _bev2cam formula (3)

通过上述处理即可得到第三变换矩阵，基于该第三变换矩阵可实现将对应图像坐标系的第一特征图转换到鸟瞰图像坐标系下的成像平面中，本实施例通过坐标系关系转换，确定鸟瞰图像坐标系到图像坐标系转换的第三变换矩阵，实现了基于第三变换矩阵可直接将图像坐标系中的特征转换到鸟瞰图像坐标系下，在第一网络模型的融合分支模型中实现了快速的空间融合，通过第一网络模型学习确定坐标系转换中的具体参数，提升了转换的准确率和速度。Through the above processing, a third transformation matrix can be obtained, and based on the third transformation matrix, the first feature map corresponding to the image coordinate system can be converted into the imaging plane under the bird's-eye image coordinate system. In this embodiment, the coordinate system relationship is converted, Determine the third transformation matrix for the transformation from the bird's-eye image coordinate system to the image coordinate system, and realize that the features in the image coordinate system can be directly transformed into the bird's-eye image coordinate system based on the third transformation matrix. In the fusion branch model of the first network model Fast spatial fusion is achieved, and the specific parameters in the coordinate system transformation are determined through the first network model learning, which improves the accuracy and speed of the transformation.

如图6所示，在上述图3所示实施例的基础上，步骤2032可包括如下步骤：As shown in FIG. 6, on the basis of the above-mentioned embodiment shown in FIG. 3, step 2032 may include the following steps:

步骤601，以多个时刻中最近时刻对应的第三特征图作为参照特征图。Step 601 , using the third feature map corresponding to the most recent time among the multiple moments as the reference feature map.

可选地，当多个时刻包括t、t-1……t-s时刻，可以将第t时刻(最近时刻)对应的第三特征图作为参照特征图，当然，本实施例仅为一个具体示例，在实际应用中，可以将任意一个时刻对应的第三特征图作为参照特征图，并不影响时序融合的操作和结果；而选择最近时刻作为参数特征图时，使时域融合的结果对应最近时刻，即输出的第二特征对应最近时刻(例如，当前时刻)，提升了基于第二特征确定的感知结果的实时性，即提升了车载视角感知方法的实时性。Optionally, when the multiple times include times t, t-1...t-s, the third feature map corresponding to the t-th time (the latest time) may be used as the reference feature map. Of course, this embodiment is only a specific example, In practical applications, the third feature map corresponding to any time can be used as the reference feature map, which does not affect the operation and results of time series fusion; and when the most recent time is selected as the parameter feature map, the result of time domain fusion corresponds to the most recent time. , that is, the outputted second feature corresponds to the most recent moment (for example, the current moment), which improves the real-time performance of the sensing result determined based on the second feature, that is, the real-time performance of the vehicle perspective perception method.

步骤602，分别对至少一个第三特征图中每个第三特征图进行重构，得到至少一个第四特征图。Step 602: Reconstruct each third feature map of the at least one third feature map, respectively, to obtain at least one fourth feature map.

本实施例中，为了实现可以将多个时刻分别对应的第三特征图进行融合，将每个第三特征图都映射到参照特征图对应的特征空间中，以便进行特征融合。In this embodiment, in order to realize that the third feature maps corresponding to multiple times can be fused, each third feature map is mapped into the feature space corresponding to the reference feature map, so as to perform feature fusion.

步骤603，对参照特征图和至少一个第四特征图进行逐点融合，得到第二特征图。Step 603: Perform point-by-point fusion on the reference feature map and at least one fourth feature map to obtain a second feature map.

可选地，逐点融合的方法可以包括但不限于：逐元素相加、在特征通道维拼接、逐元素取最大值、神经网络融合等；本实施例通过对经过空间融合后得到的多个时刻中每个时刻对应的第三特征图时序融合，实现了在第一网络模型的融合分支网络中依次实现空间融合和时序融合，达到在第一网络模型中的内部分支实现，将多个视角在多个时刻采集的多帧图像帧融合为鸟瞰图像坐标系下的一个第二特征图，使第一网络模型可以端到端的学习如何实现特征图的空间和时序的融合，无需在网络模型之后外接后处理执行空间和时序融合，降低了后处理的复杂程度，并减少了由于在后处理中进行融合而引入的歧义问题。Optionally, the point-by-point fusion method may include, but is not limited to: element-by-element addition, splicing in the feature channel dimension, element-by-element maximum value, neural network fusion, etc. The timing fusion of the third feature map corresponding to each moment in the moment realizes the spatial fusion and timing fusion in the fusion branch network of the first network model in turn, and achieves the internal branch realization in the first network model, combining multiple perspectives. The multi-frame image frames collected at multiple times are fused into a second feature map in the bird's-eye image coordinate system, so that the first network model can learn end-to-end how to realize the fusion of the space and time series of the feature maps, without the need to follow the network model. External post-processing performs spatial and temporal fusion, reducing post-processing complexity and reducing ambiguity issues introduced by fusion in post-processing.

可选地，在上述实施例的基础上，步骤602还可以包括：Optionally, on the basis of the foregoing embodiment, step 602 may further include:

b1，基于车载摄像头在最近时刻和至少一个第三特征图对应的至少一个时刻对应的帧间运动变换矩阵，确定参照特征图对应的第一特征图与至少一个第三特征图对应的第一特征图之间的至少一个单应变换矩阵。b1. Determine the first feature corresponding to the first feature map corresponding to the reference feature map and the at least one third feature map based on the inter-frame motion transformation matrix corresponding to the vehicle camera at the latest time and at least one time corresponding to at least one third feature map At least one homography transformation matrix between graphs.

可选地，由于车载摄像头在不同时刻采集图像时，车辆的位置可能发生变化，导致相机位置相应产生变化，基于车辆在帧间运动(每个时刻对应一帧图像帧)和车载设定位置的车载摄像头(例如，车载前视相机)距离自车坐标系xy平面的高度d已知(属于相机外参，可通过现有技术算法计算确定或通过传感器采集的信息确定等)，计算任一视角前后两帧图像间的单应变换矩阵H。假设t时刻到t-1时刻的旋转为R，平移为m，此时，单应变换矩阵H可基于以下公式(4)确定：Optionally, since the vehicle's position may change when the vehicle-mounted camera collects images at different times, the camera position may change accordingly. The height d of the vehicle camera (for example, the vehicle front-view camera) from the xy plane of the vehicle coordinate system is known (it is an external parameter of the camera, which can be calculated and determined by the existing technology algorithm or determined by the information collected by the sensor, etc.), and any angle of view is calculated. The homography transformation matrix H between the two frames of images before and after. Assuming that the rotation from time t to time t-1 is R, and the translation is m, at this time, the homography transformation matrix H can be determined based on the following formula (4):

其中，n＝[0，0，1]^T为自车坐标系xy平面的法向；K表示车载摄像头内参矩阵；d表示车载摄像头距离自车坐标系xy平面的高度。Among them, n=[0, 0, 1] ^T is the normal direction of the xy plane of the vehicle coordinate system; K represents the internal parameter matrix of the vehicle camera; d represents the height of the vehicle camera from the xy plane of the vehicle coordinate system.

b2，基于至少一个单应变换矩阵和第三变换矩阵，确定参照特征图与至少一个第三特征图之间的至少一个转换矩阵。b2. Based on the at least one homography transformation matrix and the third transformation matrix, determine at least one transformation matrix between the reference feature map and the at least one third feature map.

本实施例中，结合空间转换的第三变换矩阵和每个第三特征图对应的单应变换矩阵为每个第三特征图确定一个转换矩阵，可选地，可基于以下公式(5)确定转换矩阵T_temp：In this embodiment, a transformation matrix is determined for each third feature map in combination with the third transformation matrix of spatial transformation and the homography transformation matrix corresponding to each third feature map. Optionally, it can be determined based on the following formula (5) Transformation matrix T _temp :

b3，基于至少一个转换矩阵中每个转换矩阵分别对至少一个第三特征图中每个第三特征图进行重构，得到至少一个第四特征图。b3. Reconstruct each third feature map in at least one third feature map based on each conversion matrix in the at least one conversion matrix, respectively, to obtain at least one fourth feature map.

本实施例中，其中，以t-1时刻的第三特征图为例，t-1时刻空间融合后在(U_t-1，V_t-1)位置的特征可以通过转换矩阵T_temp映射到t时刻空间融合特征的(U_t，V_t)，从而得到重构的t时刻特征F′_t，即，通过转换矩阵实现了将t-1时刻的第三特征图的重构，得到第四特征图，以此类推，分别对每个时刻的第三特征图进行重构，得到至少一个第四特征图；通过特征图重构，使不同时刻对应的特征图能够进行融合，并且，通过执行空间融合，后执行时序融合，减少了时序融合使处理的特征图数量，降低了时序融合的难度，并且，以空间融合中的第三变换矩阵结合实现时序融合，提高了参数的重复利用率，提高了融合效率。In this embodiment, taking the third feature map at time t-1 as an example, the feature at the position (U _t-1 , V _t-1 ) after spatial fusion at time t-1 can be _mapped to (U _t , V _t ) of the spatial fusion features at time t to obtain the reconstructed feature F′ _{t at time t} , that is, the reconstruction of the third feature map at time t-1 is achieved through the transformation matrix, and the fourth The feature map, and so on, reconstruct the third feature map at each moment to obtain at least one fourth feature map; through the feature map reconstruction, the feature maps corresponding to different moments can be fused, and by executing Space fusion, and then perform time series fusion, reduces the number of feature maps processed by time series fusion, and reduces the difficulty of time series fusion. In addition, the third transformation matrix in spatial fusion is used to achieve time series fusion, which improves the repeated utilization of parameters. Improve the fusion efficiency.

如图7所示，在上述图2所示实施例的基础上，步骤204可包括如下步骤：As shown in FIG. 7 , on the basis of the above-mentioned embodiment shown in FIG. 2 , step 204 may include the following steps:

步骤2041，通过第一网络模型中的解码分支网络对第二特征图进行解码处理，得到解码后的第二特征图。Step 2041 , decode the second feature map through the decoding branch network in the first network model to obtain a decoded second feature map.

可选地，本实施例中的解码分支网络可参照图1提供的实施例中的解码分支网络103进行理解。Optionally, the decoding branch network in this embodiment may be understood with reference to the decoding branch network 103 in the embodiment provided in FIG. 1 .

步骤2042，基于第二网络模型对解码后的第二特征图进行识别，确定预设感知任务对应的感知结果。Step 2042: Identify the decoded second feature map based on the second network model, and determine the perception result corresponding to the preset perception task.

本实施例中，可以通过解码分支网络将第二特征图映射到预设感知任务所需的特征空间中，使后续预设感知任务对应的网络模型可以直接对映射后的特征图进行处理，减少了中间处理的其他过程，实现了预设感知任务的网络模型直接与第一网络模型相连接，可实现网络模型的联合学习，通过联合学习，提升了预设感知任务的感知结果的准确率。In this embodiment, the second feature map can be mapped into the feature space required by the preset sensing task through the decoding branch network, so that the network model corresponding to the subsequent preset sensing task can directly process the mapped feature map, reducing the number of The other process of intermediate processing is realized, and the network model of the preset sensing task is directly connected with the first network model, which can realize the joint learning of the network model. Through the joint learning, the accuracy of the sensing result of the preset sensing task is improved.

本公开实施例提供的任一种车载视觉感知方法可以由任意适当的具有数据处理能力的设备执行，包括但不限于：终端设备和服务器等。或者，本公开实施例提供的任一种车载视觉感知方法可以由处理器执行，如处理器通过调用存储器存储的相应指令来执行本公开实施例提及的任一种车载视觉感知方法。下文不再赘述。Any of the vehicle-mounted visual perception methods provided by the embodiments of the present disclosure may be executed by any appropriate device with data processing capabilities, including but not limited to: terminal devices and servers. Alternatively, any in-vehicle visual perception method provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor executes any of the in-vehicle visual perception methods mentioned in the embodiments of the present disclosure by invoking corresponding instructions stored in the memory. No further description will be given below.

示例性装置Exemplary device

图8是本公开一示例性实施例提供的车载视觉感知装置的结构示意图。如图8所示，本实施例提供的装置包括：FIG. 8 is a schematic structural diagram of a vehicle-mounted visual perception device provided by an exemplary embodiment of the present disclosure. As shown in FIG. 8 , the device provided in this embodiment includes:

图像采集模块81，用于通过设置在车辆上预设位置的多个车载摄像头在连续多个时刻进行图像采集，得到多个图像帧集合。The image acquisition module 81 is configured to perform image acquisition at a plurality of consecutive moments through a plurality of vehicle-mounted cameras set at preset positions on the vehicle to obtain a plurality of image frame sets.

其中，每个图像帧集合中包括基于同一车载摄像头采集的多帧图像帧，每一所述车载摄像头在每一时刻对应一帧图像帧。Wherein, each image frame set includes multiple frames of image frames collected based on the same vehicle-mounted camera, and each of the vehicle-mounted cameras corresponds to one frame of image frame at each moment.

编码模块82，用于通过第一网络模型中的编码分支网络对图像采集模块81得到的多个图像帧集合中包括的图像帧进行特征提取，得到多个第一特征图。The encoding module 82 is configured to perform feature extraction on the image frames included in the multiple image frame sets obtained by the image acquisition module 81 through the encoding branch network in the first network model to obtain multiple first feature maps.

融合模块83，用于通过第一网络模型中的融合分支网络对编码模块82确定的多个第一特征图执行空间融合和时序融合，得到鸟瞰图像坐标系下的第二特征图。The fusion module 83 is configured to perform spatial fusion and time series fusion on the plurality of first feature maps determined by the encoding module 82 through the fusion branch network in the first network model, to obtain a second feature map in the bird's-eye image coordinate system.

感知模块84，用于基于预设感知任务对应的网络模型对融合模块83确定的第二特征图进行识别，得到预设感知任务对应的感知结果。The sensing module 84 is configured to identify the second feature map determined by the fusion module 83 based on the network model corresponding to the preset sensing task, and obtain the sensing result corresponding to the preset sensing task.

本公开上述实施例提供的一种车载视觉感知装置，通过在第一神经网络内部实现空间融合和时序融合，实现了神经网络端到端的学习，由于无需采用后处理融合，可以有效避免因在后处理时对图像进行空间融合和时序融合的复杂程度，以及可避免在后处理中将同一目标误识别为多个目标的情况发生。The vehicle-mounted visual perception device provided by the above embodiments of the present disclosure realizes the end-to-end learning of the neural network by implementing spatial fusion and time sequence fusion within the first neural network. The complexity of spatial fusion and temporal fusion of images during processing, as well as avoiding misidentification of the same target as multiple targets in post-processing.

图9是本公开另一示例性实施例提供的车载视觉感知装置的结构示意图。如图9所示，本实施例提供的装置中，融合模块83，包括：FIG. 9 is a schematic structural diagram of a vehicle-mounted visual perception device provided by another exemplary embodiment of the present disclosure. As shown in FIG. 9, in the apparatus provided by this embodiment, the fusion module 83 includes:

空间融合单元831，用于针对多个时刻中每个时刻，对时刻对应的多个第一特征图执行空间融合，得到鸟瞰图像坐标系下的多个第三特征图。The spatial fusion unit 831 is configured to perform spatial fusion on the plurality of first feature maps corresponding to the moments for each moment in the plurality of moments to obtain a plurality of third feature maps in the bird's-eye image coordinate system.

时序融合单元832，用于对多个第三特征图执行时序融合，得到第二特征图。The time sequence fusion unit 832 is configured to perform time sequence fusion on a plurality of third feature maps to obtain a second feature map.

可选地，空间融合单元831，具体用于分别对多个第一特征图中每个第一特征图执行单应变换，得到鸟瞰图像坐标系下的多个变换特征图；对多个变换特征图进行逐点融合，得到鸟瞰图像坐标系下的第三特征图。Optionally, the spatial fusion unit 831 is specifically configured to perform homography transformation on each of the first feature maps in the plurality of first feature maps to obtain a plurality of transformed feature maps in the bird's-eye image coordinate system; The images are fused point by point to obtain the third feature map in the bird's-eye image coordinate system.

可选地，空间融合单元831在分别对多个第一特征图中每个第一特征图执行单应变换，得到鸟瞰图像坐标系下的多个变换特征图时，用于基于每个第一特征图对应的车载摄像头的内参矩阵和外参矩阵，确定第一特征图对应的自车坐标系到相机坐标系的第一变换矩阵；基于自车坐标系的预设平面上的感知范围和鸟瞰图像坐标系的缩放比例和平移距离，确定鸟瞰图像坐标系到自车坐标系的预设平面的第二变换矩阵；基于第一变换矩阵、第二变换矩阵和车载摄像头的内参矩阵，确定鸟瞰图像坐标系到车载摄像头对应的图像坐标系的第三变换矩阵；基于第三变换矩阵对第一特征图进行变换，得到变换特征图。Optionally, when the spatial fusion unit 831 performs homography transformation on each of the first feature maps of the plurality of first feature maps to obtain a plurality of transformed feature maps in the bird's-eye image coordinate system, it is used to perform the homography transformation based on each of the first feature maps. The internal parameter matrix and external parameter matrix of the vehicle camera corresponding to the feature map, determine the first transformation matrix from the vehicle coordinate system corresponding to the first feature map to the camera coordinate system; the perception range and bird's-eye view on the preset plane based on the self-vehicle coordinate system The zoom ratio and translation distance of the image coordinate system determine the second transformation matrix from the bird's-eye image coordinate system to the preset plane of the vehicle coordinate system; determine the bird's-eye image based on the first transformation matrix, the second transformation matrix and the internal parameter matrix of the vehicle camera The third transformation matrix from the coordinate system to the image coordinate system corresponding to the vehicle camera; the first feature map is transformed based on the third transformation matrix to obtain the transformed feature map.

可选地，空间融合单元831基于第一变换矩阵、第二变换矩阵和车载摄像头的内参矩阵，确定鸟瞰图像坐标系到车载摄像头对应的图像坐标系的第三变换矩阵时，用于基于第一变换矩阵，确定自车坐标系的预设平面到相机坐标系的第四变换矩阵；基于第四变换矩阵和第二变换矩阵，确定鸟瞰图像坐标系到相机坐标系的第五变换矩阵；基于第五变换矩阵和车载摄像头的内参矩阵，确定第三变换矩阵。Optionally, when the spatial fusion unit 831 determines the third transformation matrix from the bird's-eye image coordinate system to the image coordinate system corresponding to the on-board camera based on the first transformation matrix, the second transformation matrix, and the internal parameter matrix of the vehicle-mounted camera, it is used for the first transformation based on the first transformation matrix. Transformation matrix, determining the fourth transformation matrix from the preset plane of the vehicle coordinate system to the camera coordinate system; based on the fourth transformation matrix and the second transformation matrix, determining the fifth transformation matrix from the bird's-eye image coordinate system to the camera coordinate system; The fifth transformation matrix and the internal parameter matrix of the vehicle camera are used to determine the third transformation matrix.

可选地，时序融合单元832，具体用于以多个时刻中最近时刻对应的第三特征图作为参照特征图；分别对至少一个第三特征图中每个第三特征图进行重构，得到至少一个第四特征图；对参照特征图和至少一个第四特征图进行逐点融合，得到第二特征图。Optionally, the time sequence fusion unit 832 is specifically configured to use the third feature map corresponding to the most recent time among the multiple moments as the reference feature map; respectively reconstruct each third feature map in at least one third feature map to obtain: at least one fourth feature map; performing point-by-point fusion of the reference feature map and the at least one fourth feature map to obtain a second feature map.

可选地，时序融合单元832在分别对至少一个第三特征图中每个第三特征图进行重构，得到至少一个第四特征图时，用于基于车载摄像头在最近时刻和至少一个第三特征图对应的至少一个时刻对应的帧间运动变换矩阵，确定参照特征图对应的第一特征图与至少一个第三特征图对应的第一特征图之间的至少一个单应变换矩阵；基于至少一个单应变换矩阵和第三变换矩阵，确定参照特征图与至少一个第三特征图之间的至少一个转换矩阵；基于至少一个转换矩阵中每个转换矩阵分别对至少一个第三特征图中每个第三特征图进行重构，得到至少一个第四特征图。Optionally, when the time sequence fusion unit 832 reconstructs each third feature map of at least one third feature map respectively to obtain at least one fourth feature map, it is used for at least one The inter-frame motion transformation matrix corresponding to at least one moment corresponding to the feature map, determining at least one homography transformation matrix between the first feature map corresponding to the reference feature map and the first feature map corresponding to at least one third feature map; A homography transformation matrix and a third transformation matrix to determine at least one transformation matrix between the reference feature map and at least one third feature map; The third feature map is reconstructed to obtain at least one fourth feature map.

在一些可选实施例中，感知模块84，包括：In some optional embodiments, the perception module 84 includes:

解码单元841，用于通过第一网络模型中的解码分支网络对第二特征图进行解码处理，得到解码后的第二特征图。The decoding unit 841 is configured to perform decoding processing on the second feature map through the decoding branch network in the first network model to obtain a decoded second feature map.

特征识别单元842，用于基于第二网络模型对解码后的第二特征图进行识别，确定预设感知任务对应的感知结果。The feature identifying unit 842 is configured to identify the decoded second feature map based on the second network model, and determine the sensing result corresponding to the preset sensing task.

示例性电子设备Exemplary Electronics

下面，参考图10来描述根据本公开实施例的电子设备。该电子设备可以是第一设备100和第二设备200中的任一个或两者、或与它们独立的单机设备，该单机设备可以与第一设备和第二设备进行通信，以从它们接收所采集到的输入信号。Hereinafter, an electronic device according to an embodiment of the present disclosure will be described with reference to FIG. 10 . The electronic device can be either or both of the first device 100 and the second device 200, or a stand-alone device independent of them that can communicate with the first device and the second device to receive data from them The acquired input signal.

图10图示了根据本公开实施例的电子设备的框图。10 illustrates a block diagram of an electronic device according to an embodiment of the present disclosure.

如图10所示，电子设备10包括一个或多个处理器11和存储器12。As shown in FIG. 10 , the electronic device 10 includes one or more processors 11 and a memory 12 .

处理器11可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元，并且可以控制电子设备10中的其他组件以执行期望的功能。Processor 11 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in electronic device 10 to perform desired functions.

存储器12可以包括一个或多个计算机程序产品，所述计算机程序产品可以包括各种形式的计算机可读存储介质，例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令，处理器11可以运行所述程序指令，以实现上文所述的本公开的各个实施例的车载视觉感知方法以及/或者其他期望的功能。在所述计算机可读存储介质中还可以存储诸如输入信号、信号分量、噪声分量等各种内容。Memory 12 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random access memory (RAM) and/or cache memory, or the like. The non-volatile memory may include, for example, read only memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 11 may execute the program instructions to implement the vehicle-mounted visual perception method and/or the various embodiments of the present disclosure described above. Other desired features. Various contents such as input signals, signal components, noise components, etc. may also be stored in the computer-readable storage medium.

在一个示例中，电子设备10还可以包括：输入装置13和输出装置14，这些组件通过总线系统和/或其他形式的连接机构(未示出)互连。In one example, the electronic device 10 may also include an input device 13 and an output device 14 interconnected by a bus system and/or other form of connection mechanism (not shown).

例如，在该电子设备是第一设备100或第二设备200时，该输入装置13可以是上述的麦克风或麦克风阵列，用于捕捉声源的输入信号。在该电子设备是单机设备时，该输入装置13可以是通信网络连接器，用于从第一设备100和第二设备200接收所采集的输入信号。For example, when the electronic device is the first device 100 or the second device 200, the input device 13 may be the above-mentioned microphone or microphone array for capturing the input signal of the sound source. When the electronic device is a stand-alone device, the input device 13 may be a communication network connector for receiving the collected input signals from the first device 100 and the second device 200 .

此外，该输入装置13还可以包括例如键盘、鼠标等等。In addition, the input device 13 may also include, for example, a keyboard, a mouse, and the like.

该输出装置14可以向外部输出各种信息，包括确定出的距离信息、方向信息等。该输出装置14可以包括例如显示器、扬声器、打印机、以及通信网络及其所连接的远程输出设备等等。The output device 14 can output various information to the outside, including the determined distance information, direction information, and the like. The output device 14 may include, for example, displays, speakers, printers, and communication networks and their connected remote output devices, among others.

当然，为了简化，图10中仅示出了该电子设备10中与本公开有关的组件中的一些，省略了诸如总线、输入/输出接口等等的组件。除此之外，根据具体应用情况，电子设备10还可以包括任何其他适当的组件。Of course, for simplicity, only some of the components in the electronic device 10 related to the present disclosure are shown in FIG. 10 , and components such as buses, input/output interfaces, and the like are omitted. Besides, the electronic device 10 may also include any other suitable components according to the specific application.

示例性计算机程序产品和计算机可读存储介质Exemplary computer program product and computer readable storage medium

除了上述方法和设备以外，本公开的实施例还可以是计算机程序产品，其包括计算机程序指令，所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的车载视觉感知方法中的步骤。In addition to the methods and apparatus described above, embodiments of the present disclosure may also be computer program products comprising computer program instructions that, when executed by a processor, cause the processor to perform the "exemplary method" described above in this specification The steps in the in-vehicle visual perception method according to various embodiments of the present disclosure described in the section.

所述计算机程序产品可以以一种或多种程序设计语言的任意组合来编写用于执行本公开实施例操作的程序代码，所述程序设计语言包括面向对象的程序设计语言，诸如Java、C++等，还包括常规的过程式程序设计语言，诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。The computer program product may write program code for performing operations of embodiments of the present disclosure in any combination of one or more programming languages, including object-oriented programming languages, such as Java, C++, etc. , also includes conventional procedural programming languages, such as "C" language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.

此外，本公开的实施例还可以是计算机可读存储介质，其上存储有计算机程序指令，所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的车载视觉感知方法中的步骤。In addition, embodiments of the present disclosure may also be computer-readable storage media having computer program instructions stored thereon that, when executed by a processor, cause the processor to perform the above-described "Example Method" section of this specification The steps in the in-vehicle visual perception method according to various embodiments of the present disclosure described in .

所述计算机可读存储介质可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以包括但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The computer-readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses or devices, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

以上结合具体实施例描述了本公开的基本原理，但是，需要指出的是，在本公开中提及的优点、优势、效果等仅是示例而非限制，不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外，上述公开的具体细节仅是为了示例的作用和便于理解的作用，而非限制，上述细节并不限制本公开为必须采用上述具体的细节来实现。The basic principles of the present disclosure have been described above with reference to specific embodiments. However, it should be pointed out that the advantages, advantages, effects, etc. mentioned in the present disclosure are only examples rather than limitations, and these advantages, advantages, effects, etc. should not be considered to be A must-have for each embodiment of the present disclosure. In addition, the specific details disclosed above are only for the purpose of example and easy understanding, but not for limitation, and the above details do not limit the present disclosure to be implemented by using the above specific details.

本说明书中各个实施例均采用递进的方式描述，每个实施例重点说明的都是与其它实施例的不同之处，各个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言，由于其与方法实施例基本对应，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other. As for the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for related parts, please refer to the partial description of the method embodiment.

本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的，可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇，指“包括但不限于”，且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”，且可与其互换使用，除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”，且可与其互换使用。The block diagrams of devices, apparatuses, apparatuses, and systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, or configurations must be in the manner shown in the block diagrams. As those skilled in the art will appreciate, these means, apparatuses, apparatuses, systems may be connected, arranged, configured in any manner. Words such as "including", "including", "having" and the like are open-ended words meaning "including but not limited to" and are used interchangeably therewith. As used herein, the words "or" and "and" refer to and are used interchangeably with the word "and/or" unless the context clearly dictates otherwise. As used herein, the word "such as" refers to and is used interchangeably with the phrase "such as but not limited to".

可能以许多方式来实现本公开的方法和装置。例如，可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和装置。用于所述方法的步骤的上述顺序仅是为了进行说明，本公开的方法的步骤不限于以上具体描述的顺序，除非以其它方式特别说明。此外，在一些实施例中，还可将本公开实施为记录在记录介质中的程序，这些程序包括用于实现根据本公开的方法的机器可读指令。因而，本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。The methods and apparatus of the present disclosure may be implemented in many ways. For example, the methods and apparatus of the present disclosure may be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure can also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

还需要指出的是，在本公开的装置、设备和方法中，各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。It should also be noted that, in the apparatus, device and method of the present disclosure, each component or each step may be decomposed and/or recombined. These disaggregations and/or recombinations should be considered equivalents of the present disclosure.

提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或者使用本公开。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的，并且在此定义的一般原理可以应用于其他方面而不脱离本公开的范围。因此，本公开不意图被限制到在此示出的方面，而是按照与在此公开的原理和新颖的特征一致的最宽范围。The above description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the present disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

为了例示和描述的目的已经给出了以上描述。此外，此描述不意图将本公开的实施例限制到在此公开的形式。尽管以上已经讨论了多个示例方面和实施例，但是本领域技术人员将认识到其某些变型、修改、改变、添加和子组合。The foregoing description has been presented for the purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the present disclosure to the forms disclosed herein. Although a number of example aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, changes, additions and sub-combinations thereof.

Claims

1. An on-vehicle visual perception method, comprising:

acquiring images at a plurality of continuous moments by a plurality of vehicle-mounted cameras arranged at preset positions on a vehicle to obtain a plurality of image frame sets; each image frame set comprises a plurality of image frames collected by the same vehicle-mounted camera, and each vehicle-mounted camera corresponds to one image frame at each moment;

performing feature extraction on the image frames in the plurality of image frame sets through a coding branch network in a first network model to obtain a plurality of first feature maps;

performing space fusion and time sequence fusion on the plurality of first feature maps through a fusion branch network in the first network model to obtain a second feature map under a bird's-eye view image coordinate system;

and identifying the second feature map based on a network model corresponding to a preset perception task, and determining a perception result corresponding to the preset perception task.

2. The method according to claim 1, wherein the performing spatial fusion and time-series fusion on the plurality of first feature maps through a fusion branch network in the first network model to obtain a second feature map in a bird's-eye view image coordinate system comprises:

performing spatial fusion on a plurality of first feature maps corresponding to the time to obtain a plurality of third feature maps under a bird's-eye view image coordinate system for each time in the plurality of times; each third feature map corresponds to a moment;

and performing time sequence fusion on the plurality of third feature maps to obtain the second feature map.

3. The method according to claim 2, wherein the performing spatial fusion on the plurality of first feature maps corresponding to the time instants to obtain a plurality of third feature maps in a bird's-eye view image coordinate system comprises:

performing homographic transformation on each first feature map in the plurality of first feature maps respectively to obtain a plurality of transformation feature maps under a bird's-eye view image coordinate system;

and performing point-by-point fusion on the plurality of transformation characteristic graphs to obtain the third characteristic graph under the aerial view image coordinate system.

4. The method according to claim 3, wherein the performing the homographic transformation on each of the plurality of first feature maps to obtain a plurality of transformed feature maps under the bird's-eye view image coordinate system comprises:

determining a first transformation matrix from a vehicle coordinate system to a camera coordinate system corresponding to the first characteristic diagram based on an internal reference matrix and an external reference matrix of the vehicle-mounted camera corresponding to each first characteristic diagram;

determining a second transformation matrix from the aerial view image coordinate system to a preset plane of the self-vehicle coordinate system based on a sensing range on the preset plane of the self-vehicle coordinate system and the scaling and translation distance of the aerial view image coordinate system;

determining a third transformation matrix from the bird's-eye view image coordinate system to an image coordinate system corresponding to the vehicle-mounted camera based on the first transformation matrix, the second transformation matrix and an internal reference matrix of the vehicle-mounted camera;

and transforming the first characteristic diagram based on the third transformation matrix to obtain the transformation characteristic diagram.

5. The method of claim 4, wherein the determining a third transformation matrix of the bird's eye view image coordinate system to an image coordinate system corresponding to the vehicle-mounted camera based on the first transformation matrix, the second transformation matrix, and an internal reference matrix of the vehicle-mounted camera comprises:

determining a fourth transformation matrix from a preset plane of the own vehicle coordinate system to a camera coordinate system based on the first transformation matrix;

determining a fifth transformation matrix from the aerial view image coordinate system to the camera coordinate system based on the fourth transformation matrix and the second transformation matrix;

and determining the third transformation matrix based on the fifth transformation matrix and the internal reference matrix of the vehicle-mounted camera.

6. The method according to claim 4 or 5, wherein the performing time-series fusion on the plurality of third feature maps to obtain the second feature map comprises:

taking a third characteristic diagram corresponding to the latest moment in the plurality of moments as a reference characteristic diagram;

reconstructing each third feature map in the at least one third feature map respectively to obtain at least one fourth feature map;

and performing point-by-point fusion on the reference feature map and the at least one fourth feature map to obtain the second feature map.

7. The method according to claim 6, wherein the reconstructing each of the at least one third feature map to obtain at least one fourth feature map comprises:

determining at least one homography transformation matrix between the first feature diagram corresponding to the reference feature diagram and the first feature diagram corresponding to the at least one third feature diagram based on the interframe motion transformation matrix corresponding to the vehicle-mounted camera at the latest moment and the at least one moment corresponding to the at least one third feature diagram;

determining at least one transformation matrix between the reference feature map and the at least one third feature map based on the at least one homographic transformation matrix and the third transformation matrix;

and reconstructing each third characteristic diagram in the at least one third characteristic diagram respectively based on each conversion matrix in the at least one conversion matrix to obtain the at least one fourth characteristic diagram.

8. The method according to any one of claims 1 to 7, wherein the identifying the second feature map based on a second network model corresponding to a preset sensing task and determining a sensing result corresponding to the preset sensing task include:

decoding the second feature map through a decoding branch network in the first network model to obtain a decoded second feature map;

and identifying the decoded second feature map based on the second network model, and determining a perception result corresponding to the preset perception task.

9. An on-vehicle visual perception device, comprising:

the system comprises an image acquisition module, a data acquisition module and a data processing module, wherein the image acquisition module is used for acquiring images at a plurality of continuous moments through a plurality of vehicle-mounted cameras arranged at preset positions on a vehicle to obtain a plurality of image frame sets; each image frame set comprises a plurality of image frames collected by the same vehicle-mounted camera, and each vehicle-mounted camera corresponds to one image frame at each moment;

the encoding module is used for extracting the features of the image frames in the image frame sets obtained by the image acquisition module through an encoding branch network in a first network model to obtain a plurality of first feature maps;

the fusion module is used for performing space fusion and time sequence fusion on the plurality of first feature maps determined by the encoding module through a fusion branch network in the first network model to obtain a second feature map under a bird's-eye view image coordinate system;

and the perception module is used for identifying the second characteristic diagram determined by the fusion module based on a network model corresponding to a preset perception task to obtain a perception result corresponding to the preset perception task.

10. A computer-readable storage medium, storing a computer program for executing the in-vehicle visual perception method of any of the above claims 1-8.

11. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for reading the executable instructions from the memory and executing the instructions to realize the vehicle-mounted visual perception method of any one of the claims 1-8.