CN114885144A

CN114885144A - High frame rate 3D video generation method and device based on data fusion

Info

Publication number: CN114885144A
Application number: CN202210293645.3A
Authority: CN
Inventors: 高跃; 李思奇; 李一鹏
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2022-08-09
Anticipated expiration: 2042-03-23
Also published as: CN114885144B

Abstract

The application discloses a high frame rate 3D video generation method and device based on data fusion, wherein the method comprises the following steps: acquiring video and event data lower than a preset frame rate from an event camera, combining the video and the event data in pairs to generate a plurality of groups of adjacent image frames, calculating to obtain a timestamp set of all intermediate frames, intercepting event streams from two boundary frames to expected intermediate frames, inputting the event streams into a preset impulse neural network for forward propagation to obtain an event stream data feature vector, splicing the event stream data feature vector with the adjacent image frames, inputting the event stream data feature vector into a preset multi-mode fusion network for forward propagation to obtain all intermediate frames, generating high frame rate video higher than a second preset frame rate, and performing forward propagation by using a preset 3D depth estimation network to obtain all high frame rate depth maps, thereby forming the high frame rate 3D video. Therefore, the technical problem that the generated image quality is low due to the fact that the initial brightness value of each pixel point is lacked because only the event stream is used as input in the related technology is solved.

Description

Method and device for generating high frame rate 3D video based on data fusion

技术领域technical field

本申请涉及计算机视觉及神经形态计算技术领域，特别涉及一种基于数据融合的高帧率3D视频生成方法及装置。The present application relates to the technical fields of computer vision and neuromorphic computing, and in particular, to a method and device for generating high frame rate 3D video based on data fusion.

背景技术Background technique

一方面，传统相机受帧率限制，拍摄高帧率视频所需的专业高速摄像机成本极高；另一方面，从低帧率视频生成高帧率的3D视频，即高帧率深度图视频，实现高速3D观测存在一定缺陷。On the one hand, traditional cameras are limited by the frame rate, and the professional high-speed cameras required to shoot high-frame-rate videos are extremely expensive; There are certain defects in realizing high-speed 3D observation.

相关技术使用纯事件流生成视频，将事件流使用堆叠的方式转换成为网格状张量表示，从而使用深度学习方法生成图像，实现高速3D观测的目的。Related technologies use pure event streams to generate videos, convert event streams into grid-like tensor representations in a stacking manner, and use deep learning methods to generate images to achieve high-speed 3D observation.

然而，相关技术仅使用事件流作为输入，缺乏每个像素点的初始亮度值，仅依靠亮度变化记录去估计亮度是一种欠定问题，进而导致生成的图像质量较低，有待改善。However, the related art only uses the event stream as input, lacks the initial brightness value of each pixel, and only relies on the brightness change record to estimate the brightness is an underdetermined problem, which in turn leads to low quality of the generated image, which needs to be improved.

发明内容SUMMARY OF THE INVENTION

本申请提供一种基于数据融合的高帧率3D视频生成方法及装置，以解决相关技术中仅使用事件流作为输入，缺乏每个像素点的初始亮度值，从而导致生成的图像质量较低的技术问题。The present application provides a method and device for generating high frame rate 3D video based on data fusion, so as to solve the problem that the related art only uses event stream as input, lacks the initial brightness value of each pixel point, thus resulting in low quality of the generated image. technical problem.

本申请第一方面实施例提供一种基于数据融合的高帧率3D视频生成方法，包括以下步骤：从事件相机获取低于预设帧率的视频和事件数据；将所述视频中相邻图像帧进行两两组合，生成多组相邻图像帧，并计算期望得到所有中间帧的时间戳集合；根据所述时间戳集合截取从两个边界帧到期望中间帧的第一事件流和第二事件流，并将所述第一事件流和第二事件流输入至预设的脉冲神经网络进行前向传播，得到第一事件流数据特征向量和第二事件流数据特征向量；拼接所述相邻图像帧、所述第一事件流数据特征向量和所述第二事件流数据特征向量，并输入至预设的多模态融合网络进行前向传播，得到所有中间帧，生成高于第二预设帧率的高帧率视频；基于所述高帧率视频，利用预设的3D深度估计网络进行前向传播，得到所有高帧率深度图，并组合所述所有高帧率深度图，构成高帧率3D视频。The embodiment of the first aspect of the present application provides a high frame rate 3D video generation method based on data fusion, including the following steps: acquiring video and event data with a frame rate lower than a preset frame rate from an event camera; The frames are combined in pairs to generate multiple groups of adjacent image frames, and calculate the timestamp set of all intermediate frames expected to be obtained; intercept the first event stream and the second event stream from the two boundary frames to the expected intermediate frame according to the timestamp set. event stream, and input the first event stream and the second event stream into a preset spiking neural network for forward propagation, and obtain the first event stream data feature vector and the second event stream data feature vector; splicing the phase The adjacent image frames, the first event stream data feature vector and the second event stream data feature vector are input to the preset multi-modal fusion network for forward propagation, and all intermediate frames are obtained. A high frame rate video with a preset frame rate; based on the high frame rate video, forward propagation is performed using a preset 3D depth estimation network to obtain all high frame rate depth maps, and combine all the high frame rate depth maps, Compose high frame rate 3D video.

可选地，在本申请的一个实施例中，在将所述第一事件流和第二事件流输入至所述预设的脉冲神经网络进行前向传播之前，还包括：基于Spike Response模型作为神经元动力学模型，构建所述脉冲神经网络。Optionally, in an embodiment of the present application, before inputting the first event stream and the second event stream into the preset spiking neural network for forward propagation, the method further includes: based on the Spike Response model as a Neuron dynamics model to construct the spiking neural network.

可选地，在本申请的一个实施例中，所述多模态融合网络包含粗合成子网络和微调子网络，其中，所述粗合成子网络使用第一U-Net结构，输入层的输入通道数为64+2×k，输出层的输出通道数为k，且所述微调子网络使用第二U-Net结构，输入层的输入通道数为3×k，输出层的输出通道数为k，k为所述低于预设帧率的视频的图像帧的通道数。Optionally, in an embodiment of the present application, the multi-modal fusion network includes a coarse synthesis sub-network and a fine-tuning sub-network, wherein the coarse synthesis sub-network uses the first U-Net structure, and the input of the input layer The number of channels is 64+2×k, the number of output channels of the output layer is k, and the fine-tuning sub-network uses the second U-Net structure, the number of input channels of the input layer is 3×k, and the number of output channels of the output layer is k, k is the channel number of the image frame of the video whose frame rate is lower than the preset frame rate.

可选地，在本申请的一个实施例中，所述3D深度估计网络使用第三U-Net结构，且输入层的输入通道数为3×k，输出层的输出通道数为1。Optionally, in an embodiment of the present application, the 3D depth estimation network uses a third U-Net structure, and the number of input channels of the input layer is 3×k, and the number of output channels of the output layer is 1.

可选地，在本申请的一个实施例中，所述所有中间帧的时间戳集合的计算公式为：Optionally, in an embodiment of the present application, the calculation formula of the timestamp set of all intermediate frames is:

其中，N为输入低帧率视频的总帧数，n为期望帧率提升的倍数，t_j为输入低帧率视频第j帧的时间戳。Among them, N is the total number of frames of the input low frame rate video, n is the multiple of the expected frame rate increase, and t _j is the timestamp of the jth frame of the input low frame rate video.

本申请第二方面实施例提供一种基于数据融合的高帧率3D视频生成装置，包括：第一获取模块，用于从事件相机获取低于预设帧率的视频和事件数据；计算模块，用于将所述视频中相邻图像帧进行两两组合，生成多组相邻图像帧，并计算期望得到所有中间帧的时间戳集合；第二获取模块，用于根据所述时间戳集合截取从两个边界帧到期望中间帧的第一事件流和第二事件流，并将所述第一事件流和第二事件流输入至预设的脉冲神经网络进行前向传播，得到第一事件流数据特征向量和第二事件流数据特征向量；融合模块，用于拼接所述相邻图像帧、所述第一事件流数据特征向量和所述第二事件流数据特征向量，并输入至预设的多模态融合网络进行前向传播，得到所有中间帧，生成高于第二预设帧率的高帧率视频；生成模块，用于基于所述高帧率视频，利用预设的3D深度估计网络进行前向传播，得到所有高帧率深度图，并组合所述所有高帧率深度图，构成高帧率3D视频。An embodiment of the second aspect of the present application provides a high frame rate 3D video generation device based on data fusion, including: a first acquisition module, configured to acquire video and event data with a frame rate lower than a preset frame rate from an event camera; a calculation module, For combining adjacent image frames in the video in pairs, generating multiple groups of adjacent image frames, and calculating a set of timestamps expected to obtain all intermediate frames; a second acquisition module, used for intercepting according to the set of timestamps From the two boundary frames to the first event stream and the second event stream of the expected intermediate frame, and input the first event stream and the second event stream to the preset spiking neural network for forward propagation to obtain the first event The stream data feature vector and the second event stream data feature vector; the fusion module is used for splicing the adjacent image frames, the first event stream data feature vector and the second event stream data feature vector, and input to the pre- The set multi-modal fusion network performs forward propagation, obtains all intermediate frames, and generates a high frame rate video higher than the second preset frame rate; the generation module is used for using the preset 3D video based on the high frame rate video. The depth estimation network performs forward propagation to obtain all high frame rate depth maps, and combines all the high frame rate depth maps to form a high frame rate 3D video.

可选地，在本申请的一个实施例中，还包括：构建模块，用于基于Spike Response模型作为神经元动力学模型，构建所述脉冲神经网络。Optionally, in an embodiment of the present application, it further includes: a building module for building the spiking neural network based on the Spike Response model as a neuron dynamics model.

可选地，在本申请的一个实施例中，所述第一事件流和所述第二事件流的计算公式为：Optionally, in an embodiment of the present application, the calculation formulas of the first event flow and the second event flow are:

本申请第三方面实施例提供一种电子设备，包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述程序，以实现如上述实施例所述的基于数据融合的高帧率3D视频生成方法。An embodiment of a third aspect of the present application provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the program to achieve The method for generating high frame rate 3D video based on data fusion as described in the above embodiments.

本申请第四方面实施例提供一种计算机可读存储介质，所述计算机可读存储介质存储计算机指令，所述计算机指令用于使所述计算机执行如上述实施例所述的基于数据融合的高帧率3D视频生成方法。Embodiments of the fourth aspect of the present application provide a computer-readable storage medium, where the computer-readable storage medium stores computer instructions, and the computer instructions are used to cause the computer to execute the high-speed data fusion-based high-level data fusion described in the foregoing embodiments. Frame rate 3D video generation method.

本申请实施例可以使用事件数据提供帧间运动信息，利用脉冲神经网络对事件流进行编码，并通过多模态融合网络得到所有中间帧，生成高帧率视频，进而利用3D深度估计网络构成高帧率3D视频，实现对于高速场景的有效的立体观测，通过使用事件流和低帧率视频图像帧作为输入，可以更好地使用多模态数据信息，进而提升高帧率3D视频的质量。由此，解决了相关技术中仅使用事件流作为输入，缺乏每个像素点的初始亮度值，从而导致生成的图像质量较低的技术问题。In this embodiment of the present application, event data can be used to provide inter-frame motion information, a spiking neural network can be used to encode the event stream, all intermediate frames can be obtained through a multi-modal fusion network, and a high-frame-rate video can be generated, and then a 3D depth estimation network can be used to form a high-speed video. Frame rate 3D video enables effective stereoscopic observation of high-speed scenes. By using event streams and low frame rate video image frames as input, multi-modal data information can be better used, thereby improving the quality of high frame rate 3D video. As a result, the technical problem in the related art that only the event stream is used as an input and the initial brightness value of each pixel is lacking, thus resulting in a lower quality of the generated image, is solved.

本申请附加的方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本申请的实践了解到。Additional aspects and advantages of the present application will be set forth, in part, in the following description, and in part will be apparent from the following description, or learned by practice of the present application.

附图说明Description of drawings

本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present application will become apparent and readily understood from the following description of embodiments taken in conjunction with the accompanying drawings, wherein:

图1为根据本申请实施例提供的一种基于数据融合的高帧率3D视频生成方法的流程图；1 is a flowchart of a method for generating a high frame rate 3D video based on data fusion provided according to an embodiment of the present application;

图2为根据本申请一个实施例的基于数据融合的高帧率3D视频生成方法的流程图；2 is a flowchart of a method for generating a high frame rate 3D video based on data fusion according to an embodiment of the present application;

图3为根据本申请一个实施例的基于数据融合的高帧率3D视频生成方法的低帧率视频数据及事件流数据示意图；3 is a schematic diagram of low frame rate video data and event stream data of a high frame rate 3D video generation method based on data fusion according to an embodiment of the present application;

图4为根据本申请一个实施例的基于数据融合的高帧率3D视频生成方法的中间帧视频数据示意图；4 is a schematic diagram of intermediate frame video data of a method for generating high frame rate 3D video based on data fusion according to an embodiment of the present application;

图5为根据本申请一个实施例的基于数据融合的高帧率3D视频生成方法的输入事件流、低帧率视频和生成的高帧率视频数据示意图；5 is a schematic diagram of an input event stream, a low frame rate video, and the generated high frame rate video data of a method for generating a high frame rate 3D video based on data fusion according to an embodiment of the present application;

图6为根据本申请一个实施例的基于数据融合的高帧率3D视频生成方法的10倍帧率提升下的高帧率深度图；6 is a high frame rate depth map under a 10-fold frame rate improvement of a method for generating a high frame rate 3D video based on data fusion according to an embodiment of the present application;

图7为根据本申请实施例提供的一种基于数据融合的高帧率3D视频生成装置的结构示意图；7 is a schematic structural diagram of a device for generating high frame rate 3D video based on data fusion provided according to an embodiment of the present application;

图8为根据本申请实施例提供的电子设备的结构示意图FIG. 8 is a schematic structural diagram of an electronic device provided according to an embodiment of the present application

具体实施方式Detailed ways

下面详细描述本申请的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本申请，而不能理解为对本申请的限制。The following describes in detail the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, and are intended to be used to explain the present application, but should not be construed as a limitation to the present application.

下面参考附图描述本申请实施例的基于数据融合的高帧率3D视频生成方法及装置。针对上述背景技术中心提到的相关技术中仅使用事件流作为输入，缺乏每个像素点的初始亮度值，从而导致生成的图像质量较低的技术问题，本申请提供了一种基于数据融合的高帧率3D视频生成方法，在该方法中，可以使用事件数据提供帧间运动信息，利用脉冲神经网络对事件流进行编码，并通过多模态融合网络得到所有中间帧，生成高帧率视频，进而利用3D深度估计网络构成高帧率3D视频，实现对于高速场景的有效的立体观测，通过使用事件流和低帧率视频图像帧作为输入，可以更好地使用多模态数据信息，进而提升高帧率3D视频的质量。由此，解决了相关技术中仅使用事件流作为输入，缺乏每个像素点的初始亮度值，从而导致生成的图像质量较低的技术问题。The following describes the method and apparatus for generating a high frame rate 3D video based on data fusion according to the embodiments of the present application with reference to the accompanying drawings. Aiming at the technical problem that the related art mentioned in the above-mentioned Background Art Center only uses the event stream as input and lacks the initial brightness value of each pixel, thus resulting in low quality of the generated image, the present application provides a data fusion-based A high frame rate 3D video generation method, in which event data can be used to provide inter-frame motion information, a spiking neural network is used to encode the event stream, and all intermediate frames are obtained through a multi-modal fusion network to generate a high frame rate video , and then use the 3D depth estimation network to form high frame rate 3D video to achieve effective stereoscopic observation for high-speed scenes. By using event streams and low frame rate video image frames as input, multi-modal data information can be better used, and then Improve the quality of high frame rate 3D videos. As a result, the technical problem in the related art that only the event stream is used as an input and the initial brightness value of each pixel is lacking, thus resulting in a lower quality of the generated image, is solved.

具体而言，图1为本申请实施例所提供的一种基于数据融合的高帧率3D视频生成方法的流程示意图。Specifically, FIG. 1 is a schematic flowchart of a method for generating a high frame rate 3D video based on data fusion provided by an embodiment of the present application.

如图1所示，该基于数据融合的高帧率3D视频生成方法包括以下步骤：As shown in Figure 1, the high frame rate 3D video generation method based on data fusion includes the following steps:

在步骤S101中，从事件相机获取低于预设帧率的视频和事件数据。In step S101, video and event data below a preset frame rate are acquired from the event camera.

在实际执行过程中，本申请实施例可以从事件相机获取低于预设帧率的视频和事件数据，实现原始数据的获取，为后续生成高帧率视频奠定数据基础。In the actual execution process, the embodiment of the present application can acquire video and event data with a lower frame rate than a preset frame rate from the event camera, realize the acquisition of raw data, and lay a data foundation for the subsequent generation of high frame rate video.

可以理解的是，事件相机是一种受生物启发的传感器，工作原理与传统的相机有很大的差别，与传统相机以固定帧率采集场景绝对光强不同，事件相机仅在场景光强变化时输出事件流，与传统相机相比，事件相机有着高动态范围、高时间分辨率、无动态模糊等优点，有利于保证高帧率视频的生成。Understandably, an event camera is a bio-inspired sensor that works very differently from a traditional camera. Unlike a traditional camera that captures the absolute light intensity of a scene at a fixed frame rate, an event camera only changes when the scene light intensity changes. Compared with traditional cameras, event cameras have the advantages of high dynamic range, high temporal resolution, and no motion blur, which is conducive to ensuring the generation of high frame rate video.

事件相机作为一种新型视觉传感器，无法直接应用传统相机及图像的各种算法，事件相机没有帧率的概念，其每个像素点异步工作，当检测到光强变化时输出一条事件，每条事件为一个四元组(x，y，t，p)，包含像素横纵坐标(x，y)、时间戳t和事件极性p(其中，p＝-1表示该像素点光强减小，p＝1表示该像素点光强增大)，将所有像素点输出的事件数据进行汇总，可以形成由一条条事件组成的事件列表，作为相机输出的事件流数据。事件相机则没有帧率的概念，其每个像素点异步工作，当检测到光强变化时输出一条事件。所有像素点输出的事件数据汇总起来，形成由若干条事件组成的事件列表，作为相机输出的事件流数据。As a new type of visual sensor, the event camera cannot directly apply various algorithms of traditional cameras and images. The event camera has no concept of frame rate. Each pixel of the event camera works asynchronously. When a change in light intensity is detected, an event is output. The event is a quadruple (x, y, t, p), including the horizontal and vertical coordinates of the pixel (x, y), the timestamp t and the event polarity p (where, p=-1 indicates that the light intensity of the pixel decreases , p=1 indicates that the light intensity of the pixel increases), and the event data output by all the pixels is aggregated to form an event list composed of events as the event stream data output by the camera. The event camera has no concept of frame rate, and each pixel works asynchronously, and outputs an event when a change in light intensity is detected. The event data output by all pixel points is aggregated to form an event list composed of several events, which is used as the event stream data output by the camera.

其中，预设帧率可以由本领域技术人员进行相应设置，在此不做具体限制。The preset frame rate can be set correspondingly by those skilled in the art, which is not specifically limited here.

在步骤S102中，将视频中相邻图像帧进行两两组合，生成多组相邻图像帧，并计算期望得到所有中间帧的时间戳集合。In step S102, adjacent image frames in the video are combined in pairs to generate multiple groups of adjacent image frames, and a set of time stamps of all intermediate frames expected to be obtained is calculated.

作为一种可能实现的方式，本申请实施例可以将低帧率视频中，相邻图像帧两两组合，生成多组相邻图像帧，且对于每一组相邻的图像帧，计算期望得到所有中间帧的时间戳集合T，记为：As a possible implementation method, in this embodiment of the present application, adjacent image frames in a low frame rate video can be combined in pairs to generate multiple groups of adjacent image frames, and for each group of adjacent image frames, the calculation is expected to obtain The timestamp set T of all intermediate frames, denoted as:

T＝{τ¹ _1,2,τ² _1,2,...,τⁿ _1,2,τ¹ _2,3,τ² _2,3,...,τⁿ _2,3,...,τ¹ _N-1,N,τ² _N-1,N,...,τⁿ _N-1,N}。T={τ ¹ _1,2 ,τ ² _1,2 ,...,τ ⁿ _1,2 ,τ ¹ _2,3 ,τ ² _2,3 ,...,τ ⁿ _2,3 ,... ,τ ¹ _N-1,N ,τ ² _N-1,N ,...,τ ⁿ _N-1,N }.

可选地，在本申请的一个实施例中，所有中间帧的时间戳集合的计算公式为：Optionally, in an embodiment of the present application, the calculation formula of the timestamp set of all intermediate frames is:

具体地，期望得到所有中间帧的时间戳的计算公式可以如下：Specifically, the calculation formula for obtaining the timestamps of all intermediate frames may be as follows:

其中，N是输入低帧率视频的总帧数，n是期望帧率提升的倍数，t_j是输入低帧率视频第j帧的时间戳。Among them, N is the total number of frames of the input low frame rate video, n is the multiple of the expected frame rate improvement, and t _j is the timestamp of the jth frame of the input low frame rate video.

本申请实施例可以通过计算期望得到所有中间帧的时间戳集合，实现对数据的预处理，为后续进行数据融合提供基础。In the embodiment of the present application, the set of timestamps of all intermediate frames can be expected to be obtained by calculation, so as to realize the preprocessing of the data, and provide a basis for the subsequent data fusion.

在步骤S103中，根据时间戳集合截取从两个边界帧到期望中间帧的第一事件流和第二事件流，并将第一事件流和第二事件流输入至预设的脉冲神经网络进行前向传播，得到第一事件流数据特征向量和第二事件流数据特征向量。In step S103, intercept the first event stream and the second event stream from the two boundary frames to the desired intermediate frame according to the timestamp set, and input the first event stream and the second event stream into a preset spiking neural network for Forward propagation to obtain the first event stream data feature vector and the second event stream data feature vector.

进一步地，本申请实施例可以根据步骤S102中计算获得的中间帧时间戳集合，截取从两个边界帧到期望中间帧的第一事件流ε₁和第二事件流ε₂，并将第一事件流和第二事件流输入至预设的脉冲神经网络进行前向传播，得到第一事件流数据特征向量F₁和第二事件流数据特征向量F₂。本申请实施例通过使用脉冲神经网络对于事件流进行编码，可以更好地起到事件流数据去噪的效果，进而提高生成视频的质量。Further, this embodiment of the present application may intercept the first event stream ε ₁ and the second event stream ε ₂ from the two boundary frames to the expected intermediate frame according to the intermediate frame timestamp set calculated in step S102 , and convert the first event stream ε 1 to the desired intermediate frame. The event stream and the second event stream are input to the preset spiking neural network for forward propagation to obtain the first event stream data feature vector F ₁ and the second event stream data feature vector F ₂ . By using the spiking neural network to encode the event stream in the embodiments of the present application, the effect of denoising the event stream data can be better achieved, thereby improving the quality of the generated video.

其中，第一事件流ε₁和第二事件流ε₂的计算公式可以分别如下：Wherein, the calculation formulas of the first event flow ε ₁ and the second event flow ε ₂ can be respectively as follows:

其中，τⁱ _j,j+1为期望中间帧的时间戳，t_j和t_j+1为期望中间帧相邻输入低帧率视频帧的时间戳。Among them, τ ⁱ _j,j+1 is the time stamp of the desired intermediate frame, and t _j and t _j+1 are the time stamps of the adjacent input low frame rate video frames of the desired intermediate frame.

需要注意的是，预设的脉冲神经网络会在下文进行详细阐述。It should be noted that the preset spiking neural network will be described in detail below.

可选地，在本申请的一个实施例中，在将第一事件流和第二事件流输入至预设的脉冲神经网络进行前向传播之前，还包括：基于Spike Response模型作为神经元动力学模型，构建脉冲神经网络。Optionally, in an embodiment of the present application, before inputting the first event stream and the second event stream into a preset spiking neural network for forward propagation, the method further includes: using a Spike Response model as a neuron dynamics model to build a spiking neural network.

在此对脉冲神经网络进行详细阐述。The spiking neural network is described in detail here.

可以理解的是，脉冲神经网络是第三代人工神经网络，脉冲神经网络中的神经元不是在每一次迭代传播中都被激活，而是在它的膜电位达到某一个特定值才被激活，当一个神经元被激活，脉冲神经网络会产生一个信号传递给其他神经元，提高或降低其膜电位，因此脉冲神经网络模拟神经元更加接近实际，更加适用于处理时序脉冲信号。It is understandable that the spiking neural network is the third generation of artificial neural network. The neurons in the spiking neural network are not activated in each iterative propagation, but are activated when their membrane potential reaches a certain value. When a neuron is activated, the spiking neural network will generate a signal that is transmitted to other neurons, increasing or decreasing its membrane potential, so the simulated neurons of the spiking neural network are closer to reality and are more suitable for processing timing pulse signals.

在实际执行过程中，本申请实施例可以使用Spike Response模型作为神经元动力学模型，构建脉冲卷积神经网络。In the actual execution process, the embodiment of the present application may use the Spike Response model as a neuron dynamics model to construct a spiking convolutional neural network.

具体地，脉冲神经网络可以包括输入卷积层、隐藏卷积层和输出卷积层。其中，输入卷积层的输入通道数为2，对应事件流的正极性事件和负极性事件，卷积核的尺寸为3×3，步长为1，输出通道数为16；隐藏卷积层的输入通道数为16，卷积核的尺寸为3×3，步长为1，输出通道数为16；输出卷积层的输入通道数为16，卷积核的尺寸为3×3，步长为1，输出通道数为32。Specifically, a spiking neural network can include an input convolutional layer, a hidden convolutional layer, and an output convolutional layer. Among them, the number of input channels of the input convolution layer is 2, corresponding to the positive events and negative events of the event stream, the size of the convolution kernel is 3 × 3, the stride is 1, and the number of output channels is 16; the hidden convolution layer The number of input channels is 16, the size of the convolution kernel is 3 × 3, the stride is 1, and the number of output channels is 16; the number of input channels of the output convolution layer is 16, the size of the convolution kernel is 3 × 3, the step The length is 1, and the number of output channels is 32.

在步骤S104，拼接相邻图像帧、第一事件流数据特征向量和第二事件流数据特征向量，并输入至预设的多模态融合网络进行前向传播，得到所有中间帧，生成高于第二预设帧率的高帧率视频。In step S104, the adjacent image frames, the first event stream data feature vector and the second event stream data feature vector are spliced, and input to a preset multi-modal fusion network for forward propagation to obtain all intermediate frames, and generate higher than High frame rate video of the second preset frame rate.

作为一种可能实现的方式，本申请实施例可以将从步骤S102获得的低帧率视频的相邻图像帧和从步骤S103获得的第一事件流数据特征向量F₁和第二事件流数据特征向量F₂进行拼接，并输入至预设的多模态融合网络进行前向传播，生成一帧中间帧，以完成单一高帧率图像帧计算。As a possible implementation manner, in this embodiment of the present application, the adjacent image frames of the low frame rate video obtained from step S102 and the first event stream data feature vector F ₁ and the second event stream data feature obtained from step S103 can be _The vector F2 is spliced and input to the preset multi-modal fusion network for forward propagation, and an intermediate frame is generated to complete the calculation of a single high frame rate image frame.

具体地，本申请实施例可以首先将低帧率视频相邻图像帧和事件流数据特征向量F₁和F₂拼接起来，输入到粗合成子网络中得到粗输出结果；随后将粗输出结果与输入相邻图像帧拼接起来，输入到微调子网络中得到最终输出结果。Specifically, in the embodiment of the present application, the adjacent image frames of the low frame rate video and the event stream data feature vectors F ₁ and F ₂ can be spliced together, and input into the coarse synthesis sub-network to obtain a coarse output result; then the coarse output result is combined with The input adjacent image frames are stitched together and input to the fine-tuning sub-network to get the final output.

进一步地，本申请实施例可以对于步骤S102中计算的期望每一个中间帧的时间戳，重复上述步骤，完成所有中间帧的计算，进而生成高于第二预设帧率的高帧率视频。Further, in this embodiment of the present application, the above steps may be repeated for the timestamp of each expected intermediate frame calculated in step S102 to complete the calculation of all intermediate frames, thereby generating a high frame rate video higher than the second preset frame rate.

需要注意的是，预设的多模态融合网络会在下文进行详细阐述。It should be noted that the preset multimodal fusion network will be described in detail below.

可选地，在本申请的一个实施例中，多模态融合网络包含粗合成子网络和微调子网络，其中，粗合成子网络使用第一U-Net结构，输入层的输入通道数为64+2×k，输出层的输出通道数为k，且微调子网络使用第二U-Net结构，输入层的输入通道数为3×k，输出层的输出通道数为k，k为低于预设帧率的视频的图像帧的通道数。Optionally, in an embodiment of the present application, the multimodal fusion network includes a coarse synthesis sub-network and a fine-tuning sub-network, wherein the coarse synthesis sub-network uses the first U-Net structure, and the number of input channels of the input layer is 64. +2×k, the number of output channels of the output layer is k, and the fine-tuning sub-network uses the second U-Net structure, the number of input channels of the input layer is 3×k, the number of output channels of the output layer is k, and k is lower than The number of channels of the image frame of the video with the preset frame rate.

在此对多模态融合网络进行详细阐述。The multimodal fusion network is elaborated here.

可以理解的是，数据融合网络包含一个粗合成子网络和一个微调子网络。其中，粗合成子网络使用第一U-Net结构，输入层的输入通道数为64+2×k，输出层的输出通道数为k；微调子网络使用第二U-Net结构，输入层的输入通道数为3×k，输出层的输出通道数为k。Understandably, the data fusion network consists of a coarse synthesis sub-network and a fine-tuned sub-network. Among them, the coarse synthesis sub-network uses the first U-Net structure, the number of input channels of the input layer is 64+2×k, and the number of output channels of the output layer is k; the fine-tuning sub-network uses the second U-Net structure, and the number of input channels of the input layer is k The number of input channels is 3×k, and the number of output channels of the output layer is k.

其中，k为步骤S101中输入的低帧率视频的图像帧的通道数，即当步骤S101中输入的低帧率视频的图像帧为灰度图时，k＝1，当步骤S101中输入的低帧率视频的图像帧为RGB图像时，k＝3。Among them, k is the channel number of the image frame of the low frame rate video input in step S101, that is, when the image frame of the low frame rate video input in step S101 is a grayscale image, k=1, when the input in step S101 When the image frame of the low frame rate video is an RGB image, k=3.

在步骤S105中，基于高帧率视频，利用预设的3D深度估计网络进行前向传播，得到所有高帧率深度图，并组合所有高帧率深度图，构成高帧率3D视频。In step S105, based on the high frame rate video, forward propagation is performed using a preset 3D depth estimation network to obtain all high frame rate depth maps, and all high frame rate depth maps are combined to form a high frame rate 3D video.

在实际执行过程中，本申请实施例可以将上述步骤中获得的高帧率图像帧，与其前后相邻高帧率图像帧进行拼接，使用预设的3D深度估计网络进行前向传播，生成一系列高帧率深度图，并将生成的一系列高帧率深度图进行组合，构成高帧率3D视频，实现高帧率3D视频生成。本申请实施例可以使用事件数据提供帧间运动信息，利用脉冲神经网络对事件流进行编码，并通过多模态融合网络得到所有中间帧，生成高帧率视频，进而利用3D深度估计网络构成高帧率3D视频，实现对于高速场景的有效的立体观测，通过使用事件流和低帧率视频图像帧作为输入，可以更好地使用多模态数据信息，进而提升高帧率3D视频的质量。In the actual execution process, the embodiment of the present application can splicing the high frame rate image frame obtained in the above steps with the adjacent high frame rate image frames before and after, and use the preset 3D depth estimation network for forward propagation to generate a A series of high frame rate depth maps, and the generated series of high frame rate depth maps are combined to form a high frame rate 3D video to achieve high frame rate 3D video generation. In this embodiment of the present application, event data can be used to provide inter-frame motion information, a spiking neural network can be used to encode the event stream, all intermediate frames can be obtained through a multi-modal fusion network, and a high-frame-rate video can be generated, and then a 3D depth estimation network can be used to form a high-speed video. Frame rate 3D video enables effective stereoscopic observation of high-speed scenes. By using event streams and low frame rate video image frames as input, multi-modal data information can be better used, thereby improving the quality of high frame rate 3D video.

可选地，在本申请的一个实施例中，3D深度估计网络使用第三U-Net结构，且输入层的输入通道数为3×k，输出层的输出通道数为1。Optionally, in an embodiment of the present application, the 3D depth estimation network uses a third U-Net structure, and the number of input channels of the input layer is 3×k, and the number of output channels of the output layer is 1.

在此对3D深度估计网络的构建进行详细阐述。The construction of the 3D depth estimation network is elaborated here.

具体地，本申请实施例构建的3D深度估计网络可以使用第三U-Net结构，输入层的输入通道数为3×k，输出层的输出通道数为1，其中，k为步骤S101中输入的低帧率视频的图像帧的通道数，即当步骤S101中输入的低帧率视频的图像帧为灰度图时，k＝1，当步骤S101中输入的低帧率视频的图像帧为RGB图像时，k＝3。Specifically, the 3D depth estimation network constructed in the embodiment of the present application can use the third U-Net structure, the number of input channels of the input layer is 3×k, and the number of output channels of the output layer is 1, where k is the input in step S101 The number of channels of the image frame of the low frame rate video, that is, when the image frame of the low frame rate video input in step S101 is a grayscale image, k=1, and when the image frame of the low frame rate video input in step S101 is For RGB images, k=3.

下面结合图2至7所示，以一个实施例对本申请实施例进行详细阐述。如图2所示，本申请实施例包括以下步骤：Hereinafter, with reference to FIGS. 2 to 7 , the embodiments of the present application will be described in detail with an embodiment. As shown in Figure 2, the embodiment of the present application includes the following steps:

步骤S201：低帧率视频数据及事件流数据获取。在实际执行过程中，本申请实施例可以从事件相机获取帧率的视频和事件数据，实现原始数据的获取，为后续生成高帧率视频奠定数据基础。Step S201 : acquiring low frame rate video data and event stream data. In the actual execution process, the embodiments of the present application can acquire video and event data at a frame rate from an event camera, realize the acquisition of original data, and lay a data foundation for subsequent generation of high-frame rate videos.

可以理解的是，事件相机没有帧率的概念，其每个像素点异步工作，当检测到光强变化时输出一条事件，每条事件为一个四元组(x，y，t，p)，包含像素横纵坐标(x，y)、时间戳t和事件极性p(其中，p＝-1表示该像素点光强减小，p＝1表示该像素点光强增大)，将所有像素点输出的事件数据进行汇总，可以形成由一条条事件组成的事件列表，作为相机输出的事件流数据。It can be understood that the event camera does not have the concept of frame rate, and each pixel of it works asynchronously. When a change in light intensity is detected, an event is output, and each event is a quadruple (x, y, t, p), Including the horizontal and vertical coordinates of the pixel (x, y), the time stamp t and the event polarity p (where, p=-1 indicates that the light intensity of the pixel decreases, and p=1 indicates that the light intensity of the pixel increases). The event data output by the pixel points can be aggregated to form an event list composed of events, which can be used as the event stream data output by the camera.

举例而言，如图3所示，本申请实施例从事件相机获取的低帧率视频的帧率可以为20FPS(Frames Per Second，每秒传输帧数)，共计31帧，对应的事件流持续时间为1500ms。For example, as shown in FIG. 3 , the frame rate of the low frame rate video obtained from the event camera in this embodiment of the present application may be 20 FPS (Frames Per Second, the number of frames transmitted per second), a total of 31 frames, and the corresponding event stream continues The time is 1500ms.

步骤S202：数据预处理。本申请实施例可以将低帧率视频中相邻图像帧两两组合，对于每一组相邻图像帧，计算期望得到所有中间帧的时间戳集合T，记为：Step S202: data preprocessing. In this embodiment of the present application, adjacent image frames in the low frame rate video may be combined in pairs, and for each group of adjacent image frames, the time stamp set T of all intermediate frames expected to be obtained is calculated as:

T＝{τ¹ _1,2,τ² _1,2,...,τⁿ _1,2,τ¹ _2,3,τ² _2,3,...,τⁿ _2,3,...,τ¹ _N-1,N,τ² _N-1,N,...,τⁿ _N-1,N}，T={τ ¹ _1,2 ,τ ² _1,2 ,...,τ ⁿ _1,2 ,τ ¹ _2,3 ,τ ² _2,3 ,...,τ ⁿ _2,3 ,... ,τ ¹ _N-1,N ,τ ² _N-1,N ,...,τ ⁿ _N-1,N },

其中，每个期望得到的中间帧时间戳的计算公式如下：Among them, the calculation formula of each expected intermediate frame timestamp is as follows:

举例而言，本申请实施例输入低帧率视频可以包含N＝31帧图像，帧率为20FPS，则输入低帧率视频第j帧的时间戳为t_j＝(j-1)×50ms。若得到帧率提升n＝10倍的高帧率视频，则计算得到的所有中间帧的时间戳集合可以为T＝{0,5,10,15,20,...,1495}，包含300个元素。For example, the input low frame rate video in this embodiment of the present application may include N=31 frames of images and the frame rate is 20FPS, and the timestamp of the jth frame of the input low frame rate video is t _j =(j-1)×50ms. If a high frame rate video with a frame rate increase of n=10 times is obtained, the calculated timestamp set of all intermediate frames can be T={0,5,10,15,20,...,1495}, including 300 elements.

步骤S203：脉冲神经网络构建。在实际执行过程中，本申请实施例可以使用SpikeResponse模型作为神经元动力学模型，构建脉冲卷积神经网络。Step S203: constructing a spiking neural network. In the actual execution process, the embodiment of the present application may use the SpikeResponse model as a neuron dynamics model to construct a spiking convolutional neural network.

步骤S204：事件流编码计算。本申请实施例可以根据步骤S202计算得到的中间帧的时间戳τⁱ _j,j+1，截取从两个边界帧到期望中间帧的事件流ε₁,ε₂，并将ε₁,ε₂分别输入通过步骤S203得到的脉冲神经网络进行前向传播，得到事件流数据特征向量F₁和F₂。Step S204: Event stream encoding calculation. In this embodiment of the present application, according to the timestamp τ ⁱ _j,j+1 of the intermediate frame calculated in step S202 , intercept the event streams ε ₁ , ε ₂ from the two boundary frames to the expected intermediate frame, and combine ε ₁ , ε ₂ The spiking neural network obtained in step S203 is respectively input to carry out forward propagation to obtain event stream data feature vectors F ₁ and F ₂ .

其中，两个边界帧到期望中间帧的事件流ε₁和ε₂的计算公式如下：Among them, the calculation formulas of event streams ε ₁ and ε ₂ from two boundary frames to expected intermediate frames are as follows:

其中，τⁱ _j,j+1是期望中间帧的时间戳，t_j和t_j+1是期望中间帧相邻输入低帧率视频帧的时间戳。where τ ⁱ _j,j+1 is the timestamp of the expected intermediate frame, and t _j and t _j+1 are the timestamps of the input low frame rate video frames adjacent to the expected intermediate frame.

举例而言，以第15个期望得到的中间帧的时间戳，即本申请实施例在输入低帧率视频第2帧和第3帧中插入的第5帧，τ⁵ _2,3＝75ms为例，两个边界帧到期望中间帧的事件流ε₁和ε₂可以如表1所示。其中，表1和表2分别为事件流ε₁和ε₂的数据表。For example, taking the timestamp of the 15th expected intermediate frame, that is, the 5th frame inserted in the 2nd and 3rd frames of the input low frame rate video in the embodiment of the present application, τ ⁵ _2,3 =75ms is For example, event streams ε ₁ and ε ₂ from two boundary frames to desired intermediate frames can be as shown in Table 1. Among them, Table 1 and Table 2 are the data tables of event streams ε ₁ and ε ₂ , respectively.

表一Table I

步骤S205：多模态融合网络构建。可以理解的是，数据融合网络包含一个粗合成子网络和一个微调子网络。其中，粗合成子网络使用U-Net结构，输入层的输入通道数为64+2×k，输出层的输出通道数为k；微调子网络使用U-Net结构，输入层的输入通道数为3×k，输出层的输出通道数为k。Step S205: building a multimodal fusion network. Understandably, the data fusion network consists of a coarse synthesis sub-network and a fine-tuned sub-network. Among them, the coarse synthesis sub-network uses the U-Net structure, the number of input channels of the input layer is 64+2×k, and the number of output channels of the output layer is k; the fine-tuning sub-network uses the U-Net structure, and the number of input channels of the input layer is 3×k, the number of output channels of the output layer is k.

其中，k为步骤S201中输入的低帧率视频的图像帧的通道数，即当步骤S201中输入的低帧率视频的图像帧为灰度图时，k＝1，当步骤S201中输入的低帧率视频的图像帧为RGB图像时，k＝3。Wherein, k is the channel number of the image frame of the low frame rate video input in step S201, that is, when the image frame of the low frame rate video input in step S201 is a grayscale image, k=1, when the input in step S201 When the image frame of the low frame rate video is an RGB image, k=3.

举例而言，本申请实施例可以输入步骤S201中输入的低帧率视频的图像帧为灰度图，即k＝1，此时，粗合成子网络输入层的输入通道数为66，输出层的输出通道数为1；微调子网络输入层的输入通道数为3；输出层的输出通道数为1。For example, in this embodiment of the present application, the image frame of the low frame rate video input in step S201 can be input as a grayscale image, that is, k=1. The number of output channels is 1; the number of input channels of the fine-tuning sub-network input layer is 3; the number of output channels of the output layer is 1.

步骤S206：单一高帧率图像帧计算。作为一种可能实现的方式，本申请实施例可以将从步骤S202获得的低帧率视频的相邻图像帧和从步骤S203获得的第一事件流数据特征向量F₁和第二事件流数据特征向量F₂进行拼接，并输入至预设的多模态融合网络进行前向传播，生成一帧中间帧，以完成单一高帧率图像帧计算。Step S206: Single high frame rate image frame calculation. As a possible implementation manner, in this embodiment of the present application, the adjacent image frames of the low frame rate video obtained from step S202 and the first event stream data feature vector F ₁ and the second event stream data feature obtained from step S203 can be _The vector F2 is spliced and input to the preset multi-modal fusion network for forward propagation, and an intermediate frame is generated to complete the calculation of a single high frame rate image frame.

举例而言，以第15个期望得到的中间帧为例，生成的中间帧如图4所示。For example, taking the 15th expected intermediate frame as an example, the generated intermediate frame is shown in FIG. 4 .

步骤S207：全部高帧率图像帧计算。进一步地，本申请实施例可以对于步骤S302中计算的期望每一个中间帧的时间戳，重复上述步骤S302至步骤S306，完成所有中间帧的计算。Step S207: Calculate all high frame rate image frames. Further, in this embodiment of the present application, the above steps S302 to S306 may be repeated for the timestamp of each expected intermediate frame calculated in step S302 to complete the calculation of all intermediate frames.

举例而言，本申请实施例可以输入低帧率视频包含N＝31帧图像，若得到帧率提升n＝10倍的高帧率视频，则需要重复步骤S202至步骤S206共计300次。For example, in this embodiment of the present application, a low frame rate video can be input including N=31 frames of images. If a high frame rate video with a frame rate increased by n=10 times is obtained, steps S202 to S206 need to be repeated a total of 300 times.

本申请实施例将步骤S207中得到的所有中间帧进行组合，构成高帧率视频，实现高帧率视频生成。In this embodiment of the present application, all intermediate frames obtained in step S207 are combined to form a high frame rate video, so as to realize the generation of a high frame rate video.

其中，以得到帧率提升n＝10倍的高帧率视频为例，输入事件流、低帧率视频和生成的高帧率视频可以如图5所示。Wherein, taking obtaining a high frame rate video with a frame rate increase of n=10 times as an example, the input event stream, the low frame rate video and the generated high frame rate video may be as shown in FIG. 5 .

步骤S208：3D深度估计网络构建。具体地，本申请实施例构建的3D深度估计网络可以使用第三U-Net结构，输入层的输入通道数为3×k，输出层的输出通道数为1，其中，k为步骤S201中输入的低帧率视频的图像帧的通道数，即当步骤S201中输入的低帧率视频的图像帧为灰度图时，k＝1，当步骤S201中输入的低帧率视频的图像帧为RGB图像时，k＝3。Step S208: 3D depth estimation network construction. Specifically, the 3D depth estimation network constructed in the embodiment of the present application can use the third U-Net structure, the number of input channels of the input layer is 3×k, and the number of output channels of the output layer is 1, where k is the input in step S201 The number of channels of the image frame of the low frame rate video, that is, when the image frame of the low frame rate video input in step S201 is a grayscale image, k=1, and when the image frame of the low frame rate video input in step S201 is For RGB images, k=3.

步骤S209：高帧率3D深度估计计算。Step S209: high frame rate 3D depth estimation calculation.

步骤S210：数据后处理。在实际执行过程中，本申请实施例可以将上述步骤中获得的高帧率图像帧，与其前后相邻高帧率图像帧进行拼接，使用预设的3D深度估计网络进行前向传播，生成一系列高帧率深度图，并将生成的一系列高帧率深度图进行组合，构成高帧率3D视频，实现高帧率3D视频生成。Step S210: data post-processing. In the actual execution process, the embodiment of the present application can splicing the high frame rate image frame obtained in the above steps with the adjacent high frame rate image frames before and after, and use the preset 3D depth estimation network for forward propagation to generate a A series of high frame rate depth maps, and the generated series of high frame rate depth maps are combined to form a high frame rate 3D video to achieve high frame rate 3D video generation.

举例而言，如图6所示，本申请实施例可以实现10倍帧率提升下的高帧率深度图视频生成，实现高速环境下有效立体场景观测。For example, as shown in FIG. 6 , the embodiment of the present application can realize the generation of a high frame rate depth map video with a frame rate increase of 10 times, and realize effective stereoscopic scene observation in a high-speed environment.

根据本申请实施例提出的基于数据融合的高帧率3D视频生成方法，可以使用事件数据提供帧间运动信息，利用脉冲神经网络对事件流进行编码，并通过多模态融合网络得到所有中间帧，生成高帧率视频，进而利用3D深度估计网络构成高帧率3D视频，实现对于高速场景的有效的立体观测，通过使用事件流和低帧率视频图像帧作为输入，可以更好地使用多模态数据信息，进而提升高帧率3D视频的质量。由此，解决了相关技术中仅使用事件流作为输入，缺乏每个像素点的初始亮度值，从而导致生成的图像质量较低的技术问题。According to the method for generating high frame rate 3D video based on data fusion proposed in the embodiment of the present application, event data can be used to provide inter-frame motion information, a spiking neural network can be used to encode the event stream, and all intermediate frames can be obtained through a multi-modal fusion network. , generate a high frame rate video, and then use the 3D depth estimation network to form a high frame rate 3D video to achieve effective stereo observation for high-speed scenes. By using event streams and low frame rate video image frames as input, you can better use Modal data information, thereby improving the quality of high frame rate 3D video. As a result, the technical problem in the related art that only the event stream is used as an input and the initial brightness value of each pixel is lacking, thus resulting in a lower quality of the generated image, is solved.

其次参照附图描述根据本申请实施例提出的基于数据融合的高帧率3D视频生成装置。Next, the device for generating high frame rate 3D video based on data fusion proposed according to the embodiments of the present application will be described with reference to the accompanying drawings.

图7是本申请实施例的基于数据融合的高帧率3D视频生成装置的方框示意图。FIG. 7 is a schematic block diagram of an apparatus for generating a high frame rate 3D video based on data fusion according to an embodiment of the present application.

如图7所示，该基于数据融合的高帧率3D视频生成装置10包括：第一获取模块100、计算模块200、第二获取模块300、融合模块400和生成模块500。As shown in FIG. 7 , the high frame rate 3D video generation device 10 based on data fusion includes: a first acquisition module 100 , a calculation module 200 , a second acquisition module 300 , a fusion module 400 and a generation module 500 .

具体地，第一获取模块100，用于从事件相机获取低于预设帧率的视频和事件数据。Specifically, the first obtaining module 100 is configured to obtain video and event data with a frame rate lower than a preset frame rate from the event camera.

计算模块200，用于将视频中相邻图像帧进行两两组合，生成多组相邻图像帧，并计算期望得到所有中间帧的时间戳集合。The calculation module 200 is configured to combine adjacent image frames in the video in pairs to generate multiple groups of adjacent image frames, and calculate a set of timestamps expected to obtain all intermediate frames.

第二获取模块300，用于根据时间戳集合截取从两个边界帧到期望中间帧的第一事件流和第二事件流，并将第一事件流和第二事件流输入至预设的脉冲神经网络进行前向传播，得到第一事件流数据特征向量和第二事件流数据特征向量。The second acquiring module 300 is configured to intercept the first event stream and the second event stream from the two boundary frames to the desired intermediate frame according to the timestamp set, and input the first event stream and the second event stream to the preset pulse The neural network performs forward propagation to obtain the first event stream data feature vector and the second event stream data feature vector.

融合模块400，用于拼接相邻图像帧、第一事件流数据特征向量和第二事件流数据特征向量，并输入至预设的多模态融合网络进行前向传播，得到所有中间帧，生成高于第二预设帧率的高帧率视频。The fusion module 400 is used for splicing adjacent image frames, the first event stream data feature vector and the second event stream data feature vector, and inputting them to a preset multi-modal fusion network for forward propagation, obtaining all intermediate frames, and generating High frame rate video above the second preset frame rate.

生成模块500，用于基于高帧率视频，利用预设的3D深度估计网络进行前向传播，得到所有高帧率深度图，并组合所有高帧率深度图，构成高帧率3D视频。The generating module 500 is used for forward propagation based on the high frame rate video using a preset 3D depth estimation network to obtain all high frame rate depth maps, and combining all the high frame rate depth maps to form a high frame rate 3D video.

可选地，在本申请的一个实施例中，基于数据融合的高帧率3D视频生成装置10还包括：构建模块。Optionally, in an embodiment of the present application, the apparatus 10 for generating a high frame rate 3D video based on data fusion further includes: a building block.

其中，构建模块，用于基于Spike Response模型作为神经元动力学模型，构建脉冲神经网络。Among them, the building block is used to build a spiking neural network based on the Spike Response model as a neuron dynamics model.

需要说明的是，前述对基于数据融合的高帧率3D视频生成方法实施例的解释说明也适用于该实施例的基于数据融合的高帧率3D视频生成装置，此处不再赘述。It should be noted that, the foregoing explanations of the embodiment of the method for generating high frame rate 3D video based on data fusion are also applicable to the apparatus for generating high frame rate 3D video based on data fusion in this embodiment, which will not be repeated here.

根据本申请实施例提出的基于数据融合的高帧率3D视频生成装置，可以使用事件数据提供帧间运动信息，利用脉冲神经网络对事件流进行编码，并通过多模态融合网络得到所有中间帧，生成高帧率视频，进而利用3D深度估计网络构成高帧率3D视频，实现对于高速场景的有效的立体观测，通过使用事件流和低帧率视频图像帧作为输入，可以更好地使用多模态数据信息，进而提升高帧率3D视频的质量。由此，解决了相关技术中仅使用事件流作为输入，缺乏每个像素点的初始亮度值，从而导致生成的图像质量较低的技术问题。According to the high frame rate 3D video generation device based on data fusion proposed in the embodiment of the present application, the event data can be used to provide inter-frame motion information, the spiking neural network can be used to encode the event stream, and all intermediate frames can be obtained through the multi-modal fusion network. , generate a high frame rate video, and then use the 3D depth estimation network to form a high frame rate 3D video to achieve effective stereo observation for high-speed scenes. By using event streams and low frame rate video image frames as input, you can better use Modal data information, thereby improving the quality of high frame rate 3D video. As a result, the technical problem in the related art that only the event stream is used as an input and the initial brightness value of each pixel is lacking, thus resulting in a lower quality of the generated image, is solved.

图8为本申请实施例提供的电子设备的结构示意图。该电子设备可以包括：FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. The electronic device may include:

存储器801、处理器802及存储在存储器801上并可在处理器802上运行的计算机程序。Memory 801 , processor 802 , and computer programs stored on memory 801 and executable on processor 802 .

处理器802执行程序时实现上述实施例中提供的基于数据融合的高帧率3D视频生成方法。When the processor 802 executes the program, the method for generating a high frame rate 3D video based on data fusion provided in the above embodiments is implemented.

进一步地，电子设备还包括：Further, the electronic device also includes:

通信接口803，用于存储器801和处理器802之间的通信。The communication interface 803 is used for communication between the memory 801 and the processor 802 .

存储器801，用于存放可在处理器802上运行的计算机程序。The memory 801 is used to store computer programs that can be executed on the processor 802 .

存储器801可能包含高速RAM存储器，也可能还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。The memory 801 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk memory.

如果存储器801、处理器802和通信接口803独立实现，则通信接口803、存储器801和处理器802可以通过总线相互连接并完成相互间的通信。总线可以是工业标准体系结构(Industry Standard Architecture，简称为ISA)总线、外部设备互连(PeripheralComponent，简称为PCI)总线或扩展工业标准体系结构(Extended Industry StandardArchitecture，简称为EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示，图8中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。If the memory 801, the processor 802 and the communication interface 803 are independently implemented, the communication interface 803, the memory 801 and the processor 802 can be connected to each other through a bus and complete communication with each other. The bus may be an Industry Standard Architecture (referred to as ISA) bus, a Peripheral Component (referred to as PCI) bus, or an Extended Industry Standard Architecture (referred to as EISA) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. For convenience of representation, only one thick line is used in FIG. 8, but it does not mean that there is only one bus or one type of bus.

可选地，在具体实现上，如果存储器801、处理器802及通信接口803，集成在一块芯片上实现，则存储器801、处理器802及通信接口803可以通过内部接口完成相互间的通信。Optionally, in terms of specific implementation, if the memory 801, the processor 802 and the communication interface 803 are integrated on one chip, the memory 801, the processor 802 and the communication interface 803 can communicate with each other through the internal interface.

处理器802可能是一个中央处理器(Central Processing Unit，简称为CPU)，或者是特定集成电路(Application Specific Integrated Circuit，简称为ASIC)，或者是被配置成实施本申请实施例的一个或多个集成电路。The processor 802 may be a central processing unit (Central Processing Unit, CPU for short), or a specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), or is configured to implement one or more of the embodiments of the present application integrated circuit.

本实施例还提供一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现如上的基于数据融合的高帧率3D视频生成方法。This embodiment also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the above method for generating a high frame rate 3D video based on data fusion.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或N个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or N of the embodiments or examples. Furthermore, those skilled in the art may combine and combine the different embodiments or examples described in this specification, as well as the features of the different embodiments or examples, without conflicting each other.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本申请的描述中，“N个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。In addition, the terms "first" and "second" are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In the description of the present application, "N" means at least two, such as two, three, etc., unless otherwise expressly and specifically defined.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更N个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本申请的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本申请的实施例所属技术领域的技术人员所理解。Any process or method description in the flowchart or otherwise described herein may be understood to represent a module, segment or portion of code comprising one or N more executable instructions for implementing custom logical functions or steps of the process , and the scope of the preferred embodiments of the present application includes alternative implementations in which the functions may be performed out of the order shown or discussed, including performing the functions substantially concurrently or in the reverse order depending upon the functions involved, which should It is understood by those skilled in the art to which the embodiments of the present application belong.

在流程图中表示或在此以其他方式描述的逻辑和/或步骤，例如，可以被认为是用于实现逻辑功能的可执行指令的定序列表，可以具体实现在任何计算机可读介质中，以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用，或结合这些指令执行系统、装置或设备而使用。就本说明书而言，"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下：具有一个或N个布线的电连接部(电子装置)，便携式计算机盘盒(磁装置)，随机存取存储器(RAM)，只读存储器(ROM)，可擦除可编辑只读存储器(EPROM或闪速存储器)，光纤装置，以及便携式光盘只读存储器(CDROM)。另外，计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质，因为可以例如通过对纸或其他介质进行光学扫描，接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序，然后将其存储在计算机存储器中。The logic and/or steps represented in flowcharts or otherwise described herein, for example, may be considered an ordered listing of executable instructions for implementing the logical functions, may be embodied in any computer-readable medium, For use with, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a system including a processor, or other system that can fetch instructions from and execute instructions from an instruction execution system, apparatus, or apparatus) or equipment. For the purposes of this specification, a "computer-readable medium" can be any device that can contain, store, communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or apparatus. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections (electronic devices) with one or N wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read Only Memory (ROM), Erasable Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, followed by editing, interpretation, or other suitable medium as necessary process to obtain the program electronically and then store it in computer memory.

应当理解，本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，N个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。如，如果用硬件来实现和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that various parts of this application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented by any one of the following techniques known in the art, or a combination thereof: discrete with logic gates for implementing logic functions on data signals Logic circuits, application specific integrated circuits with suitable combinational logic gates, Programmable Gate Arrays (PGA), Field Programmable Gate Arrays (FPGA), etc.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括方法实施例的步骤之一或其组合。Those skilled in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing the relevant hardware through a program, and the program can be stored in a computer-readable storage medium, and the program can be stored in a computer-readable storage medium. When executed, one or a combination of the steps of the method embodiment is included.

此外，在本申请各个实施例中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist physically alone, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated modules are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.

上述提到的存储介质可以是只读存储器，磁盘或光盘等。尽管上面已经示出和描述了本申请的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本申请的限制，本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like. Although the embodiments of the present application have been shown and described above, it should be understood that the above embodiments are exemplary and should not be construed as limitations to the present application. Embodiments are subject to variations, modifications, substitutions and variations.

Claims

1. a high frame rate 3D video generation method based on data fusion, is characterized in that, comprises the following steps:

Get video and event data below the preset frame rate from the event camera;

Combining adjacent image frames in the video in pairs, generating multiple groups of adjacent image frames, and calculating the time stamp sets expected to obtain all intermediate frames;

Intercept the first event stream and the second event stream from the two boundary frames to the desired intermediate frame according to the timestamp set, and input the first event stream and the second event stream into the preset spiking neural network for preprocessing Propagating in the direction to obtain the first event stream data feature vector and the second event stream data feature vector;

Splicing the adjacent image frames, the first event stream data feature vector and the second event stream data feature vector, and inputting them to a preset multi-modal fusion network for forward propagation, obtaining all intermediate frames, and generating High frame rate video higher than the second preset frame rate;

Based on the high frame rate video, forward propagation is performed using a preset 3D depth estimation network to obtain all high frame rate depth maps, and all the high frame rate depth maps are combined to form a high frame rate 3D video.

2. The method according to claim 1, wherein before inputting the first event stream and the second event stream to the preset spiking neural network for forward propagation, the method further comprises:

The spiking neural network is constructed based on the Spike Response model as the neuron dynamics model.

3. The method according to claim 1, wherein the multimodal fusion network comprises a coarse synthesis sub-network and a fine-tuning sub-network, wherein the coarse synthesis sub-network uses a first U-Net structure, and the input layer The number of input channels is 64+2×k, the number of output channels of the output layer is k, and the fine-tuning sub-network uses the second U-Net structure, the number of input channels of the input layer is 3×k, and the number of output channels of the output layer is k. The number is k, where k is the channel number of the image frame of the video whose frame rate is lower than the preset frame rate.

4 . The method according to claim 1 , wherein the 3D depth estimation network uses a third U-Net structure, and the number of input channels of the input layer is 3×k, and the number of output channels of the output layer is 1. 5 .

5. The method according to any one of claims 1-4, wherein the calculation formula of the time stamp sets of all intermediate frames is:

Among them, N is the total number of frames of the input low frame rate video, n is the multiple of the expected frame rate increase, and t _j is the timestamp of the jth frame of the input low frame rate video.

6. A high frame rate 3D video generation device based on data fusion, is characterized in that, comprising:

The first acquisition module is used to acquire video and event data below the preset frame rate from the event camera;

A calculation module, for combining adjacent image frames in the video in pairs, generating multiple groups of adjacent image frames, and calculating the time stamp sets expected to obtain all intermediate frames;

The second acquiring module is configured to intercept the first event stream and the second event stream from the two boundary frames to the expected intermediate frame according to the timestamp set, and input the first event stream and the second event stream to the pre- The set spiking neural network performs forward propagation to obtain the first event stream data feature vector and the second event stream data feature vector;

The fusion module is used for splicing the adjacent image frames, the first event stream data feature vector and the second event stream data feature vector, and inputting them to a preset multi-modal fusion network for forward propagation to obtain All intermediate frames, generate a high frame rate video higher than the second preset frame rate;

The generation module is used for forward propagation based on the high frame rate video using a preset 3D depth estimation network to obtain all high frame rate depth maps, and combining all the high frame rate depth maps to form a high frame rate 3D video.

7 . The apparatus according to claim 6 , further comprising: a building module for building the spiking neural network based on the SpikeResponse model as a neuron dynamics model. 8 .

8. The device according to any one of claims 6-7, wherein the calculation formula of the time stamp sets of all intermediate frames is:

9. An electronic device, characterized in that it comprises: a memory, a processor and a computer program stored on the memory and running on the processor, the processor executing the program to realize the method as claimed in the claim The method for generating high frame rate 3D video based on data fusion according to any one of requirements 1-5.

10. A computer-readable storage medium on which a computer program is stored, characterized in that the program is executed by a processor to implement the data fusion-based high frame as claimed in any one of claims 1-5 Rate 3D video generation method.