CN117934647A

CN117934647A - Multi-task multi-mode data simulation method and device

Info

Publication number: CN117934647A
Application number: CN202311766510.5A
Authority: CN
Inventors: 赵蓉; 林逸晗; 王韬毅; 曾辉; 陈雨过; 施路平
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-04-26

Abstract

The invention provides a multi-task multi-mode data simulation method and a device, comprising the following steps: acquiring original image data; inputting the original image data into a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data; the target simulation data includes target image data, target spatial differential data, and target temporal differential data. The invention makes different designs for the multi-task multi-mode data simulation module aiming at the data flow and acquisition systems of different input modes of the original image data, adds simulation data on the basis of the input original image data, and finally outputs multi-task multi-mode target simulation data with the simulation data as the main and the real data as the auxiliary, thereby solving the problems of sparseness and unavailability of extreme case data from the data source with reliability and low cost.

Description

Multi-task multi-mode data simulation method and device

Technical Field

The invention relates to the technical field of digital image acquisition, in particular to a multi-task multi-mode data simulation method and device.

Background

The complementary vision sensor (complementary vision sensor, CVS) is a novel neuromorphic vision sensor. Complementary cameras are proposed based on the complementary perception theory, i.e. the property of reference human vision, requiring that the acquired data be separated into a plurality of different pathways, and that the data structures of each pathway should be complementary in different properties. The complementary vision sensor is characterized by emulating a human retina, outputting multiple data modes in a single vision chip, that is, using the same CMOS chip, RGB, spatial differential and temporal differential data information can be output, and encoding the information in different modes.

When an algorithm iteration is performed on a vision application solution, such as an image recognition solution, of an image sensor of a complementary vision sensor, a large number of images in different scenes are required to be used as training data, so that a hardware device of the image sensor needs to be subjected to a large number of adjustments and changes, and after the hardware adjustment is built, the images can be shot, and a large amount of time and cost are required. The algorithm iteration itself requires a large amount of training data, especially training data with instance level annotation information, training data of special scenes, etc., so a reliable and low-cost multi-mode data simulation method is needed.

Currently, simulators of existing image sensors are highly customized, and for different image sensor research and development companies and chip design companies needing to develop and support the image sensors, the angles of attention are different, the behavior-level simulation models of related image sensors are also different, and different simulation models are only suitable for fixed image sensors and have low expansibility. For a complementary vision sensor capable of outputting multi-modal data, no suitable simulation method exists.

Disclosure of Invention

The invention provides a multi-task multi-mode data simulation method and device, which are used for solving the defect that a multi-task multi-mode data simulation method is lacked in the prior art and realizing reliable and low-cost multi-task multi-mode data simulation.

The invention provides a multi-task multi-mode data simulation method, which comprises the following steps:

acquiring original image data;

Inputting the original image data into a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data; the target simulation data includes target image data, target spatial differential data, and target temporal differential data.

According to the multi-task multi-mode data simulation method provided by the invention, the original image data is static data; inputting the original image data to a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data, wherein the method specifically comprises the following steps of:

performing exposure simulation on the original image data according to the first exposure setting, and sampling the result of the exposure simulation to obtain target image data;

Performing gray scale exposure simulation on the original image data according to the second exposure setting to obtain a gray scale simulation result;

obtaining target space differential data through space differential response based on the gray simulation result;

and obtaining target time differential data through time differential response based on the gray scale simulation result.

According to the multi-task multi-mode data simulation method provided by the invention, the original image data are dynamic data, and comprise first original image data and second original image data; inputting the original image data to a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data, wherein the method specifically comprises the following steps of:

performing exposure simulation on the first original image data according to a first exposure setting, and sampling a result of the exposure simulation to obtain target image data;

Gray scale exposure simulation is carried out on the first original image data according to the second exposure setting, and a first gray scale simulation result is obtained; gray scale exposure simulation is carried out on the second original image data according to the second exposure setting, and a second gray scale simulation result is obtained;

Obtaining target space differential data through space differential response based on the first gray scale simulation result;

and obtaining target time differential data through time differential response based on the first gray scale simulation result and the second gray scale simulation result.

According to the multi-task multi-mode data simulation method provided by the invention, the original image data are real shooting data, and comprise first original image data and second original image data; inputting the original image data to a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data, wherein the method specifically comprises the following steps of:

sampling the first original image data to obtain target image data;

Obtaining target space difference data through space difference response based on the first original image data;

and obtaining target time difference data through time difference response based on the first original image data and the second original image data.

According to the multi-task multi-mode data simulation method provided by the invention, the first original image data comprise high-dynamic-range camera real shooting data and high-speed camera real shooting data, and the second original image data comprise the high-speed camera real shooting data; the acquiring of the original image data further includes:

Setting the high dynamic range camera according to a first exposure setting and setting the high speed camera according to a second exposure setting;

And carrying out spatial alignment and time synchronization on the high-dynamic-range camera and the high-speed camera by using preset spatial calibration parameters and time stamps.

According to the multi-task multi-mode data simulation method provided by the invention, the original image data is a 3D scene based on physical simulation; inputting the original image data to a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data, wherein the method specifically comprises the following steps of:

based on the 3D scene, obtaining photon density on a preset imaging plane through a preset simulation tool according to a camera motion script, a scene object motion script, a light environment change script and a weather change script which are written in advance;

and obtaining target simulation data through a preset simulation tool according to the photon density and a preset core index of the sensor.

According to the method for simulating the multi-task and multi-mode data provided by the invention, the original image data is input into a pre-built multi-task and multi-mode data simulation module, the simulation is performed according to the physical parameters of a simulation target camera and the simulation level of simulation on the basis of the original image data according to the type of the original image data, the target simulation data is obtained, and then the method further comprises the steps of:

and constructing a target multi-task multi-mode data set based on the target simulation data.

The invention also provides a multi-task multi-mode data simulation device, which comprises:

An acquisition unit configured to acquire original image data;

The simulation unit is used for inputting the original image data into a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the physical parameters of a simulation target camera and the simulation level of simulation degree on the basis of the original image data to obtain target simulation data; the target simulation data includes target image data, target spatial differential data, and target temporal differential data.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-task multi-mode data simulation method according to any one of the above when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a multi-tasking multi-modal data simulation method as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a multi-tasking multi-modal data simulation method as described in any of the above.

The invention provides a multi-task multi-mode data simulation method and a device, which are characterized in that original image data are obtained; inputting the original image data into a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data; the target simulation data includes target image data, target spatial differential data, and target temporal differential data. The invention makes different designs for the multi-task multi-mode data simulation module aiming at the data flow and acquisition systems of different input modes of the original image data, adds simulation data on the basis of the input original image data, finally outputs multi-task multi-mode target simulation data with the simulation data as a main part and the real data as an auxiliary part, realizes reliable and low-cost multi-task multi-mode data simulation, and solves the problems of sparseness and unobtainability of extreme case (corner cases) data from a data source.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a multi-task and multi-mode data simulation method provided by the invention;

FIG. 2 is a second flow chart of the simulation method of multi-task and multi-mode data according to the present invention;

FIG. 3 is a schematic view of an exposure setting of an embodiment of a multi-task multi-modality data simulation method provided by the present invention;

FIG. 4 is a third flow chart of the simulation method of multi-task and multi-mode data according to the present invention;

FIG. 5 is a flow chart of a multi-task and multi-mode data simulation method according to the present invention;

FIG. 6 is a flowchart of a multi-task and multi-mode data simulation method according to the present invention;

FIG. 7 is a schematic diagram of a multi-task and multi-modal data simulator according to the present invention;

Fig. 8 is a schematic structural diagram of an electronic device provided by the present invention.

Reference numerals:

710: an acquisition unit; 720: and a simulation unit.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The basic principle of the current mainstream image sensor is frame-based photographing and video recording, and this principle is implemented by an Active Pixel (APS) array. The active pixel sensor can only process color images arranged in a pixel matrix image frame mode, has the advantages of high color reproducibility, high resolution and high image quality, however, the dynamic range of the acquired image signals is smaller, and the shooting speed is slower.

An event camera (EVENT CAMERA), also known as a dynamic vision receptor (Dynamic Vision Sensor, DVS), is a new imaging system. Compared with the traditional camera which uses a shutter to control the frame rate, all pixels record light intensity according to frames, the event camera is sensitive to the light intensity change rate, each pixel independently records the change of the light intensity logarithmic value at the pixel, and when the change exceeds a threshold value, a positive pulse or a negative pulse is generated. Because of the asynchronous characteristic of the event camera, the event camera is not limited by a fast threshold, has extremely high time resolution (the frame rate is about 1, 000, 000fps, compared with the traditional camera, which is about 100 fps), and has natural adaptability to tasks such as motion monitoring by combining the characteristic of sensitivity to changes. Another camera, known as DAVIS, combines conventional active pixel receptors (Active Pixel Sensor, APS) with DVS to record both single frame images and event information, while providing the advantages of high spatial resolution of conventional cameras and high temporal resolution of DVS cameras. At present, sony and hawk have realized various types DAVIS of technologies based on three-dimensional stacked fabrication processes, including time-division multiplexing the same pixel to realize high performance APS and DVS by setting APS and DVS of different resolution ratios, and the like.

The complementary vision sensor (complementary vision sensor, CVS) is a novel neuromorphic vision sensor. Complementary cameras are proposed based on the complementary perception theory, i.e. the property of reference human vision, requiring that the acquired data be separated into a plurality of different paths, and that the data structure of each path should be complementary in different properties, such that the property of having complementary properties is called primitive. The primitives include time resolution (fast and slow complementation), spatial resolution (high and low complementation), color (color, gray scale, infrared, ultraviolet, etc. spectral sensitivity range complementation), sensitivity (i.e. high and low complementation of response coefficient to light intensity), response mode (integral intensity amount or differential change amount), and data precision (high and low complementation). The CVS can output RGB, spatial differential and temporal differential data information using the same CMOS chip and encode the information in different modes by constructing a hybrid pixel arrangement in the image sensor and designing a hybrid data readout circuit. This information is transmitted using different data paths. The different mode data have strong complementary attributes, including sampling precision, sampling speed, dynamic range, sensitivity, color range, spatial resolution and the like, and the paths are different from each other and can complement each other to ensure the integrity of information, so the camera is also called a complementary camera.

Unlike conventional image sensors, CVS is more closely related to the information handling mechanism of the retina of the human eye. The method simulates the calculation mode of retinal ganglion cells at the chip level, supports isotropic and anisotropic central peripheral structures, and can be used for tasks such as motion detection, scene segmentation, target tracking and the like. The design method of the biological heuristic sensor is expected to greatly reduce the calculation complexity and improve the calculation efficiency.

The multi-modal output of the CVS allows it to be used in conjunction with different neural networks to construct an end-to-end vision system. It can also be applied to multi-sensor fusion, and works in cooperation with other non-visual sensors to provide more robust and intelligent environmental awareness. The CVS technology is hopeful to promote the application of the vision algorithm to the floor, and is widely applied to the fields of automatic driving, service robots, intelligent monitoring and the like.

However, no simulation method suitable for the complementary vision sensor exists at present, and based on this, the invention provides a multi-task multi-mode data simulation method and device.

The following describes a multi-task multi-mode data simulation method of the present invention with reference to fig. 1 to 6, as shown in fig. 1, fig. 1 is one of flow diagrams of the multi-task multi-mode data simulation method provided by the present invention, where the method includes:

Step 110: raw image data is acquired.

Acquired are data streams under different input modes including, but not limited to, still image data, dynamic data, real shot data, and 3D virtual scenes based on physical simulation. In some embodiments, the original image dataset may be directly acquired, and the simulation may be performed based on the original image dataset to obtain the target simulation dataset. Further, the acquisition approaches of the original image data include, but are not limited to: the sensor is used for directly shooting and synthesizing a plurality of sensors, the shooting of the sensors is processed manually, and the neural network is generated.

Step 120: inputting the original image data into a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data; the target simulation data includes target image data, target spatial differential data, and target temporal differential data.

The present invention utilizes various complementary data modalities to construct off-chip complementarity. Specifically, the invention designs a multi-task multi-mode data simulation module. In some embodiments, a multi-tasking multi-modality data simulation module is developed based on C++ and python hybrid programming, fully emulating a complementary camera from the physical level. It should be understood that the simulation links include physical simulation and behavioral simulation, as shown in fig. 2-6. Different simulation hierarchies are suitable for different data input sources.

In addition, in the implementation process, the invention further comprises the following steps: labeling the target simulation data. It will be appreciated that labeling of complementary data sets of multiple passes is very difficult. The present embodiment first generates simulation results of complementary data sets from existing data sets based on simulation means. And pre-labeling the acquired real data by using the existing advanced manual network automatic labeling technology. Then manually fine-tuning. The cost of data annotation can be greatly saved, and the accuracy of the data annotation is ensured.

Further, the operation of the emulator is classified into the following categories: (1) still image data conversion (2) dynamic data conversion (3) non-complementary camera real shooting data conversion. (4) simulation environment data generation.

In some embodiments, as shown in fig. 2, for the input of the original image data of the static data or the static data set, it is necessary to complement the TD (time difference) and SD (space difference) paths on the basis of the original image data. Since the existing static image data is the actual data which is sampled by other sensors through the ADC, the static image data is directly input into the behavior simulator and imitates the output data mode of the complementary camera. Since there is only static data, dynamic simulation needs to be added. Wherein the dynamic simulation uses a vibration method and an optical flow distortion method, mimicking rigid and non-rigid movements, respectively. The benefit of the conversion of static datasets is that a large number of semantic tags can be provided by existing datasets. The simulation flow is shown in fig. 2.

Specifically, the original image data at this time is static data; it should be noted that the source of the static data may be sensor direct shooting, multi-sensor synthesis, sensor shooting with manual processing, neural network generation, or the like, or may be obtained from a static data set. Inputting the original image data to a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data, wherein the method specifically comprises the following steps of:

Firstly, performing exposure simulation on original image data according to a first exposure setting, and sampling a result of the exposure simulation to obtain target image data. In some embodiments, the results of the exposure simulation are sampled using a bayer sampling pattern. Further, the specific sampling requirements may need to be designed according to the actual physical characteristics of the complementary camera, which the present invention is not limited to.

And then carrying out gray scale exposure simulation on the original image data according to the second exposure setting to obtain a gray scale simulation result. It should be appreciated that the first exposure setting and the second exposure setting are implemented by adjusting the cut-off, the gain and the noise, and the specific setting needs to be designed according to the actual physical characteristics of the complementary camera. When the environment is a non-high-speed and high-dynamic scene, the first exposure setting is determined according to an automatic exposure algorithm on the first path, the second exposure setting keeps other parameters the same, and the exposure time is given by K times (K < 1) of the first path. K is an adjustable parameter that is determined by the specific design of the pixel. When the environment is high dynamic or high speed, the exposure decision is given by several given strategies, such as the high dynamic scene shown in fig. 3, one of the paths is selected to be a dark light sensitive path (such as the first path), the other path is a high light sensitive path, and exposure parameters of the two paths are complementary, so that the maximum light sensitive range can be obtained, and the HDR image can be synthesized. Under such environmental parameters, the parameters in the first exposure setting are still determined according to the auto-exposure algorithm, but the expected outcome of the auto-exposure is given by the HDR exposure rules, and the second exposure setting is given according to the complementary constraints.

And after the gray simulation result is obtained, obtaining target space differential data through space differential response based on the gray simulation result. It should be appreciated that, in the spatial differential response, the spatial differential (SPATIAL DIFFERENCE, SD) value is the differential and quantized result between the current time output value of the photosensitive subunit and the current time output value of the photosensitive subunit in the pixel unit associated with the pixel unit; the SD path outputs a spatial difference value of a pixel unit (x, y) at a time t _n (current time) and a pixel value of a pixel unit (x, y) associated at a time t _n, the pixel unit associated with the pixel unit (x, y) includes the first pixel unit and the second pixel unit that are both adjacent to the pixel unit, and a line connecting the first pixel unit, the pixel unit, and the second pixel unit is not on the same line.

For example: the first pixel unit and the second pixel unit are a pixel unit (x+1, y) and a pixel unit (x, y+1) respectively; note that x and y refer to pixel coordinates, and no unit is provided.

At this time, the SD path outputs a spatial difference value SD _x(x,y,t_n between the pixel value of the pixel unit (x, y) at time t _n and the pixel value of the pixel unit (x+1, y) at time t _n, and a spatial difference value SD _y(x,y,t_n between the pixel value of the pixel unit (x, y) at time t _n and the pixel value of the pixel unit (x, y+1) at time t _n;

SD_x(x,y,t_n)＝Q_SD(I(x,y,t_n)-I(x+1,y,t_n))

SD_y(x,y,t_n)＝Q_SD(I(x,y,t_n)-I(x,y+1,t_n))

Or for example, the first pixel unit and the second pixel unit are a pixel unit (x-1, y+1) and a pixel unit (x+1, y+1), respectively;

At this time, the SD path outputs a spatial difference value SD _↙(x,y,t_n between the pixel value of the pixel unit (x, y) at time t _n and the pixel value of the pixel unit (x-1, y+1) at time t _n, and a spatial difference value SD _↘(x,y,t_n between the pixel value of the pixel unit (x, y) at time t _n and the pixel value of the pixel unit (x+1, y+1) at time t _n;

SD_↙(x,y,t_n)＝Q_SD9I(x,y,t_n)-I(x+1,y+1,t_n))

SD_↘(x,y,t_n)＝Q_SD(I(x,y,t_n)-I(x-1,y+1,y_n))

In the above equation, Q _SD is a quantization method used for the spatial differential path.

I (x+1, y, t _n)、I(x,y+1,y_n)、I(x+1,y+1,t_n) and I (x-1, y+1, t _n) are the output values of the photoreceptor subunits inside the pixel units (x+1, y), (x, y+1), (x+1, y+1) and I (x-1, y+1) at time t _n, respectively.

All signals referred to above are three-dimensional quantities, including a spatial two-dimensional quantity of x, y and a temporal dimension t.

Correspondingly, after the gray simulation result is obtained, the target time difference data is obtained through time difference response based on the gray simulation result. In some embodiments, the gray scale results are further processed, such as performing image warping and warping operations on the gray scale simulation results to obtain warped results, and the target time differential data is obtained through a time differential response based on the gray scale simulation results and the warped results. It should be understood that, in the time difference response, the time difference value (Temporal Difference, TD) is the difference and quantization result between the current time output value and the last time output value of the photosensitive subunit in the pixel unit; the TD path outputs a time difference value TD (x, y, t _n) of the current pixel unit (x, y) at time t _n, expressed as:

TD(x,y,t_n)＝Q_TD(I(x,y,t_n)-I(x,y,t_n-1))

In the above formula, I (x, y, t _n) and I (x, y, t _n-1) are output values of the photosensitive subunit in the current pixel unit (x, y) at time t _n and time t _n-1 immediately above, respectively, and Q _TD is a quantization method used by the time difference path.

Further, the response of the spatio-temporal difference needs to be designed according to the actual physical characteristics of the complementary camera, which is not limited by the present invention.

In some embodiments, TPS WARPING is used to perform image warping and warping operations on the gray scale simulation results. It will be appreciated that TPS WARPING refers to TPS (Thin Plate Splines) warping, a method for image distortion and deformation correction in image processing. It is based on a mathematical model that enables the distortion and deformation operations on the image by controlling the local mesh deformation. The principle of TPS warping is based on thin-plate spline interpolation methods. It assumes that each point on the image has a corresponding control point, and changes the shape of the image by movement of the control point. Specifically, TPS regards each point in an image as a point in a two-dimensional coordinate system, calculates a grid distortion function from the position of the control point, and then uses the function to deform the image. The key to TPS warping is to calculate the grid warping function. It is achieved by constructing a local warp model between control points. The model calculates the new position of each image point based on the distance and the relative position between the control points. Points around the control point are more affected and thus have a greater twisting effect. In one embodiment, TPS WARPING is used to generate a small deformation of a static scene, introducing non-rigid motion to mimic the change of a real scene, distinguishing the information content of TD (target time difference data) and SD (target space difference data).

In some embodiments, as shown in fig. 4, the raw image data input, i.e., video data input, for dynamic data or dynamic data sets may be directly input to the behavior simulator without adding other physical behaviors. Notably, the labels of most video data sets may be sparse. If the simulation is data generation that serves a particular task, the semantic tags may need to be aligned to a particular time using optical stream insertion frames. The simulation flow is shown in fig. 4.

Specifically, at this time, the original image data is dynamic data, including first original image data and second original image data; it should be noted that the source of the dynamic data may be sensor direct shooting, multi-sensor synthesis, sensor shooting, manual processing, neural network generation, or the like, or may be obtained from a dynamic data set. Inputting the original image data to a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data, wherein the method specifically comprises the following steps of:

First, exposure simulation is performed on first original image data (original image T) according to a first exposure setting, and a result of the exposure simulation is sampled to obtain target image data. In some embodiments, the results of the exposure simulation are sampled using a bayer sampling pattern. Further, the specific sampling requirements may need to be designed according to the actual physical characteristics of the complementary camera, which the present invention is not limited to.

Then, gray scale exposure simulation is carried out on the first original image data (original image T) according to the second exposure setting, and a first gray scale simulation result is obtained; and carrying out gray scale exposure simulation on the second original image data (original image T+1) according to the second exposure setting to obtain a second gray scale simulation result. It should be appreciated that the first exposure setting and the second exposure setting are implemented by adjusting the cut-off, the gain and the noise, and the specific setting needs to be designed according to the actual physical characteristics of the complementary camera. When the environment is a non-high-speed and high-dynamic scene, the first exposure setting is determined according to an automatic exposure algorithm on the first path, the second exposure setting keeps other parameters the same, and the exposure time is given by K times (K < 1) of the first path. K is an adjustable parameter that is determined by the specific design of the pixel. When the environment is high dynamic or high speed, the exposure decision is given by several given strategies, such as the high dynamic scene shown in fig. 3, one of the paths is selected to be a dark light sensitive path (such as the first path), the other path is a high light sensitive path, and exposure parameters of the two paths are complementary, so that the maximum light sensitive range can be obtained, and the HDR image can be synthesized. Under such environmental parameters, the parameters in the first exposure setting are still determined according to the auto-exposure algorithm, but the expected outcome of the auto-exposure is given by the HDR exposure rules, and the second exposure setting is given according to the complementary constraints.

Further, it can be understood that the original image T refers to an original image at time T, and the original image t+1 refers to an original image at time t+1.

And after the gray scale simulation result is obtained, obtaining target space differential data through space differential response based on the first gray scale simulation result. It should be appreciated that, in the spatial differential response, the spatial differential (SPATIAL DIFFERENCE, SD) value is the differential and quantized result between the current time output value of the photosensitive subunit and the current time output value of the photosensitive subunit in the pixel unit associated with the pixel unit; the SD path outputs a spatial difference value of a pixel unit (x, y) at a time t _n (current time) and a pixel value of a pixel unit (x, y) associated at a time t _n, the pixel unit associated with the pixel unit (x, y) includes the first pixel unit and the second pixel unit that are both adjacent to the pixel unit, and a line connecting the first pixel unit, the pixel unit, and the second pixel unit is not on the same line.

SD_x(x,y,t_n)＝Q_SD(I(x,t,r_n)-I(x+1,y,t_n))

SD_y(x,y,t_n)＝Q_SD(I(x,y,t_n)-I(x,y+1,t_n))

At this time, the SD path outputs a spatial difference value SD _↙(z,y,t_n between the pixel value of the pixel unit (x, y) at time t _n and the pixel value of the pixel unit (x-1, y+1) at time t _n, and a spatial difference value SD _↘(x,y,t_n between the pixel value of the pixel unit (x, y) at time t _n and the pixel value of the pixel unit (x+1, y+1) at time t _n;

SD_↙(x,y,t_n)＝Q_SD(I(x,y,t_n)-I(x+1,y+1,t_n))

SD_↘(x,y,t_n)＝Q_SD(I(x,y,t_n)-I(x-1,y+1,t_n))

I (x+1, y, t _n)、I(z,y+1,t_n)、I(x+1,y+1,t_n) and I (x-1, y+1, t _n) are the output values of the photoreceptor subunits inside the pixel units (x+1, y), (x, y+1), (x+1, y+1) and I (x-1, y+1) at time t _n, respectively.

And obtaining target time differential data through time differential response based on the first gray scale simulation result and the second gray scale simulation result. It should be understood that, in the time difference response, the time difference value (Temporal Difference, TD) is the difference and quantization result between the current time output value and the last time output value of the photosensitive subunit in the pixel unit; the TD path outputs a time difference value TD (x, y, t _n) of the current pixel unit (x, y) at time t _n, expressed as:

TD(x,y,t_n)＝Q_TD(I(x,y,t_n)-I(x,y,t_n-1))

In some embodiments, for the input of the real shooting data or the original image data of the real shooting data set, because the data has better original data quality, the existing most advanced visual task large model can be directly used for pre-marking, and the marked data is sent to the simulator for simulation. It should be understood that, the real shooting dataset refers to a complementary dataset built by a camera of a non-complementary camera, and the data always has better original data quality, so that the existing most advanced visual task large model can be directly used for pre-marking, and the marked data is sent to the simulator for simulation, so that the influence of the data mode of differential data and the influence of information loss are avoided. The simulation flow is as follows.

Specifically, at this time, the original image data is real shot data, including first original image data and second original image data; it should be noted that the source of the real shooting data can be direct shooting of a sensor, synthesis of multiple sensors, manual processing of shooting of the sensor, generation of a neural network, and the like, or can be obtained from a data set. Inputting the original image data to a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data, wherein the method specifically comprises the following steps of:

First, first original image data (original image T) is sampled to obtain target image data. In some embodiments, the results of the exposure simulation are sampled using a bayer sampling pattern. Further, the specific sampling requirements may need to be designed according to the actual physical characteristics of the complementary camera, which the present invention is not limited to.

Thereafter, target spatial differential data is obtained by spatial differential response based on the first original image data (original image T). It should be appreciated that, in the spatial differential response, the spatial differential (SPATIAL DIFFERENCE, SD) value is the differential and quantized result between the current time output value of the photosensitive subunit and the current time output value of the photosensitive subunit in the pixel unit associated with the pixel unit; the SD path outputs a spatial difference value of a pixel unit (x, y) at a time t _n (current time) and a pixel value of a pixel unit (x, y) associated at a time t _n, the pixel unit associated with the pixel unit (x, y) includes the first pixel unit and the second pixel unit that are both adjacent to the pixel unit, and a line connecting the first pixel unit, the pixel unit, and the second pixel unit is not on the same line.

SD_x(x,y,t_n)＝Q_SD(I(x,y,t_n)-I(x+1,y,t_n))

SD_y(x,y,t_n)＝Q_SD(I(x,y,t_n)-I(x,y+1,t_n))

SD_↙(x,y,t_n)＝Q_SD(I(x,y,t_n)-I(x+1,y+1,t_n))

SD_↘(x,y,t_n)＝Q_SD(I(x,y,t_n)-I(x-1,y+1,t_n))

I (x+1, y, t _n)、I(x,y+1,t_n)、I(x+1,y+1,t_n) and I (x-1, y+1, t _n) are the output values of the photoreceptor subunits inside the pixel units (x+1, y), (x, y+1), (x+1, y+1) and I (x-1, y+1) at time t _n, respectively.

The target time difference data is obtained by time difference response based on the first original image data (original image T) and the second original image data (original image t+1). It should be understood that, in the time difference response, the time difference value (Temporal Difference, TD) is the difference and quantization result between the current time output value and the last time output value of the photosensitive subunit in the pixel unit; the TD path outputs a time difference value TD (x, y, t _n) of the current pixel unit (x, y) at time t _n, expressed as:

TD(x,y,t_n)＝Q_TD(I(x,y,t_n)-I(x,y,t_n-1))

Further, the response of the spatio-temporal difference needs to be designed according to the actual physical characteristics of the complementary camera, which is not limited by the present invention. Further, it can be understood that the original image T refers to an original image at time T, and the original image t+1 refers to an original image at time t+1.

In a specific embodiment, as shown in fig. 5, the first raw image data includes high dynamic range camera real-time data and high speed camera real-time data, and the second raw image data includes the high speed camera real-time data; the acquiring of the original image data further includes:

Specifically, the specific method used in this embodiment is to synchronize HDR (HIGH DYNAMIC RANGE IMAGING ) data with a high speed camera, simulate one of the paths of a complementary camera with the high dynamic range camera, simulate the other path with the high speed camera, and generate a tag with the raw data. The benefit of this is that the actual exposure can be controlled directly, with a more realistic response to the behavior of the complementary vias in extreme environments. Among them, the high-speed camera and the high-dynamic camera need to be spatially aligned (by a spectroscope) and time synchronized. It should be appreciated that the first exposure setting and the second exposure setting are implemented by adjusting the cut-off, the gain and the noise, and the specific setting needs to be designed according to the actual physical characteristics of the complementary camera. When the environment is a non-high-speed and high-dynamic scene, the first exposure setting is determined according to an automatic exposure algorithm on the first path, the second exposure setting keeps other parameters the same, and the exposure time is given by K times (K < 1) of the first path. K is an adjustable parameter that is determined by the specific design of the pixel. When the environment is high dynamic or high speed, the exposure decision is given by several given strategies, such as the high dynamic scene shown in fig. 3, one of the paths is selected to be a dark light sensitive path (such as the first path), the other path is a high light sensitive path, and exposure parameters of the two paths are complementary, so that the maximum light sensitive range can be obtained, and the HDR image can be synthesized. Under such environmental parameters, the parameters in the first exposure setting are still determined according to the auto-exposure algorithm, but the expected outcome of the auto-exposure is given by the HDR exposure rules, and the second exposure setting is given according to the complementary constraints.

In some embodiments, as shown in fig. 6, for raw image data input of simulation environment data, the raw image data is a 3D scene based on physical simulation; inputting the original image data to a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data, wherein the method specifically comprises the following steps of:

In particular, compared with simulation based on input of static image data, dynamic data and non-complementary camera real shooting data, the simulation based on simulation environment data shows the advantages of complementary cameras, namely that each complementary path can encode optical information which is as different as possible to expand visual perception range and reduce information redundancy. In one particular embodiment, simulations are performed using a Physic-based rendering (PBR) environment. By means of some existing 3D scenes, camera motion scripts, scene object motion scripts, light environment change scripts and weather change scripts are pre-written, photon density is obtained through PBR simulation, and then target simulation data are obtained through core indexes of complementary sensors by means of the simulation flow of FIG. 6. Because of the simulation environment, high-density depth and semantic tag information can be easily obtained as ground truth (ground truth). In some embodiments, the core index of the complementary sensor has a complete set of imaging indices for each data path, while one complementary sensor has at least 2 data sense paths. In some embodiments, the core metrics of the pre-set sensors for each sensing path include, but are not limited to: data accuracy (quantization accuracy), noise model, pixel size, pixel arrangement, color filters, sampling frame rate, resolution, data modality (time, space difference or absolute value, etc.), range of adjustable exposure parameters, dark current model. In one particular embodiment, as shown in FIG. 6, the particular parameters include: luminous flux, photon shot noise, dark current shot noise, dark current fixed pattern noise, random telegraph signal noise, flicker noise, thermal noise, offset fixed pattern noise, and the like.

Further, in some embodiments, the inputting the original image data into a pre-built multi-task multi-mode data simulation module, according to the type of the original image data, performing simulation according to a physical parameter of a simulation target camera and a simulation level of the simulation on the basis of the original image data to obtain target simulation data, and then further including:

In the prior art, the existing image sensor data set has the following disadvantages. First, their sample size is relatively limited, lacking a large-scale dataset. Second, for some datasets, the lack of instance-level annotation information results in limited application in some tasks. In addition, these datasets have a narrow scene coverage and lack data for some special scenes. Also, some data sets are complex to collect, requiring specific equipment and conditions, limiting the size of the data collection. Thus, the simulator development of the sensor, in addition to the functionality for verifying the image sensor, also includes obtaining a more sophisticated data set based on the simulator, which is also a great aid to the application of the sensor itself.

The multi-task multi-mode data simulation module provided by the invention can generate a multi-task multi-mode data set. Multimodal datasets refer to datasets comprising a plurality of different types of data, such as images, text, audio, and the like. These datasets can provide rich information to solve complex problems, such as the combination of images and text that can be used for image labeling and automatic description, the combination of audio and text that can be used for speech recognition and machine translation, etc. By researching and analyzing the multi-modal data set, the relation among different modalities can be better understood, and a more innovative and high-performance multi-modal intelligent system can be developed. The use of multimodal datasets not only brings higher accuracy and effect, but also better meets user needs and provides more personalized services. It will be appreciated that current multi-modality dataset collection systems are typically composed of a spatially compact array of sensors. Integration and synchronization of the collection system is important because the most core applications of the multimodal dataset are multimodal tasks, which require good consistency, synchronization and alignment in space and time. Currently available collection means include, but are not limited to, roof sensor arrays, hand-held sensor arrays, head-mounted sensor arrays, unmanned vehicles, and the like. Through the combination with internet of things technology, the precedent of collecting multi-mode data through network channels exists at present. These datasets generally have the characteristics of multiple paths, multiple modes, multiple tasks, large volume and large data processing difficulty, but the collection process is complex, and a great deal of time and cost are required.

In a specific embodiment, the invention is based on a multi-task multi-mode data simulation module, converts a static data set of COCO/BDD100K as a detection and segmentation data set, and also converts video instance segmentation data sets such as a YOUTUBE data set. The scale of these data is much larger than the scale of the real data set that has been acquired as pre-training data for the downstream training task. Meanwhile, more homemade data sets are generated by using the real shooting data and the simulation data. The method well solves the problems that the collecting process of the modal data set is complicated and a large amount of time and cost are required to be consumed, can realize the training of a larger-scale neural network algorithm, and is effectively generalized to a real environment.

Further, the invention also comprises the data directly shot by the complementary camera as a small-scale training set, a verification set and a test set. And pre-labeling the real data by using a reconstruction method, an SAM (SAM) and other open domain algorithms. Data from other sensors (e.g., radar) are used as auxiliary tags with reference to data from the complementary camera. And a set of neural network training flow with strong practicability based on simulation data pre-training and real data sparse label semi-supervised training is sequentially provided. The algorithm obtained by using the training process can be directly deployed on a robot system using the set of complementary data acquisition systems.

Further, it should be appreciated that the data structure of the target multi-modal data set requires that the data have any number of paths, wherein at least one primitive is complementary in nature between each pair of paths. The data source may be generated by a single APS, DVS, DAVIS or spatial gradient camera emulation, by a combination of different types of cameras, or by manufacturing a separate camera that implements all primitives. One specific example is shown in table 1.

TABLE 2 target multitasking multimodal dataset in multiple data source formats

/>

In addition, the invention further requires the target multi-task multi-modality dataset to be designed using the CVS as a core and other cameras as auxiliary tags. The aim is to maximize the potential of complementary cameras based on CVS composition. These are merely examples, and can be extended to virtually any number of paths, the physical implementation of which is any multi-camera (APS, DVS or CVS), requiring that at least one primitive be complementary in nature between every two paths, guaranteeing low bandwidth and information integrity. The data acquisition requirements of the invention include the use of the complementary visual sensor in the chip and the use of the complementary data modality outside the chip. Compared with the existing multi-mode data set, the method has the advantages that the requirements on mutual complementation and complementation of different data are higher. The invention is more compact due to the use of complementary visual sensors.

The obtained target multi-task multi-mode data set can be provided for deep learning training. The data set includes CVS and auxiliary camera data, and internal and external parameters determined by calibration. The dataset additionally provides lidar and IMU data as assistance for camera pose estimation tasks and depth estimation tasks. The data in the target multi-task multi-mode dataset does not necessarily comprise data labels, and the unlabeled data can be used for CVS (digital video signal) applications such as image denoising, image defogging, high-speed HDR (high-definition digital video) image reconstruction, super-resolution and the like. For data in a target multi-modal dataset comprising data tags, some embodiments utilize the data integrity of the complementary camera itself, in combination with the UNet-optical flow network, to train a reconstruction algorithm on semantic annotation data. The reconstruction method acts on the complementary camera, can obtain image data with high frame rate and high dynamic range, and is used as a multi-mode reference. Other data such as radar, IMU, etc. are aligned with it. Further, in some embodiments, the detection and segmentation labels can also be obtained by manual screening using large-scale pre-trained DETR2 and SAM for data pre-labeling.

The multi-task multi-mode data simulation method provided by the invention can realize the purpose of solving the problem of sparseness and unavailability of extreme case (corn cases) data from the data source. The simulation data can be introduced with dense extreme case data, and the complementary data set can directly record extreme cases (such as high-speed HDR, flash, etc.) which cannot be recorded by the general data set. Is critical to the safety and high performance of open world robotic tasks.

The invention provides a multi-task multi-mode data simulation method, which comprises the steps of obtaining original image data; inputting the original image data into a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data; the target simulation data includes target image data, target spatial differential data, and target temporal differential data. The invention makes different designs for the multi-task multi-mode data simulation module aiming at the data flow and acquisition systems of different input modes of the original image data, adds simulation data on the basis of the input original image data, finally outputs multi-task multi-mode target simulation data with the simulation data as a main part and the real data as an auxiliary part, realizes reliable and low-cost multi-task multi-mode data simulation, and solves the problems of sparseness and unobtainability of extreme case (corner cases) data from a data source.

The following describes the multi-task multi-mode data simulation device provided by the invention, and the multi-task multi-mode data simulation device described below and the multi-task multi-mode data simulation method described above can be correspondingly referred to each other. As shown in fig. 7, fig. 7 is a schematic structural diagram of a multi-task multi-mode data simulation device provided by the present invention, where the device includes:

An acquisition unit 710 for acquiring original image data;

The simulation unit 720 is configured to input the original image data to a pre-built multi-task multi-mode data simulation module, and perform simulation according to a physical parameter of a simulation target camera and a simulation level on the basis of the original image data according to a type of the original image data, so as to obtain target simulation data; the target simulation data includes target image data, target spatial differential data, and target temporal differential data.

Based on the above embodiment, in the apparatus, the original image data is static data; the simulation unit 720 specifically includes:

Based on the above embodiment, in the apparatus, the raw image data is dynamic data, including first raw image data and second raw image data; the simulation unit 720 specifically includes:

Based on the above embodiment, in the apparatus, the raw image data is real shot data, including first raw image data and second raw image data; the simulation unit 720 specifically includes:

sampling the first original image data to obtain target image data;

Based on the above embodiment, in the apparatus, the first raw image data includes data of a high dynamic range camera real shot and data of a high speed camera real shot, and the second raw image data includes data of the high speed camera real shot; the acquiring unit 710 further includes:

Based on the above embodiment, in the apparatus, the original image data is a 3D scene based on physical simulation; the simulation unit 720 specifically includes:

Based on the above embodiment, in the apparatus, the simulation unit 720 further includes:

The multi-task multi-mode data simulation device provided by the invention obtains the original image data; inputting the original image data into a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data; the target simulation data includes target image data, target spatial differential data, and target temporal differential data. The invention makes different designs for the multi-task multi-mode data simulation module aiming at the data flow and acquisition systems of different input modes of the original image data, adds simulation data on the basis of the input original image data, finally outputs multi-task multi-mode target simulation data with the simulation data as a main part and the real data as an auxiliary part, realizes reliable and low-cost multi-task multi-mode data simulation, and solves the problems of sparseness and unobtainability of extreme case (corner cases) data from a data source.

Fig. 8 illustrates a physical structure diagram of an electronic device, as shown in fig. 8, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a multi-tasking multi-modality data emulation method comprising: acquiring original image data; inputting the original image data into a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data; the target simulation data includes target image data, target spatial differential data, and target temporal differential data.

Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing a method of multi-tasking and multi-modal data simulation provided by the methods described above, the method comprising: acquiring original image data; inputting the original image data into a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data; the target simulation data includes target image data, target spatial differential data, and target temporal differential data.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method of multi-tasking multi-modal data simulation provided by the above methods, the method comprising: acquiring original image data; inputting the original image data into a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data; the target simulation data includes target image data, target spatial differential data, and target temporal differential data.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for simulating multi-task and multi-mode data, comprising:

acquiring original image data;

2. The multi-task, multi-modality data simulation method of claim 1, wherein the raw image data is static data; inputting the original image data to a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data, wherein the method specifically comprises the following steps of:

3. The multi-tasking multi-modality data simulation method of claim 1, wherein the raw image data is dynamic data including first raw image data and second raw image data; inputting the original image data to a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data, wherein the method specifically comprises the following steps of:

4. The multi-tasking multi-modality data simulation method of claim 1, wherein the raw image data is real shot data comprising first raw image data and second raw image data; inputting the original image data to a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data, wherein the method specifically comprises the following steps of:

sampling the first original image data to obtain target image data;

5. The method of claim 4, wherein the first raw image data comprises high dynamic range camera real-time data and high speed camera real-time data, and the second raw image data comprises the high speed camera real-time data; the acquiring of the original image data further includes:

6. The multi-task and multi-modality data simulation method according to claim 1, wherein the original image data is a 3D scene based on physical simulation; inputting the original image data to a pre-built multi-task multi-mode data simulation module, and simulating according to the type of the original image data and the level of simulation degree of simulation of a simulation target camera on the basis of the original image data to obtain target simulation data, wherein the method specifically comprises the following steps of:

7. The method for multi-task and multi-mode data simulation according to any one of claims 1 to 6, wherein the inputting the original image data into a pre-built multi-task and multi-mode data simulation module performs simulation according to the type of the original image data and the level of simulation degree and physical parameters of a simulation target camera on the basis of the original image data to obtain target simulation data, and further comprises:

8. A multi-tasking multi-modality data simulation apparatus comprising:

An acquisition unit configured to acquire original image data;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-tasking multi-modal data simulation method according to any of claims 1 to 7 when executing the program.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a multi-tasking multi-modal data simulation method according to any of claims 1 to 7.