CN115984327B

CN115984327B - Self-adaptive vision tracking method, system, equipment and storage medium

Info

Publication number: CN115984327B
Application number: CN202310001472.8A
Authority: CN
Inventors: 李学龙; 王之港; 赵斌; 崔梦瑶
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2023-01-03
Filing date: 2023-01-03
Publication date: 2024-05-07
Anticipated expiration: 2043-01-03
Also published as: CN115984327A

Abstract

The embodiment of the application relates to the technical field of image processing, in particular to a self-adaptive vision tracking method, a system, equipment and a storage medium, wherein the method comprises the following steps: firstly, carrying out feature extraction and feature fusion on input image data, and automatically correcting an image; the input image data includes a low dynamic range image and color event stream data; next, after the image is automatically corrected, a high dynamic range image is generated; finally, the high dynamic range image is taken as an input, and tracked. The self-adaptive vision tracking method provided by the embodiment of the application can solve the problem of information loss in the overexposed or underexposed area of the image in the existing tracking method, and enhance the image quality under the complex illumination condition.

Description

Self-adaptive vision tracking method, system, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a self-adaptive vision tracking method, a self-adaptive vision tracking system, self-adaptive vision tracking equipment and a storage medium.

Background

In a laser-based wireless energy supply unmanned aerial vehicle system, stable and precise unmanned aerial vehicle target tracking is realized by utilizing a ground photoelectric tracking system, which is a precondition for establishing a whole laser wireless energy transmission link. In the process of tracking an air unmanned aerial vehicle, the following problems mainly exist: firstly, unmanned aerial vehicle is easy to produce attitude change and scale change due to shaking of the unmanned aerial vehicle body and change of ground distance in the flying process. In addition, tracking is mainly based on video shot by a traditional camera, and is difficult to adapt to environments with extreme illumination conditions or complex illumination changes due to the low dynamic range of the traditional camera.

Conventional cameras can only capture a portion of the real world luminance information (imaging dynamic range is limited), and luminance information below or above the dynamic range threshold portion is lost during the shooting process. In addition, the existing tracking method basically does not process the illumination aspect of the input image or only simply processes the input image, and has little capability of coping with extreme illumination conditions or complex illumination changes.

Disclosure of Invention

The embodiment of the application provides a self-adaptive visual tracking method, a system, equipment and a storage medium, which solve the problem of information loss in an image overexposed or underexposed area in the existing tracking method and enhance the image quality under the condition of complex illumination.

In order to solve the above technical problems, in a first aspect, an embodiment of the present application provides an adaptive visual tracking method, including the following steps: firstly, carrying out feature extraction and feature fusion on input image data, and automatically correcting an image; the input image data includes a low dynamic range image and color event stream data; next, after the image is automatically corrected, a high dynamic range image is generated; finally, the high dynamic range image is taken as an input, and tracked.

In some exemplary embodiments, automatically correcting an image includes: generating a training data set and a simulation test data set from the simulation data set, and generating a real test data set from the real data set; training with the training data set and testing with the test data set; the test data set comprises a simulation test data set and a real test data set; extracting multi-scale event stream features, multi-scale image features and global image features from a test data set; and carrying out feature fusion on the extracted multi-scale event stream features, the multi-scale image features and the global image features to generate a high dynamic range image.

In some exemplary embodiments, a combination of a color event camera and a conventional camera is employed to obtain a real dataset; the real dataset comprises: high dynamic range images, low dynamic range images, and color event stream data.

In some exemplary embodiments, the analog data set includes: high dynamic range image, low dynamic range image and color

Event stream data; the training color event stream data is generated by simulating random small-amplitude dithering of a high dynamic range image using the event camera simulator 5 based on the high dynamic range image data.

In a second aspect, embodiments of the present application further provide an adaptive vision tracking system, including: the image automatic correction module and the space size tracking module are connected in sequence; the image automatic correction module is used for carrying out feature extraction and feature fusion on input image data, automatically correcting the image and generating a low dynamic range image; the space size tracking module is used for taking the high dynamic range image as an input and tracking the high dynamic range image.

0 In some exemplary embodiments, the image auto-rectification module includes a data generation module and a model design module, the number

The data generating module is used for generating a test data set according to the simulation data set and the real shooting data set; the model design module comprises an event stream feature extraction module, an image feature extraction module and a feature fusion module, wherein the event stream feature extraction module is used for extracting multi-scale event stream features; the image feature extraction module is used for extracting multi-scale image features and global image features;

And the feature fusion module performs feature fusion 5 on the multi-scale event stream features, the multi-scale image features and the global image features to generate a high dynamic range image.

In some exemplary embodiments, the spatial scale tracking module includes a position estimation module and a scale estimation module; the position estimation module is used for updating position parameters; the scale estimation module is used for updating the scale parameters.

In addition, the application also provides electronic equipment, which comprises: at least one processor; and, with at least one treatment of

A memory communicatively coupled to the processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the adaptive vision tracking method described above.

In addition, the application also provides a computer readable storage medium storing a computer program which when executed by a processor realizes the adaptive vision tracking method.

The technical scheme provided by the embodiment of the application has at least the following advantages:

The embodiment of the application provides a self-adaptive visual tracking method, a system, equipment and a storage medium, wherein the method comprises the following 5 steps: firstly, carrying out feature extraction and feature fusion on input image data, and automatically correcting an image; input diagram

The image data includes low dynamic range images and color event stream data; next, after the image is automatically corrected, a high dynamic range image is generated; finally, the high dynamic range image is taken as an input, and tracked.

The application provides a self-adaptive vision tracking method and a self-adaptive vision tracking system, and firstly, in order to adaptively cope with illumination changes, the application provides an automatic image correction module combined with a traditional camera and an event camera, which solves the problem of information loss in an image overexposure or underexposure area of the existing tracking method and enhances the image quality under the condition of complex illumination. In particular, because event cameras have a high dynamic range, brightness variations under underexposure or overexposure conditions can be captured, but cannot be imaged directly as with conventional cameras. Therefore, the application combines two hardware devices and proposes a special neural network structure to generate a more stable and clear image under complex illumination conditions by utilizing the data input of the two hardware devices. Further, the application constructs a stream tracking method for automatically correcting the image and then tracking the target by using the scale estimation module and the pose estimation module.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, which are not to be construed as limiting the embodiments unless specifically indicated otherwise.

FIG. 1 is a flow chart of an adaptive vision tracking method according to an embodiment of the present application;

FIG. 2 is a flow chart of an adaptive vision tracking method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an adaptive vision tracking system according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an automatic image correction module according to an embodiment of the present application;

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

As known from the background art, the existing tracking method does not basically perform illumination processing or only simple processing on the input image, so that the existing tracking method has the technical problem of not coping with extreme illumination conditions or complex illumination changes.

Conventional cameras can only capture a portion of the real world luminance information (imaging dynamic range is limited), and luminance information below or above the dynamic range threshold portion is lost during the shooting process. The purpose of high dynamic range imaging is to recover the brightness of the real world from low dynamic range images. The existing high dynamic range image correction method mainly comprises three types of methods: the method comprises the steps of generating a high dynamic range image of a single low dynamic range image, synthesizing high dynamic range images of a plurality of low dynamic range images, and synthesizing the single low dynamic range image and the high dynamic range image assisted by other mode images. The implementation paths of the three schemes are detailed below.

And generating a high dynamic range image of a single low dynamic range image. Such methods mainly comprise three parts: mapping the nonlinear low dynamic range image to a linear space; the lost information in the original dynamic range is complemented; the lost information outside the original dynamic range (over-darkened or overexposed area) is complemented. For non-linear to linear image mapping, this is typically achieved by estimation of camera CRF parameters. The lost brightness information is usually supplemented by using a generation model, but the generation effect and the effect of the actual image are larger in and out for the region where the information is seriously lost, and the use is difficult.

And secondly, synthesizing high dynamic range images of a plurality of low dynamic range images. Such methods use two or more images for different low dynamic ranges for high dynamic range image synthesis, e.g., three low dynamic range images, overexposed, underexposed and normally exposed, respectively, to synthesize a high dynamic range image. There are two ways to acquire low dynamic range images, one is to use the same camera to shoot continuously with different apertures or exposure durations, and the other is to use multiple cameras to shoot simultaneously. The first strategy is applicable in static or near static situations and is not applicable to fast moving unmanned systems. The second strategy requires multiple cameras, which is inconvenient to deploy.

And thirdly, combining a single low dynamic range image with a high dynamic range image assisted by other modal data. The method synthesizes a single low dynamic range image with output data of an analog-digital camera, an event camera and the like to generate a high dynamic range image. There are related art techniques that use strategies that fuse monochromatic event streams with low dynamic range images for image synthesis. First, a low dynamic range image is mapped from a nonlinear space to a linear space. Subsequently, the luminance channels of the low dynamic range image are separated and the monochrome event stream data is processed into luminance maps. And finally, merging the brightness channels and the event stream brightness images, and generating a high dynamic range image. The single color event stream cannot supplement the missing color information of the low dynamic range image.

The defects of the existing high dynamic range imaging technology are classified according to the technical scheme, and can be correspondingly summarized into three aspects:

(1) For the scheme of single picture generation, the brightness information of the severely overexposed/underexposed part cannot be recovered.

(2) For a scheme of fusing a plurality of pictures, the method is difficult to meet the requirement of real-time detection and tracking.

(3) For the multi-mode fusion scheme, only brightness information can be recovered, and color information cannot be recovered.

For the scheme of single picture generation, the brightness information of the severely overexposed/underexposed part cannot be recovered. For severe overexposed/underexposed areas (appearing as full black/full white areas on the picture), the ability of the prior art generation techniques is difficult to generate missing information by virtue of the blank, and data with a higher dynamic range is needed to supplement.

For the scheme of fusion of a plurality of pictures, when a shooting scene moves rapidly, the picture scene generated by adopting a strategy of shooting pictures with different dynamic ranges sequentially has large change and is difficult to synthesize. When adopting a plurality of cameras to shoot, have all put forward higher requirement to unmanned aerial vehicle's load and image fusion.

For the scheme of combining a single picture with monochrome event stream data, the generation of the brightness map from the monochrome event stream data is independently processed, the result of the overall process cannot be fed back and the processing result of the step cannot be adjusted, and the monochrome event stream is unfavorable for providing accurate color information.

As mentioned above, since the existing tracking method does not basically perform a process in terms of illumination on the input image itself or only a simple process, there is little capability to cope with extreme illumination conditions or complex illumination changes. In order to solve the technical problems, an embodiment of the present application provides a self-adaptive vision tracking method, which includes the following steps: firstly, carrying out feature extraction and feature fusion on input image data, and automatically correcting an image; the input image data includes a low dynamic range image and color event stream data; next, after the image is automatically corrected, a high dynamic range image is generated; finally, the high dynamic range image is taken as an input, and tracked.

The application provides a self-adaptive vision tracking method and a self-adaptive vision tracking system, which provide an image automatic correction module, so that a low dynamic range image can be converted into a high dynamic range image, further the target tracking can be realized more favorably, the problem of information loss in an image overexposure or underexposure area of the existing tracking method can be solved, and the image quality under the condition of complex illumination can be enhanced.

Embodiments of the present application will be described in detail below with reference to the attached drawings. However, it will be understood by those of ordinary skill in the art that in various embodiments of the present application, numerous specific details are set forth in order to provide a thorough understanding of the present application. The claimed application may be practiced without these specific details and with various changes and modifications based on the following embodiments.

Referring to fig. 1, an embodiment of the present application provides an adaptive vision tracking method, including the steps of:

S1, carrying out feature extraction and feature fusion on input image data, and automatically correcting an image; the input image data includes a low dynamic range image and color event stream data.

Step S2, after the image is automatically corrected, a high dynamic range image is generated.

And S3, taking the high dynamic range image as an input, and tracking the high dynamic range image.

It should be noted that, the low dynamic range image is easy to be exposed and dark; the application converts the low dynamic range image into the high dynamic range image through the automatic image correction module; the high dynamic range is closer to the normal illumination effect and the visual effect of the human eye.

The application provides a self-adaptive visual tracking method for complex change targets, which integrally follows a streaming structure of firstly carrying out illumination treatment on an input image and then carrying out target tracking. The illumination processing module is an automatic image correction module combining a traditional camera and a color event camera, solves the problem of information loss in an image overexposed or underexposed area in the existing tracking method, and enhances the image quality under the condition of complex illumination.

Referring to fig. 2, in some embodiments, an image is automatically corrected by an image auto-correction module in step S1. Wherein, carry out automatic correction to the image, include: generating a training data set and a simulation test data set from the simulation data set, and generating a real test data set from the real data set; training with the training data set, and testing with the testing data set; the test data set comprises a simulation test data set and a real test data set; extracting multi-scale event stream features, multi-scale image features and global image features from the test data set; and carrying out feature fusion on the extracted multi-scale event stream features, the multi-scale image features and the global image features to generate a high dynamic range image.

The whole tracking scheme flow of the application is shown in fig. 2, and can be divided into two parts, wherein the image automatic correction module takes a low dynamic range image and corresponding color event stream data as input, and the low dynamic range image can be generated into a high dynamic range image by extracting and fusing the characteristics of the two data. Then, the low dynamic range image is taken as input for tracking. The subsequent tracking algorithm is not limited. The application adopts a specific tracking method capable of coping with pose change and scale change, so that the whole process can well cope with the pose change, scale change and illumination change of the environment where the target is positioned.

In some embodiments, a combination of a color event camera and a conventional camera is employed to obtain a real dataset; the real dataset comprises: high dynamic range images (HDR images), low dynamic range images (LDR images), and color event stream data.

In some embodiments, the analog data set includes: high dynamic range images, low dynamic range images, and color event stream data; the training color event stream data is generated by simulating random small-amplitude dithering of a high dynamic range image using an event camera simulator based on the high dynamic range image data.

Referring to fig. 3, an embodiment of the present application further provides an adaptive vision tracking system, including: the image automatic correction module 101 and the space size tracking module 102 are connected in sequence; the image automatic correction module 101 is used for performing feature extraction and feature fusion on input image data, automatically correcting an image, and generating a low dynamic range image; the spatial size tracking module 102 is configured to take the high dynamic range image as an input and track the high dynamic range image.

Referring to fig. 4, in some embodiments, the image auto-rectification module 101 includes a data generation module 1011 and a model design module 1012, the data generation module 1011 for generating a test data set from the simulated data set and the real shot data set; the model design module 1012 includes an event stream feature extraction module 1012a, an image feature extraction module 1012b, and a feature fusion module 1012c, wherein the event stream feature extraction module 1012a is used for extracting multi-scale event stream features; the image feature extraction module 1012b is used for extracting multi-scale image features and global image features; the feature fusion module 1012c performs feature fusion on the multi-scale event stream features, the multi-scale image features, and the global image features to generate a high dynamic range image.

The image automatic correction module provided by the application can be divided into two parts as shown in fig. 4: data processing and model design. In the data processing part, the present application generates a dataset using a strategy of combining an analog dataset with a real shot dataset, since there is no disclosed paired dataset of a color event camera with a High Dynamic Range (HDR), low Dynamic Range (LDR) image. In the analog dataset portion, the high dynamic range image random small amplitude dithering is simulated using an event camera simulator based on the existing high dynamic range-low dynamic range image dataset, producing color event stream data. To test the effect of training results of the simulated dataset on real use, a real dataset was photographed using a color event camera and traditional camera combination system, including HDR images, LDR images, and color event stream data.

In the model design section, it is mainly divided into three modules, as shown in the right half of fig. 4. Firstly, an event stream feature extraction module extracts features according to event stream color channels by using a convolution long-short-term memory network, and generates multi-scale event stream features by using an upsampling neural network. Meanwhile, in the LDR image feature extraction module, a linearization module and a dequantization module (which can be obtained by existing work) are used to pre-process the LDR image. The nonlinear, discrete LDR images are mapped into a linear, continuous HDR image space. Then, the deconvolution up-sampling neural network and the convolutional neural network are alternately used to extract the multi-scale image features. In order to learn the information of the exposure normal part in the global LDR image, the overexposure, underexposure and other parts are supplemented. An upper threshold value and a lower threshold value are set according to pixel brightness of an LDR image, a mask image (mask) is generated by the threshold values, so that an area with normal exposure in the LDR image is separated, and a transducer model is used for extracting global image features of a normal exposure part. And finally, fusing the event stream features and the LDR image features by using a feature fusion module and generating an HDR image. Specifically, a channel attention mechanism is utilized to select a proper event stream color channel characteristic so as to effectively supplement and correct the color of the LDR image. And the color and brightness missing area needing key recovery is found out by using a space attention mechanism, and the space characteristics of the event stream are extracted by further using a sub-network according to the space attention result. The fusion module performs feature fusion on multiple scales, and samples the fusion result layer by layer to the target HDR image size by using a deconvolution neural network, so as to synthesize a final HDR image. The whole training process of the corresponding neural network is end-to-end, so that the feature extraction capability of the event stream data can be fed back and adjusted in a positive direction. In addition, the application provides that the color event stream data is used, the design of a channel attention mechanism is also carried out on the color channel on the network structure, and the color information of the overexposed/overdosed area can be more accurately complemented.

In some embodiments, the spatial scale tracking module 102 includes a position estimation module 1021 and a scale estimation module 1022; the position estimation module 1021 is used for updating the position parameter; the scale estimation module 1022 is used to update the scale parameters.

Based on the above, the application provides a self-adaptive visual tracking method and a system for complex change targets, and the method integrally follows a streaming structure of firstly carrying out illumination treatment on an input image and then carrying out target tracking. The self-adaptive vision tracking system provided by the application designs an image automatic correction module combining a traditional camera and an event camera, and uses the color characteristics of an event stream and the global characteristics of an LDR image to recover the brightness and color information of an excessively dark and overexposed area; and a special network structure is designed to realize automatic correction of all image areas. The self-adaptive vision tracking method and system provided by the application are the only known scheme for generating the high dynamic range image by using the color event camera and the traditional camera.

The existing fusion scheme of event stream and traditional image uses single-color event stream, only the missing brightness information can be supplemented, and the information of color part is still missing. While colored images are of assistance in the performance of downstream tasks such as object detection. In addition, the application uses more advanced feature expression and extraction schemes, and has better effect on feature fusion.

The self-adaptive vision tracking method and the self-adaptive vision tracking system provided by the application have the advantage that the generation of the simulation data set is completed. The model structure has been preliminarily completed and compared with the HDR image generation scheme using only a single modality, an HDR image of higher quality can be generated.

Referring to fig. 5, another embodiment of the present application provides an electronic device including: at least one processor 110; and a memory 111 communicatively coupled to the at least one processor; the memory 111 stores instructions executable by the at least one processor 110, the instructions being executable by the at least one processor 110 to enable the at least one processor 110 to perform any one of the method embodiments described above.

Where the memory 111 and the processor 110 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors 110 and the memory 111 together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 110 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 110.

The processor 110 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 111 may be used to store data used by processor 110 in performing operations.

Another embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program implements the above-described method embodiments when executed by a processor.

That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described above. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

By the above technical solution, the embodiments of the present application provide a self-adaptive visual tracking method, system, device and storage medium, where the method includes the following steps: firstly, carrying out feature extraction and feature fusion on input image data, and automatically correcting an image; the input image data includes a low dynamic range image and color event stream data; next, after the image is automatically corrected, a high dynamic range image is generated; finally, the high dynamic range image is taken as an input, and tracked.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the application and that various changes in form and details may be made therein without departing from the spirit and scope of the application. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the application, and the scope of the application is therefore intended to be limited only by the appended claims.

Claims

1. An adaptive vision tracking method, comprising:

Carrying out feature extraction and feature fusion on input image data, and automatically correcting the image; the input image data includes a low dynamic range image and color event stream data;

after the image is automatically corrected, generating a high dynamic range image;

Taking the high dynamic range image as an input, and tracking the high dynamic range image; automatically correcting an image, comprising:

Generating a training data set and a simulation test data set from the simulation data set, and generating a real test data set from the real data set; training with the training data set, and testing with the testing data set; the test data set comprises a simulation test data set and a real test data set;

Extracting multi-scale event stream features, multi-scale image features and global image features from the test data set;

And carrying out feature fusion on the extracted multi-scale event stream features, the multi-scale image features and the global image features to generate a high dynamic range image.

2. The adaptive vision tracking method of claim 1, wherein a combination of a color event camera and a conventional camera is used to obtain a real dataset; the real dataset comprises: high dynamic range images, low dynamic range images, and color event stream data.

3. The adaptive vision tracking method of claim 1, wherein the simulated data set comprises: high dynamic range images, low dynamic range images, and color event stream data;

The training color event stream data is generated by simulating random small-amplitude dithering of a high dynamic range image using an event camera simulator based on the high dynamic range image data.

4. An adaptive vision tracking system, comprising: the system comprises an image automatic correction module and a spatial scale tracking module which are connected in sequence;

The image automatic correction module is used for carrying out feature extraction and feature fusion on input image data and carrying out automatic correction on the image; wherein,

The automatic image correction module comprises a data generation module and a model design module, wherein the data generation module is used for generating a test data set according to a simulation data set and a real shooting data set; the model design module comprises an event stream feature extraction module, an image feature extraction module and a feature fusion module, wherein the event stream feature extraction module is used for extracting multi-scale event stream features; the image feature extraction module is used for extracting multi-scale image features and global image features; the feature fusion module performs feature fusion on the multi-scale event stream features, the multi-scale image features and the global image features to generate a high dynamic range image;

the spatial scale tracking module is used for taking the high dynamic range image as input and tracking the high dynamic range image.

5. The adaptive vision tracking system of claim 4, wherein the spatial scale tracking module comprises a position estimation module and a scale estimation module; the position estimation module is used for updating position parameters; the scale estimation module is used for updating the scale parameters.

6. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the adaptive vision tracking method of any one of claims 1 to 3.

7. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the adaptive vision tracking method of any one of claims 1 to 3.