CN112819858B

CN112819858B - Target tracking method, device, equipment and storage medium based on video enhancement

Info

Publication number: CN112819858B
Application number: CN202110129674.1A
Authority: CN
Inventors: 向国庆; 文映博; 严韫瑶; 张鹏; 贾惠柱
Original assignee: Beijing Boya Huishi Intelligent Technology Research Institute Co ltd
Current assignee: Beijing Boya Huishi Intelligent Technology Research Institute Co ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2024-03-22
Anticipated expiration: 2041-01-29
Also published as: CN112819858A

Abstract

The application provides a target tracking method, device, equipment and storage medium based on video enhancement, wherein the method comprises the following steps: acquiring video data to be enhanced; enhancing the video data to be enhanced through a pre-trained low-light image enhancement network to obtain enhanced video data; and carrying out target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced. The method builds and trains the low-light image enhancement network, enhances the video data to be enhanced through the low-light image enhancement network, improves the contrast and chromaticity in each video frame in the video data to be enhanced, reduces the noise in each video frame, ensures that details in the video data to be enhanced are clearer, and is convenient for identifying the target to be tracked. And the target tracking is carried out on the wear-enhanced video data on the basis of the video enhancement, so that the accuracy of the target tracking is greatly improved.

Description

Target tracking method, device, equipment and storage medium based on video enhancement

Technical Field

The application belongs to the technical field of video processing, and particularly relates to a target tracking method, device and equipment based on video enhancement and a storage medium.

Background

The shot image or video is often limited by the environment, so that the shot image or video has the defects of insufficient brightness, low contrast, serious noise and the like. For example, the night monitoring video is limited by serious shortage of light, and the video has the problems of extreme darkness, blurred details and serious noise, so that the target category in the video is difficult to distinguish clearly, and the target detection and tracking are greatly hindered. It is therefore necessary to enhance such an image or video.

Currently, some video enhancement-based object tracking methods, such as multi-scale Retinex (image defogging algorithm) low-light enhancement technology, are provided in the related art, after being enhanced by the technology, a high-brightness image is obtained, but contrast, chromaticity and texture details are damaged to a certain extent, and bottom noise is amplified along with brightness, so that the finally obtained video image effect cannot meet the visual effect of human eyes, and objects in a clear video image are still difficult to distinguish, and the tasks of object detection and tracking are difficult to realize.

Disclosure of Invention

The target tracking method, device, equipment and storage medium based on video enhancement are used for enhancing the video data to be enhanced through a pre-trained low-light image enhancement network, improving the contrast and chromaticity in each video frame in the video data to be enhanced, reducing noise in each video frame, enabling details in the video data to be enhanced to be clearer and facilitating identification of the target to be tracked. And the target tracking is carried out on the wear-enhanced video data on the basis of the video enhancement, so that the accuracy of the target tracking is greatly improved.

An embodiment of a first aspect of the present application provides a target tracking method based on video enhancement, including:

acquiring video data to be enhanced;

performing enhancement processing on the video data to be enhanced through a pre-trained low-light image enhancement network to obtain enhanced video data;

and carrying out target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced.

In some embodiments of the present application, the enhancing processing of the video data to be enhanced through the pre-trained low-light image enhancement network, before obtaining enhanced video data, further includes:

constructing a network structure of a low-light image enhancement network;

acquiring a training set, wherein the training set comprises night video images;

and training the constructed low-light image enhancement network according to the training set to obtain a trained low-light image enhancement network.

In some embodiments of the present application, the network structure for constructing a low-light image enhancement network includes:

connecting the first convolution layer and the activation layer in series to obtain a feature extraction module;

sequentially connecting the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in series to obtain an image enhancement module;

sequentially connecting a preset number of the feature extraction modules in series;

connecting each feature extraction module with one image enhancement module respectively;

and connecting each image adding module with the full-connection layer to obtain the network structure of the low-light image enhancement network.

In some embodiments of the present application, the training the constructed low-light image enhancement network according to the training set to obtain a trained low-light image enhancement network includes:

acquiring the night video image from the training set;

inputting the night video images into a preset number of feature extraction modules which are sequentially connected in series to obtain a preset number of feature images;

respectively inputting the preset number of feature images into an image enhancement module connected with each adjustment extraction module to obtain an enhancement feature image corresponding to each feature image;

connecting each enhancement feature map through the full connection layer to obtain an enhancement video image corresponding to the night video image;

calculating a spatial consistency loss value, a perception loss value and a color loss value corresponding to the current training period according to the night video image and the corresponding enhanced video image thereof;

and when the spatial consistency loss value, the perception loss value and the color loss value meet preset convergence conditions, obtaining the trained low-light image enhancement network.

In some embodiments of the present application, before the inputting the night video image into the preset number of feature extraction modules connected in series in sequence, the method further includes:

regularization processing is carried out on the night video image, and pixel values of each color channel in the night video image are compressed to a preset interval.

In some embodiments of the present application, the performing, by a preset target tracking network, target tracking processing on the enhanced video data to obtain a target tracking video sequence corresponding to the video data to be enhanced, includes:

respectively carrying out target detection on each video frame in the enhanced video data through a preset target tracking network, and positioning each target to be tracked in each video frame;

tracking the track of each target to be tracked through a preset target tracking algorithm to obtain a target tracking result corresponding to each target to be tracked;

respectively carrying out smooth interpolation processing on target tracking results corresponding to each target to be tracked;

and generating a target track video according to a target tracking result corresponding to each target to be tracked after the smooth interpolation processing.

In some embodiments of the present application, the convolution kernels of the first convolution layer and the seventh convolution layer are each 3×3 in size, for outputting a 256×256×32 feature map;

the convolution kernels of the second convolution layer and the sixth convolution layer are 3×3 in size and are used for outputting a 128×128×8 feature map;

the convolution kernels of the third convolution layer and the fifth convolution layer are 5×5 in size and are used for outputting a 64×64×16 feature map;

the size of the convolution kernel of the fourth convolution layer is 5 x 5, for outputting a 32 x 32 feature map.

Embodiments of a second aspect of the present application provide a video enhancement-based object tracking apparatus, comprising:

the video acquisition module is used for acquiring video data to be enhanced;

the enhancement processing module is used for enhancing the video data to be enhanced through a pre-trained low-light image enhancement network to obtain enhanced video data;

and the target tracking module is used for carrying out target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced.

An embodiment of a third aspect of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the computer program to implement the method of the first aspect.

An embodiment of the fourth aspect of the present application provides a computer readable storage medium having stored thereon a computer program for execution by a processor to implement the method of the first aspect.

The technical scheme provided in the embodiment of the application has at least the following technical effects or advantages:

in the embodiment of the application, the low-light image enhancement network is constructed and trained, the video data to be enhanced is enhanced through the low-light image enhancement network, the contrast and the chromaticity in each video frame in the video data to be enhanced are improved, the noise in each video frame is reduced, details in the video data to be enhanced are clearer, and the target to be tracked is convenient to identify. And the target tracking is carried out on the wear-enhanced video data on the basis of the video enhancement, so that the accuracy of the target tracking is greatly improved.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 is a schematic diagram of a network structure of a low-light image enhancement network according to an embodiment of the present application;

FIG. 2 is a flow chart of a video enhancement-based object tracking method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a low-light image enhancement process using a low-light image enhancement network according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a video-enhanced-based object tracking device according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 6 shows a schematic diagram of a storage medium according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs.

A method, apparatus, device and storage medium for video enhancement-based object tracking according to embodiments of the present application are described below with reference to the accompanying drawings.

The embodiment of the application provides a target tracking method based on video enhancement, which trains a low-light image enhancement network, enhances the video data to be enhanced by using the low-light image enhancement network, carries out target tracking on each enhanced video frame, finally obtains a tracking video sequence and effectively improves the tracking accuracy. Compared with the method for tracking the target directly on the basis of the original low-illumination video, the method can improve the tracking accuracy by 105% on average.

The embodiment of the application trains the low-light image enhancement network through the following steps S1-S3, and specifically comprises the following steps:

s1: and constructing a network structure of the low-light image enhancement network.

Specifically, the first convolution layer and the activation layer are connected in series to obtain a feature extraction module; sequentially connecting the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in series to obtain an image enhancement module; sequentially connecting a preset number of feature extraction modules in series; each feature extraction module is respectively connected with an image enhancement module; and connecting each image adding module with the full connection layer to obtain the network structure of the low-light image enhancement network.

The convolution kernels of the first convolution layer and the seventh convolution layer may each have a size of 3×3, and a step size of 1, and are used to output a 256×256×32 feature map. The convolution kernels of the second and sixth convolution layers may each be 3×3 in step size of 1 for outputting a 128×128×8 feature map. The convolution kernels of the third and fifth convolution layers may each be 5×5 in step size of 1 for outputting a 64×64×16 feature map. The size of the convolution kernel of the fourth convolution layer is 5 x 5, the step size is 1, and the step size is 1, for outputting a 32 x 32 feature map. The predetermined number may be 8, 9, 10, etc.

The embodiment of the application does not limit the convolution kernels and the step values of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer and the seventh convolution layer, and also does not limit the size of the feature map output by each convolution layer, and can be set according to the requirements in practical application. The embodiment of the application is not limited to the specific value of the preset number, and can be set according to the requirement in practical application.

As shown in fig. 1, the network structure of the low light image enhancement network is shown at a preset number of 8, wherein the convolution kernels of the first convolution layer and the seventh convolution layer have a size of 3×3, and the convolution kernels of the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer and the sixth convolution layer have a size of 5×5.

S2: a training set is obtained, the training set including night video images.

According to the embodiment of the application, the training set is built according to the public standard video data set. The special sequence in the public data set CDW2014 is used as training data, the data set comprises 11 groups of video sequences, including severe weather, night videos, thermal imaging videos and multi-shadow videos, each group comprises 4 to 6 videos, the night video groups are selected as training videos, 6 videos are all selected, and the data set is used as training data and can comprise most of low-light enhancement application scenes.

S3: and training the constructed low-light image enhancement network according to the training set to obtain a trained low-light image enhancement network.

A number of night video images are first acquired from a training set. The training set comprises a plurality of night videos shot in a night low-light scene, each night video comprises a plurality of frames of night video images, and the plurality of night video images are acquired from the training set according to the batch size (batch processing number) corresponding to the constructed low-light image enhancement network.

Each acquired night video image is compressed to a preset size, which may be 512×512×3. And then regularizing each night video image, and compressing the pixel value of each color channel in each night video image to a preset interval, wherein the preset interval can be [0,1]. And then inputting each night video image into a preset number of feature extraction modules which are sequentially connected in series to obtain a preset number of feature images. For any night video image, a first convolution layer included in the first feature extraction module carries out convolution operation on the night video image, and then a ReLU activation function in an activation layer is utilized to activate a convolution result to obtain a first feature map corresponding to the night video image. The feature map is then input to a second feature extraction module for performing convolution and activation operations to obtain a second feature map. And then inputting the second feature map into a third feature extraction module, and obtaining a preset number of feature maps corresponding to the night video image after sequentially passing through the preset number of feature extraction modules.

And respectively inputting the obtained preset number of feature images into an image enhancement module connected with each adjustment extraction module to obtain an enhancement feature image corresponding to each feature image. In the image enhancement module, a feature image generated by a feature extraction module connected with the image enhancement module is input into the image enhancement module, the image enhancement module is composed of a second convolution layer, a third convolution layer, a fourth convolution layer, a fifth convolution layer, a sixth convolution layer and a seventh convolution layer which are sequentially connected in series, and the second convolution layer carries out convolution operation on the feature image and then outputs a 128×128×8 feature image. The 128×128×8 feature map is input to the third convolution layer, and a 64×64×16 feature map is output. The 64 x 16 feature map is input to a fourth convolution layer, output 32 x 32. The 32 x 32 feature map is input to the fifth convolution layer, the feature map of 64×64×16 is output. The 64×64×16 feature map is input to the sixth convolution layer, and a 128×128×8 feature map is output. The 128×128×8 feature map is input to the seventh convolution layer, and 256×256×32 feature maps are output.

And each image enhancement module performs image enhancement processing through the process to obtain a preset number of enhancement feature images corresponding to the night video image. And finally, connecting each enhancement feature map through a full connection layer to obtain an enhancement video image corresponding to the night video image.

For each of the batch size night video images input into the low light enhancement network, the enhancement video image corresponding to each night video image is obtained in the above manner. Training and learning are performed on the night video images of the batch size, and the training and learning are called one training period of the low light enhancement network. And in the current training period, after the enhanced video image corresponding to each night video image in the batch size night video images is obtained, calculating a space consistency loss value, a perception loss value and a color loss value corresponding to the current training period according to the night video image and the enhanced video image corresponding to the night video image.

Wherein the spatial consistency loss value is calculated by the following formula (1),

in formula (1), L _spa For the spatial consistency loss value, K is the number of local areas in the enhanced video image, Ω is four sets of blocks (the size is set to 8×8) of the current local area I, Y is the average intensity of the current local area, and I is the average intensity of the current local area on the corresponding night video image.

The perceptual loss value is calculated by the following formula (2),

in the formula (2), L _per For the perceptual loss value, i, j denote the ith layer max pooling layer of the VGG-16 (Visual Geometry Group Network-16) network, and the jth layer convolution layer of the ith layer max pooling layer, W _i,j 、H _i,j 、C _i, The feature map is wide, high, channel number, i.e., its size, respectively. F (F) _i, (I) _x,, 、F _i,j (O) _x,, And the feature map is a corresponding i layer number feature map and a corresponding j layer number feature map of the night video image and the corresponding enhanced video image.

The color loss value is calculated by the following formula (3),

in formula (3), L _col For the color loss value, J represents the average intensity of a certain color channel of the enhanced video image, (p, q) represents a pair of color channels, three out of the three color channels of RGB, and the set is epsilon.

In the current training period, after the spatial consistency loss value, the perception loss value and the color loss value corresponding to each night video image in the current training period are respectively calculated through the formulas (1) - (3), judging whether the calculated spatial consistency loss value, the perception loss value and the color loss value meet the preset convergence condition, wherein the spatial consistency loss value, the perception loss value and the color loss value in the preset convergence condition are required to be smaller than the preset spatial consistency loss threshold value, the perception loss threshold value and the color loss threshold value respectively. If the spatial consistency loss value, the perception loss value and the color loss value of the current training period are smaller than the preset spatial consistency loss threshold value, the perception loss threshold value and the color loss threshold value respectively, determining that the current spatial consistency loss value, the perception loss value and the color loss value meet the preset convergence condition, stopping training, and determining the low-light image enhancement network and the parameters thereof of the current training period as a trained low-light image enhancement network.

If any one of the space consistency loss value, the perception loss value and the color loss value of the current training period does not meet the preset convergence condition, parameters of the low-light image enhancement network are adjusted through back propagation, the batch size night video images are acquired again from the training set, training of the next training period is conducted according to the mode, and the trained low-light image enhancement network is obtained until the space consistency loss value, the perception loss value and the color loss value meet the preset convergence condition.

After obtaining the trained low-light image enhancement network in the above manner, as shown in fig. 2, enhancement and target tracking are performed on the video data to be enhanced by the following steps.

Step 101: and obtaining video data to be enhanced.

The video data to be enhanced can be video data shot in a low-light scene through a camera, or can be video data which is acquired from a network and needs to be enhanced, and the like.

Step 102: and carrying out enhancement processing on the video data to be enhanced through a pre-trained low-light image enhancement network to obtain enhanced video data.

Compressing each frame of image in the obtained video data to be enhanced to a preset size, regularizing each frame of compressed image, and compressing the pixel value of each color channel in the night video image to a preset interval. And then, acquiring batch size images corresponding to a low-light image enhancement network from each frame of image corresponding to the video data to be enhanced, inputting the acquired images into the trained low-light image enhancement network, and outputting enhanced images corresponding to each of the batch size images. As shown in fig. 3, after preprocessing a frame of image, inputting the preprocessed frame of image into a low-light image enhancement network to obtain a corresponding enhanced image, wherein the low-light image enhancement network shown in fig. 3 is provided with 8 feature extraction modules which are sequentially connected in series, and 8 image enhancement modules, so that 8 enhancement feature images corresponding to an array of images can be generated, and finally the 8 enhancement feature images are connected through a full connection layer to obtain a final enhanced image.

And obtaining a corresponding enhanced image for each frame of image in the video data to be enhanced through the trained low-light image enhancement network according to the mode, so as to obtain enhanced video data corresponding to the video data to be enhanced.

Step 103: and carrying out target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced.

After the enhanced video data corresponding to the video data to be enhanced is obtained, the enhanced video data is input into a preset target tracking network, target detection is respectively carried out on each video frame in the enhanced video data through the preset target tracking network, and each target to be tracked in each video frame is positioned. Specifically, the preset target tracking network performs Fast R-CNN target detection on each input video frame, firstly, a candidate region where a target to be tracked is located needs to be extracted, the candidate region is extracted from an input image by utilizing a Selective Search algorithm, and the candidate regions are mapped to a convolution feature layer of the preset target tracking network according to a spatial position relation; then carrying out region normalization, and carrying out ROI (Region of Interest) Pooling operation on each candidate region on the convolution characteristic layer to obtain extracted characteristics; and finally, inputting the extracted features into a full-connection layer, classifying by using Softmax (logistic regression model), and regressing the positions of the candidate areas to obtain a target detection result, namely, positioning each target to be tracked from each video frame.

And then tracking the track of each target to be tracked through a preset target tracking algorithm to obtain a target tracking result corresponding to each target to be tracked. Specifically, tracking is performed according to each target to be tracked detected by a Fast R-CNN algorithm, and a preset target tracking algorithm, such as Deep Sort, is used for obtaining a tracking result. Deep Sort is a multi-target tracking algorithm, which uses motion model and appearance information to correlate data, the running rate is mainly determined by the detection algorithm, the algorithm detects the target of each frame, and then matches the previously obtained motion trail with the current detection object by the hungarian matching algorithm with weight value to form the motion trail of the object. The weight is obtained by weighted summation of the similarity between the point and the Markov distance of the motion track and the image block.

And the obtained target tracking results are the positions of each target to be tracked in the image at different times, the target tracking results corresponding to each target to be tracked are subjected to smooth interpolation processing respectively, and the target track video is generated according to the target tracking results corresponding to each target to be tracked after the smooth interpolation processing. And each target to be tracked in the finally generated target track video corresponds to a minimum circumscribed rectangular frame for identifying the target, and the moving track of each target to be tracked is identified through a curve.

In the embodiment of the application, the quality evaluation can be further performed on the processing results of the low-light image enhancement network and the preset target tracking network. The evaluation parameters used for the low-light image enhancement network may be PSNR (peak signal to noise ratio), SSIM (structural similarity) and MAE (mean absolute error). The evaluation parameters used for the preset target tracking network may be MOTA (Multiple Object Tracking Accuracy, multi-target tracking accuracy), MOTP (Multiple Object Tracking Precision, multi-target tracking accuracy) and IDP (Identification Precision, identification accuracy). The evaluation result shows that the target tracking accuracy can be effectively improved finally.

The embodiment of the application also provides a target tracking device based on video enhancement, which is used for executing the target tracking method based on video enhancement provided by any embodiment. Referring to fig. 4, the apparatus includes:

a video acquisition module 401, configured to acquire video data to be enhanced;

the enhancement processing module 402 is configured to perform enhancement processing on video data to be enhanced through a pre-trained low-light image enhancement network, so as to obtain enhanced video data;

the target tracking module 403 is configured to perform target tracking processing on the enhanced video data through a preset target tracking network, so as to obtain a target tracking video sequence corresponding to the video data to be enhanced.

The apparatus further comprises: the network training module is used for constructing a network structure of the low-light image enhancement network; acquiring a training set, wherein the training set comprises night video images; and training the constructed low-light image enhancement network according to the training set to obtain a trained low-light image enhancement network.

The network training module is used for connecting the first convolution layer and the activation layer in series to obtain a feature extraction module; sequentially connecting the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in series to obtain an image enhancement module; sequentially connecting a preset number of feature extraction modules in series; each feature extraction module is respectively connected with an image enhancement module; and connecting each image adding module with the full connection layer to obtain the network structure of the low-light image enhancement network. Wherein, the convolution kernels of the first convolution layer and the seventh convolution layer are 3×3 in size and are used for outputting 256×256×32 feature maps; the convolution kernels of the second convolution layer and the sixth convolution layer are 3×3 in size and are used for outputting a 128×128×8 feature map; the convolution kernels of the third convolution layer and the fifth convolution layer are 5×5 in size and are used for outputting a 64×64×16 feature map; the size of the convolution kernel of the fourth convolution layer is 5 x 5, for outputting a 32 x 32 feature map.

The network training module is used for acquiring night video images from the training set; inputting night video images into a preset number of feature extraction modules which are sequentially connected in series to obtain a preset number of feature graphs; respectively inputting a preset number of feature images into an image enhancement module connected with each adjustment extraction module to obtain an enhancement feature image corresponding to each feature image; connecting each enhancement feature map through a full connection layer to obtain an enhancement video image corresponding to the night video image; calculating a spatial consistency loss value, a perception loss value and a color loss value corresponding to the current training period according to the night video image and the corresponding enhanced video image; and when the space consistency loss value, the perception loss value and the color loss value meet preset convergence conditions, obtaining the trained low-light image enhancement network.

The network training module is used for regularizing the night video images before inputting the night video images into the preset number of feature extraction modules which are sequentially connected in series, and compressing the pixel value of each color channel in the night video images to a preset interval.

The target tracking module 403 is configured to detect a target of each video frame in the enhanced video data through a preset target tracking network, and locate each target to be tracked in each video frame; tracking the track of each target to be tracked through a preset target tracking algorithm to obtain a target tracking result corresponding to each target to be tracked; respectively carrying out smooth interpolation processing on target tracking results corresponding to each target to be tracked; and generating a target track video according to a target tracking result corresponding to each target to be tracked after the smooth interpolation processing.

The video-enhancement-based object tracking device provided by the above embodiment of the present application and the video-enhancement-based object tracking method provided by the embodiment of the present application are the same inventive concept, and have the same advantages as the method adopted, operated or implemented by the application program stored therein.

The embodiment of the application also provides electronic equipment for executing the target tracking method based on video enhancement. Referring to fig. 5, a schematic diagram of an electronic device according to some embodiments of the present application is shown. As shown in fig. 5, the electronic device 5 includes: a processor 500, a memory 501, a bus 502 and a communication interface 503, the processor 500, the communication interface 503 and the memory 501 being connected by the bus 502; the memory 501 stores a computer program executable on the processor 500, and the processor 500 executes the video enhancement-based object tracking method provided in any of the foregoing embodiments of the present application when the computer program is executed.

The memory 501 may include a high-speed random access memory (RAM: random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 503 (which may be wired or wireless), the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.

Bus 502 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. The memory 501 is configured to store a program, and the processor 500 executes the program after receiving an execution instruction, and the video-enhancement-based object tracking method disclosed in any of the foregoing embodiments of the present application may be applied to the processor 500 or implemented by the processor 500.

The processor 500 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 500. The processor 500 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 501, and the processor 500 reads the information in the memory 501, and in combination with its hardware, performs the steps of the method described above.

The electronic device provided by the embodiment of the application and the video enhancement-based target tracking method provided by the embodiment of the application are the same in the same inventive concept, and have the same beneficial effects as the method adopted, operated or realized by the electronic device.

The present application further provides a computer readable storage medium corresponding to the video enhancement-based target tracking method provided in the foregoing embodiment, referring to fig. 6, the computer readable storage medium is shown as an optical disc 30, on which a computer program (i.e. a program product) is stored, where the computer program, when executed by a processor, performs the video enhancement-based target tracking method provided in any of the foregoing embodiments.

It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.

The computer readable storage medium provided by the above embodiments of the present application and the video enhancement-based object tracking method provided by the embodiments of the present application have the same advantages as the method adopted, operated or implemented by the application program stored therein, because of the same inventive concept.

It should be noted that:

in the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the present application may be practiced without these specific details. In some instances, well-known structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the following schematic diagram: i.e., the claimed application requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the present application and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.

The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A video enhancement-based target tracking method, comprising:

acquiring video data to be enhanced;

performing target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced;

the enhancing processing is performed on the video data to be enhanced through a pre-trained low-light image enhancing network, and before the enhanced video data is obtained, the method further comprises the following steps:

constructing a network structure of a low-light image enhancement network;

training the constructed low-light image enhancement network according to the training set to obtain a trained low-light image enhancement network;

the network structure for constructing the low-light image enhancement network comprises the following steps:

connecting each image enhancement module with a full connection layer to obtain a network structure of a low-light image enhancement network;

training the constructed low-light image enhancement network according to the training set to obtain a trained low-light image enhancement network, wherein the training set comprises the following steps:

acquiring the night video image from the training set;

respectively inputting the preset number of feature images into an image enhancement module connected with each feature extraction module to obtain an enhancement feature image corresponding to each feature image;

2. The method of claim 1, further comprising, prior to said inputting said night video images into a predetermined number of said feature extraction modules in series in sequence:

3. The method according to claim 1, wherein the performing, by using a preset target tracking network, target tracking processing on the enhanced video data to obtain a target tracking video sequence corresponding to the video data to be enhanced, includes:

4. A method according to any one of claim 1 to 3, wherein,

the convolution kernels of the first convolution layer and the seventh convolution layer are 3×3 in size and are used for outputting 256×256×32 feature maps;

5. A video enhancement-based object tracking apparatus, comprising:

the video acquisition module is used for acquiring video data to be enhanced;

the target tracking module is used for carrying out target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced;

the network training module is used for constructing a network structure of the low-light image enhancement network; acquiring a training set, wherein the training set comprises night video images; training the constructed low-light image enhancement network according to the training set to obtain a trained low-light image enhancement network;

the network training module is used for connecting the first convolution layer and the activation layer in series to obtain a feature extraction module; sequentially connecting the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in series to obtain an image enhancement module; sequentially connecting a preset number of feature extraction modules in series; each feature extraction module is respectively connected with an image enhancement module; connecting each image adding module with the full connecting layer to obtain a network structure of the low-light image enhancement network;

the training the constructed low-light image enhancement network according to the training set to obtain a trained low-light image enhancement network comprises the following steps:

acquiring the night video image from the training set;

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor runs the computer program to implement the method of any one of claims 1-4.

7. A computer readable storage medium having stored thereon a computer program, wherein the program is executed by a processor to implement the method of any of claims 1-4.