CN112929665A

CN112929665A - Target tracking method, device, equipment and medium combining super-resolution and video coding

Info

Publication number: CN112929665A
Application number: CN202110121731.1A
Authority: CN
Inventors: 向国庆; 文映博; 严韫瑶; 张鹏; 贾惠柱
Original assignee: Beijing Boya Huishi Intelligent Technology Research Institute Co ltd
Current assignee: Beijing Boya Huishi Intelligent Technology Research Institute Co ltd
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2021-06-08

Abstract

The disclosure relates to the technical field of hardware video encoder design, and particularly provides a target tracking method, device, equipment and medium combining super-resolution and video encoding, wherein the method comprises the following steps: acquiring a low-resolution video to be input, and compressing, coding and reconstructing the low-resolution video to obtain a reconstructed intermediate video; inputting the reconstructed intermediate video into a super-resolution network for deep learning training to obtain a trained high-resolution video; and carrying out target tracking operation on the high-resolution video to obtain an enhanced target tracking video. According to the method, the well-designed combined super-resolution video coding module is used for improving the resolution of the low-resolution video, each frame after enhancement is subjected to target tracking, and the tracking video is finally obtained. Compared with the method for directly tracking the target on the low-resolution video, the method can averagely improve the tracking accuracy by 80 percent.

Description

Target tracking method, device, equipment and medium combining super-resolution and video coding

Technical Field

The present disclosure relates to the field of hardware video encoder design technologies, and more particularly, to a target tracking method, apparatus, device, and medium combining super-resolution and video encoding.

Background

In a video compression coding scene, it is often difficult to transmit high-quality images or videos due to the limitation of bandwidth, and such images transmitted in a low-bandwidth environment mostly have the disadvantages of blocking effect, image blurring, and severe transmission noise, which not only affects the subjective experience of people, but also has a very severe impact on information extraction. In a typical case, for example, in a live broadcast process, if network fluctuation is large, a recorded video is forced to select a low-quality mode for compression during transmission, and finally, only the problems of detail blurring, serious blocking effect and much noise are transmitted to a viewer, so that a good viewing experience is difficult to obtain. This problem not only affects the subjective experience of people, but also presents a considerable obstacle to the high-level image processing tasks of target detection and tracking. Therefore, super-resolution processing of video is a very challenging but significant topic. The current mainstream scheme is to improve coding efficiency or bandwidth, but although the schemes obtain high-quality video images, the schemes have no essential solution effect on the problems, and the finally obtained video image effect still cannot meet the visual effect of human eyes at an extremely low network speed, so that the task of completing target detection and tracking is difficult to undertake.

Disclosure of Invention

The technical problem that in the prior art, transmission is still difficult after high-definition video compression is solved.

To achieve the above technical object, the present disclosure provides a target tracking method combining super-resolution and video coding, including:

acquiring a low-resolution video to be input, and compressing, coding and reconstructing the low-resolution video to obtain a reconstructed intermediate video;

inputting the reconstructed intermediate video into a super-resolution network for deep learning training to obtain a trained high-resolution video;

and carrying out target tracking operation on the high-resolution video to obtain an enhanced target tracking video.

Further, the step of obtaining a reconstructed intermediate video after the low-resolution video is compressed, encoded and reconstructed specifically includes:

and performing avs3 compression coding and reconstruction on the low-resolution video to obtain a reconstructed intermediate video, wherein the size of the intermediate video is 360x 180.

Further, the step of inputting the reconstructed intermediate video into a super-resolution network for deep learning training to obtain a trained high-resolution video specifically includes:

carrying out image feature extraction on the intermediate video frame by frame, and carrying out convolution operation to obtain a feature map;

performing collapse operation treatment on the characteristic diagram to obtain a collapsed convolutional layer;

mapping the collapsed convolution layer to obtain mapped image data;

and carrying out deconvolution operation on the mapped image data to obtain a trained high-resolution video image.

Further, after performing deconvolution operation on the mapped image data to obtain a trained high-resolution video image, the method further includes:

judging whether the loss of the trained high-resolution video image exceeds a preset loss threshold value or not, if so, calculating the image loss, performing reverse propagation, and performing the image feature extraction again; and if not, ending the process of the super-resolution network deep learning training.

Further, the obtaining of the enhanced target tracking video by performing the target tracking operation on the high-resolution video specifically includes:

performing Fast R-CNN target detection on image data of the high-resolution video frame by frame;

tracking according to a result calculated by a Fast R-CNN target detection algorithm, and acquiring a tracking result by using a multi-target tracking algorithm;

and performing smooth interpolation on the target tracking result processed by the multi-target tracking algorithm, and generating a target track video.

Further, the performing Fast R-CNN target detection on the image data of the high-resolution video frame by frame specifically includes:

extracting a candidate region, extracting the candidate region from the input image by using a selective search algorithm, and mapping the candidate region to a final convolution characteristic layer according to a spatial position relation;

carrying out region normalization, and carrying out ROI pooling operation on each candidate region on the convolution feature layer to obtain features with fixed dimensions;

and inputting the extracted features into a full connection layer, classifying by utilizing Softmax, and regressing the positions of the candidate regions to obtain a target detection result.

Further, the multi-target tracking algorithm is a Deep Sort multi-target tracking algorithm.

To achieve the above technical object, the present disclosure can also provide a target tracking apparatus combining super-resolution and video coding, including:

the video acquisition module is used for acquiring a low-resolution video to be input and obtaining a reconstructed intermediate video after the low-resolution video is compressed, coded and reconstructed;

the super-resolution learning module is used for inputting the reconstructed intermediate video into a super-resolution network for deep learning training to obtain a trained high-resolution video;

and the target tracking module is used for carrying out target tracking operation on the high-resolution video to obtain an enhanced target tracking video.

To achieve the above technical objects, the present disclosure can also provide a computer storage medium having stored thereon a computer program for implementing the steps of the above-mentioned target tracking method of joint super-resolution and video coding when being executed by a processor.

To achieve the above technical object, the present disclosure also provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the target tracking method combining super-resolution and video coding as described above when executing the computer program.

The beneficial effect of this disclosure does:

the utility model discloses a target tracking device based on joint super-resolution and video coding uses well-designed joint super-resolution video coding module to carry out resolution enhancement to the low resolution video to carry out target tracking to each frame after will strengthening, finally obtains the tracking video, and this design has effectually improved the tracking accuracy. Compared with the method for directly tracking the target on the low-resolution video, the method can averagely improve the tracking accuracy by 80 percent.

Drawings

Fig. 1 shows a schematic flow diagram of embodiment 1 of the present disclosure;

fig. 2 shows a super-resolution network deep learning flow diagram of embodiment 1 of the present disclosure;

fig. 3 shows a flow diagram of a target tracking process of embodiment 1 of the present disclosure;

fig. 4 shows a schematic structural diagram of embodiment 2 of the present disclosure;

fig. 5 shows a schematic structural diagram of embodiment 4 of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

Various structural schematics according to embodiments of the present disclosure are shown in the figures. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers, and relative sizes and positional relationships therebetween shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, as actually required.

The first embodiment is as follows:

as shown in fig. 1:

the present disclosure provides a target tracking method combining super-resolution and video coding, comprising:

s101: acquiring a low-resolution video to be input, and compressing, coding and reconstructing the low-resolution video to obtain a reconstructed intermediate video;

specifically, the step of obtaining a reconstructed intermediate video after the low-resolution video is subjected to compression coding and reconstruction specifically includes:

S102: inputting the reconstructed intermediate video into a super-resolution network for deep learning training to obtain a trained high-resolution video;

mapping the collapsed convolution layer to obtain mapped image data;

As shown in fig. 2: the process of deep learning training of the super-resolution network in the first embodiment of the disclosure is shown:

the video is input frame by frame, each frame is compressed to the size of 360x180, and by way of example, the video can be compressed and transmitted by using an AVS3 encoding standard, and the AVS3 is a new generation video compression standard, and the compression efficiency is extremely high. It is noted that the standard is not limited to the AVS 3.

And carrying out video reconstruction on the compressed code stream, and inputting the reconstructed video into a super-resolution network frame by frame.

The image after entering the network is firstly subjected to feature extraction, convolution operation with the step length of 1 is carried out by a convolution kernel with the size of 5x5, the number of output channels is 56, and the activation function is ReLU.

A collapsing operation (collapsing) was performed on the resulting signature, here using a 3x3 receptive field, 12 output channels, and a ReLU as the activation function.

And then carrying out Mapping (Mapping) operation on the collapsed convolution layer, wherein the size of a convolution kernel is 3x3, the step size is 1, the number of output channels is 12, the Mapping operation is iterated and circulated for four times, the number of the last output channels is 56, and the activation function is ReLU.

And finally, carrying out deconvolution operation on the mapped image to obtain a super-resolution video image, wherein a convolution kernel with the size of 9x9 is used, the output channel is 1, and the activation function is ReLU.

Judging whether the loss of the trained high-resolution video image exceeds a preset loss threshold value or not, wherein the number of the adopted loss functions is two, and the formula (1) is a cross entropy loss function, and the formula (2) is a cross entropy loss function

S103: and carrying out target tracking operation on the high-resolution video to obtain an enhanced target tracking video.

As shown in fig. 3, a schematic process diagram of target tracking in the first embodiment of the disclosure is shown:

performing Fast R-CNN target detection on each frame of input image, firstly extracting candidate regions, extracting the candidate regions from the input image by using a Selective Search algorithm, and mapping the candidate regions to a final convolution characteristic layer according to a spatial position relationship; then carrying out regional normalization, and carrying out ROI Pooling operation on each candidate region on the convolution feature layer to obtain features with fixed dimensions; and finally, inputting the extracted features into a full connection layer, classifying by using Softmax, and regressing the positions of the candidate regions to obtain a target detection result.

And tracking according to the result detected by the Fast R-CNN algorithm, and obtaining a tracking result by using Deep Sort. Deep Sort is a multi-target tracking algorithm, basically thought of as tracking-by-detection, and performs data association by using a motion model and appearance information, the running speed is mainly determined by a detection algorithm, the algorithm performs target detection on each frame, and then matches a previous motion trajectory with a current detection object by a weighted Hungary matching algorithm to form a motion trajectory of an object. The weight is obtained by weighting and summing the Mahalanobis distance between the point and the motion trail and the similarity of the image blocks.

And performing smooth interpolation on the target tracking result processed by Deep Sort, and generating a target track video.

Example two:

as shown in fig. 4:

the present disclosure can also provide a target tracking apparatus combining super-resolution and video coding, including:

the video acquiring module 201 is configured to acquire a low-resolution video to be input, and compress, encode and reconstruct the low-resolution video to obtain a reconstructed intermediate video;

a super-resolution learning module 202, configured to input the reconstructed intermediate video into a super-resolution network for deep learning training to obtain a trained high-resolution video;

and the target tracking module 203 is configured to perform a target tracking operation on the high-resolution video to obtain an enhanced target tracking video.

The video acquisition module 201 of the present disclosure is sequentially connected to the super-resolution learning module 202 and the target tracking module 203.

Example three:

the present disclosure can also provide a computer storage medium having stored thereon a computer program for implementing the steps of the above-described target tracking method of joint super-resolution and video coding when executed by a processor.

The computer storage medium of the present disclosure may be implemented with a semiconductor memory, a magnetic core memory, a magnetic drum memory, or a magnetic disk memory.

Semiconductor memories are mainly used as semiconductor memory elements of computers, and there are two types, Mos and bipolar memory elements. Mos devices have high integration, simple process, but slow speed. The bipolar element has the advantages of complex process, high power consumption, low integration level and high speed. NMos and CMos were introduced to make Mos memory dominate in semiconductor memory. NMos is fast, e.g. 45ns for 1K bit sram from intel. The CMos power consumption is low, and the access time of the 4K-bit CMos static memory is 300 ns. The semiconductor memories described above are all Random Access Memories (RAMs), i.e. read and write new contents randomly during operation. And a semiconductor Read Only Memory (ROM), which can be read out randomly but cannot be written in during operation, is used to store solidified programs and data. The ROM is classified into a non-rewritable fuse type ROM, PROM, and a rewritable EPROM.

The magnetic core memory has the characteristics of low cost and high reliability, and has more than 20 years of practical use experience. Magnetic core memories were widely used as main memories before the mid 70's. The storage capacity can reach more than 10 bits, and the access time is 300ns at the fastest speed. The typical international magnetic core memory has a capacity of 4 MS-8 MB and an access cycle of 1.0-1.5 mus. After semiconductor memory is rapidly developed to replace magnetic core memory as a main memory location, magnetic core memory can still be applied as a large-capacity expansion memory.

Drum memory, an external memory for magnetic recording. Because of its fast information access speed and stable and reliable operation, it is being replaced by disk memory, but it is still used as external memory for real-time process control computers and medium and large computers. In order to meet the needs of small and micro computers, subminiature magnetic drums have emerged, which are small, lightweight, highly reliable, and convenient to use.

Magnetic disk memory, an external memory for magnetic recording. It combines the advantages of drum and tape storage, i.e. its storage capacity is larger than that of drum, its access speed is faster than that of tape storage, and it can be stored off-line, so that the magnetic disk is widely used as large-capacity external storage in various computer systems. Magnetic disks are generally classified into two main categories, hard disks and floppy disk memories.

Hard disk memories are of a wide variety. The structure is divided into a replaceable type and a fixed type. The replaceable disk is replaceable and the fixed disk is fixed. The replaceable and fixed magnetic disks have both multi-disk combinations and single-chip structures, and are divided into fixed head types and movable head types. The fixed head type magnetic disk has a small capacity, a low recording density, a high access speed, and a high cost. The movable head type magnetic disk has a high recording density (up to 1000 to 6250 bits/inch) and thus a large capacity, but has a low access speed compared with a fixed head magnetic disk. The storage capacity of a magnetic disk product can reach several hundred megabytes with a bit density of 6250 bits per inch and a track density of 475 tracks per inch. The disk set of the multiple replaceable disk memory can be replaced, so that the disk set has large off-body capacity, large capacity and high speed, can store large-capacity information data, and is widely applied to an online information retrieval system and a database management system.

Example four:

the present disclosure also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above-mentioned target tracking method combining super-resolution and video coding when executing the computer program.

Fig. 5 is a schematic diagram of an internal structure of the electronic device in one embodiment. As shown in fig. 5, the electronic device includes a processor, a storage medium, a memory, and a network interface connected through a system bus. The storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable the processor to realize a target tracking method combining super-resolution and video coding when being executed by the processor. The processor of the electrical device is used to provide computing and control capabilities to support the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a target tracking method that combines super resolution and video encoding. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The electronic device includes, but is not limited to, a smart phone, a computer, a tablet, a wearable smart device, an artificial smart device, a mobile power source, and the like.

The processor may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor is a Control Unit of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (for example, executing remote data reading and writing programs, etc.) stored in the memory and calling data stored in the memory.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connected communication between the memory and at least one processor or the like.

Fig. 5 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 5 does not constitute a limitation of the electronic device, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor through a power management device, so that functions such as charge management, discharge management, and power consumption management are implemented through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device and other electronic devices.

Optionally, the electronic device may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.

Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A target tracking method combining super-resolution and video coding is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step of obtaining the reconstructed intermediate video after the low resolution video is compressed, encoded and reconstructed is specifically as follows:

3. The method according to claim 1, wherein the step of inputting the reconstructed intermediate video into a super-resolution network for deep learning training to obtain a trained high-resolution video specifically comprises:

mapping the collapsed convolution layer to obtain mapped image data;

4. The method of claim 3, wherein after deconvolving the mapped image data to obtain a trained high resolution video image, further comprising:

5. The method according to claim 1, wherein the performing the target tracking operation on the high-resolution video to obtain the enhanced target tracking video specifically comprises:

6. The method according to claim 5, wherein the performing Fast R-CNN target detection on image data of the high-resolution video frame by frame specifically comprises:

7. The method according to any one of claims 5 to 6, wherein the multi-target tracking algorithm is a Deep Sort multi-target tracking algorithm.

8. A target tracking apparatus combining super-resolution and video coding, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps corresponding to the method for target tracking with joint super resolution and video coding as claimed in any one of claims 1 to 7 when executing the computer program.

10. A computer storage medium having stored thereon computer program instructions, wherein the program instructions, when executed by a processor, are adapted to carry out the steps corresponding to the method for target tracking for joint super resolution and video coding as claimed in any of claims 1 to 7.