CN114866784A

CN114866784A - Vehicle detection method based on compressed video DCT (discrete cosine transformation) coefficients

Info

Publication number: CN114866784A
Application number: CN202210411306.0A
Authority: CN
Inventors: 何铁军; 李晓港
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2022-08-05

Abstract

The invention discloses a vehicle detection method based on compressed video DCT coefficients, which comprises the following steps: extracting a compressed code stream video, and obtaining a first DCT coefficient corresponding to the compressed code stream video; preprocessing is carried out on the basis of the first DCT coefficient, and a second DCT coefficient is obtained; constructing a vehicle detection model; acquiring an image sample set based on an open source image data set UA-DETRAC, and then training the vehicle detection model by using the image sample set to obtain a vehicle detection network; and acquiring a vehicle detection result based on the compressed code stream video, the second DCT coefficient and the vehicle detection network. The method utilizes the characteristic that the characteristic information can be obtained without completely decoding the compressed format data, combines the depth science, reduces the complexity of a vehicle detection model, reduces the calculation force required by vehicle detection, and meets the requirement of edge calculation.

Description

Vehicle detection method based on compressed video DCT (discrete cosine transformation) coefficients

Technical Field

The invention relates to the field of vehicle detection, in particular to a vehicle detection method based on video compression DCT coefficients.

Background

With the high-speed development of national economy, the quantity of motor vehicles in China is continuously increased, and the increase of the motor vehicles brings a series of problems such as traffic jam and the like, so that an intelligent transportation system is necessary to be developed. Road vehicle detection has been vigorously developed as an important technology in intelligent transportation systems. The video-based vehicle detection method has the characteristics of large information content, small interference to traffic facilities and the like.

The current video detection work firstly needs to be transmitted through a network, and the transmitted video uses a compressed format. The video vehicle detection method based on the pixel domain needs to completely decode the video and then realize vehicle detection through the current mainstream deep learning method. The deep learning detection method based on the pixel domain can realize real-time end-to-end detection at present, has high precision, but has a complex model, needs to completely decode a video and has high consumption of computing resources.

The vehicle detection method directly using the video compressed domain information does not need to completely decode the video, only needs to partially decode the video to obtain the compressed domain information, and can realize vehicle detection based on the characteristic information contained in the compressed domain, but the precision is not high.

Disclosure of Invention

In order to solve the problems, the invention provides a vehicle detection method based on a compressed video DCT coefficient.

In order to achieve the purpose of the invention, the invention provides a vehicle detection method based on compressed video DCT, which comprises the following steps:

s1: extracting a compressed code stream video, and obtaining a first DCT coefficient corresponding to the compressed code stream video;

s2: preprocessing the first DCT coefficient to obtain a second DCT coefficient;

s3: constructing a vehicle detection model;

s4: acquiring an image sample set based on an open source image data set UA-DETRAC, and then training the vehicle detection model by using the image sample set to obtain a vehicle detection network;

s5: and inputting the second DCT coefficient into the vehicle detection network for detection, obtaining the position, type and confidence information of the vehicle, and then drawing a detection frame, type and confidence of the vehicle in the decoded compressed code stream video frame based on the position, type and confidence information of the vehicle, wherein the detection frame, type and confidence of the vehicle are the detection result of the vehicle.

Further, the compressed code stream video is an h.264 compressed code stream video.

Further, in the step S1, the extracting is performed on the h.264 compressed code stream video to obtain a first DCT coefficient corresponding to the compressed code stream video, and the specific process includes the following steps:

converting the size of the H.264 compressed code stream video into 416x 416;

the image frame of the H.264 compressed code stream video comprises: i, P, and B frames;

obtaining residual DCT coefficients of 4x4 blocks of the H.264 compressed code stream video I frame and a predicted value under an I frame intra-frame prediction mode by using a JM (JM) decoder, then carrying out DCT transformation of 4x4 blocks on the predicted value under the I frame intra-frame prediction mode, and finally adding a transformation result with the residual DCT coefficients of 4x4 blocks of the H.264 compressed code stream video I frame to obtain the DCT coefficients of 4x4 blocks of the I frame;

obtaining respective residual DCT coefficients of a P frame and a B frame of the H.264 compressed code stream video and DCT coefficients of respective reference frames, obtaining positions of respective reference coding blocks and the DCT coefficients of the respective reference coding blocks according to the DCT coefficients of the respective reference frames of the P frame and the B frame and respective motion vectors of the P frame and the B frame, and obtaining the DCT coefficients of respective 4x4 blocks of the P frame and the B frame based on the obtained respective residual DCT coefficients of the P frame and the B frame, the positions of the respective reference coding blocks and the DCT coefficients of the respective reference coding blocks;

the DCT coefficients of the 4x4 blocks of each of the I frame, the P frame and the B frame are collectively called first DCT coefficients;

and converting the first DCT coefficient of the 4x4 block into the first DCT coefficient of the 8x8 block according to the block space relation of the DCT coefficients, namely obtaining the first DCT coefficient corresponding to the compressed code stream video.

Further, in step S1, the specific process of obtaining the DCT coefficients of the respective 4x4 blocks of the P frame and the B frame based on the obtained residual DCT coefficients of the P frame and the B frame, the positions of the respective reference coding blocks, and the DCT coefficients of the respective reference coding blocks includes the following steps:

when the reference coding blocks of the P frame and the B frame are positioned at integer positions of multiples of a reference frame pixel 4, directly adding residual DCT coefficients of the P frame or the B frame and DCT coefficients of the reference coding blocks of the P frame and the B frame to obtain DCT coefficients of 4x4 blocks of the P frame and the B frame;

when the reference coding blocks of the P frame and the B frame are located at integer positions which are not multiples of the pixel 4 of the reference frame, the DCT coefficients of the reference coding blocks of the P frame and the B frame are obtained according to the DCT coefficients of four blocks which are adjacent to the reference coding blocks and located at the integer positions of the multiples of the pixel 4 of the reference frame, then the residual DCT coefficients of the P frame and the B frame are respectively added with the DCT coefficients of the reference coding blocks of the P frame and the B frame, and the DCT coefficients of 4x4 blocks of the P frame and the B frame are obtained.

Further, in step S2, the specific process of preprocessing the first DCT coefficient to obtain a second DCT coefficient includes:

and removing the DCT coefficients of Cb components and Cr components in the first DCT coefficients, reserving the DCT coefficients of a format 416x416 in the first DCT coefficients, converting the DCT coefficients into a format of 52x52x64, sequencing the DCT coefficients after the format conversion according to ZigZag, and finally taking the first 24 DCT coefficients in the sequencing result, namely the second DCT coefficients.

Further, in step S3, the specific process of constructing the vehicle detection model includes:

constructing a trunk feature extraction network based on a DarkNet-53 model, combining the trunk feature extraction network with a residual error network, extracting features through the accumulation of convolution and residual error structures, and reducing the size of a feature map;

constructing a regression detection network based on a feature pyramid, and detecting vehicles on feature maps of three scales of 52x52, 26x26 and 13x 13;

determining a loss function, wherein the loss function comprises detection frame coordinate loss, confidence coefficient loss and classification loss.

Further, in step s4, the acquiring process of the image sample set includes:

and (3) uniformly scaling the picture size in the open source image data set UA-DETRAC to 416x416, then extracting the DCT coefficient of the compression format image based on a Libjpeg library, processing the extracted DCT coefficient and outputting the DCT coefficient of the Y component with the size of 52x52x24, namely the image sample set.

Further, in step S4, the specific process of training the vehicle detection model using the image sample set includes:

initializing the network weight of the vehicle detection model, and initializing the network initial weight by using normal distribution;

setting the initial learning rate of the vehicle detection model to be 1e-4, and obtaining a self-adaptive learning rate in subsequent training by using an Adam algorithm;

setting the size of an anchor frame according to the label data of the image sample set by using a K-means clustering method, and setting anchor frames with three sizes on the feature maps with the three scales of 52x52, 26x26 and 13x13 by taking the idea of YOLOv3 as reference;

setting parameter values of the vehicle detection model: detecting the category, the batch size and the iteration number;

training the vehicle detection model using the set of image samples.

Compared with the prior art, the invention has the following beneficial technical effects:

according to the scheme, the vehicle detection method based on the deep learning of the pixel domain is combined with the method based on the video compression domain information, so that the end-to-end detection with high precision can be realized, the picture does not need to be completely decoded, and the resource consumption is greatly reduced.

Drawings

FIG. 1 is a flowchart illustrating a method for vehicle detection based on compressed video DCT coefficients according to an embodiment;

FIG. 2 is a diagram of referencing an encoding block and obtaining DCT coefficients from neighboring blocks in one embodiment;

FIG. 3 is a diagram illustrating the spatial relationship of a DCT block to sub-blocks of an embodiment;

FIG. 4 is a schematic illustration of a Zigzag arrangement of an embodiment;

FIG. 5 is a diagram of a backbone feature extraction network architecture of one embodiment;

FIG. 6 is a diagram illustrating an overall architecture of a vehicle inspection model according to an embodiment;

FIG. 7 is a test video frame of an embodiment;

FIG. 8 shows vehicle detection results according to one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The data set used in the specific implementation of the invention is a UA-DETRAC vehicle detection data set which comprises 8250 vehicles and 121 ten thousand target detection frames, and the training of a vehicle detection model is realized according to DCT feature information extracted from a data set picture, so that the vehicle detection of an H.264 compressed code stream is realized, and the vehicle detection of a compressed format video is realized.

As shown in fig. 1, a method for detecting a vehicle based on compressed video DCT coefficients includes the following steps:

s2: preprocessing the first DCT coefficient to obtain a second DCT coefficient;

s3: constructing a vehicle detection model;

In an embodiment, in step S1, the extracting is performed on the h.264 compressed code stream video to obtain a first DCT coefficient corresponding to the compressed code stream video, and the specific process includes the following steps:

converting the size of the H.264 compressed code stream video into 416x 416;

In an embodiment, in the step S1, the specific process of obtaining the DCT coefficients of the respective 4x4 blocks of the P frame and the B frame based on the obtained residual DCT coefficients of the respective P frame and the B frame, the position of the respective reference coding block, and the DCT coefficients of the respective reference coding block includes the following steps:

Is provided with

To reference a block of code, x ₁ 、x ₂ 、x ₃ 、x ₄ Are respectively provided with

DCT coefficients of four blocks that intersect as shown in fig. 2.

I _n Representing the identity matrix, the formula for obtaining the DCT coefficient of the reference coding block is as follows:

when the reference coding block is positioned at the fractional position, the extracted predicted value is subjected to 4x4 block DCT, and then residual DCT coefficients are added to obtain DCT coefficients.

In one embodiment, in step S1, the DCT coefficient conversion process is:

the DCT coefficients extracted from the h.264 code stream are 4x4 blocks, and the vehicle detection model converts 4x4 blocks of DCT coefficients into 8x8 blocks of DCT coefficients according to their block spatial relationship using 8x8 blocks of DCT as input.

Is provided with Y ₂ Representing 4x4 blocks of DCT coefficients,

a DCT transform matrix of order 8 is represented,

to represent

Wherein T is ₄ Representing a 4 th order DCT transformation matrix, Y ₂ Representing 8x8 blocks of DCT coefficients, wherein the schematic diagram of the DCT transform for different sizes is shown in fig. 3. The transformation of 4x4 blocks of DCT coefficients into 8x8 blocks of DCT coefficients is as follows:

in an embodiment, in the step S2, the specific process of preprocessing the first DCT coefficient to obtain the second DCT coefficient includes:

removing the DCT coefficients of Cb components and Cr components in the first DCT coefficient, reserving the DCT coefficients of a format 416x416 in the first DCT coefficient, converting the DCT coefficients into a format 52x52x64, sequencing the DCT coefficients after the format conversion according to the ZigZag, wherein the sequencing mode is shown in FIG. 4, reserving the DC coefficient at the upper left corner and 23 AC coefficients, namely a DC coefficient and the first 23 AC coefficients sequenced by the ZigZag in (2-3), and obtaining the DCT coefficients of a format 52x52x 24; and finally, taking the first 24 DCT coefficients from the sequencing result, namely the second DCT coefficient.

In one embodiment, the step S3, constructing the vehicle detection model process includes:

constructing a trunk feature extraction network, constructing the trunk feature extraction network based on a DarkNet-53 model, wherein the feature extraction network has a structure shown in figure 5, constructing a light-weight trunk feature extraction network, and extracting features by combining a residual network idea through the accumulation of convolution and residual structures to reduce the size of a feature map.

Constructing a regression detection network based on a feature pyramid, and detecting vehicles on feature maps of three scales of 52x52, 26x26 and 13x 13: the method comprises the steps that anchor frames with different sizes are arranged on feature maps with different sizes by using an anchor frame idea, a vehicle is detected based on regression of the anchor frames, the overall structure of a vehicle detection model is shown in FIG. 6, a DBL in FIG. 6 is composed of a convolution layer, a batch normalization layer (BN) and an activation function, and downsampling and feature extraction are achieved; the Resn residual module is formed by stacking one DBL and a plurality of residual components; the residual error component is composed of DBL and residual error edges, gradient disappearance is prevented, and learning accuracy is improved. The DCT coefficients with the size of 52x52x24 are input, the vehicles are detected on feature maps with different sizes through a feature extraction network, the final output results are 13x13x18, 26x26x18 and 52x52x18, and the output results comprise the positions of vehicle detection frames, the types of the vehicles and confidence information of the vehicles.

Determining a loss function, the loss function comprising: frame coordinate loss, confidence loss, and classification loss are detected.

In one embodiment, in step s4, the acquiring of the image sample set includes:

In one embodiment, the step S4, the training the vehicle detection model using the image sample set includes:

training the vehicle detection model using the set of image samples.

As shown in fig. 7, in an embodiment, a picture of a test video frame is extracted during decoding, DCT coefficients of the picture are extracted, and a prediction result obtained by mapping the test video frame onto a decoded video frame is processed by a trained vehicle detection model, as shown in fig. 8, where the detection result includes a detection frame, a type and a confidence of a vehicle, and it can be seen from fig. 8 that the invention can realize a good detection effect by using information in a compressed format picture in combination with deep learning.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application are only used for distinguishing similar objects, and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may interchange a specific order or sequence when allowed. It should be understood that "first \ second \ third" distinct objects may be interchanged under appropriate circumstances such that the embodiments of the application described herein may be implemented in an order other than those illustrated or described herein.

The terms "comprising" and "having" and any variations thereof in the embodiments of the present application are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or device that comprises a list of steps or modules is not limited to the listed steps or modules but may alternatively include other steps or modules not listed or inherent to such process, method, product, or device.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A vehicle detection method based on compressed video DCT coefficients is characterized by comprising the following steps:

s2: preprocessing the first DCT coefficient to obtain a second DCT coefficient;

s3: constructing a vehicle detection model;

2. The method according to claim 1, wherein the compressed code stream video is an H.264 compressed code stream video.

3. The method according to claim 2, wherein in step S1, the h.264 compressed code stream video is extracted to obtain the first DCT coefficient corresponding to the compressed code stream video, and the specific process includes the following steps:

converting the size of the H.264 compressed code stream video into 416x 416;

4. The method as claimed in claim 3, wherein in step S1, the specific process of obtaining the DCT coefficients of the 4x4 blocks of the P frame and the B frame based on the obtained residual DCT coefficients of the P frame and the B frame, the positions of the reference coding blocks and the DCT coefficients of the reference coding blocks comprises the following steps:

5. The method as claimed in claim 4, wherein the step S2, the pre-processing the first DCT coefficient to obtain the second DCT coefficient includes:

6. The method for detecting vehicles according to claim 5, wherein in step S3, the specific process of constructing the vehicle detection model includes:

7. The method according to claim 6, wherein in step s4, the step of obtaining the image sample set comprises:

8. The method according to claim 7, wherein the step S4 of training the vehicle detection model using the image sample set comprises:

training the vehicle detection model using the set of image samples.