CN108833928B

CN108833928B - Traffic monitoring video coding method

Info

Publication number: CN108833928B
Application number: CN201810720989.1A
Authority: CN
Inventors: 刘�东; 马常月; 吴枫; 彭秀莲
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-07-03
Filing date: 2018-07-03
Publication date: 2020-06-26
Anticipated expiration: 2038-07-03
Also published as: CN108833928A

Abstract

The invention discloses a traffic monitoring video coding method, which is based on a vehicle and a background database to realize traffic monitoring video coding, can effectively remove global redundancy of a traffic monitoring video in a time dimension after paying a certain cost of storage space, and finally has the overall effect of effectively improving the overall coding performance of the traffic monitoring video under the condition of not obviously increasing the complexity of a coding end and a decoding end.

Description

Traffic monitoring video coding method

Technical Field

The invention relates to the technical field of video coding, in particular to a traffic monitoring video coding method.

Background

In recent years, with the rapid development of intelligent traffic, the data volume of monitoring videos shows explosive growth. In order to effectively store and transmit the monitoring video data, the problem to be solved is the coding problem of the monitoring video.

Currently, the compression of surveillance video usually adopts the general video coding standard H.264/AVC or H.265/HEVC. However, considering some characteristics of the surveillance video, such as the still of the surveillance camera, the common video coding technique is directly applied to the coding of the surveillance video, and the inherent characteristics of the surveillance video cannot be fully utilized. In order to further improve the performance of surveillance video coding, many researchers have invented a series of coding techniques for surveillance video.

Generally, the content in the surveillance video can be roughly divided into background content and foreground content. Accordingly, the encoding for the surveillance video can be designed from two aspects of optimizing background encoding and optimizing foreground encoding respectively. Considering the characteristic that a monitoring camera is static, the optimized background coding usually generates a high-quality background frame first, and then improves the coding efficiency of the whole monitoring video by quality transmission. In the aspect of optimizing foreground coding, researchers successively put forward some foreground coding technologies based on models and object segmentation.

There are some works that have proposed other surveillance video coding techniques, such as:

adaptive prediction technology based on Background modeling (Xiaoanguo Zhang, Tiejun Huang, Yonghong Tian, and WenGao, "Background-modeling-based adaptive prediction for passive analysis video coding," IEEE Transactions on imaging processing, vol.23, No.2, pp.769-784,2014.)

Global vehicle coding technology based on vehicle 3D model database (Jing Xiao, Ruimin Hu, LiangLiao, Yu Chen, ZhongyuanWang, and ZixiangXiong, "Knowledge-based coding of object for multiple resource and video data," IEEETransactions on Multimedia, vol.18, No.9, pp.1691-1706,2016.)

The disadvantages of the above method:

1. the background coding technology based on the high-quality background frame can cause the surge of code stream when generating the high-quality background frame, which causes adverse effect on network transmission, and the coding performance is to be improved.

2. The foreground encoding technique based on model and object segmentation has difficulty in finely segmenting the foreground at the pixel level, and the code rate for representing the foreground is very large because the segmented foreground may have irregular shape.

3. The self-adaptive prediction technology based on background modeling subtracts a reconstructed background frame from a current frame and a reference frame at the same time, and then directly uses the obtained current frame foreground pixels to perform inter-frame prediction on the reference frame foreground pixels when a foreground is coded. When the segmentation effect of the foreground pixels is not good, the improvement of the foreground coding efficiency is easily affected.

4. The global vehicle encoding technology based on the vehicle 3D model database cannot improve the reconstruction quality of the vehicle because the texture information of the vehicle is not stored. In addition, the 3D model of the vehicle, the internal and external parameters of the monitoring camera, and the position and attitude information of the vehicle on the road required by the technique are difficult to obtain or estimate, thereby bringing difficulties to the practical use of the technique.

Disclosure of Invention

The invention aims to provide a traffic monitoring video coding method which can improve the coding performance of traffic monitoring videos.

The purpose of the invention is realized by the following technical scheme:

a traffic monitoring video coding method mainly comprises the following steps:

step 1, processing an original traffic monitoring video sequence by adopting a foreground and background segmentation method, separating a vehicle and a background, and respectively removing redundancy existing between the separated vehicle and the background and then putting the vehicle and the background into a database.

2, for the traffic monitoring video to be coded, a foreground and background segmentation method is also adopted to separate the vehicle to be coded and the background to be coded; selecting matched vehicles from a database by adopting a characteristic matching and rapid motion estimation mode for the vehicles to be coded; and selecting a matching background from the database based on the absolute difference sum for the background to be coded.

Step 3, when an inter-frame prediction mode or an intra-frame prediction mode is adopted, judging whether the vehicle to be coded or the background to be coded needs to perform rate distortion optimization processing on the matched vehicle or the matched background by using a preset mode; and carrying out corresponding processing according to the judgment result, and encoding by using a corresponding prediction mode.

According to the technical scheme provided by the invention, the traffic monitoring video coding is realized based on the vehicle and the background database, the global redundancy of the traffic monitoring video in the time dimension can be effectively removed after a certain cost of storage space is paid out, and finally, the overall effect is that the overall coding performance of the traffic monitoring video is effectively improved under the condition that the complexity of a coding end and a decoding end is not obviously increased.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a traffic monitoring video encoding method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a traffic monitoring video coding framework according to an embodiment of the present invention;

fig. 3 is a flowchart of removing SIFT features from the background of a vehicle area according to an embodiment of the present invention;

FIG. 4 is a flow chart of vehicle and background similarity analysis provided by an embodiment of the present invention;

fig. 5 is a schematic diagram of reference index bit change information according to an embodiment of the present invention;

fig. 6 is a screenshot of a test sequence provided by an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a traffic monitoring video coding method, which mainly comprises the following steps as shown in figure 1:

The schematic diagram of the whole encoding framework is shown in fig. 2, wherein the lower part of the line is also the step 1, and the upper part of the line is also the steps 2 to 3.

For ease of understanding, the following description will be made in detail with respect to the above three steps.

Firstly, establishing a vehicle and background database.

In the embodiment of the invention, for an original traffic monitoring video sequence, a foreground segmentation method (for example, a SuBSENSE method) is adopted to separate vehicles in the original traffic monitoring video sequence, a background is extracted from a background model generated when the foreground is separated, and the vehicles and the background which belong to the front section of the video sequence are used for constructing a database. The main implementation process can refer to the following ways:

1. and establishing a vehicle database.

The preferred implementation of the vehicle database establishment is as follows:

after separating the vehicles from the front section of the original traffic monitoring video sequence and removing redundancy, numbering the vehicles from 1 to N; and N is the number of separated vehicles.

Initially, the vehicle in the database is empty; for a vehicle v with redundancy removed_iRemoving vehicles v by using a method based on an inverted list_iAll other vehiclesSimilar vehicles are retrieved from vehicles_i1，v_i2，...，v_imWhere m is the number of similar vehicles.

To determine the size of m, consider vehicle v_iAnd v_jThe number of matched SIFT features, and the manner of determining whether two SIFT features are matched may be implemented by conventional techniques, or by the manner described below in connection with vehicle matching.

Comparing vehicles v when retrieving similar vehicles_iAnd any one of the remaining vehicles v_jNumber of SIFT features matched when vehicle v_iAnd a vehicle v_jWhen the matched SIFT feature number meets the following formula, the vehicle v is connected_jPut in { v_i1，v_i2，...，v_imIn the method, the following steps:

N_ij≥β×N_i；

N_ij≥min(N₀，N_i)；

in the above formula, N_ijFor vehicles v_iAnd a vehicle v_jNumber of matched SIFT features, N_iFor vehicles v_iNumber of SIFT features in (1), β and N₀Is constant, exemplary, β and N₀May be set to 0.1 and 4, respectively. After the processing in the mode, the vehicle v can be obtained_iCorresponding similar vehicle v_i1，v_i2，...，v_im}。

And then, comparing the similarity of the pixel levels of the vehicles: for vehicle v_iIf the vehicle in the database is empty, the vehicle v is put_iPutting the obtained product into a database; otherwise, the vehicle v is driven_iAnd { v_i1，v_i2，...，v_imVehicles already put into the database are compared for pixel level similarity, the similarity comparison uses a fast motion estimation method, and the loss function uses Sum of Absolute Differences (SAD).

The fast motion estimation mentioned here can be implemented by conventional techniques, and a specific fast motion estimation used in the vehicle matching described later can be used.

If calculatingThe obtained absolute difference and the average value are smaller than a set value (for example, 5), and it is determined that the two vehicles are similar at the pixel level. It will be understood by those skilled in the art that in the similarity calculation, the calculation target at each time is the vehicle v_iAnd { v_i1，v_i2，...，v_imIn a certain vehicle already put into the database, when the sum of absolute differences is calculated, the vehicle v is put into the database_iDivided into blocks of a certain size, vehicle v_iOne block on the image is put into a database to carry out rapid motion estimation on the whole image of one vehicle; as mentioned later for the 16x16 blocks, one SAD value is obtained for each 16x16 block; the absolute difference and mean, i.e. v, considered here_iAverage of the sum of absolute differences of all 16x16 blocks in (a).

If { v }_i1，v_i2，...，v_imA plurality of (e.g., 10) consecutive vehicles that have been put into the database do not communicate with the vehicle v_iAt a pixel level, the vehicle v is similar_iPut into the database, otherwise, the vehicle v_iNot put into the database.

If the vehicle v is to be finally decided_iPut into the database, then { v }_i1，v_i2，...，v_imThe vehicles and the vehicle v in which the database has been put_iMaking a comparison of the pixel level similarity, if any, to the vehicle v_iRemoving similar vehicles which are put into the database from the database if the vehicles are similar in pixel level; if the number of vehicles and the number of vehicles v are more than the total number_iAt the pixel level, the above checking process stops.

And processing each vehicle in the above manner, determining the vehicle finally put into the database, coding the vehicle and putting the vehicle into the database.

2. Background database establishment

For removing redundant background, a frame of background is taken at intervals (e.g., 20s) and encoded, and then placed in the database.

In practical application, after the monitoring camera is installed, the encoder firstly establishes the vehicle and the background database. And for the vehicle, the encoder performs high-quality encoding on the vehicle to be placed in the database according to the step of establishing the vehicle database, and places the encoded vehicle in the database. Meanwhile, information for identifying the vehicles is also coded into the code stream, and after the decoding end decodes the reconstructed image, the same vehicle database establishing process is carried out according to the decoded vehicle identification information; and for the background, the encoder performs high-quality encoding on the generated background frame at intervals according to the establishing step of the background database, and puts the encoded background into the database. Meanwhile, the high-quality coded background and the information for identifying the background are also coded into the code stream, and the decoder decodes a high-quality background frame according to the information and then puts the high-quality background frame into the database. In this way, the same vehicle and context database can be built at the codec end.

In the embodiment of the invention, the original traffic monitoring video sequence can be divided, and the former part of data is used for establishing a vehicle and background database; the latter part is used as the traffic monitoring video to be coded. Of course, the traffic monitoring video of the first day can also be used for building a vehicle and background database, and the data from the second day is used as the traffic monitoring video to be coded. The coder-decoder carries out coding and decoding work of the traffic monitoring video according to the method of the invention. A typical traffic surveillance video is typically stored for a period of several months, and the above-described process is repeated when the stored data is emptied.

Vehicle and background retrieval

1. And (5) vehicle retrieval.

1) Separation of the vehicle from the background and redundancy removal operations.

In the embodiment of the invention, the separation of the vehicle and the background and the redundancy removal operation are also required for the traffic monitoring video to be coded; this part of the operation is similar to the operation when the vehicle and context database are built. This procedure is preferably carried out as follows:

after the vehicles in the monitoring video sequence (original traffic monitoring video sequence or traffic monitoring video to be coded) are separated by adopting the SuBSENSE method, the pixels in a square area from the upper left corner to the lower right corner of the separated vehicles are used as the vehicles and the rest parts are used as the background because the shapes of the vehicles are possibly irregular. The SIFT features of the vehicles are extracted, the background SIFT features in the SIFT features are removed, and the flow of removing the background SIFT features is shown in figure 3.

When the SuBSENSE method is adopted to separate the vehicles, a relatively clean background frame is generated step by step. And extracting the vehicle from the monitoring video sequence, and extracting the background of the corresponding position on the background frame.

Taking a traffic monitoring video to be coded as an example, extracting SIFT features of a current vehicle to be coded and a corresponding background which are separated, and retrieving each SIFT feature extracted from the current vehicle to be coded in a certain position neighborhood range on the corresponding background by adopting the following formula:

(xs_c-xs_b)²+(ys_c-ys_b)²≤d²；

wherein, xs_cAnd ys_cCoordinates, xs, representing SIFT features extracted from the vehicle currently to be coded_bAnd ys_bCoordinates representing SIFT features extracted from a corresponding background; d is the defined range of the location neighborhood; illustratively, d-5 may be set.

If the distance between the searched SIFT feature with the minimum normalized Euclidean distance and a certain SIFT feature of the current vehicle to be coded is smaller than a certain threshold value: d_min≤D₁(ii) a Wherein D is_minIs the normalized Euclidean distance between the SIFT feature with the minimum normalized Euclidean distance and a certain SIFT feature of the current vehicle to be coded, D₁For the threshold, exemplary D may be set₁1.1 as the ratio; and if so, indicating that SIFT characteristics similar to the SIFT characteristics of the current vehicle to be coded exist in the background region, and removing the SIFT characteristics from the vehicle SIFT, wherein the corresponding SIFT characteristics of the current vehicle to be coded are background SIFT characteristics.

2) And performing coarse retrieval by adopting feature matching.

In the embodiment of the invention, SIFT features of vehicles (including vehicles in a database and vehicles to be coded) are extracted, vehicles in the database establish inverted list indexes based on the SIFT features, and for the vehicles to be coded, a plurality of candidate vehicles are roughly retrieved from the database based on SIFT feature matching. The preferred implementation of this process is as follows:

roughly selecting a plurality of candidate vehicles from a database by adopting a characteristic matching mode: quantizing SIFT features of each vehicle in the database into visual words by adopting a k-means algorithm, and calculating a corresponding mapping mean vector for each visual word; mapping each SIFT feature of each vehicle in the database to the nearest neighbor visual character, and comparing the mapped SIFT feature vectors with the mapping mean vector corresponding to the nearest neighbor visual character to obtain the binarization representation of each SIFT feature vector; and simultaneously, representing each vehicle in the database by using a frequency histogram of visual characters corresponding to SIFT features of the vehicle, and organizing the frequency histogram of each vehicle in the database in an inverted list mode.

For the current vehicle to be coded, each SIFT feature of the current vehicle to be coded is distributed to the nearest visual characters according to the method for processing the vehicle in the database, so that a frequency histogram of the current vehicle to be coded is obtained, and meanwhile, the binarization representation of each SIFT feature is calculated.

When the similarity between a current vehicle to be coded and a certain vehicle in a database is compared, under the condition that the Hamming distance of the binarization representation of the SIFT features mapped to the same visual characters is smaller than a certain threshold value, taking the distance of a frequency histogram weighted by a tf-idf (term frequency-inverse document frequency) item as an evaluation index of the similarity, and obtaining a comparison result of the similarity between the current vehicle to be coded and each vehicle in the database; and sorting according to the comparison result of the calculated similarity, and selecting a plurality of vehicles with the top similarity as candidate vehicles.

For example, in a particular implementation, 10 candidate vehicles may be retrieved.

3) Vehicle culling is performed using fast motion estimation.

In the embodiment of the invention, a matched vehicle is selected from a plurality of candidate vehicles by using a rapid motion estimation mode; the preferred implementation of this process is as follows:

a. and aligning the current vehicle to be coded with each candidate vehicle.

The preferred embodiment of the alignment is as follows:

for a certain SIFT feature of the current vehicle to be coded, calculating the distance between the certain SIFT feature and all SIFT features of each candidate vehicle, sorting the calculated distances from small to large, and if the following formula is met, judging that the corresponding SIFT feature of the current vehicle to be coded finds a matched SIFT feature in the corresponding candidate vehicle:

d₁≤D₂；

d₁/d₂≤α；

wherein d is₁And d₂Respectively, a minimum and a second small distance, D₂And α are constants;

calculating each SIFT feature of the current vehicle to be coded according to the mode to obtain SIFT matching pairs of the current vehicle to be coded and each candidate vehicle; according to the obtained result of SIFT feature matching pairs, calculating the position offset of the current vehicle to be coded and each candidate vehicle, as shown in the following formula:

wherein, MV_xAnd MV_yFor the horizontal and vertical components of the offset, n is the number of matched SIFT feature pairs, xc_iAnd yc_iAs coordinates of the SIFT feature of the vehicle currently to be encoded, xv_iAnd yv_iCoordinates of SIFT features of the candidate vehicle; i is the serial number of SIFT feature matching pair;

removing abnormal points in an iteration mode to obtain a final position offset result; and aligning the current vehicle to be coded with the corresponding candidate vehicle according to the calculated position offset result.

The anomaly point may be determined by: if the motion vector calculated by a certain pair of SIFT matching pairs deviates far from the mean motion vector (i.e. exceeds a set value), the SIFT feature matching pair is an outlier.

b. Dividing the current vehicle to be coded into blocks with the fixed size of 16x16, and searching a block with the minimum loss function in a certain candidate vehicle by each block with the size of 16x16, wherein the loss function consists of the sum of absolute differences and the coding code rate of a motion vector; the searching mode is that the position of the current 16x16 block is taken as a starting point, eight-point diamond type searching is carried out in the range of 64 pixels up, down, left and right around the starting point, and the loss functions of all the 16x16 blocks are accumulated to be used as the whole loss function of the whole current vehicle to be coded on a candidate vehicle; and finally, the candidate vehicle with the minimum overall loss function is reserved as the matching vehicle.

2. And (5) retrieving the background.

In the embodiment of the present invention, for the background to be encoded, based on the absolute difference and the matching background selected from the database, the preferred implementation manner of this process is as follows:

taking the absolute difference sum of the current background to be coded and the pixels at the corresponding positions of the background in the database as a similarity evaluation criterion, calculating the absolute difference sum of the current background to be coded and each background in the database, as shown in the following formula:

SAD＝∑_k∈B|pc_k-pl_k|；

wherein, pc_kAnd pl_kRespectively representing the current background to be coded and the k pixel value of the background in the database, wherein B is the set of the current background pixel to be coded;

and sorting the calculation results from small to large, and taking the background with the minimum absolute difference as the matching background of the current background to be coded.

And thirdly, encoding.

1. And (5) analyzing the similarity.

In the embodiment of the invention, after the matched vehicle and background of the current vehicle to be coded and the background are determined, whether the Rate Distortion Optimization (RDO) is carried out on the matched vehicle and background by the current vehicle and the background is determined. When the current vehicle and the background adopt an inter-frame prediction mode, RDO comparison is carried out on the matched vehicle and the background and the existing reference frame information of the current vehicle and the background; and when the current vehicle and the background adopt the intra-frame prediction mode, RDO comparison is carried out on the candidate vehicle and the background and the rough intra-frame prediction mode of the current vehicle and the background. The detailed flow of the vehicle and background similarity analysis is shown in fig. 4. The comparison of RDO in the inter-frame and intra-frame prediction modes will be described in detail below.

1) Comparison of RDO in inter prediction mode.

The comparison criterion for rate-distortion optimization in the inter-prediction mode is as follows:

wherein, I is a Lagrange loss function, D is the sum of absolute differences of the prediction block and the matching block, R is the bit number used for representing the mode information, and lambda is a Lagrange multiplier;

in order to compare a matched vehicle with a background and an existing reference frame, a Lagrangian loss function of the current vehicle to be coded, the background and the existing reference frame is obtained through calculation, an updated Lagrangian loss function is obtained after the matched vehicle and the background obtained through retrieval are considered through calculation, the Lagrangian loss functions before and after updating are compared, and whether RDO is carried out on the matched vehicle and the background is determined. The preferred implementation of this process is as follows:

a. calculating Lagrange loss functions of the current vehicle to be encoded and the current background to be encoded and the current reference frame:

for each existing reference frame of the current vehicle to be coded, firstly estimating the displacement of the current vehicle on the existing reference frame, then obtaining the optimal RDO result of the current vehicle on the existing reference frame, and finally comparing the optimal RDO result with the optimal RDO result of the current vehicle on the candidate matching vehicle to determine whether to perform RDO on the candidate matching vehicle, wherein the correlation process is as follows:

the Motion Vector (MV) of the block adopting inter-frame prediction 4 × 4 at the corresponding position of the current vehicle to be coded and the image number (POC) information of the reference frame thereof are obtained by taking the block of 4 × 4 as a unit, and based on the Motion Vector (MV) and the POC information, the Motion Vector information of the block corresponding to 4 × 4 on the current vehicle to be coded is estimated, and the estimation formula is as follows:

wherein, MVX_refAnd MVY_refHorizontal and vertical components of the block motion vector, POC, of inter prediction 4 × 4 on an existing reference frame, respectively_cur、POC_refAnd POC_colrefRespectively POC of the frame where the current vehicle to be coded is positioned, POC of the existing reference frame and POC of the block reference frame of inter-frame prediction 4 × 4 on the existing reference frame, MVX_curAnd MVY_curTraversing each 4x4 small block in the current vehicle to be coded, recording the number of blocks of inter-frame prediction 4 × and the motion vector of the corresponding block of the current vehicle to be coded 4 ×, and finally, taking the horizontal component and the vertical component of the finally estimated current vehicle to be coded as the average value of all the inter-frame prediction 4x4 small block motion vectors;

obtaining the displacement of the current vehicle to be coded on each existing reference frame, then dividing the current vehicle to be coded into blocks with the fixed size of 16x16, and sequentially searching the block with the minimum loss function in all the existing reference frames by each block with the size of 16x16, wherein the loss function consists of the sum of absolute differences and the coding code rate of a motion vector; the searching mode is that the position of the current 16x16 block after translation according to the estimated displacement is taken as a starting point, and eight-point diamond type searching is carried out in the range of 64 pixels around the starting point; taking a 16x16 block as a unit, recording the minimum loss function of all blocks in the current vehicle to be coded and the matching blocks in all the existing reference frames; sequentially traversing each 16x16 block in the current vehicle to be coded, accumulating the blocks to obtain the minimum loss function sum, and obtaining the Lagrangian loss function of the current vehicle to be coded relative to the current reference frame

For the current background to be coded, dividing the current background into 16x16 blocks; for the current 16x16 block, searching a matching block corresponding to the minimum loss function from all the existing reference frames; the searching mode is that the absolute difference sum of the 16x16 blocks at the corresponding positions of all the reference frames and the current 16x16 block in the current background to be coded is compared, and the minimum absolute difference sum is selected as the loss function of the current 16x16 block in the current background to be coded; traversing all 16x16 blocks in the background to be coded currently, and accumulating the loss functions of all 16x16 blocks as Lagrangian loss functions of the background to be coded currently

b. Taking the matched vehicle and background into account, calculating an updated Lagrangian loss function:

for each 16x16 block in the vehicle currently to be encoded, the lagrange loss function

On the basis of the calculation result, calculating a loss function of the vehicle and the matched vehicle by adopting a rapid motion estimation method; then, the loss function of each 16x16 block and the matched vehicle is compared with the calculated Lagrange loss function

Comparing the obtained minimum loss function with the minimum loss function of the existing reference frame, and taking the smaller one as the minimum loss function of the corresponding 16x16 block; traversing each 16x16 block in the current vehicle to be coded, and accumulating the minimum loss function of each 16x16 block to obtain the Lagrangian loss function of the current vehicle to be coded

Meanwhile, for the current vehicle to be coded, the change of the bit number includes the position index information of the matched vehicle in the database, the position information of the matched vehicle in the reference frame, the reference index (index of the reference frame) bit change information and the CTU levelRepresenting information relating to the change in these bit numbers to a Lagrangian loss function

Combined to obtain an updated Lagrangian loss function

For each 16x16 block within the current context to be encoded, the lagrangian loss function

Calculating a loss function of the matched background on the basis of the calculation result; then, the loss function of each 16x16 block and the matched background is combined with the calculated Lagrangian loss function

Comparing the obtained minimum loss function with the minimum loss function of the existing reference frame, and taking the smaller one as the minimum loss function of the corresponding 16x16 block; traversing each 16x16 block in the current background to be coded, and accumulating the minimum loss function of each 16x16 block to obtain the loss function of the current background to be coded

Meanwhile, for the current background to be coded, the change of the bit number comprises the position index information and the reference index bit change information of the matched background in the database, and the bit number change and the Lagrangian loss function are carried out

Combined to obtain an updated Lagrangian loss function

The following description will be given by taking a bit number calculation method of reference index bit change information as an example:

as shown in fig. 5, for each 16 × 16 block in the current vehicle to be coded and the background, when calculating the minimum loss function between the block and the existing reference frame and between the matching vehicle and the background, if the matching block index corresponding to the minimum loss function is n-1, the bit number is added by 1, where n is the number of the existing reference frames; otherwise, if the matching block corresponding to the minimum loss function is on the matched vehicle or background, the bit number is increased by n-1-idx, wherein idx is the index of the matching block corresponding to the 16x16 block minimum loss function when the matched vehicle and background are not considered. In addition, the number of bits is unchanged. Traversing each 16x16 block within the current vehicle and context to be encoded, the final reference index bit change information bits is a summation of the block change bit numbers of each 16x 16. And combining the bit number change with the Lagrangian loss function obtained by the previous calculation to obtain an updated Lagrangian loss function corresponding to the current vehicle to be coded and the background.

Finally, the Lagrangian loss functions are compared

With updated lagrange loss function

The size between, if

Performing rate distortion optimization processing on the matched vehicle; comparison of Lagrange loss functions

With updated lagrange loss function

The size between, if

Then a rate distortion optimization process is performed on the matching background.

2. Comparison of RDO in intra prediction mode.

Similar to the inter prediction mode, the comparison criterion for rate-distortion optimization in the intra prediction mode is also expressed as:

wherein J is a Lagrangian loss function, D is the sum of absolute differences of the prediction block and the matching block, R is the number of bits used for representing mode information, and λ is a Lagrangian multiplier.

a. And for the current background to be coded, always performing rate distortion optimization processing on the matched background in the intra-frame prediction mode.

b. For the current vehicle to be coded, firstly, a loss function when the current vehicle to be coded adopts intra-frame prediction is roughly estimated: dividing a current vehicle to be coded into blocks with a fixed size of 16x16, and sequentially estimating a mean value mode (DC), a smoothing mode (planar), a horizontal intra-frame prediction mode and a vertical intra-frame prediction mode for each block of 16x16 to obtain the sum of absolute differences of each block of 16x16 corresponding to each mode; in intra prediction mode estimation, the reference pixel values of the current 16x16 block are deduced from the original values of the neighboring 16x16 blocks; for each 16x16 block, sorting the sum of absolute differences estimated in all modes in order from small to large, and taking the result with the smallest sum of absolute differences as the optimal matching result of the current 16x16 block; traversing all 16x16 blocks in the current vehicle to be encoded, accumulating the optimal matching result of each 16x16 block, and obtaining the Lagrangian loss function of the current vehicle to be encoded

Taking the matched vehicles into account, calculating an updated Lagrangian loss function: for each 16x16 block in the vehicle currently to be encoded, the lagrange loss function

On the basis of the calculation result, calculating a loss function (sum of absolute differences) of the matched vehicle by adopting a rapid motion estimation method; then each 16x16 block is matched with the loss function of the matched vehicleNumber, and calculating Lagrangian loss function

Comparing the obtained minimum absolute difference sum estimated by the intra-frame prediction, and taking the smaller one as the minimum loss function of the corresponding 16x16 block; traversing each 16x16 block in the current vehicle to be coded, and accumulating the minimum loss function of each 16x16 block to obtain the loss function of the current vehicle to be coded

Meanwhile, for the current vehicle to be coded, the change of the bit number comprises the position index information of the matched vehicle in the database, the position information of the matched vehicle in the reference frame and the CTU-level representation information, and the change of the bit number and the Lagrange loss function are combined

Combined to obtain an updated Lagrangian loss function

Comparison of Lagrange loss functions

With updated lagrange loss function

The size between, if

Then a rate-distortion optimization process is performed on the matching vehicle.

2. And (4) coding of the vehicle and the background.

1) When an inter-frame prediction mode is adopted, if the rate distortion optimization processing needs to be carried out on the matched vehicle or the matched background, a reference frame space is newly applied, and the matched vehicle or the matched background is attached to the newly applied reference frame and is used for inter-frame prediction of the current vehicle to be coded or the background to be coded together with the existing reference frame; after the interframe prediction is finished, traversing each 4x4 block covered by the current vehicle to be coded or the current background to be coded, and if a certain 4x4 block refers to the information of the current vehicle to be coded or the current background to be coded, coding a corresponding syntax element into a code stream;

2) when an intra-frame prediction mode is adopted, if the rate distortion optimization processing needs to be carried out on the matched vehicle or the matched background, a reference frame space is newly applied, and the matched vehicle or the matched background is attached to the newly applied reference frame for intra-frame prediction of the current vehicle to be coded or the background to be coded.

In the above two parts, the position of the matching vehicle attached to the reference frame of the new application is determined by the following formula:

x₀＝x_c+MV_x；

y₀＝y_c+MV_y；

wherein x is₀And y₀Indicating the location of the matching vehicle to attach to the newly applied reference frame, x_cAnd y_cIndicating the position, MV, of the current vehicle to be coded in the current frame_xAnd MV_yThe horizontal component and the vertical component (obtained by the aforementioned fast motion estimation) of the offset of the current vehicle to be coded with respect to the matching vehicle;

when the matched background is pasted on the reference frame, the matched background is aligned with the position of the reference frame.

3. Coding code stream structure

In the embodiment of the invention, the structure of the coding code stream is divided into two layers, namely a slice (slice) and a tree coding unit (CTU); wherein:

slice layer: for the current vehicle to be coded, the slice layer comprises a flag (flag) which indicates whether a matching vehicle is referred to in the current slice layer; traversing 4x4 blocks covered by all vehicles in the current slice layer, and judging whether the blocks refer to matched vehicles, if a certain 4x4 block refers to a matched vehicle, marking the block as true, otherwise, marking the block as false; if the mark is true, the slice layer also comprises a syntax element which represents the number of the referenced matched vehicles in the current slice layer; for each matched vehicle, the position index of the matched vehicle in the database and the position of the matched vehicle attached to the reference frame of the new application are coded into a code stream, and the number of the referenced matched vehicles, the index of each matched vehicle and the position of each matched vehicle attached to the reference frame of the new application are coded in a fixed-length coding mode;

for the current background to be coded, the slice layer comprises a mark for indicating whether the matching background is referred in the current slice layer; traversing all 4x4 blocks covered by the background in the current slice layer, and judging whether the blocks refer to a matching background, if a certain 4x4 block refers to the matching background, marking the block as true, otherwise, marking the block as false; if the mark is true, the slice layer also contains a position index syntax element of the referenced matching background in the database, and the syntax element is coded by adopting a fixed-length coding mode;

and (3) CTU layer: for the current vehicle to be coded, the CTU layer comprises a mark for indicating whether the current CTU layer refers to the matched vehicle pixel or not; traversing each 4x4 block in the current CTU layer, if there is some 4x4 block that references a matching vehicle pixel, then marking as true, otherwise marking as false; when the flag is true, the CTU layer further includes a syntax element indicating a matching vehicle index (index);

for the current background to be encoded, the CTU layer contains a flag indicating whether the current CTU layer references matching background pixels.

On the other hand, the related tests are also performed in order to illustrate the coding performance of the above scheme of the present invention.

The test conditions included: 1) and (3) interframe configuration: random Access, RA; low latency B, Low-delayB, LDB; low latency P, Low-delay P, LDP. 2) The base quantization step (QP) is set to {27,32,37,42}, the software based on is HM16.7, and the test sequence is 14 test sequences taken for oneself, the screenshot of which is shown in fig. 6. The results are shown in tables 1 and 2.

Wherein, table 1 is the performance comparison result under the setting of RA, LDB, and LDP, and table 2 is the complexity comparison result of the encoding and decoding end under the setting of RA, LDB, and LDP.

TABLE 1 Performance comparison results under RA, LDB, LDP settings

Table 2 complexity contrast results of encoding and decoding end under RA, LDB and LDP settings

As can be seen from tables 1 to 2, the above scheme of the embodiment of the present invention can obtain code rate savings of 35.1%, 31.3% and 28.8.0% in RA, LDB and LDP modes, respectively, relative to HM16.7, and increase complexity at the encoding and decoding ends is within a reasonable range.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A traffic monitoring video coding method is characterized by comprising the following steps:

processing an original traffic monitoring video sequence by adopting a foreground and background segmentation method, separating out vehicles and a background, and respectively removing redundancy existing between the separated vehicles and the background and then putting the vehicles and the background into a database;

for the traffic monitoring video to be coded, a foreground and background segmentation method is also adopted to separate the vehicle to be coded and the background to be coded; selecting matched vehicles from a database by adopting a characteristic matching and rapid motion estimation mode for the vehicles to be coded; selecting a matched background from a database on the basis of the sum of absolute differences for the background to be coded;

when an inter-frame prediction mode or an intra-frame prediction mode is adopted, judging whether the vehicle to be coded or the background to be coded needs to perform rate distortion optimization processing on a matched vehicle or a matched background by using a preset mode; performing corresponding processing according to the judgment result, and encoding by using a corresponding prediction mode;

wherein, the processing the original traffic monitoring video sequence by adopting the front background segmentation method to separate the vehicle and the background, and respectively removing the redundancy existing between the separated vehicle and the background and then putting the vehicle and the background into the database comprises the following steps:

numbering vehicles from 1 to N after the redundancy is removed;

initially, the vehicle in the database is empty; for a certain vehicle v with redundancy removed_iRemoving vehicles v by using a method based on an inverted list_iSearching similar vehicles { v ] from all other vehicles_i1,v_i2,…,v_imWhere m is the number of similar vehicles;

comparing vehicles v when retrieving similar vehicles_iAnd any one of the remaining vehicles v_jNumber of SIFT features matched when vehicle v_iAnd a vehicle v_jWhen the matched SIFT feature number meets the following formula, the vehicle v is connected_jPut in { v_i1,v_i2,…,v_imIn the method, the following steps:

N_ij≥β×N_i；

N_ij≥min(N₀,N_i)；

in the above formula, N_ijFor vehicles v_iAnd a vehicle v_jNumber of matched SIFT features, N_iFor vehicles v_iNumber of SIFT features in (1), β and N₀Is a constant;

then, forAnd (3) comparing the similarity of the pixel levels by the vehicle: for vehicle v_iIf the vehicle in the database is empty, the vehicle v is put_iPutting the obtained product into a database; otherwise, the vehicle v is driven_iAnd { v_i1,v_i2,…,v_imComparing the pixel level similarity of vehicles which are put into the database, using a rapid motion estimation mode during similarity comparison, using a loss function as a sum of absolute differences, and judging that the two vehicles are similar at the pixel level if the calculated sum of absolute differences and the average value are smaller than a set value; if { v }_i1,v_i2,…,v_imThe vehicles already put into the database are not connected with the vehicle v continuously_iAt a pixel level, the vehicle v is similar_iPut into the database, otherwise, the vehicle v_iNot putting the data into a database; if the vehicle v is to be finally decided_iPut into the database, then { v }_i1,v_i2,…,v_imThe vehicles and the vehicle v in which the database has been put_iMaking a comparison of the pixel level similarity, if any, to the vehicle v_iRemoving similar vehicles which are put into the database from the database if the vehicles are similar in pixel level; if the number of vehicles and the number of vehicles v are more than the total number_iStopping comparing the pixel level similarity of the vehicles when the pixel levels are not similar;

processing each vehicle in the above manner, determining the vehicle finally put into the database, coding the vehicle and putting the vehicle into the database;

and for the background after removing the redundancy, taking a frame of background at intervals, coding the background and then putting the background into a database.

2. The traffic monitoring video coding method according to claim 1, wherein when a front background segmentation method is adopted to separate the vehicle from the background, pixels in a square region from the upper left corner to the lower right corner of the separated vehicle are used as the vehicle, and the rest part is used as the background;

for the separated current vehicle to be coded and the corresponding background, SIFT features of the vehicle to be coded and the corresponding background are respectively extracted, and for each SIFT feature extracted from the current vehicle to be coded, the following formula is adopted to search in a certain position neighborhood range on the corresponding background:

(xs_c-xs_b)²+(ys_c-ys_b)²≤d²；

wherein, xs_cAnd ys_cCoordinates, xs, representing SIFT features extracted from the vehicle currently to be coded_bAnd ys_bCoordinates representing SIFT features extracted from a corresponding background, and d is a defined range of a position neighborhood;

if the distance between the searched SIFT feature with the minimum Euclidean distance after normalization and a certain SIFT feature of the current vehicle to be coded is smaller than a threshold value, the fact that the SIFT feature similar to the SIFT feature of the current vehicle to be coded exists in the background region is indicated, the corresponding SIFT feature of the current vehicle to be coded is the background SIFT feature, and the SIFT feature is removed from the SIFT of the vehicle.

3. The traffic monitoring video coding method according to claim 1, wherein the selecting matched vehicles from the database by using the methods of feature matching and fast motion estimation for the vehicles to be coded comprises:

firstly, a plurality of candidate vehicles are roughly selected from a database in a characteristic matching mode: quantizing SIFT features of each vehicle in the database into visual words by adopting a k-means algorithm, and calculating a corresponding mapping mean vector for each visual word; mapping each SIFT feature of each vehicle in the database to the nearest neighbor visual character, and comparing the mapped SIFT feature vectors with the mapping mean vector corresponding to the nearest neighbor visual character to obtain the binarization representation of each SIFT feature vector; simultaneously, representing each vehicle in the database by using a frequency histogram of visual characters corresponding to SIFT features of the vehicle, and organizing the frequency histogram of each vehicle in the database in an inverted list mode; for the current vehicle to be coded, distributing each SIFT feature to the nearest visual characters according to the method for processing the vehicle in the database to obtain a frequency histogram of the current vehicle to be coded, and simultaneously calculating the binarization representation of each SIFT feature; when the similarity between a current vehicle to be coded and a certain vehicle in a database is compared, under the condition that the Hamming distance of the binarization representation of the SIFT features mapped to the same visual characters is smaller than a certain threshold value, the distance of a frequency histogram weighted by a tf-idf term is used as an evaluation index of the similarity, and the comparison result of the similarity between the current vehicle to be coded and each vehicle in the database is obtained; sorting according to the comparison result of the calculated similarity, and selecting a plurality of vehicles with the top similarity ranking as candidate vehicles;

then, a matching vehicle is selected from a plurality of candidate vehicles by using a rapid motion estimation mode: aligning a current vehicle to be coded with each candidate vehicle, dividing the current vehicle to be coded into blocks with the fixed size of 16x16, and searching a block with the minimum loss function in a certain candidate vehicle by each block with the size of 16x16, wherein the loss function consists of the sum of absolute differences and the coding code rate of a motion vector; the searching mode is that the position of the current 16x16 block is taken as a starting point, eight-point diamond type searching is carried out in the range of 64 pixels up, down, left and right around the starting point, and the loss functions of all the 16x16 blocks are accumulated to be used as the whole loss function of the whole current vehicle to be coded on a candidate vehicle; and finally, the candidate vehicle with the minimum overall loss function is reserved as the matching vehicle.

4. The traffic monitoring video coding method according to claim 3, wherein the alignment of the current vehicle to be coded with each candidate vehicle is performed as follows:

d₁≤D₂；

d₁/d₂≤α；

5. The traffic monitoring video coding method according to claim 1, wherein selecting the matching context from the database based on the sum of absolute differences for the context to be coded comprises:

SAD＝∑_k∈B|pc_k-pl_k|；

6. The traffic monitoring video coding method according to claim 1, wherein when the inter-frame prediction mode is adopted, the determining whether the vehicle to be coded or the background to be coded needs to perform rate distortion optimization processing on the matching vehicle or the matching background by using a predetermined mode comprises:

j is a Lagrange loss function, D is the sum of absolute differences of the prediction block and the matching block, R is the bit number used for representing mode information, and lambda is a Lagrange multiplier;

firstly, calculating Lagrange loss functions of a current vehicle to be encoded and a current background to be encoded and a current reference frame:

for each existing reference frame of the current vehicle to be coded, the motion vector of the block adopting inter-frame prediction 4 × 4 at the corresponding position of the current vehicle to be coded and the image number information of the reference frame are obtained by taking the block of 4 × 4 as a unit, and on the basis, the motion vector information of the block corresponding to 4 × 4 on the current vehicle to be coded is estimated, wherein the estimation formula is as follows:

wherein, MVX_refAnd MVY_refHorizontal and vertical components of the block motion vector, POC, of inter prediction 4 × 4 on an existing reference frame, respectively_cur、POC_refAnd POC_colrefThe image number of the frame where the current vehicle to be coded is located, the image number of the existing reference frame and the block reference frame of inter-frame prediction 4 × 4 on the existing reference frameThe image number of (1); MVX_curAnd MVY_curTraversing each 4x4 small block in the current vehicle to be coded, recording the number of blocks of inter-frame prediction 4 × and the motion vector of the corresponding block of the current vehicle to be coded 4 ×, and finally, taking the horizontal component and the vertical component of the finally estimated current vehicle to be coded as the average value of all the inter-frame prediction 4x4 small block motion vectors;

obtaining the displacement of the current vehicle to be coded on each existing reference frame, then dividing the current vehicle to be coded into blocks with the fixed size of 16x16, and sequentially searching the block with the minimum loss function in all the existing reference frames by each block with the size of 16x16, wherein the loss function consists of the sum of absolute differences and the coding code rate of a motion vector; the searching mode is that the position of the current 16x16 block after translation according to the estimated displacement is taken as a starting point, and eight-point diamond type searching is carried out in the range of 64 pixels around the starting point; recording the minimum loss function of all blocks in the current vehicle to be coded and the matching blocks of the blocks in all the existing reference frames by taking a 16x16 block as a unit; sequentially traversing each 16x16 block in the current vehicle to be coded, accumulating the blocks to obtain the minimum loss function sum, and obtaining the Lagrangian loss function of the current vehicle to be coded relative to the current reference frame

Then, taking into account the matching vehicles and the background, an updated lagrangian loss function is calculated:

Meanwhile, for the current vehicle to be coded, the change of the bit number comprises the position index information of the matched vehicle in the database, the position information of the matched vehicle in the reference frame, the reference index bit change information and the CTU-level representation information, and the change of the bit number and the Lagrangian loss function are combined

Combined to obtain an updated Lagrangian loss function

Combined to obtain an updated Lagrangian loss function

Finally, the Lagrangian loss functions are compared

With updated lagrange loss function

The size between, if

With updated lagrange loss function

The size between, if

7. The traffic monitoring video coding method according to claim 1, wherein when the intra prediction mode is adopted, the determining whether the vehicle to be coded or the background to be coded needs to perform rate distortion optimization processing on the matching vehicle or the matching background by using a predetermined mode comprises:

the comparison criteria for rate-distortion optimization in intra prediction mode are:

for the current background to be coded, carrying out rate distortion optimization processing on the matched background all the time in an intra-frame prediction mode;

for the current vehicle to be coded, firstly, a loss function when the current vehicle to be coded adopts intra-frame prediction is roughly estimated: dividing a current vehicle to be coded into blocks with a fixed size of 16x16, and sequentially estimating DC, planar, horizontal and vertical intra-frame prediction modes for each block of 16x16 to obtain the sum of absolute differences of each block of 16x16 corresponding to each mode; in intra prediction mode estimation, the reference pixel values of the current 16x16 block are deduced from the original values of the neighboring 16x16 blocks; for each 16x16 block, sorting the sum of absolute differences estimated in all modes in order from small to large, and taking the result with the smallest sum of absolute differences as the optimal matching result of the current 16x16 block; go throughAccumulating the optimal matching result of each 16x16 block of all 16x16 blocks in the current vehicle to be coded to obtain the Lagrangian loss function of the current vehicle to be coded

Then, taking the matching vehicles into account, an updated lagrangian loss function is calculated: for each 16x16 block in the vehicle currently to be encoded, the lagrange loss function

On the basis of the calculation result, calculating and matching a loss function of the vehicle by adopting a rapid motion estimation method; then, the loss function of each 16x16 block and the matched vehicle is compared with the calculated Lagrange loss function

Combined to obtain an updated Lagrangian loss function

Finally, the Lagrangian loss functions are compared

With updated lagrange loss function

The size between, if

8. The traffic monitoring video coding method according to claim 1, 6 or 7, wherein the corresponding processing is performed according to the judgment result, the coding is performed by using the corresponding prediction mode, and the information of the matched vehicle or the matched background referred to in the coding is coded into the code stream together

When an inter-frame prediction mode is adopted, if the rate distortion optimization processing needs to be carried out on the matched vehicle or the matched background, a reference frame space is newly applied, and the matched vehicle or the matched background is attached to the newly applied reference frame and is used for inter-frame prediction of the current vehicle to be coded or the background to be coded together with the existing reference frame; after the interframe prediction is finished, traversing each 4x4 block covered by the current vehicle to be coded or the current background to be coded, and if a certain 4x4 block refers to the information of the current vehicle to be coded or the current background to be coded, coding a corresponding syntax element into a code stream;

when an intra-frame prediction mode is adopted, if the rate distortion optimization processing needs to be carried out on the matched vehicle or the matched background, a reference frame space is newly applied, and the matched vehicle or the matched background is attached to the newly applied reference frame for intra-frame prediction of the current vehicle to be coded or the background to be coded;

the location of the reference frame where the matching vehicle is attached to the new application is determined by:

x₀＝x_c+MV_x；

y₀＝y_c+MV_y；

wherein x is₀And y₀Indicating the location of the matching vehicle to attach to the newly applied reference frame, x_cAnd y_cIndicating the position, MV, of the current vehicle to be coded in the current frame_xAnd MV_yThe horizontal component and the vertical component of the offset of the current vehicle to be coded relative to the matched vehicle;

9. The traffic monitoring video coding method according to claim 8, wherein the structure of the coded stream is divided into two layers, a slice and a tree coding unit CTU; wherein:

slice layer: for the current vehicle to be coded, the slice layer comprises a mark for indicating whether a matched vehicle is referenced in the current slice layer; traversing 4x4 blocks covered by all vehicles in the current slice layer, and judging whether the blocks refer to matched vehicles, if a certain 4x4 block refers to a matched vehicle, marking the block as true, otherwise, marking the block as false; if the mark is true, the slice layer also comprises a syntax element which represents the number of the referenced matched vehicles in the current slice layer; for each matched vehicle, the position index of the matched vehicle in the database and the position of the matched vehicle attached to the reference frame of the new application are coded into a code stream, and the number of the referenced matched vehicles, the index of each matched vehicle and the position of each matched vehicle attached to the reference frame of the new application are coded in a fixed-length coding mode;

and (3) CTU layer: for the current vehicle to be coded, the CTU layer comprises a mark for indicating whether the current CTU layer refers to the matched vehicle pixel or not; traversing each 4x4 block in the current CTU layer, if there is some 4x4 block that references a matching vehicle pixel, then marking as true, otherwise marking as false; when the flag is true, the CTU layer further includes a syntax element indicating a matching vehicle index;