CN115695949A

CN115695949A - Video concentration method based on target track motion mode

Info

Publication number: CN115695949A
Application number: CN202211322752.0A
Authority: CN
Inventors: 汪陈伍; 武君胜; 王佩
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-02-03

Abstract

The invention provides a video concentration method based on a target track motion mode, and relates to the technical field of image processing and video monitoring. The method comprises the following steps: 1) Generating and grouping target track motion modes; the stage comprises the following steps: extracting a motion track of a video target, generating a track motion mode, and grouping the track motion modes; 2) On-line video enrichment based on a trajectory motion pattern. The stage comprises the following steps: extracting a single target track, generating a video background picture, matching track motion modes, rearranging target tracks in a group, and generating a concentrated video. The method comprises the steps of training off-line video data to obtain a target track motion mode and a group in a monitoring video scene; according to the trained track motion mode and the trained grouping, on-line video concentration is implemented for the monitoring video stream, the execution efficiency of the video concentration is improved, and the visual effect of the concentrated video is improved.

Description

Video concentration method based on target track motion mode

Technical Field

The invention belongs to the technical field of image processing and video monitoring, and particularly relates to a video concentration method based on a target track motion mode.

Background

With the development of computer networks and video technologies, in order to meet the public safety requirements, a large number of monitoring cameras are deployed in cities, the monitoring cameras continuously work for 24 hours every day, and a huge amount of video data is captured and stored every day. However, browsing video data is time-consuming and labor-consuming, and how to effectively use video data is a difficult problem because massive data information cannot be accurately acquired in time only by manpower.

The video concentration technology is a video abstraction method based on a target, can compress a long video into a short video, and is used for quickly retrieving and browsing original monitoring data. The method comprises the steps of taking a target as a basic processing unit, extracting a background picture of a monitoring video through background modeling, obtaining a foreground target by adopting a target detection and instance segmentation technology, and performing matching association on continuous foreground targets by using a target tracking technology to generate a target track (also called a target tube). And moving the target track on a time axis, rearranging the target track, and carrying out image fusion processing on the rearranged target track and the background picture to generate a concentrated video. The video concentration technology not only eliminates the time redundancy and the space redundancy in the source video, but also well retains the dynamic characteristics of the moving object.

However, the movement of video compression techniques also brings new problems: one is that false collisions between video objects can occur. Two target tracks without time intersection in the source video cannot collide, but when the target tracks are moved on a time axis in the processing process, the target tracks without collision may collide, which is called pseudo collision. Secondly, the target trajectory rearrangement process is very time-consuming. The target track rearrangement is the most important link in video concentration, generally, a loss function is set to convert the target track rearrangement into a multi-target optimization problem, optimal solutions are globally searched by adopting methods such as simulated annealing, markov chain Monte Carlo and the like, and most of the time is slow in convergence rate and very time-consuming. Thirdly, the concentrated video has poor visual browsing effect. The concentrated video usually displays more targets per frame, and if a plurality of targets with different directions and different speeds are collected into one frame and more pseudo-collisions occur between the targets, the visual browsing effect of the user is affected. How to reduce false collision as much as possible under the condition of improving the compression ratio of the condensed video, improve the execution efficiency and improve the visual browsing effect is a problem to be solved in the technical field of video condensation.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a video concentration method based on a target track motion mode, which can improve the video concentration execution efficiency and the visual browsing effect and reduce the false collision among targets.

The technical solution of the invention is as follows: the invention provides a video concentration method based on a target track motion mode, which comprises the following steps:

1) And generating and grouping target track motion patterns. Inputting historical data of a monitoring video, constructing a model to train the video data, and generating a target track motion mode and a group;

2) On-line video enrichment based on a trajectory motion pattern. Inputting a monitoring video stream, extracting a video motion target track, and carrying out online video concentration processing by combining the track motion mode and the grouping result in the step 1) to generate a concentrated video.

The step 1) comprises the following steps:

11 Inputting offline monitoring video data, performing target detection, example segmentation and target tracking by using a deep learning model, obtaining an example mask sequence (the example mask sequence is a target track) of each target in different video frames, and completing extraction of all moving target tracks in the video;

12 On the basis of extracting the target tracks, clustering all the target tracks by adopting a clustering algorithm, clustering the target tracks into different classes, and generating a target track motion mode;

13 On the basis of target track motion mode generation, according to the principle of less collision and direction consistency, the target track motion modes are divided to generate track motion mode groups.

The step 2) comprises the following steps:

21 Input surveillance video stream, use Gaussian mixture model to carry out background modeling, extract background picture of surveillance video, and set certain time interval to update background;

22 Input monitoring video stream, subtraction processing is carried out between a current frame and a background of the video, and a foreground mask is obtained through expansion and corrosion operations; then, tracking the target by using Kalman filtering and Hungarian matching algorithm so as to extract the motion trail of the target;

23 Matching the extracted target motion track with the target track motion mode, and automatically attributing the target track to the corresponding track motion mode group if the matching is successful; if the matching fails, the target track is assigned to an abnormal track motion mode group;

24 For the target motion tracks belonging to the same track motion mode grouping, carrying out video concentration processing by adopting a target track rearrangement method based on dynamic search space local optimization; defining an energy loss function for target motion tracks belonging to abnormal track motion mode groups, and performing video concentration processing of target track rearrangement;

25 For each video condensed packet, performing image fusion processing on the rearranged target track (target tube) and the extracted video background picture by frames by adopting a Poisson fusion algorithm, and merging the fused continuous video frames to generate a condensed video.

The step 12) further comprises the following steps:

121 ) expressing the target track by adopting an equidistant nearest adjacent sampling point method, wherein the target track is expressed as a coordinate vector

Wherein

For calculatingFold-over parameters used in the method, (x) _i ，y _i ) Coordinates of a track sampling point;

122 A criterion of similarity of target tracks is determined, euclidean distance between track coordinate vectors is used as the distance between two tracks, and the similarity is higher when the distance value is smaller. The distance between the ith and jth tracks is D _i，j The calculation formula is as follows:

wherein the content of the first and second substances,

and

respectively expressed as x and y coordinate values of the kth sample point of the ith trace,

is a fold-back parameter.

123 Calculating and generating a similarity measurement matrix of the distances between the tracks, setting the radius gamma of adjacent points and the threshold omega of the minimum number of core points by adopting a DBSCAN clustering algorithm, and clustering the target tracks, wherein each cluster is determined as a target track motion mode;

124 For each target trajectory motion pattern, a representative trajectory is generated. And respectively calculating the average value of the abscissa x and the average value of the ordinate y of the coordinate vectors of all target tracks belonging to the same cluster according to the sampling points to form the coordinate vector representing the track. Assuming that there are n target tracks in a cluster,

for the doubling parameter, then

The number of the sampling points of the track is,

and

respectively representing x and y coordinates of a k sampling point of the track, and the calculation formula is as follows:

then the vector coordinates representing the trajectory are:

the step 13) further comprises the following steps:

131 Counting the sizes (including height and width) of all target detection frames in the current motion mode group in the video, and calculating the average value of the sizes of all the target detection frames;

132 The average values of the height and the width of the detection frames are superposed to the representative track, so that the representative track has one detection frame in each frame, and then the collision value between the representative tracks is calculated by summing the number of pixel points of intersection of all two detection frames in the two tracks;

133 Equally dividing theta by 360 degrees in a rectangular coordinate system, respectively representing theta-class directions, then representing the directions of the representative tracks by the directions of connecting lines from the starting points to the end points of the representative tracks, classifying the directions of the representative tracks, and judging the relation between different representative track directions by judging whether the representative track directions belong to the same direction category or not;

134 According to the collision relation and the direction relation among the representative tracks, grouping the representative tracks, wherein each group is a target track motion mode group. Setting a collision threshold value to ensure that no collision exists or little collision exists among the representative tracks in one group; meanwhile, the motion modes forming the annular flow are preferentially enabled to be divided into one group by considering the direction relation, so that the subsequent concentrated videos in the group keep fluency.

The step 23) further comprises the following steps:

231 The extracted target motion track is expressed by adopting an equidistant nearest neighbor sampling point representation method, and the target motion track is expressed as a coordinate vector

Wherein

For the fold-over parameter used in the algorithm, (x) _i ，y _i ) Coordinates of a track sampling point;

232 Calculating the distances of all sampling points between the target track and the representative tracks of all motion modes to serve as a basis for judging the similarity between the tracks, wherein the smaller the distance value is, the higher the similarity is;

233 Finding out the target track motion mode with the shortest distance, and recording the shortest distance as d _min And compares the distance with a set threshold value alpha if d _min If the target track is less than or equal to alpha, the target track is attributed to the target track motion mode and is attributed to the corresponding track motion mode group;

234 If d) _min And if the target track is larger than the threshold value alpha, attributing the target track to the abnormal track motion mode.

The step 24) further comprises the following steps:

241 Defining an energy loss function, the loss terms including a condensed length loss term, a pseudo collision loss term, and a time series disorder loss term; the energy loss function is formulated as:

wherein l _i Indicating the start position, P, of the track i in the condensed video _g Representing all tracks belonging to a motion pattern group, P, with track i _c Representing the same movement pattern as the trajectory iAll the tracks. E _r Representing a condensed video length loss term, E _c Representing a condensed video false collision loss term, E _t Representing a condensed video track temporal clutter loss term. Alpha is alpha _r ，α _c ，α _t Weights of different loss terms are respectively used for balancing the effects of the different loss terms;

242 Determining the range of the rearrangement position of the target track by adopting a dynamic search space method; target trajectory O _i At the start position l of the condensed video _i Is a dynamically changing value, l _i The minimum value is recorded as l _min And the maximum value is denoted by l _max The minimum value is the maximum value of the starting positions of all the tracks arranged in the group, the maximum value is the maximum value of the ending positions of all the tracks arranged in the group, and the calculation formula is as follows:

P _c representing all trajectories belonging to the same motion pattern as trajectory i, len (O) _k ) The frame length representing the ith trajectory;

243 Using energy loss function and initial position to search space range, and adopting local optimization greedy algorithm to carry out concentration rearrangement on target tracks in the group.

The step 121) further comprises the following steps:

1211 Mark the coordinates of the starting point and the end point of the target track, and draw a straight line from the starting point to the end point;

1212 Equally divide a straight line into equal parts at intervals of distance

Parts of wherein

Represents the times of folding the film in half,typically having a range of values

Then straight line is added in the middle

A dividing point;

1213 Taking the division point on the straight line as the center of a circle, searching a point which is closest to the division distance on the target track, and taking the point as a sampling point of the target track;

1214 Starting and ending points of the target track, and (2) ^ω -1) a coordinate vector of sample points representing the target trajectory, the coordinate vector being of the form

The invention has the beneficial effects that:

(1) The target track is represented by adopting an equidistant nearest adjacent sampling point method, the position and position change information of the target track are more accurately represented, the represented lengths of the tracks with different lengths are unified, the distance calculation between the tracks is convenient, and the efficiency and accuracy of a track clustering algorithm and track matching are improved;

(2) The method adopts two-stage processing modes of off-line video data training generation motion mode and video stream on-line video concentration, and the historical data is fully utilized in the training stage, so that the generated track motion mode is more complete and accurate. And in the online stage, the existing track motion mode is used for matching, the dynamic search space local optimization target track rearrangement is carried out, and the execution efficiency of video concentration is improved.

(3) According to the principle of less collision and direction consistency, track motion mode grouping is carried out, detected target tracks belong to different motion mode grouping in an online stage, target track rearrangement is carried out in the group, so that false collision among concentrated video targets is less, the concentrated video comprises the target tracks in the same annular stream, and the visual browsing effect is good.

Drawings

Fig. 1 is a schematic diagram illustrating steps of a video compression method according to the present invention.

FIG. 2 is a schematic diagram of a method for equidistant nearest neighbor sampling points according to the present invention.

FIG. 3 is a schematic diagram of a method for measuring similarity between target tracks according to the present invention.

Fig. 4 is a schematic diagram of grouping target track motion patterns according to the present invention.

Detailed Description

Referring to fig. 2, in the method for equally-spaced nearest-neighbor sampling points of a target track, a straight line is used to connect the start point and the end point of the track, and the straight line is selected to be equally divided by 8, 16, 32 or 64 according to the resolution of a video image. Taking an 8-equal-division straight line as an example, respectively drawing a circle by taking each division point on the straight line as a center, and searching the nearest point on the track as a sampling point of the target track. Coordinates of 8 sampling points on the track form a coordinate vector together to represent the target track.

Referring to fig. 3, in the method for measuring similarity between target tracks, after the tracks are represented by using a method of equally spaced nearest adjacent sampling points, euclidean distances of two tracks 8 to the sampling points are respectively calculated in sequence, and all distance values are added and summed, so that the sum is the distance between the two target tracks. The smaller the distance value, the higher the similarity.

Referring to fig. 4, there are 6 sets of trajectory movement patterns, where a and b, c and d, e and f are representative trajectories of three pairs of trajectory movement patterns with similar trajectories and opposite directions, and there is a large collision between the three pairs of representative trajectories. The principle of motion pattern grouping is to group the motion pattern groups representing fewer trajectory collisions, while the plurality of motion pattern groups in a group are as much as possible in one circular flow direction. After grouping, the groups are divided into two groups, a, c and e are divided into one group, and b, d and f are divided into the other group.

Referring to fig. 1, the present invention provides a video compression method based on target trajectory motion pattern grouping, including:

1) And generating and grouping target track motion patterns. Historical data of the monitoring video is input, a model is built to train the video data, and a target track motion mode and a target track motion group are generated.

The specific process of the step 1) is as follows:

11 Inputting offline monitoring video data, performing target detection and example segmentation by using a Yolact + + model, and performing target tracking by using a Deepsort model to obtain an example mask sequence of each target in different video frames, wherein the example mask sequence is a target track, and the extraction of all moving target tracks in the video is completed;

12 On the basis of target track extraction, clustering operation is carried out on all target tracks by adopting a DBSCAN clustering algorithm, the target tracks are clustered into different classes, and a target track motion mode is generated;

the specific process of the step 12) is as follows:

121 ) the target track is expressed by adopting an equal-spacing nearest neighbor sampling point algorithm, and the target track is expressed as a coordinate vector

Wherein

as shown in fig. 2, the specific process of step 121) is as follows:

1212 Equally divide a straight line into equal parts at intervals of distance

Parts of, wherein

Representing the number of folds, usually in the range of

Then straight line is added in the middle

A dividing point;

1214 Starting and ending points of the target track, and (2) ^ω -1) a coordinate vector of sample points in the form of a coordinate vector

122 Determine the similarity criterion of the target track, and take the euclidean distance between the coordinate vectors of the tracks as the distance between the two tracks, wherein the smaller the distance value is, the higher the similarity is. As shown in FIG. 3, the distance between the ith and jth tracks is D _i，j The calculation formula is as follows:

wherein the content of the first and second substances,

and

is a folding parameter.

124 For each target trajectory motion pattern, a representative trajectory is generated. And respectively calculating the average value of the abscissa x and the average value of the ordinate y of the coordinate vectors of all target tracks belonging to the same cluster according to the sampling points to form the coordinate vector representing the track. Suppose there are n targets in a clusterThe trajectory of the light beam is determined,

for the doubling parameter, then

The number of the sampling points of the track is,

and

then the vector coordinates representing the trajectory are:

The specific process of the step 13) is as follows:

133 Dividing 360 degrees into 8 equal parts in a rectangular coordinate system, respectively representing 8 types of directions, then representing the direction of the representative track by the direction of a connecting line from the starting point to the end point of the representative track, classifying the directions of the representative tracks, and judging the relation between different directions of the representative tracks according to whether the representative tracks belong to the same category or not;

134 As shown in fig. 4, representative trajectories are grouped according to the collision relationship and the direction relationship between the representative trajectories, and each group is a group of target trajectory motion patterns. Setting a collision threshold value to ensure that no collision exists or little collision exists among the representative tracks in one group; meanwhile, the motion modes forming the annular flow are preferentially enabled to be divided into one group by considering the direction relation, so that the subsequent concentrated videos in the group keep fluency.

The specific process of step 2) is as follows:

22 Input monitoring video stream, subtraction processing is carried out between the current frame and the background, and a foreground mask is obtained through expansion and corrosion operations; then, tracking the target by using Kalman filtering and Hungarian matching algorithm so as to extract the motion trail of the target;

23 Matching the extracted target motion track with the target track motion mode, and if the matching is successful, automatically attributing the target track to a corresponding track motion mode group; if the matching fails, the target track is assigned to an abnormal track motion mode group;

the specific process of the step 23) is as follows:

231 The extracted target motion track is expressed by adopting an equal-spacing nearest adjacent sampling point expression method. As shown in FIG. 2, the target motion trajectory is represented as a coordinate vector

Wherein

233 Finding out the target track motion mode with the shortest distance, and recording the shortest distance as d _min And compares this distance with a set threshold alpha, if d _min If the target trajectory is less than or equal to alpha, attributing the target trajectory to the target trajectory motion mode, and attributing the target trajectory to a corresponding trajectory motion mode group;

24 For the target motion tracks belonging to the same track motion mode grouping, carrying out video concentration processing by adopting a target track rearrangement method based on dynamic search space local optimization; for target motion tracks belonging to abnormal track motion mode groups, directly carrying out video concentration processing of target track rearrangement according to an energy loss function;

the specific process of the step 24) is as follows:

wherein l _i Indicating the start position, P, of the track i in the condensed video _g Representing all tracks belonging to a motion pattern group, P, with track i _c Representing all trajectories belonging to the same motion pattern as trajectory i. E _r Indicating condensed visionLoss term of frequency length, E _c Representing a condensed video false collision loss term, E _t Representing a condensed video track temporal clutter loss term. Alpha is alpha _r ，α _c ，α _t Weights of different loss terms are respectively used for balancing the effects of the different loss terms;

Claims

1. A video concentration method based on a target track motion mode is characterized by comprising the following specific processes:

2. The method for concentrating video based on target track motion pattern according to claim 1, wherein the specific process of step 1) is as follows:

3. The method for concentrating video based on target track motion pattern as shown in claim 1, wherein the specific process of the step 2) is as follows:

4. The method as claimed in claim 2, wherein the clustering algorithm is used to cluster all target tracks, and the target tracks are clustered into different classes to generate the target track motion pattern, wherein the step 12) comprises the following specific steps:

121 The target track is expressed by adopting an equal-spacing nearest adjacent sampling point algorithm, the target track is expressed as a coordinate vector, and the length of the coordinate vector is the number of sampling points;

122 Determine the similarity criterion of the target track, and take the euclidean distance between the coordinate vectors of the tracks as the distance between the two tracks, wherein the smaller the distance value is, the higher the similarity is.

124 For each target trajectory motion pattern, a representative trajectory is generated. And respectively calculating the average value of the abscissa x and the average value of the ordinate y of the coordinate vectors of all target tracks belonging to the same cluster according to the sampling points to form the coordinate vector representing the track.

5. The method as in claim 2 wherein the step 13) comprises the following steps:

131 Counting the sizes (including the heights and the widths) of all target detection frames in the current motion mode group in the video, and calculating the average value of the heights and the widths of all the target detection frames;

6. Matching the extracted target motion trail with the target trail motion patterns according to the right 3, and automatically grouping the target trail into corresponding trail motion patterns if the matching is successful, wherein the specific process of the step 23) is as follows:

231 The extracted target motion track is expressed by adopting an equidistant nearest adjacent sampling point representation method, and the target motion track is expressed as a coordinate vector;

232 Calculating the distances of all sampling points between the target track and the representative tracks of all motion modes, and taking the distances as the basis for judging the similarity between the tracks, wherein the smaller the distance value is, the higher the similarity is;

234 If d) _min And if the target track is larger than the threshold value alpha, the target track is assigned to an abnormal track motion mode.

7. The method for performing video compression processing on the target motion trajectories belonging to the same trajectory motion mode group by using the target trajectory rearrangement method based on dynamic search space local optimization as described in claim 3, wherein the specific process in step 24) is as follows:

wherein l _i Indicating the start position, P, of the track i in the condensed video _g Representing all tracks belonging to a motion pattern group, P, with track i _c Representing all trajectories belonging to the same motion pattern as trajectory i. E _r Representing a condensed video length loss term, E _c Representing a condensed video false collision loss term, E _t Representing a condensed video track temporal clutter loss term. Alpha (alpha) ("alpha") _r ，α _c ，α _t Weights of different loss terms are respectively used for balancing the effects of the different loss terms;

242 Determining the range of the rearrangement position of the target track by adopting a dynamic search space method; target trajectory O _i At the start position l of the condensed video _i Is a dynamically changing value, l _i The minimum value is recorded as l _min And the maximum value is denoted by l _max The minimum value is already in the groupThe maximum value of the start positions of all the arranged tracks is the maximum value of the end positions of all the arranged tracks in the group, and the calculation formula is as follows:

243 Using an energy loss function and a starting position search space range, and adopting a local optimization greedy algorithm to perform concentration and rearrangement on the target tracks in the group.

8. The method as claimed in claim 4, wherein the step 121) comprises the following steps:

1212 Equally divide a straight line into equal parts at intervals of distance

Parts of, wherein

The folding parameters represent the times of folding, and usually have a value range of

Then straight line is added in the middle

A dividing point;

1213 Taking the dividing point on the straight line as the center of a circle, searching a point which is closest to the dividing point on the target track, and taking the point as a sampling point of the target track;