CN111401143A

CN111401143A - Pedestrian tracking system and method

Info

Publication number: CN111401143A
Application number: CN202010118386.1A
Authority: CN
Inventors: 谢英红; 李路; 韩晓微; 涂斌斌; 李华
Original assignee: Shenyang University
Current assignee: Shenyang University
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2020-07-10

Abstract

The invention provides a pedestrian tracking system and a pedestrian tracking method, and relates to the technical field of computer vision. Determining that a first frame of a plurality of video frames includes a target frame of a target object; for a subsequent frame except the first frame in the plurality of video frames, determining a current target frame including a target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained VGG-16 network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained RPN network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions through a plurality of convolution kernels with different sizes to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the characteristics of the multiple interested areas, distinguishing a target from a background, and obtaining multiple tracking affine frames of a target object; and performing non-maximum suppression on the plurality of tracking affine frames to obtain a tracking result of the target object of the current frame.

Description

Pedestrian tracking system and method

Technical Field

The invention relates to the technical field of computer vision, in particular to a pedestrian tracking system and a pedestrian tracking method.

Background

The pedestrian tracking technology is to recognize and track the pedestrian target on the picture in the video and the image through the computer vision technology. The pedestrian identification tracking project is regarded as a key research project by many countries, and the project is emphasized because the technology is advanced and has wide hunting, namely the technology can be used for battlefield detection, target tracking, accurate guidance and the like in the field of national defense and military, the technology can be used for intelligent traffic, violation detection, unmanned driving and the like in the field of urban traffic, and the technology can be used for people flow monitoring and the like in the field of social security.

Many pedestrian tracking methods and apparatus are disclosed in the prior art. Although many popular neural network technologies are used in these systems and methods, there is no special solution for accurate positioning of a deformed target.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a pedestrian tracking system and a pedestrian tracking method.

The technical scheme adopted by the invention is as follows:

in one aspect, the present invention provides a pedestrian tracking system comprising a memory and a processor;

the memory is used for storing computer executable instructions;

the processor is to execute the executable instructions to determine that a first frame of a plurality of video frames includes a target box of a target object; for a subsequent frame except the first frame in the plurality of video frames, determining a current target frame including a target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained VGG-16 network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained RPN network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions through a plurality of convolution kernels with different sizes to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the characteristics of the multiple interested areas, distinguishing a target from a background, and obtaining multiple tracking affine frames of a target object; and performing non-maximum suppression on the plurality of tracking affine frames to obtain a tracking result of the target object of the current frame.

On the other hand, the invention also provides a pedestrian tracking method, which is realized by adopting the pedestrian tracking system and comprises the following steps:

step 1: determining that a first frame of the plurality of video frames includes a target frame of a target object;

step 2: for the subsequent frames except the first frame, determining a current target frame including the target object in the current frame according to the determined target frame;

and step 3: and adjusting the determined target frame into a fixed size, inputting the fixed size into a pre-trained VGG-16 network, acquiring a candidate feature map of the target frame in the current frame, and designing a loss function.

And 4, step 4: inputting the candidate feature map into a pre-trained RPN network to obtain a plurality of target candidate regions;

the target candidate region is a region in which a plurality of shapes and positions of a target object in the current frame exist simultaneously.

And 5: performing pooling operation on the characteristics of the target candidate regions through a plurality of convolution kernels with different sizes to obtain a plurality of interested regions aiming at the target object;

the plurality of convolution kernels of different sizes includes three convolution kernels for roughly describing different deformations of the target object.

Step 6: performing full-link operation on the features of the multiple regions of interest, distinguishing a target from a background, comparing the multiple tracking affine frames with a reference target frame to obtain an affine tracking frame with the largest overlapping area, and thus obtaining multiple tracking affine frames of the target object;

and 7: performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame;

and 8: and (3) judging whether the number of the next frame of the current image is less than the total frame number of the video, if not, directly finishing, if so, returning to the step (2), and tracking the next frame of the image until all the frames of the video are tracked.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in:

the method and the device utilize the affine transformation parameter information of the previous frame image to cut the current target image, reduce the search range and improve the algorithm efficiency. In addition, during the pooling operation, convolution kernels with different sizes and shapes are applied to preliminarily simulate the deformation of the target, and the target position can be accurately extracted.

Drawings

FIG. 1 is a block diagram of an implementation of an embodiment of the invention using a computer architecture.

Fig. 2 is a flowchart of a pedestrian tracking algorithm according to an embodiment of the present invention.

FIG. 3 is a schematic block diagram of a process flow of an embodiment of the present invention.

Fig. 4 is a comparison graph of the effects of the horizontal NMS and the affine transformation NMS of the embodiment of the present invention.

FIG. 5 is a graph of the tracking results of the embodiment of the present invention.

Fig. 6 shows a network structure of VGG-16 according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings.

the memory is used for storing computer executable instructions;

As shown in fig. 1, a schematic diagram of an electronic system 600 suitable for use in implementing embodiments of the present disclosure is shown. The electronic system shown in fig. 1 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present disclosure.

As shown in fig. 1, electronic system 600 may include a processing device (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage device 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 607 including, for example, a liquid crystal display (L CD), speaker, vibrator, etc., storage devices 608 including, for example, magnetic tape, hard disk, etc., and communication devices 609 may allow electronic system 600 to communicate wirelessly or wiredly with other devices to exchange data.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer-readable medium may be embodied in the electronic system (also referred to herein as an "affine multi-task regression-based pedestrian tracking system"); or may exist separately and not be assembled into the electronic system. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic system to: 1) determining that a previous frame of the plurality of video frames comprises a target frame of a target object; 2) determining a current target frame including the target object in the current frame according to the determined target frame; 3) inputting the current target frame into a pre-trained first neural network, and acquiring a candidate feature map of the target frame in the current frame; 4) inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions; 5) Pooling features of the target candidate regions to obtain a plurality of regions of interest for the target object; 6) performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background so as to obtain multiple tracking affine frames of the target object; and 7) carrying out non-maximum suppression on the plurality of tracking affine frames to obtain a tracking result of the target object of the current frame.

On the other hand, the invention also provides a pedestrian tracking method based on affine multitask regression, as shown in fig. 2, which is implemented by adopting the pedestrian tracking system based on affine multitask regression, and the method comprises the following steps:

the method comprises the steps of initializing the size of an original image, setting the size of the original image to be m × n (unit: pixel), manually marking the position of a target frame of the frame when t =1, marking the central position of the target frame as (cx, cy), wherein t represents the image of the t-th frame, t is a positive integer, cx and cy are the horizontal and vertical coordinates of the central position of the target frame respectively, and the target frame comprises an object to be tracked, such as the object 301 in FIG. 3.

Initializing affine transformation parameters:U ₁=[r1，r2，r3，r4，r5，r6]^T。

step 2: determining a current target frame including the target object in the current frame according to the determined target frame;

for example, assuming that two side lengths of a circumscribed rectangle of the target frame in the t-1 frame are marked as a, b, on the t-1 frame image, a picture of size (2 a) × (2 b), such as a rectangular frame marked as 302 in fig. 3, is cut out centering on the target center point (cx, cy) of the t-1 frame, in the present application, the purpose of centering on the center point of the target of the previous frame is to make the cut-out picture include target information, because the coordinates of the center point of the target of two adjacent frames do not change much, and as long as the coordinates of the center point of the target of two adjacent frames do not change much, the target to be tracked can be included in the cut-out sub-picture as long as the target of a sufficient size is cut out at a position near the target center point.

And adjusting the cut target frame into a fixed size, sending the fixed size into a pre-trained neural network, for example, into the VGG-16 network, and acquiring a feature map of the image after fifth layer convolution in the network, namely acquiring a candidate feature map of the target frame in the image. Such as indicated by reference numeral 303 in fig. 3.

In this embodiment, the accuracy and the operating efficiency of the system are taken into consideration comprehensively, and the various embodiments of the present application are implemented by using a classic VGG-16 network structure, as shown in fig. 1, an exemplary VGG-16 network structure is shown, as shown in fig. 1, the network structure comprises 13 convolutional layers (201) and 3 fully-connected layers (203). specifically, as shown in fig. 1, a convolutional layer is first constructed by using a filter with 3 × and 1 step, assuming that the size of a network input is m × n × (m and n are positive integers), in order to ensure that the first two dimensions of the feature matrix after convolution are the same as those of the input matrix, i.e., m × n, as shown in fig. 1, the input matrix is additionally provided with one turn of 0. the dimension of the input matrix is changed to (m +2) × (n +2), then 3 × is convolved.the first two dimensions of the feature matrix after convolution are m × n, then using a filter with 2, then 256 filters with 2 steps, then 202, then 256 filters are again operated, then three times, and finally, the convolution operation results are obtained by using a convolution operation of the filter after convolution operation, i.e., a convolutional operation is considered as shown in fig. a convolutional layer corresponding to be activated, and then activated, a specific operation is considered to obtain a convolutional function corresponding operation result, namely, a convolutional layer corresponding to obtain a filter corresponding operation result, namely, a convolutional layer corresponding to obtain a number of seven of a number of a convolutional layer corresponding operation after activation, namely, and a number of a convolutional layer corresponding to obtain a.

The method includes constructing the VGG-16 network, training the VGG-16 network, and comparing the result with standard data to obtain a test error rate (e.g., 98% error rate), wherein the test error rate is based on a comparison of the test data set with the standard data, and wherein the test error rate is based on a comparison of the test data set with a predetermined error rate (e.g., 98% error rate).

Alternatively, the loss and regression need to be calculated first, optimizing the affine transformation parameters. The loss function design for the entire network of VGG-16 described above can be expressed, for example, as:

the loss function of the VGG-16 network is expressed as:

（1）

wherein, α₁And α₂Is the learning rate.pIs a categorytcThe logarithmic loss of (c) is shown in equation (2).

L _c（p,tc）=-logp _tc（2）

iThe number of the regression box indicating the loss being calculated;

tcthe representation is a category label, for example:tc=1 is a representation of the target,tc=0 represents background;

x，y，w，hand other variables, respectively, in abscissa/ordinate/width/height.

Parameter(s)v _i=（v _x， v _y， v _w， v _h) Is a real rectangular bounding box tuple comprising a central point abscissa, an ordinate, a width and a height;

the predicted target frame tuple comprises a central point abscissa, an ordinate, a width and a height;

u _i=（r1,r2,r3,r4,r5,r6) An affine parameter tuple of the real target area;

predicting an affine parameter tuple of the target area;

r1，r2，r3，r4，r5，r6) fixing values of six components of the structure for affine transformation of the real target region;

r1^*，r2^*，r3^*，r4^*，r5^*，r6^*) Predicting values of six components of the affine transformation fixed structure of the target area;

representing an affine bounding box parameter loss function;

representing a rectangular bounding box parametric loss function;

let (w，w*) To represent

Or

,

Is defined as:

(3)

(4)

whereinxAre real numbers.

The feature map obtained from the neural network is input into an rpn (region pro-social network) network, and candidate regions for obtaining a plurality of targets, for example, 2000 candidate regions, are extracted. Such as that indicated by reference numeral 304 in fig. 3. The RPN is a network that generates a plurality of candidate areas of different sizes, unlike the VGG-16 network. The candidate region is a region of a plurality of shapes and positions where the target of the current frame may exist. According to the method, a plurality of regions which may exist in the algorithm are estimated in advance, and then optimization regression is carried out on the regions, so that more accurate tracking regions are screened out.

The features of these candidate regions of different sizes are pooled to obtain a plurality of regions of interest (ROI) for the target object, here, a plurality of convolution kernels of different sizes, for example, three convolution kernels, respectively 7 × 7, 5 × 9 and 9 × 5, are designed in the pooling layer in consideration of the deformation of the target, for example, as shown by reference numeral 305 in fig. 3. a plurality of different pooling kernels may primarily describe the deformation of the target, for example, 7 × 7, 5 × 9 may describe the person standing under different cameras, 9 × 5 may describe the action of bending the person, etc., of course, different size pooling kernels may be designed according to different application scenarios.

the result of the pooling, i.e. the features of the multiple regions of interest (ROIs), is subjected to a full join operation. Here, the full linking operation is to concatenate a plurality of ROI features in sequence. Such as that indicated by reference numeral 306 in fig. 3. Then, the series-connected features are subjected to score comparison by using a softmax function, and the score of the target/background result of the compared target area is obtained. For example, a region with a score greater than a certain threshold is determined as a target region, otherwise, the region is a background region.

the obtained affine area determined as the target area is subjected to non-local maximum suppression (for example, as indicated by reference numeral 308 in fig. 3), and a tracking result of the t-th frame image, that is, a corresponding affine parameter and a frame are obtained. Such as that shown at reference numeral 309 in fig. 3. In one embodiment, the multiple tracked affine frames may be compared with a reference target frame (i.e., a target frame tracked in a previous frame), and an affine tracking frame with a largest overlapping area is obtained as a final tracking result.

Affine transformation is used herein to represent the object geometric deformation. First, thetThe affine transformation parameters of the tracking result of the target region of the frame are writtenU _tThe structure is as follows:U _t =[r1,r2,r3,r4,r5,r6]^T. Corresponding affine transformation matrix

The utility model has the advantages of having a plum group structure,ga(2) is corresponding to affine lie groupGA(2) Lie algebra, matrix ofG _j（

) Is thatGA(2) Generator and matrix ofga(2) The group (2) of (a). For matrixGA(2) The generating element of (1) is:

(5)

for the lie group matrix, the riemann distance is defined as the matrix logarithm:

(6)

where X and Y are elements of the lie group matrix, giving a symmetric positive definite matrix of N

The inner mean of (d) defines:

(7)

wherein

，qIs a constant;

and carrying out non-maximum suppression on the tracking affine frames to obtain a tracking result of the t frame image. A plurality of different target areas can be obtained through regression, and in order to obtain a detection algorithm with the highest accuracy correctly, an affine transformation non-maximum suppression method is adopted to screen out the final tracking result. In addition, the loss function is designed, the affine deformation of the target is taken into consideration, and the accuracy of predicting the position of the target is improved.

Current object detection methods, non-maxima suppression (NMS), are widely used as post-processing detection candidates. The method can estimate the axis alignment boundary box and the inclined boundary box, and can execute normal NMS on the axis alignment boundary box and also can execute inclined NMS on the affine transformation boundary box. In affine transformation non-maximum suppression, the computation of the conventional intersection (IoU) is modified to IoU between the two affine bounding boxes. The effect of the algorithm is shown in fig. 4. In fig. 4, each frame with the number 401 is a candidate frame before the suppression of the non-maximum value, the frame with the number 402 is a frame obtained after the suppression of the normal NMS, and the frame with the number 403 is a frame obtained by the suppression of the affine transformation non-maximum value according to the present application. It can be seen that the tracking frame obtained by the method is more accurate.

And 8: and (4) determining whether the number of the t +1 frames is less than the total frame number of the video, and if so, returning to the step 2 to track the t +1 th frame image. And ending the algorithm until all the video frames are tracked. The partial tracking result frames are shown as black frames indicated by

arrows

501, 502, 503, 504 in fig. 5.

According to the method and the device, the current target image is cut by using the affine transformation parameter information of the previous frame image, the search range is narrowed, and the algorithm efficiency is improved. In addition, the cut image is input into the VGG-16 network to calculate the feature and then input into the RPN network, so that the repeated calculation of feature extraction is avoided, and the algorithm efficiency is improved. In addition, during the pooling operation, convolution kernels with different sizes and shapes are applied to preliminarily simulate the deformation of the target, and the target position can be accurately extracted. In the present application, the features output by the highest layer of the network are used as semantic models, and affine transformation results are used as spatial models, which form complementary advantages, because the features of the highest layer contain more semantic information and less spatial information. Furthermore, the above-described multitask loss function including affine transformation parameter regression optimizes network performance.

In the above-mentioned pedestrian tracking system, the candidate regions obtained from the RPN network are regions of multiple shapes and positions where the target object in the current frame exists, and furthermore, the step 5 pools the features of the multiple target candidate regions by using multiple convolution kernels of different sizes to obtain multiple regions of interest for the target object.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A pedestrian tracking system, characterized by: comprising a memory and a processor;

the memory is used for storing computer executable instructions;

2. A pedestrian tracking method implemented by the pedestrian tracking system of claim 1, comprising the steps of:

and step 3: adjusting the determined target frame into a fixed size, inputting the fixed size into a pre-trained VGG-16 network, acquiring a candidate feature map of the target frame in the current frame, and designing a loss function;

3. The pedestrian tracking method according to claim 2, wherein the target candidate region in step 4 is a region in which a plurality of shapes and positions of the target object in the current frame exist simultaneously.

4. A pedestrian tracking method according to claim 2, wherein said plurality of convolution kernels of different sizes in step 5 comprises three convolution kernels for describing different deformations of said target object roughly.