CN111428567B

CN111428567B - Pedestrian tracking system and method based on affine multitask regression

Info

Publication number: CN111428567B
Application number: CN202010118387.6A
Authority: CN
Inventors: 谢英红; 韩晓微; 刘天惠; 涂斌斌; 唐璐
Original assignee: Shenyang University
Current assignee: Shenyang University
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2024-02-02
Anticipated expiration: 2040-02-26
Also published as: CN111428567A

Abstract

The invention provides a pedestrian tracking system and method based on affine multitask regression, and relates to the technical field of computer vision. The method determines that a previous frame in a plurality of video frames comprises a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and obtaining a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas; pooling the characteristics of the target candidate areas to obtain a plurality of interested areas aiming at the target object; performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.

Description

Pedestrian tracking system and method based on affine multitask regression

Technical Field

The invention relates to the technical field of computer vision, in particular to a pedestrian tracking system and method based on affine multitask regression.

Background

The pedestrian tracking technology is to identify and track pedestrian targets on pictures in videos and images through a computer vision technology.

The prior patent application CN108629791A provides a pedestrian tracking method and device and a camera-crossing pedestrian tracking method and device. The pedestrian tracking method comprises the following steps: acquiring a video; pedestrian detection is carried out on at least partial video frames in the video so as to obtain pedestrian frames in each video frame in the at least partial video frames; for each pedestrian frame in all the obtained pedestrian frames, processing the image blocks contained in the pedestrian frame by using a trained convolutional neural network so as to obtain the feature vector of the pedestrian frame; and matching all the pedestrian frames based on the feature vector of each of the pedestrian frames to obtain a pedestrian tracking result, wherein the pedestrian tracking result comprises at least one pedestrian track. The method and the device are not limited by position information, have good robustness, can realize accurate and efficient pedestrian tracking, and can easily realize pedestrian tracking across cameras.

The deformation of the geometric and optical properties of the image can be kept well without deformation, and when the existing Gamma normalization condition is used for processing, the gesture floating range of the pedestrian is larger, and most of fine actions do not influence the detection effect, so that the pedestrian detection method of HOG and SVM is selected. The pedestrian tracking method based on the KLT feature points disclosed by CN107292908A is combined with the KLT algorithm to track the detection result; KLT is a further development of an optical flow method, has good real-time performance, is not easy to lose tracking targets, and can very track existing targets in real time. The problems that a plurality of tracking algorithm cameras are fixed and cannot move or cannot track a specific target at present are well solved by combining a detection algorithm and a tracking algorithm; the method also overcomes the defect of slow detection speed due to high calculation complexity of HOG and SVM.

CN110414439a discloses an anti-blocking pedestrian tracking method based on multimodal detection, firstly, pedestrian detection is performed to obtain an initial position, tracker parameters and pedestrian templates are initialized, the position of a characteristic fusion response peak is taken as the center of a pedestrian prediction position in each subsequent frame, calculation of a target response peak Fmax and average peak correlation energy APCE and a threshold thereof is performed, and the combined confidence formed by the peak is used for detecting multiple peak values of filter response, so that pedestrian blocking judgment is realized, updating of filter parameters and pedestrian target templates is suspended in blocking frames, and anti-blocking pedestrian tracking tasks are realized. According to the invention, FHOG features and colorNamine features are selected for self-adaptive fusion to serve as feature descriptors, so that the robustness of the pedestrian tracking method to the deformation and illumination of pedestrians is improved; updating of pedestrian templates and filter parameters is suspended in pedestrian shielding frames, and the problem that tracking position drift is easy to cause is solved.

CN108509859a discloses a non-overlapping area pedestrian tracking method based on a deep neural network, which comprises the following steps: (1) Detecting a current pedestrian target in the monitoring video image by using a YOLO algorithm, and dividing a pedestrian target picture; (2) Tracking and predicting the detection result by using a Kalman algorithm; (3) Extracting depth characteristics of pictures by using a convolutional neural network, wherein the pictures comprise candidate pedestrian pictures and target pedestrian pictures in the step (2), and storing the pictures and the characteristics of the candidate pedestrians; (4) And calculating the similarity and arranging the characteristics of the target pedestrians and the candidate pedestrians, and identifying the target pedestrians. The invention can obtain higher detection and tracking precision, thereby being beneficial to improving the recognition rate of pedestrians.

However, there is currently no special solution for accurate localization of deformed objects for the above or other popular deep learning networks.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a pedestrian tracking system and method based on affine multitask regression. By applying affine transformation to the deep learning network, accurate tracking of the deformed target is obtained.

In order to solve the technical problems, the invention adopts the following technical scheme:

in one aspect, the invention provides an affine multitasking regression-based pedestrian tracking system comprising a memory and a processor;

the memory is used for storing computer executable instructions;

the processor is configured to execute the executable instructions to initialize affine parameters by determining that a previous frame of a plurality of video frames includes a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and obtaining a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas; pooling the characteristics of the target candidate areas to obtain a plurality of interested areas aiming at the target object; performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain tracking results of the target object of the current frame, namely affine parameters, tracking frames and center point coordinates.

On the other hand, the invention also provides a pedestrian tracking method based on affine multi-task regression, which is realized by adopting the pedestrian tracking system based on affine multi-task regression, and the method comprises the following steps:

step 1: determining that a first frame of the plurality of video frames includes a target frame of a target object;

step 2: determining a current target frame including the target object in the current frame according to the determined target frame;

step 3: the determined target frame is adjusted to be of a fixed size and is input into a pre-trained first neural network, a candidate feature map of the target frame in the current frame is obtained, and a loss function, an affine boundary frame parameter loss function and a rectangular boundary frame parameter loss function are designed;

the first neural network is a VGG-16 network;

the loss function of the VGG-16 network is expressed as:

wherein alpha is ₁ And alpha ₂ Is the learning rate. p is the logarithmic penalty for category tc, where L _c (p,tc)＝-logp _tc ；

i represents the sequence number of the regression box that is calculating the loss;

tc represents a category label, such as: tc=1 represents a target, tc=0 represents a background;

x, y, w, h and other variables are used in combination to represent abscissa/ordinate/width/height, respectively.

Parameter v _i ＝(v _x， v _y， v _w， v _h ) Is a true rectangular bounding box tuple, comprising a center point abscissa, an ordinate, a width and a height;is a predicted target frame tuple, including a center point abscissa, ordinate, width, and height;

u _i = (r 1, r2, r3, r4, r5, r 6) is an affine parameter tuple of the real target region;

affine parameter tuples for predicting the target region;

(r 1, r2, r3, r4, r5, r 6) are values of six components of the affine transformation fixed structure of the real target region;

(r1 ^* ，r2 ^* ，r3 ^* ，r4 ^* ，r5 ^* ，r6 ^* ) Fixing values of six components of the structure for affine transformation predicting the target region;

representing affine bounding box parameter loss functions;

representing a rectangular bounding box parameter loss function;

let (w, w) denoteOr->The definition is as follows:

L _reg (w，w ^* )＝smooth _L1 (w，w ^* )

where x is a real number.

Step 4: inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas;

the second neural network is an RPN network.

Step 5: pooling the characteristics of the target candidate areas to obtain a plurality of interested areas aiming at the target object;

step 6: performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object;

step 7: and performing non-maximum suppression on the tracking affine frames to obtain tracking results of the target object of the current frame, namely affine parameters, tracking frames and center point coordinates.

Step 7.1: scoring and comparing the characteristics corresponding to the tracking affine frames to obtain scores of the compared target/background results of the target area;

step 7.2: judging that the score result is larger than a certain threshold value as a target area, otherwise, judging that the score result is a background area;

step 7.3: performing non-maximum suppression on the characteristics of the determined target area to obtain a tracking result of the target object of the current frame;

step 8: and judging whether the frame number of the next frame of the current image is smaller than the total frame number of the video, if not, directly ending, if so, returning to the step 2, and tracking the next frame of the image until the frame tracking of all the video is completed.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in:

according to the method and the device, affine transformation parameter information of the previous frame of image is utilized to cut the current target image, so that the searching range is reduced, and the algorithm efficiency is improved. In addition, the cut image is input to the VGG-16 network to calculate the characteristics and then is input to the RPN network, repeated calculation of characteristic extraction is avoided, and algorithm efficiency is improved. In the present application, the features output by the highest layer of the network are used as a semantic model, and affine transformation results are used as a spatial model, and the two are complementary in advantage, because the features of the highest layer contain more semantic information and less spatial information. Furthermore, the above-described multitasking loss function including affine transformation parameter regression optimizes network performance.

Drawings

FIG. 1 is a block diagram of an embodiment of the present invention implemented using a computer architecture.

Fig. 2 is a flowchart of a pedestrian tracking algorithm in accordance with an embodiment of the present invention.

Fig. 3 is a schematic block diagram of a flow chart of an embodiment of the present invention.

Fig. 4 is a graph showing the comparison of the effects of the horizontal NMS and affine transformation NMS according to the embodiment of the present invention.

FIG. 5 is a graph of tracking results according to an embodiment of the present invention.

Fig. 6 is a network structure of VGG-16 according to an embodiment of the invention.

Detailed Description

The following describes the embodiments of the present invention in detail with reference to the drawings.

the memory is used for storing computer executable instructions;

the processor is configured to execute the executable instructions to initialize affine parameters by determining that a previous frame of a plurality of video frames includes a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and obtaining a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas; pooling the characteristics of the target candidate areas to obtain a plurality of interested areas aiming at the target object; performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.

As shown in fig. 1, a schematic diagram of an electronic system 600 suitable for use in implementing embodiments of the present disclosure is shown. The electronic system shown in fig. 1 is only one example and should not be construed as limiting the functionality and scope of use of the embodiments of the present disclosure.

As shown in fig. 1, the electronic system 600 may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 601 that may perform various suitable actions and processes according to programs stored in a Read Only Memory (ROM) 602 or loaded from a storage device 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic system 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic system 600 having various devices, it is to be understood that not all of the illustrated devices are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 6 may represent one device or a plurality of devices as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 601.

It should be noted that, the computer readable medium according to the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Whereas in embodiments of the present disclosure, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic system described above (also referred to herein as an "affine-multitasking regression-based pedestrian tracking system"); or may exist alone without being assembled into the electronic system. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic system to: 1) Determining a target frame of which a previous frame in a plurality of video frames comprises a target object; 2) Determining a current target frame including the target object in the current frame according to the determined target frame; 3) Inputting the current target frame into a pre-trained first neural network, and obtaining a candidate feature map of the target frame in the current frame; 4) Inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas; 5) Pooling the characteristics of the target candidate areas to obtain a plurality of regions of interest for the target object; 6) Performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object; and 7) performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.

On the other hand, the invention also provides a pedestrian tracking method based on affine multi-task regression, which is realized by adopting the pedestrian tracking system based on affine multi-task regression as shown in fig. 2, and the method comprises the following steps:

the size of the original image is initialized. Let the original image size be m×n (unit: pixel). When t=1, the position of the target frame of the frame is manually marked. The center position of the target frame is marked (cx, cy), where t represents the image of the t frame, t is a positive integer, cx, cy are the abscissa and ordinate, respectively, of the center position of the target frame, and the target frame includes the object to be tracked, for example, as shown by reference numeral 301 in fig. 3.

Initializing affine transformation parameters: u (U) ₁ ＝[r1，r2，r3，r4，r5，r6] ^T 。

in this embodiment, a current target frame including the target object in the current frame is determined according to the determined target frame. Specifically, the input t (t > 2) frame picture is cut, and the target frame of the t frame is determined by taking the center coordinates (cx, cy) of the target frame tracked or identified by the t-1 frame as the center. For example, assume that two sides of the circumscribed rectangle of the target frame in the t-1 frame are noted as: a, b, a picture of size (2 a) × (2 b), for example, a rectangular frame denoted by reference numeral 302 in fig. 3, is cut out on the t-th frame image centering on the target center point (cx, cy) of the t-1 st frame. In the present application, the purpose of centering on the center point of the object of the previous frame is to make the clipped picture contain the object information, because the coordinates of the center points of the objects in two adjacent frames do not change greatly, and as long as the sub-picture is clipped at the position near the center point, the object to be tracked can be contained.

Step 3: and adjusting the determined target frame into a fixed size, inputting the fixed size into a pre-trained first neural network, acquiring a candidate feature map of the target frame in the current frame, and designing a loss function.

And adjusting the cut target frame into a fixed size, sending the target frame into a pre-trained neural network, for example, sending the target frame into the VGG-16 network, and acquiring a characteristic diagram of the image after the fifth layer convolution in the network, namely acquiring a candidate characteristic diagram of the target frame in the image. Such as that shown at 303 in fig. 3.

The first neural network is a VGG-16 network; an exemplary VGG-16 network architecture is shown in fig. 6. As shown in fig. 6, the network structure includes 13 convolutional layers (201) and 3 fully-connected layers (203). Specifically, as shown in fig. 6, a convolution layer is first constructed with a 3×3 filter with a step of 1, and assuming that the network input size is mxn×3 (m and n are positive integers), in order to ensure that the first two dimensions of the feature matrix after convolution are the same as the first two dimensions of the input matrix, that is: m×n. A circle of 0 is added to the input matrix. The dimension of the input matrix is changed to (m+2) × (n+2), and then 3×3 is convolved. The first two dimensions of the feature matrix after such convolution are still: m×n. The max-pooling layer 202 is then built with a 2 x 2 stride filter. Then three convolutions are performed with 256 identical filters, then pooled, then convolved three more times, and then pooled. The activation function used above is the existing relu function. After doing so for several rounds, the resulting 7×7×512 feature map is fully connected (i.e., fully connected layer 203) to obtain 4096 units, and then the softmax function is activated, i.e., as shown in the activation layer 204, to output the result identified from 1000 objects.

After the network is constructed, it is trained by using the ImageNet dataset. The ImageNet dataset is divided into a training set and a testing set. The dataset corresponds to, for example, 1000 categories. Each data has a corresponding tag vector, each tag vector corresponding to a different category, such as a target object or background. The present application does not concern the specific classification of the input image, but simply applies the data set to train the weights of the VGG-16 network. Specifically, the ImageNet training set is adjusted to 224×224×3, and then the VGG-16 network is input to train the network, so as to obtain the weight parameter information of each layer or each unit of the network. And then, inputting a predetermined test data set and a label vector of a corresponding category into the VGG-16 network structure obtained through training. The test data set may, for example, also be 224 x 3 in size. By inputting the test data set and the label vector of the corresponding category into the VGG-16 network, the output result of the VGG-16 network can be detected, and the detected result is compared with standard data to adjust the parameters (weights) of the VGG-16 network according to the compared error. Repeating the above steps until the obtained test accuracy reaches a predetermined standard, for example, the accuracy is more than 98%.

the above feature map obtained from the neural network is input into the RPN (Region Proposal Network) network, and candidate areas, for example, 2000 candidate areas, for obtaining a plurality of targets are extracted. Such as indicated by reference numeral 304 in fig. 3. The RPN is a network that generates a plurality of candidate areas of different sizes, unlike the VGG-16 network. The candidate region is a region of a plurality of shapes and locations where the current frame object may exist. The method comprises the steps of estimating a plurality of areas possibly existing in an algorithm in advance, carrying out optimization regression on the areas, and screening out more accurate tracking areas.

The second neural network is an RPN network.

pooling the features of these differently sized candidate regions to obtain multiple regions of interest (ROIs) for the target object. Here, considering the deformation of the object, a plurality of convolution kernels of different sizes are designed in the pooling layer, for example three convolution kernels are designed, respectively: 7X 7, 5X 9 and 9X 5. Such as indicated by reference numeral 305 in fig. 3. The plurality of different pooling kernels may initially describe the deformation of the object. For example: 7×7,5×9 may describe a person standing under different cameras, and 9×5 may describe a person's bending over, etc. Of course, different sizes of pooling cores can be designed according to different application scenes.

the result of the pooling, i.e. the characteristics of a plurality of regions of interest (ROIs), is subjected to a fully connected operation. Here, the full link operation is to concatenate a plurality of ROI features in sequence. Such as indicated by reference numeral 306 in fig. 3. The concatenated features are then scored using a softmax function to yield a score for the target/background result of the compared target region. For example, a decision that the score result is greater than a certain threshold is a target region, otherwise it is a background region.

Step 7: and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.

step 7.2: judging that the score result is larger than a certain threshold value as a target area, otherwise, judging that the score result is a background area; and

step 7.3: and performing non-maximum suppression on the characteristics of the determined target area to obtain a tracking result of the target object of the current frame.

Non-maximum suppression (for example, reference numeral 308 in fig. 3) is performed on the obtained affine region determined as the target region, and a tracking result of the t-frame image, that is, corresponding affine parameters and frames, is obtained. Such as indicated by reference numeral 309 in fig. 3. In one embodiment, the multiple tracking affine frames may be compared with a reference target frame (i.e., a target frame tracked by a previous frame) to obtain an affine tracking frame with the largest overlapping area, which is used as a final tracking result. The specific algorithm is described below.

Alternatively, the loss and regression need to be calculated first, optimizing affine transformation parameters. The loss function design for the entire network of VGG-16 described above can be expressed, for example, as:

the loss function of the VGG-16 network is expressed as:

wherein alpha is ₁ And alpha ₂ Is the learning rate. p is the logarithmic loss of class tc, and the formula is shown in (2).

L _c (p,tc)＝-logp _tc (2)

affine parameter tuples for predicting the target region;

(r1 ^* ，r2 ^* ，r3 ^* ，r4 ^* ，r5 ^* ，r6 ^* ) To predict the target areaThe affine transformation of (a) fixes the values of the six components of the structure;

representing affine bounding box parameter loss functions;

representing a rectangular bounding box parameter loss function;

let (w, w) denoteOr->The definition is as follows:

L _reg (w，w ^* )＝smooth _L1 (w，w ^* ) (3)

where x is a real number.

Affine transformation is used herein to represent the target geometric deformation. Affine transformation parameters of the tracking result of the target area of the t-th frame are denoted as U _t The structure is as follows: u (U) _t ＝[r1,r2,r3,r4,r5,r6] ^T . Corresponding affine transformation matrixHaving a lie group structure, GA (2) is the lie algebra corresponding to affine lie group GA (2), matrix G _j />Is the generator of GA (2) and the basis of matrix GA (2). The generator for matrix GA (2) is:

for a Liqun matrix, the Riemann distance is defined as a matrix logarithm operation:

d′(X，Y)＝||log(YX-1)|| (6)

wherein X and Y are elements of the Leu matrix, giving a symmetric positive definite matrix of NIs defined by the internal mean of:

wherein q is E [1, N ], q is a constant;

and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the t-frame image. Multiple different target areas can be obtained through regression, and in order to correctly obtain a detection algorithm with highest accuracy, the application adopts an affine transformation non-maximum suppression method to screen out the final tracking result. In addition, the design of the loss function considers the affine deformation of the target, and improves the accuracy of predicting the target position.

Current object detection methods, non-maximum suppression (NMS) are widely used for post-processing detection candidates. While estimating the axis aligned bounding box and the tilted bounding box, the normal NMS may be performed on the axis aligned bounding box, or the tilted NMS may be performed on the affine transformation bounding box, which becomes affine transformation non-maximum suppression. In affine transformation non-maximum suppression, the computation of the conventional intersection point (IoU) is modified to IoU between two affine bounding boxes. The algorithm effect is shown in fig. 4. In fig. 4, each frame No. 401 is a candidate tracking frame before non-maximum suppression, a frame No. 402 is a tracking frame obtained after normal NMS suppression, and a frame No. 403 is a tracking frame obtained by affine transformation non-maximum suppression in the present application. It can be seen that the tracking frame obtained by the method is more accurate.

Step 8: and determining whether the number of t+1 is smaller than the total frame number of the video, and if so, returning to the step 2, and tracking the t+1 frame image. And (5) ending the algorithm until the frame tracking of all the videos is completed. The partial tracking result frames are shown as black frames indicated by arrows 501, 502, 503, 504 in fig. 5.

In the method, affine transformation parameter information of the previous frame of image is utilized to cut the current target image, so that the searching range is reduced, and the algorithm efficiency is improved. In addition, the cut image is input to the VGG-16 network to calculate the characteristics and then is input to the RPN network, repeated calculation of characteristic extraction is avoided, and algorithm efficiency is improved. In addition, during pooling operation, convolution kernels with different sizes and different shapes are applied to preliminarily simulate deformation of the target, so that more accurate extraction of the target position is facilitated. In the present application, the features output by the highest layer of the network are used as a semantic model, and affine transformation results are used as a spatial model, and the two are complementary in advantage, because the features of the highest layer contain more semantic information and less spatial information. Furthermore, the above-described multitasking loss function including affine transformation parameter regression optimizes network performance.

In the pedestrian tracking system, the first neural network is a VGG-16 network, and the second neural network is an RPN network.

In the pedestrian tracking system described above, the candidate region obtained from the second neural network is a region of a plurality of shapes and positions where the target object exists in the current frame. In addition, the step 5 performs pooling operation on the features of the target candidate regions through a plurality of convolution kernels with different sizes, so as to obtain a plurality of regions of interest for the target object. For example, a plurality of convolution kernels of different sizes may include three convolution kernels to initially describe different deformations of the object. In particular, as described above, considering the deformation of the target, a plurality of convolution kernels of different sizes are designed in the pooling layer, for example three convolution kernels, respectively: 7X 7, 5X 9 and 9X 5. The plurality of different pooling kernels may initially describe the deformation of the object. For example: 7×7,5×9 may describe a person standing under different cameras, and 9×5 may describe a person's bending over, etc. Of course, different sizes of pooling cores can be designed according to different application scenes.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims

1. The pedestrian tracking method based on affine multitask regression is characterized by comprising the following steps of:

step 3: the determined target frame is adjusted to be of a fixed size and is input into a pre-trained first neural network, a candidate feature map of the target frame in the current frame is obtained, a loss function is designed, and affine boundary frame parameter loss functions and rectangular boundary frame parameter loss functions are designed;

the loss function is expressed as:

wherein alpha is ₁ And alpha ₂ Is the learning rate; p is the logarithmic penalty for category tc, where L _c (p,tc)＝-logp _tc ；

tc represents a category label, tc=1 represents a target, and tc=0 represents a background;

x, y, w, h and other variables are used in combination to represent abscissa, ordinate, width, height, respectively;

affine parameter tuples for predicting the target region;

representing affine bounding box parameter loss functions;

representing a rectangular bounding box parameter loss function;

let (w, w) denoteOr->The definition is as follows:

L _reg (w，w ^* )＝smooth _L1 (w，w ^* )

wherein x is a real number;

step 7: performing non-maximum suppression on the tracking affine frames to obtain tracking results of the target object of the current frame, namely affine parameters, tracking frames and center point coordinates;

step 8: judging whether the frame number of the next frame of the current image is smaller than the total frame number of the video, if not, directly ending, if so, returning to the step 2, and tracking the next frame of the image until the frame tracking of all the video is completed;

the pedestrian tracking method based on affine multitask regression is realized based on the following pedestrian tracking system and comprises a memory and a processor;

the memory is used for storing computer executable instructions;

2. The pedestrian tracking method based on affine multitasking regression of claim 1, wherein the first neural network is a VGG-16 network and the second neural network is an RPN network.

3. The method for tracking pedestrians based on affine multitasking regression according to claim 1, wherein said step 7 specifically comprises: