CN111428567B - Pedestrian tracking system and method based on affine multitask regression - Google Patents

Pedestrian tracking system and method based on affine multitask regression Download PDF

Info

Publication number
CN111428567B
CN111428567B CN202010118387.6A CN202010118387A CN111428567B CN 111428567 B CN111428567 B CN 111428567B CN 202010118387 A CN202010118387 A CN 202010118387A CN 111428567 B CN111428567 B CN 111428567B
Authority
CN
China
Prior art keywords
target
frame
affine
tracking
target object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010118387.6A
Other languages
Chinese (zh)
Other versions
CN111428567A (en
Inventor
谢英红
韩晓微
刘天惠
涂斌斌
唐璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang University
Original Assignee
Shenyang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang University filed Critical Shenyang University
Priority to CN202010118387.6A priority Critical patent/CN111428567B/en
Publication of CN111428567A publication Critical patent/CN111428567A/en
Application granted granted Critical
Publication of CN111428567B publication Critical patent/CN111428567B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/02Affine transformations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a pedestrian tracking system and method based on affine multitask regression, and relates to the technical field of computer vision. The method determines that a previous frame in a plurality of video frames comprises a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and obtaining a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas; pooling the characteristics of the target candidate areas to obtain a plurality of interested areas aiming at the target object; performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.

Description

Pedestrian tracking system and method based on affine multitask regression
Technical Field
The invention relates to the technical field of computer vision, in particular to a pedestrian tracking system and method based on affine multitask regression.
Background
The pedestrian tracking technology is to identify and track pedestrian targets on pictures in videos and images through a computer vision technology.
The prior patent application CN108629791A provides a pedestrian tracking method and device and a camera-crossing pedestrian tracking method and device. The pedestrian tracking method comprises the following steps: acquiring a video; pedestrian detection is carried out on at least partial video frames in the video so as to obtain pedestrian frames in each video frame in the at least partial video frames; for each pedestrian frame in all the obtained pedestrian frames, processing the image blocks contained in the pedestrian frame by using a trained convolutional neural network so as to obtain the feature vector of the pedestrian frame; and matching all the pedestrian frames based on the feature vector of each of the pedestrian frames to obtain a pedestrian tracking result, wherein the pedestrian tracking result comprises at least one pedestrian track. The method and the device are not limited by position information, have good robustness, can realize accurate and efficient pedestrian tracking, and can easily realize pedestrian tracking across cameras.
The deformation of the geometric and optical properties of the image can be kept well without deformation, and when the existing Gamma normalization condition is used for processing, the gesture floating range of the pedestrian is larger, and most of fine actions do not influence the detection effect, so that the pedestrian detection method of HOG and SVM is selected. The pedestrian tracking method based on the KLT feature points disclosed by CN107292908A is combined with the KLT algorithm to track the detection result; KLT is a further development of an optical flow method, has good real-time performance, is not easy to lose tracking targets, and can very track existing targets in real time. The problems that a plurality of tracking algorithm cameras are fixed and cannot move or cannot track a specific target at present are well solved by combining a detection algorithm and a tracking algorithm; the method also overcomes the defect of slow detection speed due to high calculation complexity of HOG and SVM.
CN110414439a discloses an anti-blocking pedestrian tracking method based on multimodal detection, firstly, pedestrian detection is performed to obtain an initial position, tracker parameters and pedestrian templates are initialized, the position of a characteristic fusion response peak is taken as the center of a pedestrian prediction position in each subsequent frame, calculation of a target response peak Fmax and average peak correlation energy APCE and a threshold thereof is performed, and the combined confidence formed by the peak is used for detecting multiple peak values of filter response, so that pedestrian blocking judgment is realized, updating of filter parameters and pedestrian target templates is suspended in blocking frames, and anti-blocking pedestrian tracking tasks are realized. According to the invention, FHOG features and colorNamine features are selected for self-adaptive fusion to serve as feature descriptors, so that the robustness of the pedestrian tracking method to the deformation and illumination of pedestrians is improved; updating of pedestrian templates and filter parameters is suspended in pedestrian shielding frames, and the problem that tracking position drift is easy to cause is solved.
CN108509859a discloses a non-overlapping area pedestrian tracking method based on a deep neural network, which comprises the following steps: (1) Detecting a current pedestrian target in the monitoring video image by using a YOLO algorithm, and dividing a pedestrian target picture; (2) Tracking and predicting the detection result by using a Kalman algorithm; (3) Extracting depth characteristics of pictures by using a convolutional neural network, wherein the pictures comprise candidate pedestrian pictures and target pedestrian pictures in the step (2), and storing the pictures and the characteristics of the candidate pedestrians; (4) And calculating the similarity and arranging the characteristics of the target pedestrians and the candidate pedestrians, and identifying the target pedestrians. The invention can obtain higher detection and tracking precision, thereby being beneficial to improving the recognition rate of pedestrians.
However, there is currently no special solution for accurate localization of deformed objects for the above or other popular deep learning networks.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a pedestrian tracking system and method based on affine multitask regression. By applying affine transformation to the deep learning network, accurate tracking of the deformed target is obtained.
In order to solve the technical problems, the invention adopts the following technical scheme:
in one aspect, the invention provides an affine multitasking regression-based pedestrian tracking system comprising a memory and a processor;
the memory is used for storing computer executable instructions;
the processor is configured to execute the executable instructions to initialize affine parameters by determining that a previous frame of a plurality of video frames includes a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and obtaining a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas; pooling the characteristics of the target candidate areas to obtain a plurality of interested areas aiming at the target object; performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain tracking results of the target object of the current frame, namely affine parameters, tracking frames and center point coordinates.
On the other hand, the invention also provides a pedestrian tracking method based on affine multi-task regression, which is realized by adopting the pedestrian tracking system based on affine multi-task regression, and the method comprises the following steps:
step 1: determining that a first frame of the plurality of video frames includes a target frame of a target object;
step 2: determining a current target frame including the target object in the current frame according to the determined target frame;
step 3: the determined target frame is adjusted to be of a fixed size and is input into a pre-trained first neural network, a candidate feature map of the target frame in the current frame is obtained, and a loss function, an affine boundary frame parameter loss function and a rectangular boundary frame parameter loss function are designed;
the first neural network is a VGG-16 network;
the loss function of the VGG-16 network is expressed as:
wherein alpha is 1 And alpha 2 Is the learning rate. p is the logarithmic penalty for category tc, where L c (p,tc)=-logp tc
i represents the sequence number of the regression box that is calculating the loss;
tc represents a category label, such as: tc=1 represents a target, tc=0 represents a background;
x, y, w, h and other variables are used in combination to represent abscissa/ordinate/width/height, respectively.
Parameter v i =(v x, v y, v w, v h ) Is a true rectangular bounding box tuple, comprising a center point abscissa, an ordinate, a width and a height;is a predicted target frame tuple, including a center point abscissa, ordinate, width, and height;
u i = (r 1, r2, r3, r4, r5, r 6) is an affine parameter tuple of the real target region;
affine parameter tuples for predicting the target region;
(r 1, r2, r3, r4, r5, r 6) are values of six components of the affine transformation fixed structure of the real target region;
(r1 * ,r2 * ,r3 * ,r4 * ,r5 * ,r6 * ) Fixing values of six components of the structure for affine transformation predicting the target region;
representing affine bounding box parameter loss functions;
representing a rectangular bounding box parameter loss function;
let (w, w) denoteOr->The definition is as follows:
L reg (w,w * )=smooth L1 (w,w * )
where x is a real number.
Step 4: inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas;
the second neural network is an RPN network.
Step 5: pooling the characteristics of the target candidate areas to obtain a plurality of interested areas aiming at the target object;
step 6: performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object;
step 7: and performing non-maximum suppression on the tracking affine frames to obtain tracking results of the target object of the current frame, namely affine parameters, tracking frames and center point coordinates.
Step 7.1: scoring and comparing the characteristics corresponding to the tracking affine frames to obtain scores of the compared target/background results of the target area;
step 7.2: judging that the score result is larger than a certain threshold value as a target area, otherwise, judging that the score result is a background area;
step 7.3: performing non-maximum suppression on the characteristics of the determined target area to obtain a tracking result of the target object of the current frame;
step 8: and judging whether the frame number of the next frame of the current image is smaller than the total frame number of the video, if not, directly ending, if so, returning to the step 2, and tracking the next frame of the image until the frame tracking of all the video is completed.
The beneficial effects of adopting above-mentioned technical scheme to produce lie in:
according to the method and the device, affine transformation parameter information of the previous frame of image is utilized to cut the current target image, so that the searching range is reduced, and the algorithm efficiency is improved. In addition, the cut image is input to the VGG-16 network to calculate the characteristics and then is input to the RPN network, repeated calculation of characteristic extraction is avoided, and algorithm efficiency is improved. In the present application, the features output by the highest layer of the network are used as a semantic model, and affine transformation results are used as a spatial model, and the two are complementary in advantage, because the features of the highest layer contain more semantic information and less spatial information. Furthermore, the above-described multitasking loss function including affine transformation parameter regression optimizes network performance.
Drawings
FIG. 1 is a block diagram of an embodiment of the present invention implemented using a computer architecture.
Fig. 2 is a flowchart of a pedestrian tracking algorithm in accordance with an embodiment of the present invention.
Fig. 3 is a schematic block diagram of a flow chart of an embodiment of the present invention.
Fig. 4 is a graph showing the comparison of the effects of the horizontal NMS and affine transformation NMS according to the embodiment of the present invention.
FIG. 5 is a graph of tracking results according to an embodiment of the present invention.
Fig. 6 is a network structure of VGG-16 according to an embodiment of the invention.
Detailed Description
The following describes the embodiments of the present invention in detail with reference to the drawings.
In one aspect, the invention provides an affine multitasking regression-based pedestrian tracking system comprising a memory and a processor;
the memory is used for storing computer executable instructions;
the processor is configured to execute the executable instructions to initialize affine parameters by determining that a previous frame of a plurality of video frames includes a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and obtaining a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas; pooling the characteristics of the target candidate areas to obtain a plurality of interested areas aiming at the target object; performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.
As shown in fig. 1, a schematic diagram of an electronic system 600 suitable for use in implementing embodiments of the present disclosure is shown. The electronic system shown in fig. 1 is only one example and should not be construed as limiting the functionality and scope of use of the embodiments of the present disclosure.
As shown in fig. 1, the electronic system 600 may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 601 that may perform various suitable actions and processes according to programs stored in a Read Only Memory (ROM) 602 or loaded from a storage device 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic system 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic system 600 having various devices, it is to be understood that not all of the illustrated devices are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 6 may represent one device or a plurality of devices as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 601.
It should be noted that, the computer readable medium according to the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Whereas in embodiments of the present disclosure, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic system described above (also referred to herein as an "affine-multitasking regression-based pedestrian tracking system"); or may exist alone without being assembled into the electronic system. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic system to: 1) Determining a target frame of which a previous frame in a plurality of video frames comprises a target object; 2) Determining a current target frame including the target object in the current frame according to the determined target frame; 3) Inputting the current target frame into a pre-trained first neural network, and obtaining a candidate feature map of the target frame in the current frame; 4) Inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas; 5) Pooling the characteristics of the target candidate areas to obtain a plurality of regions of interest for the target object; 6) Performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object; and 7) performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.
On the other hand, the invention also provides a pedestrian tracking method based on affine multi-task regression, which is realized by adopting the pedestrian tracking system based on affine multi-task regression as shown in fig. 2, and the method comprises the following steps:
step 1: determining that a first frame of the plurality of video frames includes a target frame of a target object;
the size of the original image is initialized. Let the original image size be m×n (unit: pixel). When t=1, the position of the target frame of the frame is manually marked. The center position of the target frame is marked (cx, cy), where t represents the image of the t frame, t is a positive integer, cx, cy are the abscissa and ordinate, respectively, of the center position of the target frame, and the target frame includes the object to be tracked, for example, as shown by reference numeral 301 in fig. 3.
Initializing affine transformation parameters: u (U) 1 =[r1,r2,r3,r4,r5,r6] T
Step 2: determining a current target frame including the target object in the current frame according to the determined target frame;
in this embodiment, a current target frame including the target object in the current frame is determined according to the determined target frame. Specifically, the input t (t > 2) frame picture is cut, and the target frame of the t frame is determined by taking the center coordinates (cx, cy) of the target frame tracked or identified by the t-1 frame as the center. For example, assume that two sides of the circumscribed rectangle of the target frame in the t-1 frame are noted as: a, b, a picture of size (2 a) × (2 b), for example, a rectangular frame denoted by reference numeral 302 in fig. 3, is cut out on the t-th frame image centering on the target center point (cx, cy) of the t-1 st frame. In the present application, the purpose of centering on the center point of the object of the previous frame is to make the clipped picture contain the object information, because the coordinates of the center points of the objects in two adjacent frames do not change greatly, and as long as the sub-picture is clipped at the position near the center point, the object to be tracked can be contained.
Step 3: and adjusting the determined target frame into a fixed size, inputting the fixed size into a pre-trained first neural network, acquiring a candidate feature map of the target frame in the current frame, and designing a loss function.
And adjusting the cut target frame into a fixed size, sending the target frame into a pre-trained neural network, for example, sending the target frame into the VGG-16 network, and acquiring a characteristic diagram of the image after the fifth layer convolution in the network, namely acquiring a candidate characteristic diagram of the target frame in the image. Such as that shown at 303 in fig. 3.
The first neural network is a VGG-16 network; an exemplary VGG-16 network architecture is shown in fig. 6. As shown in fig. 6, the network structure includes 13 convolutional layers (201) and 3 fully-connected layers (203). Specifically, as shown in fig. 6, a convolution layer is first constructed with a 3×3 filter with a step of 1, and assuming that the network input size is mxn×3 (m and n are positive integers), in order to ensure that the first two dimensions of the feature matrix after convolution are the same as the first two dimensions of the input matrix, that is: m×n. A circle of 0 is added to the input matrix. The dimension of the input matrix is changed to (m+2) × (n+2), and then 3×3 is convolved. The first two dimensions of the feature matrix after such convolution are still: m×n. The max-pooling layer 202 is then built with a 2 x 2 stride filter. Then three convolutions are performed with 256 identical filters, then pooled, then convolved three more times, and then pooled. The activation function used above is the existing relu function. After doing so for several rounds, the resulting 7×7×512 feature map is fully connected (i.e., fully connected layer 203) to obtain 4096 units, and then the softmax function is activated, i.e., as shown in the activation layer 204, to output the result identified from 1000 objects.
After the network is constructed, it is trained by using the ImageNet dataset. The ImageNet dataset is divided into a training set and a testing set. The dataset corresponds to, for example, 1000 categories. Each data has a corresponding tag vector, each tag vector corresponding to a different category, such as a target object or background. The present application does not concern the specific classification of the input image, but simply applies the data set to train the weights of the VGG-16 network. Specifically, the ImageNet training set is adjusted to 224×224×3, and then the VGG-16 network is input to train the network, so as to obtain the weight parameter information of each layer or each unit of the network. And then, inputting a predetermined test data set and a label vector of a corresponding category into the VGG-16 network structure obtained through training. The test data set may, for example, also be 224 x 3 in size. By inputting the test data set and the label vector of the corresponding category into the VGG-16 network, the output result of the VGG-16 network can be detected, and the detected result is compared with standard data to adjust the parameters (weights) of the VGG-16 network according to the compared error. Repeating the above steps until the obtained test accuracy reaches a predetermined standard, for example, the accuracy is more than 98%.
Step 4: inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas;
the above feature map obtained from the neural network is input into the RPN (Region Proposal Network) network, and candidate areas, for example, 2000 candidate areas, for obtaining a plurality of targets are extracted. Such as indicated by reference numeral 304 in fig. 3. The RPN is a network that generates a plurality of candidate areas of different sizes, unlike the VGG-16 network. The candidate region is a region of a plurality of shapes and locations where the current frame object may exist. The method comprises the steps of estimating a plurality of areas possibly existing in an algorithm in advance, carrying out optimization regression on the areas, and screening out more accurate tracking areas.
The second neural network is an RPN network.
Step 5: pooling the characteristics of the target candidate areas to obtain a plurality of interested areas aiming at the target object;
pooling the features of these differently sized candidate regions to obtain multiple regions of interest (ROIs) for the target object. Here, considering the deformation of the object, a plurality of convolution kernels of different sizes are designed in the pooling layer, for example three convolution kernels are designed, respectively: 7X 7, 5X 9 and 9X 5. Such as indicated by reference numeral 305 in fig. 3. The plurality of different pooling kernels may initially describe the deformation of the object. For example: 7×7,5×9 may describe a person standing under different cameras, and 9×5 may describe a person's bending over, etc. Of course, different sizes of pooling cores can be designed according to different application scenes.
Step 6: performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object;
the result of the pooling, i.e. the characteristics of a plurality of regions of interest (ROIs), is subjected to a fully connected operation. Here, the full link operation is to concatenate a plurality of ROI features in sequence. Such as indicated by reference numeral 306 in fig. 3. The concatenated features are then scored using a softmax function to yield a score for the target/background result of the compared target region. For example, a decision that the score result is greater than a certain threshold is a target region, otherwise it is a background region.
Step 7: and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame.
Step 7.1: scoring and comparing the characteristics corresponding to the tracking affine frames to obtain scores of the compared target/background results of the target area;
step 7.2: judging that the score result is larger than a certain threshold value as a target area, otherwise, judging that the score result is a background area; and
step 7.3: and performing non-maximum suppression on the characteristics of the determined target area to obtain a tracking result of the target object of the current frame.
Non-maximum suppression (for example, reference numeral 308 in fig. 3) is performed on the obtained affine region determined as the target region, and a tracking result of the t-frame image, that is, corresponding affine parameters and frames, is obtained. Such as indicated by reference numeral 309 in fig. 3. In one embodiment, the multiple tracking affine frames may be compared with a reference target frame (i.e., a target frame tracked by a previous frame) to obtain an affine tracking frame with the largest overlapping area, which is used as a final tracking result. The specific algorithm is described below.
Alternatively, the loss and regression need to be calculated first, optimizing affine transformation parameters. The loss function design for the entire network of VGG-16 described above can be expressed, for example, as:
the loss function of the VGG-16 network is expressed as:
wherein alpha is 1 And alpha 2 Is the learning rate. p is the logarithmic loss of class tc, and the formula is shown in (2).
L c (p,tc)=-logp tc (2)
i represents the sequence number of the regression box that is calculating the loss;
tc represents a category label, such as: tc=1 represents a target, tc=0 represents a background;
x, y, w, h and other variables are used in combination to represent abscissa/ordinate/width/height, respectively.
Parameter v i =(v x, v y, v w, v h ) Is a true rectangular bounding box tuple, comprising a center point abscissa, an ordinate, a width and a height;is a predicted target frame tuple, including a center point abscissa, ordinate, width, and height;
u i = (r 1, r2, r3, r4, r5, r 6) is an affine parameter tuple of the real target region;
affine parameter tuples for predicting the target region;
(r 1, r2, r3, r4, r5, r 6) are values of six components of the affine transformation fixed structure of the real target region;
(r1 * ,r2 * ,r3 * ,r4 * ,r5 * ,r6 * ) To predict the target areaThe affine transformation of (a) fixes the values of the six components of the structure;
representing affine bounding box parameter loss functions;
representing a rectangular bounding box parameter loss function;
let (w, w) denoteOr->The definition is as follows:
L reg (w,w * )=smooth L1 (w,w * ) (3)
where x is a real number.
Affine transformation is used herein to represent the target geometric deformation. Affine transformation parameters of the tracking result of the target area of the t-th frame are denoted as U t The structure is as follows: u (U) t =[r1,r2,r3,r4,r5,r6] T . Corresponding affine transformation matrixHaving a lie group structure, GA (2) is the lie algebra corresponding to affine lie group GA (2), matrix G j />Is the generator of GA (2) and the basis of matrix GA (2). The generator for matrix GA (2) is:
for a Liqun matrix, the Riemann distance is defined as a matrix logarithm operation:
d′(X,Y)=||log(YX-1)|| (6)
wherein X and Y are elements of the Leu matrix, giving a symmetric positive definite matrix of NIs defined by the internal mean of:
wherein q is E [1, N ], q is a constant;
and performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the t-frame image. Multiple different target areas can be obtained through regression, and in order to correctly obtain a detection algorithm with highest accuracy, the application adopts an affine transformation non-maximum suppression method to screen out the final tracking result. In addition, the design of the loss function considers the affine deformation of the target, and improves the accuracy of predicting the target position.
Current object detection methods, non-maximum suppression (NMS) are widely used for post-processing detection candidates. While estimating the axis aligned bounding box and the tilted bounding box, the normal NMS may be performed on the axis aligned bounding box, or the tilted NMS may be performed on the affine transformation bounding box, which becomes affine transformation non-maximum suppression. In affine transformation non-maximum suppression, the computation of the conventional intersection point (IoU) is modified to IoU between two affine bounding boxes. The algorithm effect is shown in fig. 4. In fig. 4, each frame No. 401 is a candidate tracking frame before non-maximum suppression, a frame No. 402 is a tracking frame obtained after normal NMS suppression, and a frame No. 403 is a tracking frame obtained by affine transformation non-maximum suppression in the present application. It can be seen that the tracking frame obtained by the method is more accurate.
Step 8: and determining whether the number of t+1 is smaller than the total frame number of the video, and if so, returning to the step 2, and tracking the t+1 frame image. And (5) ending the algorithm until the frame tracking of all the videos is completed. The partial tracking result frames are shown as black frames indicated by arrows 501, 502, 503, 504 in fig. 5.
In the method, affine transformation parameter information of the previous frame of image is utilized to cut the current target image, so that the searching range is reduced, and the algorithm efficiency is improved. In addition, the cut image is input to the VGG-16 network to calculate the characteristics and then is input to the RPN network, repeated calculation of characteristic extraction is avoided, and algorithm efficiency is improved. In addition, during pooling operation, convolution kernels with different sizes and different shapes are applied to preliminarily simulate deformation of the target, so that more accurate extraction of the target position is facilitated. In the present application, the features output by the highest layer of the network are used as a semantic model, and affine transformation results are used as a spatial model, and the two are complementary in advantage, because the features of the highest layer contain more semantic information and less spatial information. Furthermore, the above-described multitasking loss function including affine transformation parameter regression optimizes network performance.
In the pedestrian tracking system, the first neural network is a VGG-16 network, and the second neural network is an RPN network.
In the pedestrian tracking system described above, the candidate region obtained from the second neural network is a region of a plurality of shapes and positions where the target object exists in the current frame. In addition, the step 5 performs pooling operation on the features of the target candidate regions through a plurality of convolution kernels with different sizes, so as to obtain a plurality of regions of interest for the target object. For example, a plurality of convolution kernels of different sizes may include three convolution kernels to initially describe different deformations of the object. In particular, as described above, considering the deformation of the target, a plurality of convolution kernels of different sizes are designed in the pooling layer, for example three convolution kernels, respectively: 7X 7, 5X 9 and 9X 5. The plurality of different pooling kernels may initially describe the deformation of the object. For example: 7×7,5×9 may describe a person standing under different cameras, and 9×5 may describe a person's bending over, etc. Of course, different sizes of pooling cores can be designed according to different application scenes.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims (3)

1. The pedestrian tracking method based on affine multitask regression is characterized by comprising the following steps of:
step 1: determining that a first frame of the plurality of video frames includes a target frame of a target object;
step 2: determining a current target frame including the target object in the current frame according to the determined target frame;
step 3: the determined target frame is adjusted to be of a fixed size and is input into a pre-trained first neural network, a candidate feature map of the target frame in the current frame is obtained, a loss function is designed, and affine boundary frame parameter loss functions and rectangular boundary frame parameter loss functions are designed;
the loss function is expressed as:
wherein alpha is 1 And alpha 2 Is the learning rate; p is the logarithmic penalty for category tc, where L c (p,tc)=-logp tc
i represents the sequence number of the regression box that is calculating the loss;
tc represents a category label, tc=1 represents a target, and tc=0 represents a background;
x, y, w, h and other variables are used in combination to represent abscissa, ordinate, width, height, respectively;
parameter v i =(v x, v y, v w, v h ) Is a true rectangular bounding box tuple, comprising a center point abscissa, an ordinate, a width and a height;is a predicted target frame tuple, including a center point abscissa, ordinate, width, and height;
u i = (r 1, r2, r3, r4, r5, r 6) is an affine parameter tuple of the real target region;
affine parameter tuples for predicting the target region;
(r 1, r2, r3, r4, r5, r 6) are values of six components of the affine transformation fixed structure of the real target region;
(r1 * ,r2 * ,r3 * ,r4 * ,r5 * ,r6 * ) Fixing values of six components of the structure for affine transformation predicting the target region;
representing affine bounding box parameter loss functions;
representing a rectangular bounding box parameter loss function;
let (w, w) denoteOr->The definition is as follows:
L reg (w,w * )=smooth L1 (w,w * )
wherein x is a real number;
step 4: inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas;
step 5: pooling the characteristics of the target candidate areas to obtain a plurality of interested areas aiming at the target object;
step 6: performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object;
step 7: performing non-maximum suppression on the tracking affine frames to obtain tracking results of the target object of the current frame, namely affine parameters, tracking frames and center point coordinates;
step 8: judging whether the frame number of the next frame of the current image is smaller than the total frame number of the video, if not, directly ending, if so, returning to the step 2, and tracking the next frame of the image until the frame tracking of all the video is completed;
the pedestrian tracking method based on affine multitask regression is realized based on the following pedestrian tracking system and comprises a memory and a processor;
the memory is used for storing computer executable instructions;
the processor is configured to execute the executable instructions to initialize affine parameters by determining that a previous frame of a plurality of video frames includes a target frame of a target object; determining a current target frame including the target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained first neural network, and obtaining a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate areas; pooling the characteristics of the target candidate areas to obtain a plurality of interested areas aiming at the target object; performing full-link operation on the characteristics of the multiple regions of interest, and distinguishing a target from a background, so as to obtain multiple tracking affine frames of the target object; and performing non-maximum suppression on the tracking affine frames to obtain tracking results of the target object of the current frame, namely affine parameters, tracking frames and center point coordinates.
2. The pedestrian tracking method based on affine multitasking regression of claim 1, wherein the first neural network is a VGG-16 network and the second neural network is an RPN network.
3. The method for tracking pedestrians based on affine multitasking regression according to claim 1, wherein said step 7 specifically comprises:
step 7.1: scoring and comparing the characteristics corresponding to the tracking affine frames to obtain scores of the compared target/background results of the target area;
step 7.2: judging that the score result is larger than a certain threshold value as a target area, otherwise, judging that the score result is a background area;
step 7.3: and performing non-maximum suppression on the characteristics of the determined target area to obtain a tracking result of the target object of the current frame.
CN202010118387.6A 2020-02-26 2020-02-26 Pedestrian tracking system and method based on affine multitask regression Active CN111428567B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010118387.6A CN111428567B (en) 2020-02-26 2020-02-26 Pedestrian tracking system and method based on affine multitask regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010118387.6A CN111428567B (en) 2020-02-26 2020-02-26 Pedestrian tracking system and method based on affine multitask regression

Publications (2)

Publication Number Publication Date
CN111428567A CN111428567A (en) 2020-07-17
CN111428567B true CN111428567B (en) 2024-02-02

Family

ID=71547182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010118387.6A Active CN111428567B (en) 2020-02-26 2020-02-26 Pedestrian tracking system and method based on affine multitask regression

Country Status (1)

Country Link
CN (1) CN111428567B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112055172B (en) * 2020-08-19 2022-04-19 浙江大华技术股份有限公司 Method and device for processing monitoring video and storage medium
WO2022133911A1 (en) * 2020-12-24 2022-06-30 深圳市大疆创新科技有限公司 Target detection method and apparatus, movable platform, and computer-readable storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093480A (en) * 2013-01-15 2013-05-08 沈阳大学 Particle filtering video image tracking method based on dual model
CN105389832A (en) * 2015-11-20 2016-03-09 沈阳大学 Video object tracking method based on Grassmann manifold and projection group
CN106683091A (en) * 2017-01-06 2017-05-17 北京理工大学 Target classification and attitude detection method based on depth convolution neural network
US9946960B1 (en) * 2017-10-13 2018-04-17 StradVision, Inc. Method for acquiring bounding box corresponding to an object in an image by using convolutional neural network including tracking network and computing device using the same
CN108280855A (en) * 2018-01-13 2018-07-13 福州大学 A kind of insulator breakdown detection method based on Fast R-CNN
CN109255351A (en) * 2018-09-05 2019-01-22 华南理工大学 Bounding box homing method, system, equipment and medium based on Three dimensional convolution neural network
CN109961034A (en) * 2019-03-18 2019-07-02 西安电子科技大学 Video object detection method based on convolution gating cycle neural unit
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN110399808A (en) * 2019-07-05 2019-11-01 桂林安维科技有限公司 A kind of Human bodys' response method and system based on multiple target tracking
CN110458864A (en) * 2019-07-02 2019-11-15 南京邮电大学 Based on the method for tracking target and target tracker for integrating semantic knowledge and example aspects
CN110781350A (en) * 2019-09-26 2020-02-11 武汉大学 Pedestrian retrieval method and system oriented to full-picture monitoring scene

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093480A (en) * 2013-01-15 2013-05-08 沈阳大学 Particle filtering video image tracking method based on dual model
CN105389832A (en) * 2015-11-20 2016-03-09 沈阳大学 Video object tracking method based on Grassmann manifold and projection group
CN106683091A (en) * 2017-01-06 2017-05-17 北京理工大学 Target classification and attitude detection method based on depth convolution neural network
US9946960B1 (en) * 2017-10-13 2018-04-17 StradVision, Inc. Method for acquiring bounding box corresponding to an object in an image by using convolutional neural network including tracking network and computing device using the same
CN108280855A (en) * 2018-01-13 2018-07-13 福州大学 A kind of insulator breakdown detection method based on Fast R-CNN
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN109255351A (en) * 2018-09-05 2019-01-22 华南理工大学 Bounding box homing method, system, equipment and medium based on Three dimensional convolution neural network
CN109961034A (en) * 2019-03-18 2019-07-02 西安电子科技大学 Video object detection method based on convolution gating cycle neural unit
CN110458864A (en) * 2019-07-02 2019-11-15 南京邮电大学 Based on the method for tracking target and target tracker for integrating semantic knowledge and example aspects
CN110399808A (en) * 2019-07-05 2019-11-01 桂林安维科技有限公司 A kind of Human bodys' response method and system based on multiple target tracking
CN110781350A (en) * 2019-09-26 2020-02-11 武汉大学 Pedestrian retrieval method and system oriented to full-picture monitoring scene

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于Grassmann流形和投影群的目标跟踪;谢英红;庞彦伟;韩晓微;田丹;;仪器仪表学报(第05期);全文 *
基于卷积神经网络与一致性预测器的稳健视觉跟踪;高琳;王俊峰;范勇;陈念年;光学学报;第37卷(第8期);全文 *
基于深度谱卷积神经网络的高效视觉目标跟踪算法;郭强;芦晓红;谢英红;孙鹏;;红外与激光工程(第06期);全文 *

Also Published As

Publication number Publication date
CN111428567A (en) 2020-07-17

Similar Documents

Publication Publication Date Title
US11436739B2 (en) Method, apparatus, and storage medium for processing video image
CN111401201B (en) Aerial image multi-scale target detection method based on spatial pyramid attention drive
Lee et al. Simultaneous traffic sign detection and boundary estimation using convolutional neural network
CN109961009B (en) Pedestrian detection method, system, device and storage medium based on deep learning
CN110059558B (en) Orchard obstacle real-time detection method based on improved SSD network
CN111797893B (en) Neural network training method, image classification system and related equipment
CN109035304B (en) Target tracking method, medium, computing device and apparatus
CN111401516B (en) Searching method for neural network channel parameters and related equipment
CN112926410B (en) Target tracking method, device, storage medium and intelligent video system
US11948340B2 (en) Detecting objects in video frames using similarity detectors
JP2018523877A (en) System and method for object tracking
CN104424634A (en) Object tracking method and device
CN110910445B (en) Object size detection method, device, detection equipment and storage medium
US20230137337A1 (en) Enhanced machine learning model for joint detection and multi person pose estimation
CN111368634B (en) Human head detection method, system and storage medium based on neural network
CN111931764A (en) Target detection method, target detection framework and related equipment
CN111428566B (en) Deformation target tracking system and method
CN113487610B (en) Herpes image recognition method and device, computer equipment and storage medium
CN111428567B (en) Pedestrian tracking system and method based on affine multitask regression
CN111401143A (en) Pedestrian tracking system and method
CN110490058B (en) Training method, device and system of pedestrian detection model and computer readable medium
CN108257148B (en) Target suggestion window generation method of specific object and application of target suggestion window generation method in target tracking
Fan et al. Covered vehicle detection in autonomous driving based on faster rcnn
CN117372928A (en) Video target detection method and device and related equipment
CN116453109A (en) 3D target detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant