CN111401143A - Pedestrian tracking system and method - Google Patents

Pedestrian tracking system and method Download PDF

Info

Publication number
CN111401143A
CN111401143A CN202010118386.1A CN202010118386A CN111401143A CN 111401143 A CN111401143 A CN 111401143A CN 202010118386 A CN202010118386 A CN 202010118386A CN 111401143 A CN111401143 A CN 111401143A
Authority
CN
China
Prior art keywords
target
frame
tracking
target object
frames
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010118386.1A
Other languages
Chinese (zh)
Inventor
谢英红
李路
韩晓微
涂斌斌
李华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang University
Original Assignee
Shenyang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang University filed Critical Shenyang University
Priority to CN202010118386.1A priority Critical patent/CN111401143A/en
Publication of CN111401143A publication Critical patent/CN111401143A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a pedestrian tracking system and a pedestrian tracking method, and relates to the technical field of computer vision. Determining that a first frame of a plurality of video frames includes a target frame of a target object; for a subsequent frame except the first frame in the plurality of video frames, determining a current target frame including a target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained VGG-16 network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained RPN network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions through a plurality of convolution kernels with different sizes to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the characteristics of the multiple interested areas, distinguishing a target from a background, and obtaining multiple tracking affine frames of a target object; and performing non-maximum suppression on the plurality of tracking affine frames to obtain a tracking result of the target object of the current frame.

Description

Pedestrian tracking system and method
Technical Field
The invention relates to the technical field of computer vision, in particular to a pedestrian tracking system and a pedestrian tracking method.
Background
The pedestrian tracking technology is to recognize and track the pedestrian target on the picture in the video and the image through the computer vision technology. The pedestrian identification tracking project is regarded as a key research project by many countries, and the project is emphasized because the technology is advanced and has wide hunting, namely the technology can be used for battlefield detection, target tracking, accurate guidance and the like in the field of national defense and military, the technology can be used for intelligent traffic, violation detection, unmanned driving and the like in the field of urban traffic, and the technology can be used for people flow monitoring and the like in the field of social security.
Many pedestrian tracking methods and apparatus are disclosed in the prior art. Although many popular neural network technologies are used in these systems and methods, there is no special solution for accurate positioning of a deformed target.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a pedestrian tracking system and a pedestrian tracking method.
The technical scheme adopted by the invention is as follows:
in one aspect, the present invention provides a pedestrian tracking system comprising a memory and a processor;
the memory is used for storing computer executable instructions;
the processor is to execute the executable instructions to determine that a first frame of a plurality of video frames includes a target box of a target object; for a subsequent frame except the first frame in the plurality of video frames, determining a current target frame including a target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained VGG-16 network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained RPN network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions through a plurality of convolution kernels with different sizes to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the characteristics of the multiple interested areas, distinguishing a target from a background, and obtaining multiple tracking affine frames of a target object; and performing non-maximum suppression on the plurality of tracking affine frames to obtain a tracking result of the target object of the current frame.
On the other hand, the invention also provides a pedestrian tracking method, which is realized by adopting the pedestrian tracking system and comprises the following steps:
step 1: determining that a first frame of the plurality of video frames includes a target frame of a target object;
step 2: for the subsequent frames except the first frame, determining a current target frame including the target object in the current frame according to the determined target frame;
and step 3: and adjusting the determined target frame into a fixed size, inputting the fixed size into a pre-trained VGG-16 network, acquiring a candidate feature map of the target frame in the current frame, and designing a loss function.
And 4, step 4: inputting the candidate feature map into a pre-trained RPN network to obtain a plurality of target candidate regions;
the target candidate region is a region in which a plurality of shapes and positions of a target object in the current frame exist simultaneously.
And 5: performing pooling operation on the characteristics of the target candidate regions through a plurality of convolution kernels with different sizes to obtain a plurality of interested regions aiming at the target object;
the plurality of convolution kernels of different sizes includes three convolution kernels for roughly describing different deformations of the target object.
Step 6: performing full-link operation on the features of the multiple regions of interest, distinguishing a target from a background, comparing the multiple tracking affine frames with a reference target frame to obtain an affine tracking frame with the largest overlapping area, and thus obtaining multiple tracking affine frames of the target object;
and 7: performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame;
and 8: and (3) judging whether the number of the next frame of the current image is less than the total frame number of the video, if not, directly finishing, if so, returning to the step (2), and tracking the next frame of the image until all the frames of the video are tracked.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in:
the method and the device utilize the affine transformation parameter information of the previous frame image to cut the current target image, reduce the search range and improve the algorithm efficiency. In addition, during the pooling operation, convolution kernels with different sizes and shapes are applied to preliminarily simulate the deformation of the target, and the target position can be accurately extracted.
Drawings
FIG. 1 is a block diagram of an implementation of an embodiment of the invention using a computer architecture.
Fig. 2 is a flowchart of a pedestrian tracking algorithm according to an embodiment of the present invention.
FIG. 3 is a schematic block diagram of a process flow of an embodiment of the present invention.
Fig. 4 is a comparison graph of the effects of the horizontal NMS and the affine transformation NMS of the embodiment of the present invention.
FIG. 5 is a graph of the tracking results of the embodiment of the present invention.
Fig. 6 shows a network structure of VGG-16 according to an embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings.
In one aspect, the present invention provides a pedestrian tracking system comprising a memory and a processor;
the memory is used for storing computer executable instructions;
the processor is to execute the executable instructions to determine that a first frame of a plurality of video frames includes a target box of a target object; for a subsequent frame except the first frame in the plurality of video frames, determining a current target frame including a target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained VGG-16 network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained RPN network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions through a plurality of convolution kernels with different sizes to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the characteristics of the multiple interested areas, distinguishing a target from a background, and obtaining multiple tracking affine frames of a target object; and performing non-maximum suppression on the plurality of tracking affine frames to obtain a tracking result of the target object of the current frame.
As shown in fig. 1, a schematic diagram of an electronic system 600 suitable for use in implementing embodiments of the present disclosure is shown. The electronic system shown in fig. 1 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present disclosure.
As shown in fig. 1, electronic system 600 may include a processing device (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage device 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
In general, input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 607 including, for example, a liquid crystal display (L CD), speaker, vibrator, etc., storage devices 608 including, for example, magnetic tape, hard disk, etc., and communication devices 609 may allow electronic system 600 to communicate wirelessly or wiredly with other devices to exchange data.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.
It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer-readable medium may be embodied in the electronic system (also referred to herein as an "affine multi-task regression-based pedestrian tracking system"); or may exist separately and not be assembled into the electronic system. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic system to: 1) determining that a previous frame of the plurality of video frames comprises a target frame of a target object; 2) determining a current target frame including the target object in the current frame according to the determined target frame; 3) inputting the current target frame into a pre-trained first neural network, and acquiring a candidate feature map of the target frame in the current frame; 4) inputting the candidate feature map into a pre-trained second neural network to obtain a plurality of target candidate regions; 5) Pooling features of the target candidate regions to obtain a plurality of regions of interest for the target object; 6) performing full-link operation on the features of the multiple regions of interest to distinguish a target from a background so as to obtain multiple tracking affine frames of the target object; and 7) carrying out non-maximum suppression on the plurality of tracking affine frames to obtain a tracking result of the target object of the current frame.
On the other hand, the invention also provides a pedestrian tracking method based on affine multitask regression, as shown in fig. 2, which is implemented by adopting the pedestrian tracking system based on affine multitask regression, and the method comprises the following steps:
step 1: determining that a first frame of the plurality of video frames includes a target frame of a target object;
the method comprises the steps of initializing the size of an original image, setting the size of the original image to be m × n (unit: pixel), manually marking the position of a target frame of the frame when t =1, marking the central position of the target frame as (cx, cy), wherein t represents the image of the t-th frame, t is a positive integer, cx and cy are the horizontal and vertical coordinates of the central position of the target frame respectively, and the target frame comprises an object to be tracked, such as the object 301 in FIG. 3.
Initializing affine transformation parameters:U 1=[r1,r2,r3,r4,r5,r6] T
step 2: determining a current target frame including the target object in the current frame according to the determined target frame;
for example, assuming that two side lengths of a circumscribed rectangle of the target frame in the t-1 frame are marked as a, b, on the t-1 frame image, a picture of size (2 a) × (2 b), such as a rectangular frame marked as 302 in fig. 3, is cut out centering on the target center point (cx, cy) of the t-1 frame, in the present application, the purpose of centering on the center point of the target of the previous frame is to make the cut-out picture include target information, because the coordinates of the center point of the target of two adjacent frames do not change much, and as long as the coordinates of the center point of the target of two adjacent frames do not change much, the target to be tracked can be included in the cut-out sub-picture as long as the target of a sufficient size is cut out at a position near the target center point.
And step 3: and adjusting the determined target frame into a fixed size, inputting the fixed size into a pre-trained VGG-16 network, acquiring a candidate feature map of the target frame in the current frame, and designing a loss function.
And adjusting the cut target frame into a fixed size, sending the fixed size into a pre-trained neural network, for example, into the VGG-16 network, and acquiring a feature map of the image after fifth layer convolution in the network, namely acquiring a candidate feature map of the target frame in the image. Such as indicated by reference numeral 303 in fig. 3.
In this embodiment, the accuracy and the operating efficiency of the system are taken into consideration comprehensively, and the various embodiments of the present application are implemented by using a classic VGG-16 network structure, as shown in fig. 1, an exemplary VGG-16 network structure is shown, as shown in fig. 1, the network structure comprises 13 convolutional layers (201) and 3 fully-connected layers (203). specifically, as shown in fig. 1, a convolutional layer is first constructed by using a filter with 3 × and 1 step, assuming that the size of a network input is m × n × (m and n are positive integers), in order to ensure that the first two dimensions of the feature matrix after convolution are the same as those of the input matrix, i.e., m × n, as shown in fig. 1, the input matrix is additionally provided with one turn of 0. the dimension of the input matrix is changed to (m +2) × (n +2), then 3 × is convolved.the first two dimensions of the feature matrix after convolution are m × n, then using a filter with 2, then 256 filters with 2 steps, then 202, then 256 filters are again operated, then three times, and finally, the convolution operation results are obtained by using a convolution operation of the filter after convolution operation, i.e., a convolutional operation is considered as shown in fig. a convolutional layer corresponding to be activated, and then activated, a specific operation is considered to obtain a convolutional function corresponding operation result, namely, a convolutional layer corresponding to obtain a filter corresponding operation result, namely, a convolutional layer corresponding to obtain a number of seven of a number of a convolutional layer corresponding operation after activation, namely, and a number of a convolutional layer corresponding to obtain a.
The method includes constructing the VGG-16 network, training the VGG-16 network, and comparing the result with standard data to obtain a test error rate (e.g., 98% error rate), wherein the test error rate is based on a comparison of the test data set with the standard data, and wherein the test error rate is based on a comparison of the test data set with a predetermined error rate (e.g., 98% error rate).
Alternatively, the loss and regression need to be calculated first, optimizing the affine transformation parameters. The loss function design for the entire network of VGG-16 described above can be expressed, for example, as:
the loss function of the VGG-16 network is expressed as:
Figure DEST_PATH_IMAGE002
(1)
wherein, α1And α2Is the learning rate.pIs a categorytcThe logarithmic loss of (c) is shown in equation (2).
L c p,tc)=-logp tc (2)
iThe number of the regression box indicating the loss being calculated;
tcthe representation is a category label, for example:tc=1 is a representation of the target,tc=0 represents background;
xywhand other variables, respectively, in abscissa/ordinate/width/height.
Parameter(s)v i =(v x, v y, v w, v h ) Is a real rectangular bounding box tuple comprising a central point abscissa, an ordinate, a width and a height;
Figure DEST_PATH_IMAGE004
the predicted target frame tuple comprises a central point abscissa, an ordinate, a width and a height;
u i =(r1,r2,r3,r4,r5,r6) An affine parameter tuple of the real target area;
Figure DEST_PATH_IMAGE006
predicting an affine parameter tuple of the target area;
Figure DEST_PATH_IMAGE008
r1,r2,r3,r4,r5,r6) fixing values of six components of the structure for affine transformation of the real target region;
Figure 567280DEST_PATH_IMAGE008
r1*r2*r3*r4*r5*r6*) Predicting values of six components of the affine transformation fixed structure of the target area;
Figure DEST_PATH_IMAGE010
representing an affine bounding box parameter loss function;
Figure DEST_PATH_IMAGE012
representing a rectangular bounding box parametric loss function;
let (ww*) To represent
Figure DEST_PATH_IMAGE014
Or
Figure DEST_PATH_IMAGE016
,
Figure DEST_PATH_IMAGE018
Is defined as:
Figure DEST_PATH_IMAGE020
(3)
Figure DEST_PATH_IMAGE022
(4)
whereinxAre real numbers.
And 4, step 4: inputting the candidate feature map into a pre-trained RPN network to obtain a plurality of target candidate regions;
the target candidate region is a region in which a plurality of shapes and positions of a target object in the current frame exist simultaneously.
The feature map obtained from the neural network is input into an rpn (region pro-social network) network, and candidate regions for obtaining a plurality of targets, for example, 2000 candidate regions, are extracted. Such as that indicated by reference numeral 304 in fig. 3. The RPN is a network that generates a plurality of candidate areas of different sizes, unlike the VGG-16 network. The candidate region is a region of a plurality of shapes and positions where the target of the current frame may exist. According to the method, a plurality of regions which may exist in the algorithm are estimated in advance, and then optimization regression is carried out on the regions, so that more accurate tracking regions are screened out.
And 5: performing pooling operation on the characteristics of the target candidate regions through a plurality of convolution kernels with different sizes to obtain a plurality of interested regions aiming at the target object;
the plurality of convolution kernels of different sizes includes three convolution kernels for roughly describing different deformations of the target object.
The features of these candidate regions of different sizes are pooled to obtain a plurality of regions of interest (ROI) for the target object, here, a plurality of convolution kernels of different sizes, for example, three convolution kernels, respectively 7 × 7, 5 × 9 and 9 × 5, are designed in the pooling layer in consideration of the deformation of the target, for example, as shown by reference numeral 305 in fig. 3. a plurality of different pooling kernels may primarily describe the deformation of the target, for example, 7 × 7, 5 × 9 may describe the person standing under different cameras, 9 × 5 may describe the action of bending the person, etc., of course, different size pooling kernels may be designed according to different application scenarios.
Step 6: performing full-link operation on the features of the multiple regions of interest, distinguishing a target from a background, comparing the multiple tracking affine frames with a reference target frame to obtain an affine tracking frame with the largest overlapping area, and thus obtaining multiple tracking affine frames of the target object;
the result of the pooling, i.e. the features of the multiple regions of interest (ROIs), is subjected to a full join operation. Here, the full linking operation is to concatenate a plurality of ROI features in sequence. Such as that indicated by reference numeral 306 in fig. 3. Then, the series-connected features are subjected to score comparison by using a softmax function, and the score of the target/background result of the compared target area is obtained. For example, a region with a score greater than a certain threshold is determined as a target region, otherwise, the region is a background region.
And 7: performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame;
the obtained affine area determined as the target area is subjected to non-local maximum suppression (for example, as indicated by reference numeral 308 in fig. 3), and a tracking result of the t-th frame image, that is, a corresponding affine parameter and a frame are obtained. Such as that shown at reference numeral 309 in fig. 3. In one embodiment, the multiple tracked affine frames may be compared with a reference target frame (i.e., a target frame tracked in a previous frame), and an affine tracking frame with a largest overlapping area is obtained as a final tracking result.
Affine transformation is used herein to represent the object geometric deformation. First, thetThe affine transformation parameters of the tracking result of the target region of the frame are writtenU t The structure is as follows:U t =[r1,r2,r3,r4,r5,r6] T . Corresponding affine transformation matrix
Figure DEST_PATH_IMAGE024
The utility model has the advantages of having a plum group structure,ga(2) is corresponding to affine lie groupGA(2) Lie algebra, matrix ofG j
Figure DEST_PATH_IMAGE026
) Is thatGA(2) Generator and matrix ofga(2) The group (2) of (a). For matrixGA(2) The generating element of (1) is:
Figure DEST_PATH_IMAGE028
(5)
for the lie group matrix, the riemann distance is defined as the matrix logarithm:
Figure DEST_PATH_IMAGE030
(6)
where X and Y are elements of the lie group matrix, giving a symmetric positive definite matrix of N
Figure DEST_PATH_IMAGE032
The inner mean of (d) defines:
Figure DEST_PATH_IMAGE034
(7)
wherein
Figure DEST_PATH_IMAGE036
qIs a constant;
and carrying out non-maximum suppression on the tracking affine frames to obtain a tracking result of the t frame image. A plurality of different target areas can be obtained through regression, and in order to obtain a detection algorithm with the highest accuracy correctly, an affine transformation non-maximum suppression method is adopted to screen out the final tracking result. In addition, the loss function is designed, the affine deformation of the target is taken into consideration, and the accuracy of predicting the position of the target is improved.
Current object detection methods, non-maxima suppression (NMS), are widely used as post-processing detection candidates. The method can estimate the axis alignment boundary box and the inclined boundary box, and can execute normal NMS on the axis alignment boundary box and also can execute inclined NMS on the affine transformation boundary box. In affine transformation non-maximum suppression, the computation of the conventional intersection (IoU) is modified to IoU between the two affine bounding boxes. The effect of the algorithm is shown in fig. 4. In fig. 4, each frame with the number 401 is a candidate frame before the suppression of the non-maximum value, the frame with the number 402 is a frame obtained after the suppression of the normal NMS, and the frame with the number 403 is a frame obtained by the suppression of the affine transformation non-maximum value according to the present application. It can be seen that the tracking frame obtained by the method is more accurate.
And 8: and (4) determining whether the number of the t +1 frames is less than the total frame number of the video, and if so, returning to the step 2 to track the t +1 th frame image. And ending the algorithm until all the video frames are tracked. The partial tracking result frames are shown as black frames indicated by arrows 501, 502, 503, 504 in fig. 5.
According to the method and the device, the current target image is cut by using the affine transformation parameter information of the previous frame image, the search range is narrowed, and the algorithm efficiency is improved. In addition, the cut image is input into the VGG-16 network to calculate the feature and then input into the RPN network, so that the repeated calculation of feature extraction is avoided, and the algorithm efficiency is improved. In addition, during the pooling operation, convolution kernels with different sizes and shapes are applied to preliminarily simulate the deformation of the target, and the target position can be accurately extracted. In the present application, the features output by the highest layer of the network are used as semantic models, and affine transformation results are used as spatial models, which form complementary advantages, because the features of the highest layer contain more semantic information and less spatial information. Furthermore, the above-described multitask loss function including affine transformation parameter regression optimizes network performance.
In the above-mentioned pedestrian tracking system, the candidate regions obtained from the RPN network are regions of multiple shapes and positions where the target object in the current frame exists, and furthermore, the step 5 pools the features of the multiple target candidate regions by using multiple convolution kernels of different sizes to obtain multiple regions of interest for the target object.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (4)

1. A pedestrian tracking system, characterized by: comprising a memory and a processor;
the memory is used for storing computer executable instructions;
the processor is to execute the executable instructions to determine that a first frame of a plurality of video frames includes a target box of a target object; for a subsequent frame except the first frame in the plurality of video frames, determining a current target frame including a target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained VGG-16 network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained RPN network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions through a plurality of convolution kernels with different sizes to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the characteristics of the multiple interested areas, distinguishing a target from a background, and obtaining multiple tracking affine frames of a target object; and performing non-maximum suppression on the plurality of tracking affine frames to obtain a tracking result of the target object of the current frame.
2. A pedestrian tracking method implemented by the pedestrian tracking system of claim 1, comprising the steps of:
step 1: determining that a first frame of the plurality of video frames includes a target frame of a target object;
step 2: for the subsequent frames except the first frame, determining a current target frame including the target object in the current frame according to the determined target frame;
and step 3: adjusting the determined target frame into a fixed size, inputting the fixed size into a pre-trained VGG-16 network, acquiring a candidate feature map of the target frame in the current frame, and designing a loss function;
and 4, step 4: inputting the candidate feature map into a pre-trained RPN network to obtain a plurality of target candidate regions;
and 5: performing pooling operation on the characteristics of the target candidate regions through a plurality of convolution kernels with different sizes to obtain a plurality of interested regions aiming at the target object;
step 6: performing full-link operation on the features of the multiple regions of interest, distinguishing a target from a background, comparing the multiple tracking affine frames with a reference target frame to obtain an affine tracking frame with the largest overlapping area, and thus obtaining multiple tracking affine frames of the target object;
and 7: performing non-maximum suppression on the tracking affine frames to obtain a tracking result of the target object of the current frame;
and 8: and (3) judging whether the number of the next frame of the current image is less than the total frame number of the video, if not, directly finishing, if so, returning to the step (2), and tracking the next frame of the image until all the frames of the video are tracked.
3. The pedestrian tracking method according to claim 2, wherein the target candidate region in step 4 is a region in which a plurality of shapes and positions of the target object in the current frame exist simultaneously.
4. A pedestrian tracking method according to claim 2, wherein said plurality of convolution kernels of different sizes in step 5 comprises three convolution kernels for describing different deformations of said target object roughly.
CN202010118386.1A 2020-02-26 2020-02-26 Pedestrian tracking system and method Pending CN111401143A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010118386.1A CN111401143A (en) 2020-02-26 2020-02-26 Pedestrian tracking system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010118386.1A CN111401143A (en) 2020-02-26 2020-02-26 Pedestrian tracking system and method

Publications (1)

Publication Number Publication Date
CN111401143A true CN111401143A (en) 2020-07-10

Family

ID=71430460

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010118386.1A Pending CN111401143A (en) 2020-02-26 2020-02-26 Pedestrian tracking system and method

Country Status (1)

Country Link
CN (1) CN111401143A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111754541A (en) * 2020-07-29 2020-10-09 腾讯科技(深圳)有限公司 Target tracking method, device, equipment and readable storage medium
CN111915647A (en) * 2020-07-16 2020-11-10 郑州轻工业大学 Object label guided self-adaptive video target tracking method
CN114253253A (en) * 2020-09-24 2022-03-29 科沃斯商用机器人有限公司 Target identification method and device based on artificial intelligence and robot

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111915647A (en) * 2020-07-16 2020-11-10 郑州轻工业大学 Object label guided self-adaptive video target tracking method
CN111915647B (en) * 2020-07-16 2021-08-13 郑州轻工业大学 Object label guided self-adaptive video target tracking method
CN111754541A (en) * 2020-07-29 2020-10-09 腾讯科技(深圳)有限公司 Target tracking method, device, equipment and readable storage medium
CN111754541B (en) * 2020-07-29 2023-09-19 腾讯科技(深圳)有限公司 Target tracking method, device, equipment and readable storage medium
CN114253253A (en) * 2020-09-24 2022-03-29 科沃斯商用机器人有限公司 Target identification method and device based on artificial intelligence and robot

Similar Documents

Publication Publication Date Title
US11475660B2 (en) Method and system for facilitating recognition of vehicle parts based on a neural network
US10755120B2 (en) End-to-end lightweight method and apparatus for license plate recognition
CN111709416B (en) License plate positioning method, device, system and storage medium
CN107529650B (en) Closed loop detection method and device and computer equipment
CN109035304B (en) Target tracking method, medium, computing device and apparatus
CN111401143A (en) Pedestrian tracking system and method
CN113936302B (en) Training method and device for pedestrian re-recognition model, computing equipment and storage medium
CN111401196A (en) Method, computer device and computer readable storage medium for self-adaptive face clustering in limited space
CN111428566B (en) Deformation target tracking system and method
CN113793297A (en) Pose determination method and device, electronic equipment and readable storage medium
CN113537070B (en) Detection method, detection device, electronic equipment and storage medium
CN112036381B (en) Visual tracking method, video monitoring method and terminal equipment
CN115546705B (en) Target identification method, terminal device and storage medium
CN115147598A (en) Target detection segmentation method and device, intelligent terminal and storage medium
CN111428567B (en) Pedestrian tracking system and method based on affine multitask regression
US11420623B2 (en) Systems for determining object importance in on-road driving scenarios and methods thereof
CN111160282B (en) Traffic light detection method based on binary Yolov3 network
CN115761646B (en) Pedestrian tracking method, equipment and storage medium for industrial park
CN114120259A (en) Empty parking space identification method and system, computer equipment and storage medium
CN112487927A (en) Indoor scene recognition implementation method and system based on object associated attention
CN111353464B (en) Object detection model training and object detection method and device
CN116958954B (en) License plate recognition method, device and storage medium based on key points and bypass correction
CN115661556B (en) Image processing method and device, electronic equipment and storage medium
CN113838085A (en) Catering tax source monitoring target tracking algorithm and system
CN116612454A (en) Vehicle image processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination