WO2020151166A1 - 多目标跟踪方法、装置、计算机装置及可读存储介质 - Google Patents

多目标跟踪方法、装置、计算机装置及可读存储介质 Download PDF

Info

Publication number
WO2020151166A1
WO2020151166A1 PCT/CN2019/091158 CN2019091158W WO2020151166A1 WO 2020151166 A1 WO2020151166 A1 WO 2020151166A1 CN 2019091158 W CN2019091158 W CN 2019091158W WO 2020151166 A1 WO2020151166 A1 WO 2020151166A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
target frame
frame
filtered
image
Prior art date
Application number
PCT/CN2019/091158
Other languages
English (en)
French (fr)
Inventor
杨国青
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020151166A1 publication Critical patent/WO2020151166A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments

Definitions

  • This application relates to the field of image processing technology, and in particular to a multi-target tracking method, device, computer device and non-volatile readable storage medium.
  • Multi-target tracking refers to tracking multiple moving objects (such as cars and pedestrians in traffic videos) in a video or image sequence to obtain the position of the moving object in each frame.
  • Multi-target tracking has a wide range of applications in video surveillance, autonomous driving, and video entertainment.
  • the current multi-target tracking mainly adopts the track by detection architecture.
  • the position information of each target is detected by the detector on each frame of the video or image sequence, and then the target position information of the current frame and the target position information of the previous frame Make a match. If the accuracy of the detector is not high, there are a lot of false detections, or the deviation between the detection frame and the real frame is too large, it will directly lead to poor tracking accuracy, tracking errors or missing targets.
  • the first aspect of the present application provides a multi-target tracking method, the method includes:
  • Score the target frame by using a target classifier to obtain a score that the target frame belongs to a specified target;
  • the filtered target frame is matched with each target frame of the previous frame of the image to obtain the updated target frame.
  • a second aspect of the present application provides a multi-target tracking device, the device including:
  • the detection module is configured to use a target detector to detect a predetermined type of target in the image to obtain a target frame of the predetermined type of target;
  • a scoring module for scoring the target frame using a target classifier to obtain a score that the target frame belongs to a designated target;
  • a deleting module configured to delete a target frame whose score is lower than a preset threshold in the target frame to obtain a filtered target frame
  • An extraction module for extracting the features of the filtered target frame by using a feature extractor to obtain the feature vector of the filtered target frame
  • the matching module is configured to match the filtered target frame with each target frame of the previous frame of the image according to the feature vector to obtain an updated target frame.
  • a third aspect of the present application provides a computer device, the computer device includes a processor, and the processor is configured to implement the multi-target tracking method when executing computer-readable instructions stored in a memory.
  • a fourth aspect of the present application provides a non-volatile readable storage medium having computer readable instructions stored thereon, and when the computer readable instructions are executed by a processor, the multi-target tracking method is implemented.
  • This application uses a target detector to detect a predetermined type of target in an image to obtain the target frame of the predetermined type of target; uses a target classifier to score the target frame to obtain the score that the target frame belongs to a specified target; delete the target A target frame with a score lower than a preset threshold in the frame, obtain a filtered target frame; use a feature extractor to extract the features of the filtered target frame to obtain the feature vector of the filtered target frame; The feature vector matches the screened target frame with each target frame of the previous frame of the image to obtain an updated target frame.
  • the present application solves the problem of dependence on the target detector in the existing multi-target tracking scheme, and improves the accuracy and robustness of tracking.
  • Fig. 1 is a flowchart of a multi-target tracking method provided by an embodiment of the present application.
  • Fig. 2 is a structural diagram of a multi-target tracking device provided by an embodiment of the present application.
  • Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the multi-target tracking method of the present application is applied to one or more computer devices.
  • the computer device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor and an application specific integrated circuit (ASIC). , Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded devices, etc.
  • ASIC application specific integrated circuit
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • FIG. 1 is a flowchart of a multi-target tracking method provided in Embodiment 1 of the present application.
  • the multi-target tracking method is applied to a computer device.
  • the multi-target tracking method of the present application tracks a specified type of moving object (such as a pedestrian) in a video or image sequence, and obtains the position of the moving object in each frame of the image.
  • the multi-target tracking method can solve the problem of dependence on the target detector in the existing multi-target tracking solution, and improve the accuracy and robustness of tracking.
  • the multi-target tracking method includes:
  • Step 101 Use a target detector to detect a predetermined type of target in an image to obtain a target frame of the predetermined type of target.
  • the predetermined type of target may include pedestrians, cars, airplanes, ships, and so on.
  • the predetermined type of target may be one type of target (for example, pedestrians) or multiple types of targets (for example, pedestrians and cars).
  • the target detector may be a neural network model with classification and regression functions.
  • the target detector may be a Faster Region-Based Convolutional Neural Network (Faster RCNN) model.
  • the Faster RCNN model includes the Region Proposal Network (RPN) and the Fast Region-based Convolution Neural Network (Fast RCNN).
  • RPN Region Proposal Network
  • Fast RCNN Fast Region-based Convolution Neural Network
  • the region suggestion network and the fast region convolutional neural network have a shared convolutional layer, and the convolutional layer is used to extract a feature map of an image.
  • the region suggestion network generates a candidate frame of the image according to the feature map, and inputs the generated candidate frame into the fast regional convolutional neural network.
  • the fast area convolutional neural network screens and adjusts the candidate frame according to the feature map to obtain the target frame of the image.
  • the target detector Before using a target detector to detect a predetermined type of target in an image, the target detector needs to be trained using a training sample set.
  • the convolutional layer extracts feature maps of each sample image in the training sample set
  • the region suggestion network obtains candidate frames in each sample image according to the feature map
  • the fast regional convolutional neural network The feature map screens and adjusts the candidate frames to obtain the target frame of each sample image.
  • the target detector detects target frames of predetermined types of targets (for example, pedestrians, cars, airplanes, ships, etc.).
  • the accelerated regional convolutional neural network model adopts a ZF framework, and the regional suggestion network and the fast regional convolutional neural network share 5 convolutional layers.
  • the training sample set can be used to train the accelerated regional convolutional neural network model according to the following steps:
  • the regional suggestion network selects many candidate boxes, and several candidate boxes with the highest scores can be screened according to the target classification score of the candidate boxes and input to the fast regional convolutional neural network to speed up training and detection.
  • the backpropagation algorithm can be used to train the region suggestion network, and the network parameters of the region suggestion network can be adjusted during the training process to minimize the loss function.
  • the loss function indicates the difference between the prediction confidence of the candidate frame predicted by the region suggestion network and the true confidence.
  • the loss function can include two parts: target classification loss and regression loss.
  • the loss function can be defined as:
  • i is the index of the candidate frame in a training batch (mini-batch).
  • N cls is the size of the training batch, such as 256.
  • p i is the predicted probability of the i-th candidate frame as the target.
  • Is the GT label if the candidate box is positive (that is, the assigned label is a positive label, called a positive candidate box), Is 1; if the candidate box is negative (that is, the assigned label is a negative label, called a negative candidate box), Is 0.
  • is the balance weight, which can be taken as 10.
  • N reg is the number of candidate frames.
  • R is a robust loss function (smoothL1), defined as:
  • the training method of the fast regional convolutional network can refer to the training method of the regional suggestion network, which will not be repeated here.
  • the method of Hard Negative Mining is added to the training of the fast area convolutional network.
  • HNM Hard Negative Mining
  • the target detector may also be other neural network models, such as a regional convolutional neural network (RCNN) model, or a Faster Convolutional Neural Network (RCNN) model.
  • RCNN regional convolutional neural network
  • RCNN Faster Convolutional Neural Network
  • the image is input to the target detector, and the target detector detects the predetermined type of target in the image, and outputs the target of the predetermined type of target in the image
  • the position of the box For example, the target detector outputs 6 target frames in the image.
  • the target frame can be presented in the form of a rectangular frame.
  • the position of the target frame may be represented by position coordinates, and the position coordinates may include upper left corner coordinates (x, y) and width and height (w, h).
  • the target detector can also output the type of each target frame, for example, output 5 pedestrian target frames (called pedestrian target frames) and 1 car type target frame (called car target frames). This method does not require high accuracy of the target detector, and the type of target frame output by the target detector may be inaccurate.
  • Step 102 Use a target classifier to score the target frame, and obtain a score that the target frame belongs to a specified target.
  • the image and the position of the target frame are input into a target classifier, and the target classifier scores each target frame to obtain a score for each target frame.
  • the specified target is included in the predetermined type target.
  • the predetermined target includes pedestrians and cars
  • the designated target includes pedestrians.
  • a target frame for a predetermined type of target.
  • To score a target frame using a target classifier is to score each target frame separately, and obtain a score for each target frame belonging to a specified target. For example, in the application of tracking pedestrians, score the obtained 5 pedestrian target frames and 1 car target frame, and obtain the score of each target frame belonging to the pedestrian.
  • the target frame of a predetermined type of target detected by the target detector may contain a target frame of a non-specified target, and the purpose of scoring the target frame by the target classifier is to identify the target frame of the non-specified target. If the target frame belongs to the specified target, the score belonging to the specified target is higher; if the target frame does not belong to the specified target, the score belonging to the specified target is lower. For example, if the specified target is a pedestrian, the pedestrian target box is entered, and the score is 0.9, and the car target box is entered, and the score is 0.1.
  • the target classifier may be a neural network model.
  • the target classifier may be a Region-based Fully Convolutional Network (R-FCN) model.
  • the R-FCN model also includes a regional proposal network. Compared with the Faster RCNN model, the R-FCN model has a deeper shared convolutional layer and can obtain more abstract features for scoring.
  • the R-FCN model obtains a position-sensitive score map of the target frame, and scores the target frame according to the position-sensitive score map.
  • a training sample set needs to be used to train the target detector.
  • the training of the target classifier can refer to the prior art, which will not be repeated here.
  • Step 103 Delete the target frame whose score is lower than the preset threshold in the target frame, and obtain the filtered target frame.
  • the filtered target frame is the target frame of the specified target.
  • the target frame is determined whether the score of each target frame belonging to the specified target in the target frame is lower than the preset threshold (for example, 0.7), and if the score of the target frame belonging to the specified target is lower than the preset threshold, the target frame is deleted . If the score of the target frame belonging to the designated target is lower than the preset threshold, the target frame is determined to be a mischeck, and the target frame is deleted. For example, the scores of the five pedestrian target frames obtained are 0.9, 0.8, 0.7, 0.8, 0.9, and the score of one car target frame is 0.1, and the score of the car target frame is lower than the preset threshold, then delete There are 5 pedestrian target frames left in the car target frame.
  • the preset threshold for example, 0.7
  • the preset threshold can be set according to actual needs.
  • Step 104 Extract the features of the filtered target frame using a feature extractor to obtain the feature vector of the filtered target frame.
  • the filtered target frame is input to a feature extractor, and the feature extractor extracts the features of the filtered target frame to obtain the feature vector of the filtered target frame.
  • Using the feature extractor to extract the features of the screened target frames is to extract the features of each screened target frame to obtain the feature vector of each screened target frame.
  • the feature extractor may be a neural network model.
  • a re-identification (Re-Identification, ReID) method may be used to extract the features of the screened target frame.
  • the method is used to track pedestrians, and the ReID method may be used, such as the part-aligned ReID (part-aligned ReID) method to extract the characteristics of the pedestrian target frame after screening (referred to as pedestrian re-identification characteristics).
  • the extracted features of the filtered target frame may include global features and local features.
  • Methods of extracting local features can include image dicing, positioning using key points (such as skeleton key points), and posture/angle correction.
  • the method is used to track pedestrians, and the feature extraction convolutional neural network (CNN) model can be used to extract the features of the screened target frame.
  • the feature extraction CNN model includes three linear sub-networks FEN-C1, FEN-C2, FEN-C3.
  • FEN-C1, FEN-C2, FEN-C3 For each screened target frame, 14 skeleton key points in the target frame can be extracted, and 7 regions of interest (Region of interest, ROI) regions can be obtained according to the 14 skeleton key points.
  • the area of interest includes 3 large areas on the head, upper body, and lower body, and 4 small areas on the limbs.
  • the target frame gets the global features through the complete feature extraction CNN model. Three large regions pass through FEN-C2 and FEN-C3 sub-networks to obtain three local features. The four limb regions pass through the FEN-C3 sub-network to obtain four local features. All 8 features are connected at different scales, and finally a pedestrian re-identification feature that combines global
  • the extracted feature vector of the filtered target frame is a 128-dimensional feature vector.
  • Step 105 Match the screened target frame with each target frame of the previous frame of the image according to the feature vector to obtain an updated target frame.
  • the difference value of each target frame of the filtered target frame and the previous frame image may be calculated according to the feature vector, and the difference value between the filtered target frame and the previous frame image may be determined according to the difference value.
  • the updated target box is obtained by matching the target box of each target box.
  • the filtered target frame includes target frame A1, target frame A2, target frame A3, and target frame A4, and the target frame of the previous frame of image includes target frame B1, target frame B2, target frame B3, and target frame B4.
  • target frame A1 calculate the difference between target frame A1 and target frame B1, target frame A1 and target frame B2, target frame A1 and target frame B3, target frame A1 and target frame B4, and minimize the difference value and not greater than the preset A set of target frames with difference values (for example, target frame A1 and target frame B1) are determined as matched target frames.
  • the target frame A2 calculates the difference value between the target frame A2 and the target frame B1, the target frame A2 and the target frame B2, the target frame A2 and the target frame B3, and the target frame A2 and the target frame B4.
  • a set of target frames greater than the preset difference value (for example, target frame A2 and target frame B2) is determined as the matched target frame;
  • the cosine distance between the feature vector of the filtered target frame and the feature vector of each target frame of the previous frame image may be calculated, and the cosine distance can be used as each of the filtered target frame and the previous frame image.
  • the difference value of the target box may be calculated, and the cosine distance can be used as each of the filtered target frame and the previous frame image.
  • the Euclidean distance between the feature vector of the filtered target frame and the feature vector of each target frame of the previous frame image may be calculated, and the Euclidean distance may be regarded as the difference between the filtered target frame and the previous frame. The difference value of each target frame of the frame image.
  • the filtered target frame is stored as a new target frame.
  • the filtered target frame is stored.
  • a target detector is used to detect a predetermined type of target in an image to obtain a target frame of the predetermined type of target; a target classifier is used to score the target frame, and the target frame is obtained Specify the score of the target; delete the target frame whose score is lower than the preset threshold in the target frame to obtain the filtered target frame; use the feature extractor to extract the features of the filtered target frame to obtain the filtered target frame.
  • the feature vector of the target frame according to the feature vector, the filtered target frame is matched with each target frame of the previous frame of the image to obtain an updated target frame.
  • FIG. 2 is a structural diagram of a multi-target tracking device provided in Embodiment 2 of the present application.
  • the multi-target tracking device 20 is applied to a computer device.
  • the multi-target tracking of this device tracks a specified type of moving object (such as a pedestrian) in a video or image sequence, and obtains the position of the moving object in each frame of the image.
  • the multi-target tracking device 20 can solve the problem of dependence on the target detector in the existing multi-target tracking solution, and improve the accuracy and robustness of tracking.
  • the multi-target tracking device 20 may include a detection module 201, a scoring module 202, a deletion module 203, an extraction module 204, and a matching module 205.
  • the detection module 201 is configured to use a target detector to detect a predetermined type of target in an image to obtain a target frame of the predetermined type of target.
  • the predetermined type of target may include pedestrians, cars, airplanes, ships, and so on.
  • the predetermined type of target may be one type of target (for example, pedestrians) or multiple types of targets (for example, pedestrians and cars).
  • the target detector may be a neural network model with classification and regression functions.
  • the target detector may be a Faster Region-Based Convolutional Neural Network (Faster RCNN) model.
  • the Faster RCNN model includes the Region Proposal Network (RPN) and the Fast Region-based Convolution Neural Network (Fast RCNN).
  • RPN Region Proposal Network
  • Fast RCNN Fast Region-based Convolution Neural Network
  • the region suggestion network and the fast region convolutional neural network have a shared convolutional layer, and the convolutional layer is used to extract a feature map of an image.
  • the region suggestion network generates a candidate frame of the image according to the feature map, and inputs the generated candidate frame into the fast regional convolutional neural network.
  • the fast area convolutional neural network screens and adjusts the candidate frame according to the feature map to obtain the target frame of the image.
  • the target detector Before using a target detector to detect a predetermined type of target in an image, the target detector needs to be trained using a training sample set.
  • the convolutional layer extracts feature maps of each sample image in the training sample set
  • the region suggestion network obtains candidate frames in each sample image according to the feature map
  • the fast regional convolutional neural network The feature map screens and adjusts the candidate frames to obtain the target frame of each sample image.
  • the target detector detects target frames of predetermined types of targets (for example, pedestrians, cars, airplanes, ships, etc.).
  • the accelerated regional convolutional neural network model adopts a ZF framework, and the regional suggestion network and the fast regional convolutional neural network share 5 convolutional layers.
  • the training sample set can be used to train the accelerated regional convolutional neural network model according to the following steps:
  • the regional suggestion network selects many candidate boxes, and several candidate boxes with the highest scores can be screened according to the target classification score of the candidate boxes and input to the fast regional convolutional neural network to speed up training and detection.
  • the backpropagation algorithm can be used to train the region suggestion network, and the network parameters of the region suggestion network can be adjusted during the training process to minimize the loss function.
  • the loss function indicates the difference between the prediction confidence of the candidate frame predicted by the region suggestion network and the true confidence.
  • the loss function can include two parts: target classification loss and regression loss.
  • i is the index of the candidate frame in a training batch (mini-batch).
  • N cls is the size of the training batch, such as 256.
  • p i is the predicted probability of the i-th candidate frame as the target.
  • Is the GT label if the candidate box is positive (that is, the assigned label is a positive label, called a positive candidate box), Is 1; if the candidate box is negative (that is, the assigned label is a negative label, called a negative candidate box), Is 0.
  • is the balance weight, which can be taken as 10.
  • N reg is the number of candidate frames.
  • R is a robust loss function (smoothL1), defined as:
  • the training method of the fast regional convolutional network can refer to the training method of the regional suggestion network, which will not be repeated here.
  • the method of Hard Negative Mining is added to the training of the fast area convolutional network.
  • HNM Hard Negative Mining
  • the target detector may also be other neural network models, such as a regional convolutional neural network (RCNN) model, or a Faster Convolutional Neural Network (RCNN) model.
  • RCNN regional convolutional neural network
  • RCNN Faster Convolutional Neural Network
  • the image is input to the target detector, and the target detector detects the predetermined type of target in the image, and outputs the target of the predetermined type of target in the image
  • the position of the box For example, the target detector outputs 6 target frames in the image.
  • the target frame can be presented in the form of a rectangular frame.
  • the position of the target frame may be represented by position coordinates, and the position coordinates may include upper left corner coordinates (x, y) and width and height (w, h).
  • the target detector can also output the type of each target frame, for example, output 5 pedestrian target frames (called pedestrian target frames) and 1 car type target frame (called car target frames). This method does not require high accuracy of the target detector, and the type of target frame output by the target detector may be inaccurate.
  • the scoring module 202 is configured to score the target frame by using a target classifier to obtain a score that the target frame belongs to a designated target.
  • the image and the position of the target frame are input into a target classifier, and the target classifier scores each target frame to obtain a score for each target frame.
  • the specified target is included in the predetermined type target.
  • the predetermined target includes pedestrians and cars
  • the designated target includes pedestrians.
  • a target frame for a predetermined type of target.
  • To score a target frame using a target classifier is to score each target frame separately, and obtain a score for each target frame belonging to a specified target. For example, in the application of tracking pedestrians, score the obtained 5 pedestrian target frames and 1 car target frame, and obtain the score of each target frame belonging to the pedestrian.
  • the target frame of a predetermined type of target detected by the target detector may contain a target frame of a non-specified target, and the purpose of scoring the target frame by the target classifier is to identify the target frame of the non-specified target. If the target frame belongs to the specified target, the score belonging to the specified target is higher; if the target frame does not belong to the specified target, the score belonging to the specified target is lower. For example, if the specified target is a pedestrian, the pedestrian target box is entered, and the score is 0.9, and the car target box is entered, and the score is 0.1.
  • the target classifier may be a neural network model.
  • the target classifier may be a Region-based Fully Convolutional Network (R-FCN) model.
  • the R-FCN model also includes a regional proposal network. Compared with the Faster RCNN model, the R-FCN model has a deeper shared convolutional layer and can obtain more abstract features for scoring.
  • the R-FCN model obtains a position-sensitive score map of the target frame, and scores the target frame according to the position-sensitive score map.
  • a training sample set needs to be used to train the target detector.
  • the training of the target classifier can refer to the prior art, which will not be repeated here.
  • the deleting module 203 is configured to delete the target frame whose score is lower than the preset threshold in the target frame to obtain the filtered target frame.
  • the filtered target frame is the target frame of the specified target.
  • the target frame is determined whether the score of each target frame belonging to the specified target in the target frame is lower than the preset threshold (for example, 0.7), and if the score of the target frame belonging to the specified target is lower than the preset threshold, the target frame is deleted . If the score of the target frame belonging to the designated target is lower than the preset threshold, the target frame is determined to be a mischeck, and the target frame is deleted. For example, the scores of the five pedestrian target frames obtained are 0.9, 0.8, 0.7, 0.8, 0.9, and the score of one car target frame is 0.1, and the score of the car target frame is lower than the preset threshold, then delete There are 5 pedestrian target frames left in the car target frame.
  • the preset threshold for example, 0.7
  • the preset threshold can be set according to actual needs.
  • the extraction module 204 is configured to extract the features of the screened target frame using a feature extractor to obtain the feature vector of the screened target frame.
  • the filtered target frame is input to a feature extractor, and the feature extractor extracts the features of the filtered target frame to obtain the feature vector of the filtered target frame.
  • Using the feature extractor to extract the features of the screened target frames is to extract the features of each screened target frame to obtain the feature vector of each screened target frame.
  • the feature extractor may be a neural network model.
  • a re-identification (Re-Identification, ReID) method may be used to extract the features of the screened target frame.
  • the method is used to track pedestrians, and the ReID method can be used, for example, the part-aligned ReID (part-aligned ReID) method extracts the characteristics of the pedestrian target frame after screening (referred to as pedestrian re-identification characteristics).
  • the extracted features of the filtered target frame may include global features and local features.
  • Methods of extracting local features can include image dicing, positioning using key points (such as skeleton key points), and posture/angle correction.
  • the method is used to track pedestrians, and the feature extraction convolutional neural network (CNN) model can be used to extract the features of the screened target frame.
  • the feature extraction CNN model includes three linear sub-networks FEN-C1, FEN-C2, FEN-C3.
  • FEN-C1, FEN-C2, FEN-C3 For each screened target frame, 14 skeleton key points in the target frame can be extracted, and 7 regions of interest (Region of interest, ROI) regions can be obtained according to the 14 skeleton key points.
  • the area of interest includes 3 large areas on the head, upper body, and lower body, and 4 small areas on the limbs.
  • the target frame gets the global features through the complete feature extraction CNN model. Three large regions pass through FEN-C2 and FEN-C3 sub-networks to obtain three local features. The four limb regions pass through the FEN-C3 sub-network to obtain four local features. All 8 features are connected at different scales, and finally a pedestrian re-identification feature that combines global
  • the extracted feature vector of the filtered target frame is a 128-dimensional feature vector.
  • the matching module 205 is configured to match the filtered target frame with each target frame of the previous frame of the image according to the feature vector to obtain an updated target frame.
  • the difference value of each target frame of the filtered target frame and the previous frame image may be calculated according to the feature vector, and the difference value between the filtered target frame and the previous frame image may be determined according to the difference value.
  • the updated target box is obtained by matching the target box of each target box.
  • the filtered target frame includes target frame A1, target frame A2, target frame A3, and target frame A4, and the target frame of the previous frame of image includes target frame B1, target frame B2, target frame B3, and target frame B4.
  • target frame A1 calculate the difference between target frame A1 and target frame B1, target frame A1 and target frame B2, target frame A1 and target frame B3, target frame A1 and target frame B4, and minimize the difference value and not greater than the preset A set of target frames with difference values (for example, target frame A1 and target frame B1) are determined as matched target frames.
  • the target frame A2 calculates the difference value between the target frame A2 and the target frame B1, the target frame A2 and the target frame B2, the target frame A2 and the target frame B3, and the target frame A2 and the target frame B4.
  • a set of target frames greater than the preset difference value (for example, target frame A2 and target frame B2) is determined as the matched target frame;
  • the cosine distance between the feature vector of the filtered target frame and the feature vector of each target frame of the previous frame image may be calculated, and the cosine distance can be used as each of the filtered target frame and the previous frame image.
  • the difference value of the target box may be calculated, and the cosine distance can be used as each of the filtered target frame and the previous frame image.
  • the Euclidean distance between the feature vector of the filtered target frame and the feature vector of each target frame of the previous frame image may be calculated, and the Euclidean distance may be regarded as the difference between the filtered target frame and the previous frame. The difference value of each target frame of the frame image.
  • the filtered target frame is stored as a new target frame.
  • the module 204 obtains the feature vector of the filtered target frame, the filtered target frame is directly processed.
  • the feature vector of the target box is stored.
  • This embodiment provides a multi-target tracking device 20.
  • the multi-target tracking is to track a specified type of moving object (such as a pedestrian) in a video or image sequence to obtain the position of the moving object in each frame of the image.
  • the multi-target tracking device 20 uses a target detector to detect a predetermined type of target in the image to obtain a target frame of the predetermined type of target; uses a target classifier to score the target frame to obtain a score that the target frame belongs to a specified target Delete the target frame whose score is lower than the preset threshold in the target frame to obtain the filtered target frame; use a feature extractor to extract the characteristics of the filtered target frame to obtain the selected target frame Feature vector; according to the feature vector, the screened target frame is matched with each target frame of the previous frame of the image to obtain an updated target frame.
  • This embodiment solves the problem of dependence on the target detector in the existing multi-target tracking scheme, and improves the accuracy and robustness of tracking.
  • This embodiment provides a readable storage medium with computer readable instructions stored on the readable storage medium, and when the computer readable instructions are executed by a processor, the steps in the above embodiment of the multi-target tracking method are implemented, for example, as shown in FIG. Steps 101-105 shown:
  • Step 101 Use a target detector to detect a predetermined type of target in an image to obtain a target frame of the predetermined type of target;
  • Step 102 Use a target classifier to score the target frame, and obtain a score that the target frame belongs to a specified target;
  • Step 103 Delete the target frame whose score is lower than a preset threshold in the target frame to obtain a filtered target frame;
  • Step 104 Extract the features of the filtered target frame using a feature extractor to obtain the feature vector of the filtered target frame;
  • Step 105 Match the screened target frame with each target frame of the previous frame of the image according to the feature vector to obtain an updated target frame.
  • each module in the above-mentioned device embodiment is realized, for example, the modules 201-205 in FIG. 2:
  • the detection module 201 is configured to use a target detector to detect a predetermined type of target in an image to obtain a target frame of the predetermined type of target;
  • the scoring module 202 is configured to score the target frame by using a target classifier to obtain the score that the target frame belongs to a designated target;
  • the deleting module 203 is configured to delete the target frame whose score is lower than a preset threshold in the target frame to obtain the filtered target frame;
  • the extraction module 204 is configured to extract the features of the filtered target frame using a feature extractor to obtain the feature vector of the filtered target frame;
  • the matching module 205 is configured to match the filtered target frame with each target frame of the previous frame of the image according to the feature vector to obtain an updated target frame.
  • FIG. 3 is a schematic diagram of a computer device provided in Embodiment 4 of this application.
  • the computer device 30 includes a memory 301, a processor 302, and computer-readable instructions 303 that are stored in the memory 301 and can run on the processor 302, such as a multi-target tracking program.
  • the processor 302 executes the computer-readable instruction 303, the steps in the above-mentioned multi-target tracking method embodiment are implemented, such as steps 101-105 shown in FIG.
  • Step 101 Use a target detector to detect a predetermined type of target in an image to obtain a target frame of the predetermined type of target;
  • Step 102 Use a target classifier to score the target frame, and obtain a score that the target frame belongs to a specified target;
  • Step 103 Delete the target frame whose score is lower than a preset threshold in the target frame to obtain a filtered target frame;
  • Step 104 Extract the features of the filtered target frame using a feature extractor to obtain the feature vector of the filtered target frame;
  • Step 105 Match the screened target frame with each target frame of the previous frame of the image according to the feature vector to obtain an updated target frame.
  • each module in the above-mentioned device embodiment is realized, for example, the modules 201-205 in FIG. 2:
  • the detection module 201 is configured to use a target detector to detect a predetermined type of target in an image to obtain a target frame of the predetermined type of target;
  • the scoring module 202 is configured to score the target frame by using a target classifier to obtain the score that the target frame belongs to a designated target;
  • the deleting module 203 is configured to delete the target frame whose score is lower than a preset threshold in the target frame to obtain the filtered target frame;
  • the extraction module 204 is configured to extract the features of the filtered target frame using a feature extractor to obtain the feature vector of the filtered target frame;
  • the matching module 205 is configured to match the filtered target frame with each target frame of the previous frame of the image according to the feature vector to obtain an updated target frame.
  • the computer-readable instruction 303 may be divided into one or more modules, and the one or more modules are stored in the memory 301 and executed by the processor 302 to complete the method .
  • the computer-readable instruction 303 may be divided into the detection module 201, the scoring module 202, the deletion module 203, the extraction module 204, and the matching module 205 in FIG. 2.
  • the specific functions of each module refer to the second embodiment.
  • the computer device 30 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the schematic diagram 3 is only an example of the computer device 30, and does not constitute a limitation on the computer device 30. It may include more or less components than those shown in the figure, or combine certain components, or different components.
  • the computer device 30 may also include input and output devices, network access devices, buses, etc.
  • the so-called processor 302 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor 302 may also be any conventional processor, etc.
  • the processor 302 is the control center of the computer device 30, which connects the entire computer device 30 through various interfaces and lines. Various parts.
  • the memory 301 may be used to store the computer-readable instructions 303, and the processor 302 executes or executes the computer-readable instructions or modules stored in the memory 301 and calls data stored in the memory 301 to implement Various functions of the computer device 30.
  • the memory 302 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.;
  • the data (such as audio data, phone book, etc.) created according to the use of the computer device 30 and the like are stored.
  • the memory 301 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD) Card, Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • a non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD) Card, Flash Card, at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the integrated module of the computer device 30 is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a readable storage medium. When the computer-readable instruction is executed by the processor, it can implement the steps of the foregoing method embodiments.
  • the computer-readable medium may include: any entity or device capable of carrying the computer-readable instructions, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) ), random access memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electrical carrier signal telecommunications signal
  • software distribution media any entity or device capable of carrying the computer-readable instructions
  • recording medium U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) ), random access memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of hardware plus software functional modules.
  • the above-mentioned software function module is stored in a readable storage medium, and includes several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute the various embodiments of this application. Part of the method.

Abstract

一种多目标跟踪方法、装置、计算机装置及非易失性可读存储介质。所述多目标跟踪方法包括:利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量;根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。本申请解决了现有多目标跟踪方案中对目标检测器的依赖问题,并且提高了跟踪的精度和鲁棒性。

Description

多目标跟踪方法、装置、计算机装置及可读存储介质
本申请要求于2019年01月23日提交中国专利局,申请号为201910064677.4,发明名称为“多目标跟踪方法、装置、计算机装置及计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,具体涉及一种多目标跟踪方法、装置、计算机装置及非易失性可读存储介质。
背景技术
多目标跟踪是指对视频或图像序列中多个运动物体(例如交通视频中的汽车和行人)进行跟踪,得到运动物体在每一帧的位置。多目标跟踪在视频监控、自动驾驶和视频娱乐等领域有广泛的应用。
目前的多目标跟踪主要采用了track by detection架构,在视频或图像序列的每帧图像上通过检测器检测出各个目标的位置信息,然后将当前帧的目标位置信息和前一帧的目标位置信息进行匹配。如果检测器的精度不高、出现大量的错检或者检测框跟真实框的偏差过大,就会直接导致跟踪的精度变差、跟踪错误或丢失目标。
发明内容
鉴于以上内容,有必要提出一种多目标跟踪方法、装置、计算机装置及非易失性可读存储介质,其可以解决现有多目标跟踪方案中对目标检测器的依赖问题,并且提高了跟踪的精度和鲁棒性。
本申请的第一方面提供一种多目标跟踪方法,所述方法包括:
利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;
利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;
删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;
利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量;
根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。
本申请的第二方面提供一种多目标跟踪装置,所述装置包括:
检测模块,用于利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;
打分模块,用于利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;
删除模块,用于删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;
提取模块,用于利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量;
匹配模块,用于根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。
本申请的第三方面提供一种计算机装置,所述计算机装置包括处理器,所述处理器用于执行存储器中存储的计算机可读指令时实现所述多目标跟踪方法。
本申请的第四方面提供一种非易失性可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现所述多目标跟踪方法。
本申请利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量;根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。本申请解决了现有多目标跟踪方案中对目标检测器的依赖问题,并且提高了跟踪的精度和鲁棒性。
附图说明
图1是本申请实施例提供的多目标跟踪方法的流程图。
图2是本申请实施例提供的多目标跟踪装置的结构图。
图3是本申请实施例提供的计算机装置的示意图。
具体实施方式
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本申请,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
优选地,本申请的多目标跟踪方法应用在一个或者多个计算机装置中。所述计算机装置是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机装置可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机装置可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
实施例一
图1是本申请实施例一提供的多目标跟踪方法的流程图。所述多目标跟踪方法应用于计算机装置。
本申请多目标跟踪方法对视频或图像序列中指定类型的运动物体(例如行人)进行跟踪,得到运动物体在每一帧图像中的位置。所述多目标跟踪方法可以解决现有多目标跟踪方案中对目标检测器的依赖问题,并且提高了跟踪的精度和鲁棒性。
如图1所示,所述多目标跟踪方法包括:
步骤101,利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框。
所述预定类型目标可以包括行人、汽车、飞机、船只等。所述预定类型目标可以是一种类型的目标(例如行人),也可以是多种类型的目标(例如行人和汽车)。
所述目标检测器可以是具有分类和回归功能的神经网络模型。在本实施例中,所述目标检测器可以是加快区域卷积神经网络(Faster Region-Based Convolutional Neural Network,Faster RCNN)模型。
Faster RCNN模型包括区域建议网络(Region Proposal Network,RPN)和快速区域卷积神经网络(Fast Region-based Convolution Neural Network,Fast RCNN)。
所述区域建议网络和所述快速区域卷积神经网络有共享的卷积层,所述卷积层用于提取图像的特征图。所述区域建议网络根据所述特征图生成图像的候选框,并将生成的候选框输入所述快速区域卷积神经网络。所述快速区域卷积神经网络根据所述特征图对所述候选框进行筛选和调整,得到图像的目标框。
在利用目标检测器检测图像中的预定类型目标之前,所述目标检测器需要使用训练样本集进行训练。在训练时,所述卷积层提取训练样本集中各个样本图像的特征图,所述区域建议网络根据所述特征图获取所述各个样本图像中的候选框,所述快速区域卷积神经网络根据所述特征图对所述候选框进行筛选和调整,得到所述各个样本图像的目标框。目标检测器检测预定类型目标(例如行人、汽车、飞机、船只等)的目标框。
在一较佳实施例中,所述加快区域卷积神经网络模型采用ZF框架,所述区域建议网络和所述快速区域卷积神经网络共享5个卷积层。
在一具体实施例中,可以按照以下步骤使用训练样本集对加快区域卷积神经网络模型进行训练:
(1)使用Imagenet模型初始化所述区域建议网络,使用所述训练样本集训练所述区域建议网络;
(2)使用(1)中训练后的区域建议网络生成训练样本集中各个样本图像的候选框,利用所述候选框训练所述快速区域卷积神经网络。此时,区域建议网络和快速区域卷积神经网络还没有共享卷积层;
(3)使用(2)中训练后的快速区域卷积神经网络初始化所述区域建议网络,使用训练样本集训练所述区域建议网络;
(4)使用(3)中训练后的区域建议网络初始化所述快速区域卷积神经网络,并保持所述卷积层固定,使用训练样本集训练所述快速区域卷积神经网络。此时,区域建议网络和快速区域卷积神经网络共享相同的卷积层,构成了一个统一的网络模型。
区域建议网络选取的候选框较多,可以根据候选框的目标分类得分筛选了若干个得分最高的候选框输入到快速区域卷积神经网络,以加快训练和检测的速度。
可以使用反向传播算法对区域建议网络进行训练,训练过程中调整区域建议网络的网络参数,使损失函数最小化。损失函数指示区域建议网络预测的候选框的预测置信度与真实置信度的差异。损失函数可以包括目标分类损失和回归损失两部分。
损失函数可以定义为:
Figure PCTCN2019091158-appb-000001
其中,i为一个训练批量(mini-batch)中候选框的索引。
Figure PCTCN2019091158-appb-000002
是候选框的目标分类损失。N cls为训练批量的大小,例如256。p i是第i个候选框为目标的预测概率。
Figure PCTCN2019091158-appb-000003
是GT标签,若候选框为正(即分配的标签为正标签,称为正候选框),
Figure PCTCN2019091158-appb-000004
为1;若候选框为负(即分配的标签为负标签,称为负候选框),
Figure PCTCN2019091158-appb-000005
为0。
Figure PCTCN2019091158-appb-000006
可以计算为
Figure PCTCN2019091158-appb-000007
Figure PCTCN2019091158-appb-000008
Figure PCTCN2019091158-appb-000009
是候选框的回归损失。λ为平衡权重,可以取为10。N reg为候选框的数量。
Figure PCTCN2019091158-appb-000010
可以计算为
Figure PCTCN2019091158-appb-000011
t i是一个坐标向量,即t i=(t x,t y,t w,t h),表示候选框的4个参数化坐标(例如候选框左上角的坐标以及宽度、高度)。
Figure PCTCN2019091158-appb-000012
是与正候选框对应的GT边界框的坐标向量,即
Figure PCTCN2019091158-appb-000013
Figure PCTCN2019091158-appb-000014
(例如真实目标框左上角的坐标以及宽度、高度)。R为具有鲁棒性的损失函数(smoothL1),定义为:
Figure PCTCN2019091158-appb-000015
快速区域卷积网络的训练方法可以参照区域建议网络的训练方法,此处不再赘述。
在本实施例中,在快速区域卷积网络的训练中加入负样本难例挖掘(Hard Negative Mining,HNM)方法。对于被快速区域卷积网络错误地分类为正样本的负样本(即难例),将这些负样本的信息记录下来,在下次迭代训练的过程中,将这些负样本再次输入到训练样本集中,并且加大其损失的权重,增强其对分类器的影响,这样能够保证不停的针对更难的负样本进行分类,使得分类器学到的特征由易到难,涵盖的样本分布也更具多样性。
在其他的实施例中,所述目标检测器还可以是其他的神经网络模型,例如区域卷积神经网络(RCNN)模型、加快卷积神经网络(Faster RCNN)模型。
利用目标检测器检测图像中的预定类型目标时,将所述图像输入所述目标检测器,所述目标检测器对图像中的预定类型目标进行检测,输出所述图像中的预定类型目标的目标框的位置。例如,所述目标检测器输出所述图像中的6个目标框。目标框可以以矩形框的形式呈现。目标框的位置可以用位置坐标表示,所述位置坐标可以包括左上角坐标(x,y)和宽高(w,h)。
所述目标检测器还可以输出每个目标框的类型,例如输出5个行人类型的目标框(称为行人目标框)和1个汽车类型的目标框(称为汽车目标框)。本方法对目标检测器的精度要求不高,所述目标检测器输出的目标框的类型可能是不准确的。
步骤102,利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数。
将所述图像和所述目标框的位置输入目标分类器,所述目标分类器对每个目标框打分,得到每个目标框的分数。
所述指定目标包含在所述预定类型目标中。例如,所述预定类型目标包括行人和汽车,所述指定目标包括行人。
预定类型目标的目标框可以是多个,利用目标分类器对目标框打分是对每个目标框分别进行打分,得到每个目标框属于指定目标的分数。例如,在对行人进行跟踪的应用中,对得到的5个行人目标框和1个汽车目标框进行打分,得到每个目标框属于行人的分数。
目标检测器检测得到的预定类型目标的目标框中可能含有非指定目标的目标框,目标分类器对所述目标框打分的目的是要识别出非指定目标的目标框。若目标框属于指定目标,则属于指定目标的分数较高;若目标框不属于指定目标,则属于指定目标的分数较低。例如,指定目标是行人,输入的是行人目标框,得到的分数为0.9,输入的是汽车目标框,得到的分数为0.1。
所述目标分类器可以是神经网络模型。在本实施例中,所述目标分类器 可以是区域全卷积网络(Region-based Fully Convolutional Network,R-FCN)模型。
R-FCN模型也包括区域建议网络。与Faster RCNN模型相比,R-FCN模型具有更深的共享卷积层,可以获得更加抽象的特征用于打分。
R-FCN模型获取目标框的的位置敏感得分图(position-sensitive score map),根据所述位置敏感得分图对所述目标框打分。
在利用目标分类器对所述目标框进行打分之前,需要使用训练样本集对目标检测器进行训练。目标分类器的训练可以参考现有技术,此处不再赘述。
步骤103,删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框。
筛选后的目标框也就是指定目标的目标框。
可以判断所述目标框中每个目标框属于指定目标的分数是否低于所述预设阈值(例如0.7),若目标框属于指定目标的分数低于所述预设阈值,则删除该目标框。若目标框属于指定目标的分数低于所述预设阈值,则认定该目标框是错检,删除该目标框。例如,得到的5个行人目标框的分数分别是0.9、0.8、0.7、0.8、0.9,得到的1个汽车目标框的分数是0.1,汽车目标框的分数低于所述预设阈值,则删除该汽车目标框,剩下5个行人目标框。
所述预设阈值可以根据实际需要进行设置。
步骤104,利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量。
将所述筛选后的目标框输入到特征提取器,所述特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量。
筛选后的目标框可以有多个,利用特征提取器提取筛选后的目标框的特征是提取每个筛选后的目标框的特征,得到每个筛选后的目标框的特征向量。
所述特征提取器可以是神经网络模型。在本实施例中,可以采用重识别(Re-Identification,ReID)方法提取筛选后的目标框的特征。例如,所述方法用于对行人进行跟踪,可以采用ReID方法,例如部位对齐ReID(part-aligned ReID)方法提取筛选后的行人目标框的特征(称为行人重识别特征)。
提取的所述筛选后的目标框的特征可以包括全局特征和局部特征。提取局部特征的方式可以包括图像切块、利用关键点(例如骨架关键点)定位以及姿态/角度矫正等。
在一具体实施例中,所述方法用于对行人进行跟踪,可以利用特征提取卷积神经网络(CNN)模型提取筛选后的目标框的特征。所述特征提取CNN模型包括线性的三个子网络FEN-C1、FEN-C2、FEN-C3。对于每个筛选后的目标框,可以提取目标框中的14个骨架关键点,根据所述14个骨架关键点获取7个感兴趣区域(Region of interest,ROI))区域,所述7个感兴趣区域包括头、上身、下身3个大区域和4个四肢小区域。目标框经过完整的特征提取CNN模型得到全局特征。3个大区域经过FEN-C2和FEN-C3子网络得到三个局部特征。四个四肢区域经过FEN-C3子网络得到四个局部特征。所有8个特征在不同的尺度进行联结,最终得到一个融合全局特征和多个尺度 局部特征的行人重识别特征。
在一具体实施例中,提取的筛选后的目标框的特征向量是128维的特征向量。
步骤105,根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。
可以根据所述特征向量计算所述筛选后的目标框与所述前一帧图像的各个目标框的差异值,根据所述差异值确定所述筛选后的目标框中与所述前一帧图像的各个目标框匹配的目标框,得到更新后的目标框。
例如,筛选后的目标框包括目标框A1、目标框A2、目标框A3、目标框A4,前一帧图像的目标框包括目标框B1、目标框B2、目标框B3、目标框B4。对于目标框A1,计算目标框A1与目标框B1、目标框A1与目标框B2、目标框A1与目标框B3、目标框A1与目标框B4的差异值,将差异值最小且不大于预设差异值的一组目标框(例如目标框A1与目标框B1)确定为匹配的目标框。类似地,对于目标框A2,计算目标框A2与目标框B1、目标框A2与目标框B2、目标框A2与目标框B3、目标框A2与目标框B4的差异值,将差异值最小且不大于预设差异值的一组目标框(例如目标框A2与目标框B2)确定为匹配的目标框;对于目标框A3,计算目标框A3与目标框B1、目标框A3与目标框B2、目标框A3与目标框B3、目标框A3与目标框B4的差异值,将差异值最小且不大于预设差异值的一组目标框(例如目标框A3与目标框B3)确定为匹配的目标框;对于目标框A4,计算目标框A4与目标框B1、目标框A4与目标框B2、目标框A4与目标框B3、目标框A4与目标框B4的差异值,将差异值最小且不大于预设差异值的一组目标框(例如目标框A4与目标框B4)确定为匹配的目标框。因此,更新后的目标框包括目标框A1、目标框A2、目标框A3、目标框A4,分别对应前一帧图像中目标框B1、目标框B2、目标框B3、目标框B4。
可以计算所述筛选后的目标框的特征向量与前一帧图像的各个目标框的特征向量的余弦距离,将所述余弦距离作为所述筛选后的目标框与所述前一帧图像的各个目标框的差异值。
或者,可以计算所述筛选后的目标框的特征向量与前一帧图像的各个目标框的特征向量的欧氏距离,将所述欧氏距离作为所述筛选后的目标框与所述前一帧图像的各个目标框的差异值。
如果所述筛选后的目标框与所述前一帧图像的各个目标框的差异值均大于预设差异值,则将所述筛选后的目标框存储为新的目标框。
需要说明的是,如果是对连续拍摄的多帧图像中的第一帧图像进行处理,即不存在前一帧图像,则在步骤104得到筛选后的目标框的特征向量之后,直接将筛选后的目标框的特征向量进行存储。
综上所述,根据上述目标跟踪方法,利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;利用特征提取器提取所述筛选 后的目标框的特征,得到所述筛选后的目标框的特征向量;根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。本申请解决了现有多目标跟踪方案中对目标检测器的依赖问题,并且提高了跟踪的精度和鲁棒性。
实施例二
图2是本申请实施例二提供的多目标跟踪装置的结构图。所述多目标跟踪装置20应用于计算机装置。本装置的多目标跟踪对视频或图像序列中指定类型的运动物体(例如行人)进行跟踪,得到运动物体在每一帧图像中的位置。所述多目标跟踪装置20可以解决现有多目标跟踪方案中对目标检测器的依赖问题,并且提高了跟踪的精度和鲁棒性。如图2所示,所述多目标跟踪装置20可以包括检测模块201、打分模块202、删除模块203、提取模块204、匹配模块205。
检测模块201,用于利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框。
所述预定类型目标可以包括行人、汽车、飞机、船只等。所述预定类型目标可以是一种类型的目标(例如行人),也可以是多种类型的目标(例如行人和汽车)。
所述目标检测器可以是具有分类和回归功能的神经网络模型。在本实施例中,所述目标检测器可以是加快区域卷积神经网络(Faster Region-Based Convolutional Neural Network,Faster RCNN)模型。
Faster RCNN模型包括区域建议网络(Region Proposal Network,RPN)和快速区域卷积神经网络(Fast Region-based Convolution Neural Network,Fast RCNN)。
所述区域建议网络和所述快速区域卷积神经网络有共享的卷积层,所述卷积层用于提取图像的特征图。所述区域建议网络根据所述特征图生成图像的候选框,并将生成的候选框输入所述快速区域卷积神经网络。所述快速区域卷积神经网络根据所述特征图对所述候选框进行筛选和调整,得到图像的目标框。
在利用目标检测器检测图像中的预定类型目标之前,所述目标检测器需要使用训练样本集进行训练。在训练时,所述卷积层提取训练样本集中各个样本图像的特征图,所述区域建议网络根据所述特征图获取所述各个样本图像中的候选框,所述快速区域卷积神经网络根据所述特征图对所述候选框进行筛选和调整,得到所述各个样本图像的目标框。目标检测器检测预定类型目标(例如行人、汽车、飞机、船只等)的目标框。
在一较佳实施例中,所述加快区域卷积神经网络模型采用ZF框架,所述区域建议网络和所述快速区域卷积神经网络共享5个卷积层。
在一具体实施例中,可以按照以下步骤使用训练样本集对加快区域卷积神经网络模型进行训练:
(1)使用Imagenet模型初始化所述区域建议网络,使用所述训练样本 集训练所述区域建议网络;
(2)使用(1)中训练后的区域建议网络生成训练样本集中各个样本图像的候选框,利用所述候选框训练所述快速区域卷积神经网络。此时,区域建议网络和快速区域卷积神经网络还没有共享卷积层;
(3)使用(2)中训练后的快速区域卷积神经网络初始化所述区域建议网络,使用训练样本集训练所述区域建议网络;
(4)使用(3)中训练后的区域建议网络初始化所述快速区域卷积神经网络,并保持所述卷积层固定,使用训练样本集训练所述快速区域卷积神经网络。此时,区域建议网络和快速区域卷积神经网络共享相同的卷积层,构成了一个统一的网络模型。
区域建议网络选取的候选框较多,可以根据候选框的目标分类得分筛选了若干个得分最高的候选框输入到快速区域卷积神经网络,以加快训练和检测的速度。
可以使用反向传播算法对区域建议网络进行训练,训练过程中调整区域建议网络的网络参数,使损失函数最小化。损失函数指示区域建议网络预测的候选框的预测置信度与真实置信度的差异。损失函数可以包括目标分类损失和回归损失两部分。
Figure PCTCN2019091158-appb-000016
其中,i为一个训练批量(mini-batch)中候选框的索引。
Figure PCTCN2019091158-appb-000017
是候选框的目标分类损失。N cls为训练批量的大小,例如256。p i是第i个候选框为目标的预测概率。
Figure PCTCN2019091158-appb-000018
是GT标签,若候选框为正(即分配的标签为正标签,称为正候选框),
Figure PCTCN2019091158-appb-000019
为1;若候选框为负(即分配的标签为负标签,称为负候选框),
Figure PCTCN2019091158-appb-000020
为0。
Figure PCTCN2019091158-appb-000021
可以计算为
Figure PCTCN2019091158-appb-000022
Figure PCTCN2019091158-appb-000023
Figure PCTCN2019091158-appb-000024
是候选框的回归损失。λ为平衡权重,可以取为10。N reg为候选框的数量。
Figure PCTCN2019091158-appb-000025
可以计算为
Figure PCTCN2019091158-appb-000026
t i是一个坐标向量,即t i=(t x,t y,t w,t h),表示候选框的4个参数化坐标(例如候选框左上角的坐标以及宽度、高度)。
Figure PCTCN2019091158-appb-000027
是与正候选框对应的GT边界框的坐标向量,即
Figure PCTCN2019091158-appb-000028
Figure PCTCN2019091158-appb-000029
(例如真实目标框左上角的坐标以及宽度、高度)。R为具有鲁棒性的损失函数(smoothL1),定义为:
Figure PCTCN2019091158-appb-000030
快速区域卷积网络的训练方法可以参照区域建议网络的训练方法,此处不再赘述。
在本实施例中,在快速区域卷积网络的训练中加入负样本难例挖掘(Hard Negative Mining,HNM)方法。对于被快速区域卷积网络错误地分类为正样本的负样本(即难例),将这些负样本的信息记录下来,在下次迭代训练的过程中,将这些负样本再次输入到训练样本集中,并且加大其损失的权重,增强其对分类器的影响,这样能够保证不停的针对更难的负样本进行分类,使得分类器学到的特征由易到难,涵盖的样本分布也更具多样性。
在其他的实施例中,所述目标检测器还可以是其他的神经网络模型,例如区域卷积神经网络(RCNN)模型、加快卷积神经网络(Faster RCNN)模型。
利用目标检测器检测图像中的预定类型目标时,将所述图像输入所述目标检测器,所述目标检测器对图像中的预定类型目标进行检测,输出所述图像中的预定类型目标的目标框的位置。例如,所述目标检测器输出所述图像中的6个目标框。目标框可以以矩形框的形式呈现。目标框的位置可以用位置坐标表示,所述位置坐标可以包括左上角坐标(x,y)和宽高(w,h)。
所述目标检测器还可以输出每个目标框的类型,例如输出5个行人类型的目标框(称为行人目标框)和1个汽车类型的目标框(称为汽车目标框)。本方法对目标检测器的精度要求不高,所述目标检测器输出的目标框的类型可能是不准确的。
打分模块202,用于利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数。
将所述图像和所述目标框的位置输入目标分类器,所述目标分类器对每个目标框打分,得到每个目标框的分数。
所述指定目标包含在所述预定类型目标中。例如,所述预定类型目标包括行人和汽车,所述指定目标包括行人。
预定类型目标的目标框可以是多个,利用目标分类器对目标框打分是对每个目标框分别进行打分,得到每个目标框属于指定目标的分数。例如,在对行人进行跟踪的应用中,对得到的5个行人目标框和1个汽车目标框进行打分,得到每个目标框属于行人的分数。
目标检测器检测得到的预定类型目标的目标框中可能含有非指定目标的目标框,目标分类器对所述目标框打分的目的是要识别出非指定目标的目标框。若目标框属于指定目标,则属于指定目标的分数较高;若目标框不属于指定目标,则属于指定目标的分数较低。例如,指定目标是行人,输入的是行人目标框,得到的分数为0.9,输入的是汽车目标框,得到的分数为0.1。
所述目标分类器可以是神经网络模型。在本实施例中,所述目标分类器可以是区域全卷积网络(Region-based Fully Convolutional Network,R-FCN)模型。
R-FCN模型也包括区域建议网络。与Faster RCNN模型相比,R-FCN模型具有更深的共享卷积层,可以获得更加抽象的特征用于打分。
R-FCN模型获取目标框的的位置敏感得分图(position-sensitive score map),根据所述位置敏感得分图对所述目标框打分。
在利用目标分类器对所述目标框进行打分之前,需要使用训练样本集对 目标检测器进行训练。目标分类器的训练可以参考现有技术,此处不再赘述。
删除模块203,用于删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框。
筛选后的目标框也就是指定目标的目标框。
可以判断所述目标框中每个目标框属于指定目标的分数是否低于所述预设阈值(例如0.7),若目标框属于指定目标的分数低于所述预设阈值,则删除该目标框。若目标框属于指定目标的分数低于所述预设阈值,则认定该目标框是错检,删除该目标框。例如,得到的5个行人目标框的分数分别是0.9、0.8、0.7、0.8、0.9,得到的1个汽车目标框的分数是0.1,汽车目标框的分数低于所述预设阈值,则删除该汽车目标框,剩下5个行人目标框。
所述预设阈值可以根据实际需要进行设置。
提取模块204,用于利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量。
将所述筛选后的目标框输入到特征提取器,所述特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量。
筛选后的目标框可以有多个,利用特征提取器提取筛选后的目标框的特征是提取每个筛选后的目标框的特征,得到每个筛选后的目标框的特征向量。
所述特征提取器可以是神经网络模型。在本实施例中,可以采用重识别(Re-Identification,ReID)方法提取筛选后的目标框的特征。例如,所述方法用于对行人进行跟踪,可以采用ReID方法,例如部位对齐ReID(part-aligned ReID)方法提取筛选后的行人目标框的特征(称为行人重识别特征)。
提取的所述筛选后的目标框的特征可以包括全局特征和局部特征。提取局部特征的方式可以包括图像切块、利用关键点(例如骨架关键点)定位以及姿态/角度矫正等。
在一具体实施例中,所述方法用于对行人进行跟踪,可以利用特征提取卷积神经网络(CNN)模型提取筛选后的目标框的特征。所述特征提取CNN模型包括线性的三个子网络FEN-C1、FEN-C2、FEN-C3。对于每个筛选后的目标框,可以提取目标框中的14个骨架关键点,根据所述14个骨架关键点获取7个感兴趣区域(Region of interest,ROI))区域,所述7个感兴趣区域包括头、上身、下身3个大区域和4个四肢小区域。目标框经过完整的特征提取CNN模型得到全局特征。3个大区域经过FEN-C2和FEN-C3子网络得到三个局部特征。四个四肢区域经过FEN-C3子网络得到四个局部特征。所有8个特征在不同的尺度进行联结,最终得到一个融合全局特征和多个尺度局部特征的行人重识别特征。
在一具体实施例中,提取的筛选后的目标框的特征向量是128维的特征向量。
匹配模块205,用于根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。
可以根据所述特征向量计算所述筛选后的目标框与所述前一帧图像的各个目标框的差异值,根据所述差异值确定所述筛选后的目标框中与所述前 一帧图像的各个目标框匹配的目标框,得到更新后的目标框。
例如,筛选后的目标框包括目标框A1、目标框A2、目标框A3、目标框A4,前一帧图像的目标框包括目标框B1、目标框B2、目标框B3、目标框B4。对于目标框A1,计算目标框A1与目标框B1、目标框A1与目标框B2、目标框A1与目标框B3、目标框A1与目标框B4的差异值,将差异值最小且不大于预设差异值的一组目标框(例如目标框A1与目标框B1)确定为匹配的目标框。类似地,对于目标框A2,计算目标框A2与目标框B1、目标框A2与目标框B2、目标框A2与目标框B3、目标框A2与目标框B4的差异值,将差异值最小且不大于预设差异值的一组目标框(例如目标框A2与目标框B2)确定为匹配的目标框;对于目标框A3,计算目标框A3与目标框B1、目标框A3与目标框B2、目标框A3与目标框B3、目标框A3与目标框B4的差异值,将差异值最小且不大于预设差异值的一组目标框(例如目标框A3与目标框B3)确定为匹配的目标框;对于目标框A4,计算目标框A4与目标框B1、目标框A4与目标框B2、目标框A4与目标框B3、目标框A4与目标框B4的差异值,将差异值最小且不大于预设差异值的一组目标框(例如目标框A4与目标框B4)确定为匹配的目标框。因此,更新后的目标框包括目标框A1、目标框A2、目标框A3、目标框A4,分别对应前一帧图像中目标框B1、目标框B2、目标框B3、目标框B4。
可以计算所述筛选后的目标框的特征向量与前一帧图像的各个目标框的特征向量的余弦距离,将所述余弦距离作为所述筛选后的目标框与所述前一帧图像的各个目标框的差异值。
或者,可以计算所述筛选后的目标框的特征向量与前一帧图像的各个目标框的特征向量的欧氏距离,将所述欧氏距离作为所述筛选后的目标框与所述前一帧图像的各个目标框的差异值。
如果所述筛选后的目标框与所述前一帧图像的各个目标框的差异值均大于预设差异值,则将所述筛选后的目标框存储为新的目标框。
需要说明的是,如果是对连续拍摄的多帧图像中的第一帧图像进行处理,即不存在前一帧图像,则在模块204得到筛选后的目标框的特征向量之后,直接将筛选后的目标框的特征向量进行存储。
本实施例供了一种多目标跟踪装置20。所述多目标跟踪是对视频或图像序列中指定类型的运动物体(例如行人)进行跟踪,得到运动物体在每一帧图像中的位置。所述多目标跟踪装置20利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量;根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。本实施例解决了现有多目标跟踪方案中对目标检测器的依赖问题,并且提高了跟踪的精度和鲁棒性。
实施例三
本实施例提供一种可读存储介质,该可读存储介质上存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述多目标跟踪方法实施例中的步骤,例如图1所示的步骤101-105:
步骤101,利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;
步骤102,利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;
步骤103,删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;
步骤104,利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量;
步骤105,根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。
或者,该计算机可读指令被处理器执行时实现上述装置实施例中各模块的功能,例如图2中的模块201-205:
检测模块201,用于利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;
打分模块202,用于利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;
删除模块203,用于删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;
提取模块204,用于利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量;
匹配模块205,用于根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。
实施例四
图3为本申请实施例四提供的计算机装置的示意图。所述计算机装置30包括存储器301、处理器302以及存储在所述存储器301中并可在所述处理器302上运行的计算机可读指令303,例如多目标跟踪程序。所述处理器302执行所述计算机可读指令303时实现上述多目标跟踪方法实施例中的步骤,例如图1所示的步骤101-105:
步骤101,利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;
步骤102,利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;
步骤103,删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;
步骤104,利用特征提取器提取所述筛选后的目标框的特征,得到所述 筛选后的目标框的特征向量;
步骤105,根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。
或者,该计算机可读指令被处理器执行时实现上述装置实施例中各模块的功能,例如图2中的模块201-205:
检测模块201,用于利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;
打分模块202,用于利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;
删除模块203,用于删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;
提取模块204,用于利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量;
匹配模块205,用于根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。
示例性的,所述计算机可读指令303可以被分割成一个或多个模块,所述一个或者多个模块被存储在所述存储器301中,并由所述处理器302执行,以完成本方法。例如,所述计算机可读指令303可以被分割成图2中的检测模块201、打分模块202、删除模块203、提取模块204,匹配模块205,各模块具体功能参见实施例二。
所述计算机装置30可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。本领域技术人员可以理解,所述示意图3仅仅是计算机装置30的示例,并不构成对计算机装置30的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述计算机装置30还可以包括输入输出设备、网络接入设备、总线等。
所称处理器302可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器302也可以是任何常规的处理器等,所述处理器302是所述计算机装置30的控制中心,利用各种接口和线路连接整个计算机装置30的各个部分。
所述存储器301可用于存储所述计算机可读指令303,所述处理器302通过运行或执行存储在所述存储器301内的计算机可读指令或模块,以及调用存储在存储器301内的数据,实现所述计算机装置30的各种功能。所述存储器302可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据计算机装置30的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器301可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart  Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
所述计算机装置30集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个可读存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。所述计算机可读介质可以包括:能够携带所述计算机可读指令的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
上述软件功能模块存储在一个可读存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的部分步骤。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他模块或步骤,单数不排除复数。系统权利要求中陈述的多个模块或装置也可以由一个模块或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当 理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种多目标跟踪方法,其特征在于,所述方法包括:
    利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;
    利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;
    删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;
    利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量;
    根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。
  2. 如权利要求1所述的方法,其特征在于,所述目标检测器是加快区域卷积神经网络模型,所述加快区域卷积神经网络模型包括区域建议网络和快速区域卷积神经网络,所述加快区域卷积神经网络模型在检测所述图像中的预定类型目标之前按照以下步骤进行训练:
    第一训练步骤,使用Imagenet模型初始化所述区域建议网络,使用训练样本集训练所述区域建议网络;
    第二训练步骤,使用所述第一训练步骤中训练后的区域建议网络生成所述训练样本集中各个样本图像的候选框,利用所述候选框训练所述快速区域卷积神经网络;
    第三训练步骤,使用所述第二训练步骤中训练后的快速区域卷积神经网络初始化所述区域建议网络,使用所述训练样本集训练所述区域建议网络;
    第四训练步骤,使用所述第三训练步骤中训练后的区域建议网络初始化所述快速区域卷积神经网络,并保持所述卷积层固定,使用所述训练样本集训练所述快速区域卷积神经网络。
  3. 如权利要求2所述的方法,其特征在于,所述加快区域卷积神经网络模型采用ZF框架,所述区域建议网络和所述快速区域卷积神经网络共享5个卷积层。
  4. 如权利要求1所述的方法,其特征在于,所述目标分类器是区域全卷积网络模型。
  5. 如权利要求1所述的方法,其特征在于,所述利用特征提取器提取所述筛选后的目标框的特征包括:
    采用重识别方法提取所述筛选后的目标框的特征。
  6. 如权利要求1所述的方法,其特征在于,所述根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配包括:
    根据所述特征向量计算所述筛选后的目标框与所述前一帧图像的各个目标框的差异值,根据所述差异值确定所述筛选后的目标框中与所述前一帧 图像的各个目标框匹配的目标框。
  7. 如权利要求6所述的方法,其特征在于,所述根据所述特征向量计算所述筛选后的目标框与所述前一帧图像的各个目标框的差异值包括:
    计算所述筛选后的目标框的特征向量与所述前一帧图像的各个目标框的特征向量的余弦距离,将所述余弦距离作为所述筛选后的目标框与所述前一帧图像的各个目标框的差异值;或
    计算所述筛选后的目标框的特征向量与所述前一帧图像的各个目标框的特征向量的欧氏距离,将所述欧氏距离作为所述筛选后的目标框与所述前一帧图像的各个目标框的差异值。
  8. 一种多目标跟踪装置,其特征在于,所述装置包括:
    检测模块,用于利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;
    打分模块,用于利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;
    删除模块,用于删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;
    提取模块,用于利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量;
    匹配模块,用于根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。
  9. 一种计算机装置,其特征在于,所述计算机装置包括存储器和处理器,所述存储器存储有至少一条计算机可读指令,所述处理器执行所述至少一条计算机可读指令以实现以下步骤:
    利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;
    利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;
    删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;
    利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目标框的特征向量;
    根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。
  10. 如权利要求9所述的计算机装置,其特征在于,所述目标检测器是加快区域卷积神经网络模型,所述加快区域卷积神经网络模型包括区域建议网络和快速区域卷积神经网络,所述处理器在所述利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框之前,还执行所述至少一条计算机可读指令以实现以下步骤:
    第一训练步骤,使用Imagenet模型初始化所述区域建议网络,使用训练样本集训练所述区域建议网络;
    第二训练步骤,使用所述第一训练步骤中训练后的区域建议网络生成所述训练样本集中各个样本图像的候选框,利用所述候选框训练所述快速区域卷积神经网络;
    第三训练步骤,使用所述第二训练步骤中训练后的快速区域卷积神经网络初始化所述区域建议网络,使用所述训练样本集训练所述区域建议网络;
    第四训练步骤,使用所述第三训练步骤中训练后的区域建议网络初始化所述快速区域卷积神经网络,并保持所述卷积层固定,使用所述训练样本集训练所述快速区域卷积神经网络。
  11. 如权利要求10所述的计算机装置,其特征在于,所述加快区域卷积神经网络模型采用ZF框架,所述区域建议网络和所述快速区域卷积神经网络共享5个卷积层。
  12. 如权利要求9所述的计算机装置,其特征在于,所述目标分类器是区域全卷积网络模型。
  13. 如权利要求9所述的计算机装置,其特征在于,所述利用特征提取器提取所述筛选后的目标框的特征包括:
    采用重识别方法提取所述筛选后的目标框的特征。
  14. 如权利要求9所述的计算机装置,其特征在于,所述根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配包括:
    根据所述特征向量计算所述筛选后的目标框与所述前一帧图像的各个目标框的差异值,根据所述差异值确定所述筛选后的目标框中与所述前一帧图像的各个目标框匹配的目标框。
  15. 如权利要求14所述的计算机装置,其特征在于,所述根据所述特征向量计算所述筛选后的目标框与所述前一帧图像的各个目标框的差异值包括:
    计算所述筛选后的目标框的特征向量与所述前一帧图像的各个目标框的特征向量的余弦距离,将所述余弦距离作为所述筛选后的目标框与所述前一帧图像的各个目标框的差异值;或
    计算所述筛选后的目标框的特征向量与所述前一帧图像的各个目标框的特征向量的欧氏距离,将所述欧氏距离作为所述筛选后的目标框与所述前一帧图像的各个目标框的差异值。
  16. 一种非易失性可读存储介质,所述非易失性可读存储介质上存储有至少一条计算机可读指令,其特征在于,所述至少一条计算机可读指令被处理器执行时实现以下步骤:
    利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框;
    利用目标分类器对所述目标框打分,得到所述目标框属于指定目标的分数;
    删除所述目标框中所述分数低于预设阈值的目标框,得到筛选后的目标框;
    利用特征提取器提取所述筛选后的目标框的特征,得到所述筛选后的目 标框的特征向量;
    根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配,得到更新后的目标框。
  17. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述目标检测器是加快区域卷积神经网络模型,所述加快区域卷积神经网络模型包括区域建议网络和快速区域卷积神经网络,所述利用目标检测器检测图像中的预定类型目标,得到所述预定类型目标的目标框之前,所述至少一条计算机可读指令被所述处理器执行时还实现以下步骤:
    第一训练步骤,使用Imagenet模型初始化所述区域建议网络,使用训练样本集训练所述区域建议网络;
    第二训练步骤,使用所述第一训练步骤中训练后的区域建议网络生成所述训练样本集中各个样本图像的候选框,利用所述候选框训练所述快速区域卷积神经网络;
    第三训练步骤,使用所述第二训练步骤中训练后的快速区域卷积神经网络初始化所述区域建议网络,使用所述训练样本集训练所述区域建议网络;
    第四训练步骤,使用所述第三训练步骤中训练后的区域建议网络初始化所述快速区域卷积神经网络,并保持所述卷积层固定,使用所述训练样本集训练所述快速区域卷积神经网络。
  18. 如权利要求17所述的非易失性可读存储介质,其特征在于,所述利用特征提取器提取所述筛选后的目标框的特征包括:
    采用重识别方法提取所述筛选后的目标框的特征。
  19. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述根据所述特征向量将所述筛选后的目标框与所述图像的前一帧图像的各个目标框进行匹配包括:
    根据所述特征向量计算所述筛选后的目标框与所述前一帧图像的各个目标框的差异值,根据所述差异值确定所述筛选后的目标框中与所述前一帧图像的各个目标框匹配的目标框。
  20. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述根据所述特征向量计算所述筛选后的目标框与所述前一帧图像的各个目标框的差异值包括:
    计算所述筛选后的目标框的特征向量与所述前一帧图像的各个目标框的特征向量的余弦距离,将所述余弦距离作为所述筛选后的目标框与所述前一帧图像的各个目标框的差异值;或
    计算所述筛选后的目标框的特征向量与所述前一帧图像的各个目标框的特征向量的欧氏距离,将所述欧氏距离作为所述筛选后的目标框与所述前一帧图像的各个目标框的差异值。
PCT/CN2019/091158 2019-01-23 2019-06-13 多目标跟踪方法、装置、计算机装置及可读存储介质 WO2020151166A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910064677.4 2019-01-23
CN201910064677.4A CN109886998A (zh) 2019-01-23 2019-01-23 多目标跟踪方法、装置、计算机装置及计算机存储介质

Publications (1)

Publication Number Publication Date
WO2020151166A1 true WO2020151166A1 (zh) 2020-07-30

Family

ID=66926556

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/091158 WO2020151166A1 (zh) 2019-01-23 2019-06-13 多目标跟踪方法、装置、计算机装置及可读存储介质

Country Status (2)

Country Link
CN (1) CN109886998A (zh)
WO (1) WO2020151166A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070175A (zh) * 2020-09-04 2020-12-11 湖南国科微电子股份有限公司 视觉里程计方法、装置、电子设备及存储介质
CN112257809A (zh) * 2020-11-02 2021-01-22 浙江大华技术股份有限公司 目标检测网络优化方法和装置、存储介质及电子设备
CN112418278A (zh) * 2020-11-05 2021-02-26 中保车服科技服务股份有限公司 一种多类物体检测方法、终端设备及存储介质
CN112633352A (zh) * 2020-12-18 2021-04-09 浙江大华技术股份有限公司 一种目标检测方法、装置、电子设备及存储介质
CN112712119A (zh) * 2020-12-30 2021-04-27 杭州海康威视数字技术股份有限公司 确定目标检测模型的检测准确率的方法和装置
CN112733741A (zh) * 2021-01-14 2021-04-30 苏州挚途科技有限公司 交通标识牌识别方法、装置和电子设备
CN112800873A (zh) * 2021-01-14 2021-05-14 知行汽车科技(苏州)有限公司 确定目标方向角的方法、装置、系统及存储介质
CN113378969A (zh) * 2021-06-28 2021-09-10 北京百度网讯科技有限公司 一种目标检测结果的融合方法、装置、设备及介质
CN113470078A (zh) * 2021-07-15 2021-10-01 浙江大华技术股份有限公司 一种目标跟踪方法、装置及系统
CN113628245A (zh) * 2021-07-12 2021-11-09 中国科学院自动化研究所 多目标跟踪方法、装置、电子设备和存储介质

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826403B (zh) * 2019-09-27 2020-11-24 深圳云天励飞技术有限公司 跟踪目标确定方法及相关设备
CN110992401A (zh) * 2019-11-25 2020-04-10 上海眼控科技股份有限公司 目标跟踪方法、装置、计算机设备和存储介质
CN111091091A (zh) * 2019-12-16 2020-05-01 北京迈格威科技有限公司 目标对象重识别特征的提取方法、装置、设备及存储介质
CN111340092B (zh) * 2020-02-21 2023-09-22 浙江大华技术股份有限公司 一种目标关联处理方法及装置
CN111401224B (zh) * 2020-03-13 2023-05-23 北京字节跳动网络技术有限公司 目标检测方法、装置及电子设备
CN113766175A (zh) * 2020-06-04 2021-12-07 杭州萤石软件有限公司 目标监控方法、装置、设备及存储介质
CN111783797B (zh) * 2020-06-30 2023-08-18 杭州海康威视数字技术股份有限公司 目标检测方法、装置及存储介质
CN111881908B (zh) * 2020-07-20 2024-04-05 北京百度网讯科技有限公司 目标检测模型的修正方法、检测方法、装置、设备及介质
CN111931641B (zh) * 2020-08-07 2023-08-22 华南理工大学 基于权重多样性正则化的行人重识别方法及其应用
CN112055172B (zh) * 2020-08-19 2022-04-19 浙江大华技术股份有限公司 一种监控视频的处理方法、装置以及存储介质
CN112183558A (zh) * 2020-09-30 2021-01-05 北京理工大学 一种基于YOLOv3的目标检测和特征提取一体化网络
CN116862946A (zh) * 2022-03-25 2023-10-10 影石创新科技股份有限公司 运动视频生成方法、装置、终端设备以及存储介质
CN115348385B (zh) * 2022-07-06 2024-03-01 深圳天海宸光科技有限公司 一种枪球联动的足球检测方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001099922A (ja) * 1999-09-30 2001-04-13 Mitsubishi Electric Corp 多目標追尾装置
CN107679455A (zh) * 2017-08-29 2018-02-09 平安科技(深圳)有限公司 目标跟踪装置、方法及计算机可读存储介质
CN108121986A (zh) * 2017-12-29 2018-06-05 深圳云天励飞技术有限公司 目标检测方法及装置、计算机装置和计算机可读存储介质
CN108229524A (zh) * 2017-05-25 2018-06-29 北京航空航天大学 一种基于遥感图像的烟囱和冷凝塔检测方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416250B (zh) * 2017-02-10 2021-06-22 浙江宇视科技有限公司 人数统计方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001099922A (ja) * 1999-09-30 2001-04-13 Mitsubishi Electric Corp 多目標追尾装置
CN108229524A (zh) * 2017-05-25 2018-06-29 北京航空航天大学 一种基于遥感图像的烟囱和冷凝塔检测方法
CN107679455A (zh) * 2017-08-29 2018-02-09 平安科技(深圳)有限公司 目标跟踪装置、方法及计算机可读存储介质
CN108121986A (zh) * 2017-12-29 2018-06-05 深圳云天励飞技术有限公司 目标检测方法及装置、计算机装置和计算机可读存储介质

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070175A (zh) * 2020-09-04 2020-12-11 湖南国科微电子股份有限公司 视觉里程计方法、装置、电子设备及存储介质
CN112257809B (zh) * 2020-11-02 2023-07-14 浙江大华技术股份有限公司 目标检测网络优化方法和装置、存储介质及电子设备
CN112257809A (zh) * 2020-11-02 2021-01-22 浙江大华技术股份有限公司 目标检测网络优化方法和装置、存储介质及电子设备
CN112418278A (zh) * 2020-11-05 2021-02-26 中保车服科技服务股份有限公司 一种多类物体检测方法、终端设备及存储介质
CN112633352A (zh) * 2020-12-18 2021-04-09 浙江大华技术股份有限公司 一种目标检测方法、装置、电子设备及存储介质
CN112633352B (zh) * 2020-12-18 2023-08-29 浙江大华技术股份有限公司 一种目标检测方法、装置、电子设备及存储介质
CN112712119A (zh) * 2020-12-30 2021-04-27 杭州海康威视数字技术股份有限公司 确定目标检测模型的检测准确率的方法和装置
CN112712119B (zh) * 2020-12-30 2023-10-24 杭州海康威视数字技术股份有限公司 确定目标检测模型的检测准确率的方法和装置
CN112800873A (zh) * 2021-01-14 2021-05-14 知行汽车科技(苏州)有限公司 确定目标方向角的方法、装置、系统及存储介质
CN112733741A (zh) * 2021-01-14 2021-04-30 苏州挚途科技有限公司 交通标识牌识别方法、装置和电子设备
CN113378969B (zh) * 2021-06-28 2023-08-08 北京百度网讯科技有限公司 一种目标检测结果的融合方法、装置、设备及介质
CN113378969A (zh) * 2021-06-28 2021-09-10 北京百度网讯科技有限公司 一种目标检测结果的融合方法、装置、设备及介质
CN113628245A (zh) * 2021-07-12 2021-11-09 中国科学院自动化研究所 多目标跟踪方法、装置、电子设备和存储介质
CN113628245B (zh) * 2021-07-12 2023-10-31 中国科学院自动化研究所 多目标跟踪方法、装置、电子设备和存储介质
CN113470078A (zh) * 2021-07-15 2021-10-01 浙江大华技术股份有限公司 一种目标跟踪方法、装置及系统

Also Published As

Publication number Publication date
CN109886998A (zh) 2019-06-14

Similar Documents

Publication Publication Date Title
WO2020151166A1 (zh) 多目标跟踪方法、装置、计算机装置及可读存储介质
WO2020151167A1 (zh) 目标跟踪方法、装置、计算机装置及可读存储介质
CN108121986B (zh) 目标检测方法及装置、计算机装置和计算机可读存储介质
CN111460926B (zh) 一种融合多目标跟踪线索的视频行人检测方法
Xu et al. An enhanced Viola-Jones vehicle detection method from unmanned aerial vehicles imagery
CN109087510B (zh) 交通监测方法及装置
JP5919665B2 (ja) 情報処理装置、物体追跡方法および情報処理プログラム
JP7044898B2 (ja) ナンバープレート認識方法、および、そのシステム
CN109977782B (zh) 基于目标位置信息推理的跨店经营行为检测方法
CN103699905B (zh) 一种车牌定位方法及装置
WO2016131300A1 (zh) 一种自适应跨摄像机多目标跟踪方法及系统
KR101896357B1 (ko) 객체를 검출하는 방법, 디바이스 및 프로그램
CN106845430A (zh) 基于加速区域卷积神经网络的行人检测与跟踪方法
KR20180042254A (ko) 오브젝트 추적을 위한 시스템들 및 방법들
CN108960115B (zh) 基于角点的多方向文本检测方法
Cepni et al. Vehicle detection using different deep learning algorithms from image sequence
CN105046278B (zh) 基于Haar特征的Adaboost检测算法的优化方法
CN110781785A (zh) 基于Faster RCNN算法改进的交通场景下行人检测方法
Tang et al. Multiple-kernel adaptive segmentation and tracking (MAST) for robust object tracking
CN111931571B (zh) 基于在线增强检测的视频文字目标追踪方法与电子设备
CN111882586A (zh) 一种面向剧场环境的多演员目标跟踪方法
CN113850136A (zh) 基于yolov5与BCNN的车辆朝向识别方法及系统
CN116091892A (zh) 一种基于卷积神经网络的快速目标检测方法
JP2022521540A (ja) オンライン学習を利用した物体追跡のための方法およびシステム
US20220398400A1 (en) Methods and apparatuses for determining object classification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19911904

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19911904

Country of ref document: EP

Kind code of ref document: A1