WO2022193990A1 - 检测跟踪方法、装置、设备、存储介质及计算机程序产品 - Google Patents

检测跟踪方法、装置、设备、存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2022193990A1
WO2022193990A1 PCT/CN2022/079697 CN2022079697W WO2022193990A1 WO 2022193990 A1 WO2022193990 A1 WO 2022193990A1 CN 2022079697 W CN2022079697 W CN 2022079697W WO 2022193990 A1 WO2022193990 A1 WO 2022193990A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
tracking
target
feature points
video
Prior art date
Application number
PCT/CN2022/079697
Other languages
English (en)
French (fr)
Inventor
毛曙源
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022193990A1 publication Critical patent/WO2022193990A1/zh
Priority to US17/976,287 priority Critical patent/US20230047514A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30221Sports video; Sports image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present application relates to the field of video processing, and relates to, but is not limited to, a detection and tracking method, apparatus, device, storage medium and computer program product.
  • a method of detecting each video frame of a video stream is adopted, that is, by detecting the bounding box of an object in each video frame, the bounding boxes of objects in adjacent video frames are matched and associated according to categories.
  • the embodiments of the present application provide a detection and tracking method, apparatus, device, storage medium and computer program product, which can improve the real-time performance and stability of target detection and tracking.
  • the technical solution is as follows:
  • the embodiment of the present application provides a detection and tracking method, the method is executed by an electronic device, and the method includes:
  • the first thread performs target detection on the extracted frame based on the feature points, and obtains the target frame in the extracted frame, and the extracted frame is the video frame extracted from the video frame sequence using the target step size;
  • the target frame is tracked in the current frame, and the target frame in the current frame is obtained;
  • An embodiment of the present application provides a detection and tracking device, which includes:
  • an analysis module configured to perform feature point analysis on the video frame sequence to obtain feature points on each video frame in the video frame sequence
  • the detection module is configured to perform target detection on the extracted frame based on the feature points through the first thread, and obtain the target frame in the extracted frame, and the extracted frame is the video frame extracted from the video frame sequence by using the target step size;
  • the tracking module is configured to track the target frame in the current frame based on the feature points and the target frame in the extracted frame through the second thread, and obtain the target frame in the current frame;
  • Output module configured to output the target box in the current frame.
  • An embodiment of the present application provides a computer device, the computer device includes: a processor and a memory, the memory stores a computer program, and the computer program is loaded and executed by the processor to implement the detection and tracking as described above method.
  • An embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the storage medium, and the computer program is loaded and executed by a processor to implement the above detection and tracking method.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the above-mentioned detection and tracking method.
  • the feature point sequence of each video frame is obtained, and then through the first thread and the second thread, respectively, the extracted frame is subjected to target detection and the target tracking of the current frame, and finally the target of each frame is obtained. frame.
  • the above method divides target detection and target tracking into two thread operations. Among them, the detection algorithm does not affect the tracking frame rate. Even if the detection thread takes a long time, the terminal can output the target frame of each video frame, and the target detection The process is implemented for extracting frames, and there is no need to detect each video frame, which can reduce the time-consuming of the detection process, and then output the target frame of the video frame in real time, improving the real-time performance and stability of target detection and tracking.
  • FIG. 1 is a schematic diagram of a multi-target detection and tracking system provided by an exemplary embodiment of the present application
  • FIG. 2 is a flowchart of a detection and tracking method provided by an exemplary embodiment of the present application
  • FIG. 3 is a schematic diagram of a target frame provided by an exemplary embodiment of the present application.
  • FIG. 4 is a schematic diagram of a time sequence relationship of a multi-target real-time detection system provided by an exemplary embodiment of the present application;
  • FIG. 5 is a flowchart of a detection and tracking method provided by another exemplary embodiment of the present application.
  • FIG. 6 is a flowchart of a detection and tracking method provided by another exemplary embodiment of the present application.
  • FIG. 7 is a flowchart of a third thread provided by an exemplary embodiment of the present application.
  • FIG. 8 is a flowchart of a second thread provided by an exemplary embodiment of the present application.
  • FIG. 9 is a schematic diagram of a video frame provided by an exemplary embodiment of the present application.
  • FIG. 10 is a schematic diagram of a video frame provided by another exemplary embodiment of the present application.
  • FIG. 11 is a schematic diagram of a video frame provided by another exemplary embodiment of the present application.
  • FIG. 12 is a structural block diagram of a detection and tracking apparatus provided by an exemplary embodiment of the present application.
  • FIG. 13 shows a structural block diagram of an electronic device provided by an exemplary embodiment of the present application.
  • Object detection refers to scanning and searching for objects in images and videos (a series of images), that is, locating and identifying objects in a scene.
  • Target tracking refers to tracking the motion characteristics of the target in the video without identifying the tracking target. Therefore, the detection and tracking of images can be widely used in target recognition and tracking in computer vision. For example, it can be used in automatic driving scenarios. target detection and tracking.
  • the first thread refers to the detection thread, which outputs the detected object target frame and category by detecting the input video frame.
  • an object detection algorithm detects an object in the video frame, and outputs a target frame and class of the object.
  • a One-Stage (a target detection method) algorithm e.g., a Two-Stage (a target detection method) algorithm or an Anchor-free (a target detection method) algorithm can be used to detect the video frame.
  • the second thread refers to the tracking thread, which realizes the tracking of the target frame through the matching pair of target feature points.
  • the target frame of the previous frame includes feature points x1, x2, and x3, whose coordinates in the previous frame are a, b, and c, respectively, and the coordinates of the feature points x1, x2, and x3 in the current frame are respectively
  • For a', b', c' by calculating the displacement and scale of a, b, c and a', b', c', the displacement and scale of the target frame of the current frame and the target frame of the previous frame are calculated, so the current frame is obtained. frame's target frame.
  • the third thread refers to the motion analysis thread, which extracts feature points from the initial frame and outputs the feature points of each video frame by tracking.
  • a corner detection algorithm such as Harris algorithm
  • an accelerated segmentation feature point test algorithm FAST, Features From Accelerated Segment Test
  • GFTT Good Feature To Tracker
  • an optical flow tracking algorithm can be used to track the feature points of the previous frame of the current frame.
  • an optical flow tracking algorithm such as the Lucas-Kanade algorithm
  • Fig. 1 shows a structural block diagram of a multi-target detection and tracking system according to an exemplary embodiment of the present application.
  • the multi-target detection and tracking system is provided with three processing threads.
  • the first thread 121 is used to detect the target of the extracted frame to obtain the detection target frame of the extracted frame;
  • the second thread 122 is used to track the target frame in the previous frame of the current frame.
  • the motion trajectory, combined with the detection target frame of the extracted frame obtains the target frame of the current frame;
  • the third thread 123 is used to extract the feature points of the initial frame to obtain the feature points on the initial frame, and to the previous frame of the current frame.
  • the feature points are tracked to obtain the feature points of the current frame (each frame).
  • the direction of the extracted frame is adjusted, the adjusted extracted frame is detected, the detection target frame of the extracted frame is obtained, and the detection target frame is input into the second thread 122 .
  • the second thread 122 Based on the second thread 122 inputting each frame of the video frame including the feature points, and the target frame exists in the previous frame, the second thread 122 obtains the tracking target frame of the current frame based on the previous frame.
  • the tracking target frame of the current frame obtained by the second thread 122 is used as the target frame of the current frame, and the target frame of the current frame is output. frame;
  • the second thread 122 When the second thread 122 receives the detection target frame of the latest extracted frame input by the first thread 121, it obtains the tracking target frame of the detection target frame in the current frame, and sets the detection target frame in the tracking target frame of the current frame and the previous frame.
  • the tracking target frame of the frame is combined with repeated frames to obtain the target frame of the current frame, and the target frame of the current frame is output.
  • the above-mentioned multi-target detection and tracking system may run at least on an electronic device, and the electronic device may be a server or a server group, or a terminal. That is to say, the above-mentioned multi-target detection and tracking system can be run on at least the terminal, or on the server, or on both the terminal and the server.
  • the detection and tracking method in this embodiment of the present application may be implemented by a terminal, or by a server or a server group, or by mutual interaction between the terminal and the server.
  • target frame for short.
  • the number of the above-mentioned terminals and servers may be more or less.
  • the above-mentioned terminal may be only one, or the above-mentioned terminal may be dozens or hundreds, or more.
  • the above server may be only one, or the above server may be dozens or hundreds, or more. This embodiment of the present application does not limit the number of terminals, device types, and the number of servers.
  • the following embodiments take the application of the multi-target real-time detection and tracking system to a terminal as an example for explanation.
  • FIG. 2 shows a detection and tracking method according to an exemplary embodiment of the present application, which is illustrated by applying the method to the multi-target detection and tracking system shown in FIG. 1 .
  • the method includes:
  • Step 220 Perform feature point analysis on the video frame sequence to obtain feature points on each video frame in the video frame sequence.
  • the terminal in response to the input video frame sequence, performs feature point analysis on the video frame sequence to obtain feature points on each video frame in the video frame sequence.
  • Feature points refer to pixel points in the video frame that have distinct characteristics and can effectively reflect the essential characteristics of the video frame, and the feature points can identify target objects in the video frame.
  • the matching of the target object can be completed, that is, the target object can be identified and classified.
  • the feature points are points with rich local information obtained by the algorithm analysis, for example, the feature points exist in the corners of the image and the regions where the texture changes drastically. It is worth noting that the feature points have scale invariance, that is, the uniform property that can be recognized in different images.
  • Feature point analysis refers to feature point extraction and feature point tracking on input video frames.
  • the terminal in response to the input video frame sequence, performs feature point extraction on the initial frame, and obtains the tracking feature points of the next frame through feature point tracking, and sequentially tracks the feature points of all video frames.
  • Harris can be used to extract feature points, that is, a fixed window is set in the initial video frame, and the window is used to slide the image in any direction.
  • the degree of grayscale change of the pixels in the window If there is sliding in any direction, the degree of grayscale change of a pixel is greater than the threshold of grayscale change, or, among multiple pixels, the degree of grayscale change of any pixel is greater than that of each pixel in the multiple pixels If the gray level of the point changes, the pixel point is determined as a feature point.
  • a feature point extraction algorithm (such as the FAST-9 algorithm) can be used for feature point extraction, that is, by detecting each pixel point on the initial video frame, when the pixel point meets certain conditions, it is determined that the A pixel is a feature point
  • the specific conditions here include at least: determining the number of target adjacent pixels whose absolute value of the pixel difference exceeds the pixel difference threshold, and judging whether the number is greater than or equal to the number threshold, when the number A certain condition is met when it is greater than or equal to the quantity threshold. For example, there are 16 pixel points on a circle with pixel point P as the center and a radius of 3.
  • Pixel difference assuming that the number threshold is 3, if at least three of the absolute values of the four pixel differences exceed the pixel difference threshold, then enter the next step to judge, otherwise determine that the pixel point P is not a feature point; based on the next step of the pixel point P Judgment, calculate the pixel difference between the 16 pixel points of the above circle and P, if the absolute value of at least 9 pixel differences in these 16 pixel differences exceeds the pixel difference threshold, then the pixel point P is determined to be a feature point.
  • Lucas-Kanade optical flow algorithm is used to track the feature points of the previous frame.
  • Step 240 performing target detection on the extracted frame based on the feature points through the first thread to obtain a target frame in the extracted frame.
  • the extracted frame is the video frame extracted from the video frame sequence using the target step size; the target step size is the frame interval for extracting the video frame sequence, if the target step size is 2, that is, one video frame is extracted for every two video frames.
  • the target step size is a fixed value, for example, the video frame sequence is extracted with a target step size of 2; in some embodiments, the target step size may be a variable, that is, there are many One possibility, such as extracting the 0th frame, the 3rd frame, the 7th frame, and the 12th frame, the target step size of the second extraction and the first extraction is 3, and the target step of the third extraction and the second extraction is 3.
  • the length is 4, and the target step size for the fourth and third extractions is 5.
  • the target step size can be set according to the time-consuming of the detection algorithm. For example, if it takes three frames to detect each video frame, the terminal sets the target step size to 3.
  • a step size of 3 may be used to decimate the sequence of video frames.
  • the first thread is used to detect the target of the extracted frame, and obtain the detection target frame of the extracted frame.
  • the One-Stage algorithm, the Two-Stage algorithm or the Anchor-free algorithm may be used to detect the video frame.
  • the detection algorithm often takes more than one frame, that is, it is impossible to detect every video frame.
  • the technical solutions provided by the embodiments of the present application perform multi-thread detection and tracking on the video frame sequence.
  • the target box is used to identify objects.
  • the target box is represented as a bounding box of the object, and class information of the object is displayed within the bounding box.
  • Figure 3 shows the target frame 301 of the mobile phone, the target frame 302 of the orange, the target frame 303 of the mouse, and the target frame 304 of the water cup.
  • the bounding box of the object, and the name of the object is also displayed in the bounding box.
  • the target frame is represented as a texture of the object, that is, a texture is added around the object to increase the interest of the video frame.
  • the type of the target frame is not limited.
  • the target frame includes a tracking target frame and a detection target frame.
  • the tracking target frame refers to the target frame obtained by tracking the target frame of the previous frame;
  • the detection target frame refers to the target frame obtained by detecting the video frame.
  • Step 260 by using the second thread to track the target frame in the current frame based on the feature points and the target frame in the extracted frame, to obtain the target frame in the current frame.
  • FIG. 4 shows a schematic diagram of a time sequence relationship of a multi-target real-time detection system according to an exemplary embodiment of the present application.
  • the duration of video frame tracking is less than the interval of video frame acquisition (ie, the image acquisition shown in Figure 4), and the tracking operation is performed for each video frame, while the detection frame rate (ie, as shown in Figure 4)
  • the video frame detection is relatively low, and it is impossible to perform image detection on each video frame, and then image detection is performed on the extracted frames, and the extraction step in Figure 4 is 3.
  • the detection of the 0th video frame has just been completed.
  • the target frame detected by the 0th frame needs to be "transferred" to the second frame so as to be performed with the tracking frame of the second frame. Fusion is equivalent to doing the tracking from frame 0 to frame 2 again.
  • the second thread based on the feature points and the target frame in the extracted frame, the second thread performs target frame tracking in the current frame to obtain the target frame in the current frame, which is divided into the following two situations:
  • the second thread tracks the second target frame in the current frame based on the feature points to obtain the target frame in the current frame; wherein the first The target frame is the target frame detected in the most recent extracted frame of the current frame in the video frame sequence, and the second target frame is the target frame tracked in the previous frame of the current frame. For example, when there is no target frame in the previous frame of the current frame, there is also no tracking target frame obtained based on the target frame of the previous frame in the current frame.
  • the first thread when the currently input video frame is the first frame, the first thread does not output the detection frame of the 0th frame.
  • the second thread is based on the feature points of the 0th frame and the feature points of the first frame.
  • the target frame in the 0th frame is tracked, and the tracking target frame of the first frame is obtained.
  • the tracking target frame is the target frame of the first frame.
  • the target frame on the 0th frame is tracked based on the previous frame target frame of the 0th frame.
  • the target frame in the 0th frame is tracked to obtain the tracking target frame of the 1st frame, which can be obtained by the following method: First, Obtain the tracking feature points of the current frame and the target feature points of the previous frame of the current frame; then, the tracking feature points of the current frame and the target feature points of the previous frame are composed of multiple sets of feature point matching pairs through the second thread, and the target feature The point is a feature point located in the second target frame; determine multiple sets of feature point offset vectors for multiple sets of feature point matching pairs; here, multiple sets of feature point offset vectors for multiple sets of feature point matching pairs can be obtained by calculation; Then, the target frame offset vector of the second target frame is calculated based on the offset vectors of the multiple sets of feature points; finally, the second target frame is offset according to the target frame offset vector to obtain the target frame in the current frame.
  • the target feature points of the 0th frame are x1, x2, and x3, and their coordinates in the 0th frame are a, b, and c, respectively, and the feature points x1, x2, and x3 correspond to the tracking feature points in the first frame.
  • the feature points x1 and x1' form a matching pair of feature points
  • x2 and x2' form a matching pair of feature points.
  • x3, x3' form feature point matching pairs, and obtain multiple sets of feature point offset vectors as (a, a'), (b, b'), (c, c').
  • m the coordinates of the target box in frame 0 are denoted as m.
  • the target frame offset vector is an average vector of multiple sets of feature point offset vectors
  • the target frame coordinate of the first frame is m+((a, a')+(b, b')+(c , c'))/3.
  • the target frame offset vector is a weighted vector of multiple sets of feature point offset vectors.
  • the weight of the offset vector (a, a') is 0.2
  • the offset vector (b, b') The weight of 0.4, the weight of the offset vector (c, c') is 0.4, then the coordinates of the target frame of the first frame are m+(0.2(a, a')+0.4(b, b')+0.4(c, c')).
  • the second method when the first thread outputs the first target frame, the second thread tracks the first target frame and the second target frame in the current frame based on the feature points to obtain the target frame in the current frame;
  • the first target frame is the target frame detected in the most recent extracted frame of the current frame in the video frame sequence
  • the second target frame is the target frame tracked in the previous frame of the current frame.
  • the above method includes the following steps: tracking the first target frame in the current frame based on the feature points by the second thread to obtain the first tracking frame; The two target frames are tracked to obtain a second tracking frame; the duplicate frames in the first tracking frame and the second tracking frame are combined to obtain the target frame in the current frame.
  • the first thread outputs the detection target frame of the 0th frame
  • the second thread tracks the detection target frame of the 0th frame to obtain the first tracking frame
  • the second thread is used to track the detection target frame of the 0th frame.
  • the thread tracks the target frame of the first frame in the second frame based on the feature points, obtains the second tracking frame, and merges the repeated frames in the first tracking frame and the second tracking frame to obtain the target frame in the second frame.
  • Step 280 output the target frame in the current frame.
  • the terminal obtains the target frame of the current frame and completes the output of the target frame of the current frame.
  • the above method divides detection and tracking into two thread operations.
  • the detection algorithm does not affect the tracking frame rate. Even if the detection thread takes a long time, the terminal can output the target frame of each video frame.
  • This method can not only output the target frame of the video frame in real time, but also the delay of real-time output will not increase significantly with the increase of the number of target frames.
  • the target detection process is implemented for extracting frames, and there is no need to detect each video frame, which can reduce the time-consuming of the detection process, and then output the target frame of the video frame in real time, improving the real-time performance and stability of target detection and tracking. sex.
  • FIG. 5 shows a detection and tracking method of an exemplary embodiment of the present application, wherein steps 220 , 240 , 260 , and 280 have been described above and will not be repeated here.
  • step 260 the repeated frames in the first tracking frame and the second tracking frame are combined to obtain the target frame in the current frame, and the following steps are further included:
  • Step 250-1 based on the fact that the intersection over union (IoU, Intersection over Union) of the first tracking frame and the second tracking frame is greater than the IoU threshold, it is determined that there are duplicate frames in the first tracking frame and the second tracking frame.
  • IoU intersection over union
  • the second thread tracks the first target frame in the current frame based on the feature points to obtain the first tracking frame, and the second thread tracks the second target frame in the current frame based on the feature points to obtain the first tracking frame.
  • Second tracking box Second tracking box.
  • IoU is a standard for the accuracy of detecting a corresponding object in a specific data set.
  • this standard is used to measure the correlation between the tracking target frame and the detection target frame. The higher the correlation, the higher the value.
  • the area where the tracking target frame is located is S1
  • the area where the detection target frame is located is S2
  • the intersection of S1 and S2 is S3, and S1 and S2 form an area S4, then the IoU is S3/S4.
  • the terminal pre-stores a parallel-intersection ratio threshold.
  • the IoU threshold is 0.5.
  • the first tracking frame and the second tracking frame are When the IoU of the current frame is greater than 0.5, it is determined that there are duplicate frames in the first tracking frame and the second tracking frame; if the IoU of the first tracking frame and the second tracking frame in the current frame is not greater than 0.5, the first tracking frame is determined.
  • the box and the second tracking box do not have duplicate boxes.
  • Step 250-2 based on the fact that the IoU of the first tracking frame and the second tracking frame is greater than the IoU threshold, and the first tracking frame and the second tracking frame are of the same category, determine that the first tracking frame and the second tracking frame have duplicate frames.
  • the first tracking frame is determined. There are duplicate boxes in the box and the second tracking box.
  • steps 250-1 and 250-2 are parallel steps, that is, only executing step 250-1 or only executing step 250-2 can complete the judgment of the repeated frame.
  • step 260 there is at least one of the following methods for performing the repeated frame merging in step 260:
  • Method 1 In response to the existence of duplicate frames in the first tracking frame and the second tracking frame, determine the first tracking frame as the target frame of the current frame;
  • the first tracking frame is determined as the target frame of the current frame.
  • Method 2 In response to the existence of duplicate frames in the first tracking frame and the second tracking frame, determine the tracking frame with the highest confidence in the first tracking frame and the second tracking frame as the target frame of the current frame;
  • the determination of the existence of duplicate frames in the first tracking frame and the second tracking frame is completed, and the tracking frame with the highest confidence in the first tracking frame and the second tracking frame is determined as the current frame. target box.
  • the target detection algorithm is used to output the confidence score of the target frame
  • the terminal deletes the target frame whose score is lower than the confidence threshold, and uses the tracking frame whose confidence is greater than or equal to the confidence threshold as the target frame of the current frame.
  • Method 3 In response to the presence of a duplicate frame between the first tracking frame and the second tracking frame, and the first tracking frame is at the boundary of the current frame, the second tracking frame is determined as the target frame of the current frame.
  • the target frame is represented as the bounding frame of the object
  • the detection target frame obtained by detecting the adjacent extracted frames cannot completely surround the entire object, that is, when the object cannot be completely displayed in the adjacent extracted frames, determine The second tracking frame is the target frame of the current frame.
  • the above-mentioned methods 1, 2 and 3 are parallel methods, that is, only method 1, only method 2, or only method 3 can be performed to complete the merging of repeated frames.
  • the above method realizes the judgment of whether there is a repeated frame in the current frame and the merging of the repeated frames, which ensures that the target frames of the current frame are clear and orderly, and avoids the repeated occurrence of target frames with the same function in the current frame. .
  • FIG. 6 shows a detection and tracking method according to an exemplary embodiment of the present application, wherein steps 240 , 260 and 280 have been described above and will not be repeated here.
  • Step 221 extracting feature points of the initial frame in the video frame sequence through the third thread, to obtain the feature points of the initial frame;
  • feature point extraction is first performed on the initial frame through the third thread 123 .
  • Step 222 through the third thread, based on the feature points of the initial frame, perform feature point tracking on the ith frame in the video frame sequence, and obtain the feature point of the ith frame in the video frame sequence; the ith frame is located after the initial frame.
  • Video frame the starting number of i is the frame number of the initial frame plus one, and i is a positive integer.
  • the feature points of the ith frame can be obtained, where the ith frame is the video after the initial frame. frame, the starting number of i is the frame number of the initial frame plus one. It is worth noting that the third thread 123 only performs feature point extraction on the initial frame, and does not perform feature point extraction on the i-th video frame.
  • Step 223 using the third thread to track the feature points of the i+1 th frame in the video frame sequence based on the feature points of the ith frame, to obtain the feature points of the i+1 th frame in the video frame sequence.
  • the feature point of the ith+1 th frame in the video frame sequence is obtained.
  • the third thread performs optical flow tracking on the feature points of the i-th frame to obtain the feature points of the i+1-th frame in the video frame sequence.
  • the Lucas-Kanade optical flow algorithm can be used to achieve Tracking the feature points of the previous frame.
  • the feature points of the video frame sequence can be extracted and tracked.
  • the third thread performs feature point tracking based on the feature points of the ith frame to obtain the feature points of the ith+1th frame in the video frame sequence, and further includes deleting and summing up the feature points of the ith+1th frame. Replenish.
  • Delete the feature point of the i+1th frame in response to the first feature point in the i+1th frame meeting the deletion condition, delete the first feature point in the i+1th frame; wherein, the deletion condition includes at least one of the following :
  • the first feature point is a feature point that fails to track.
  • the third thread performs feature point tracking based on the feature points of the ith frame, and obtains the first feature point of the ith+1 th frame in the video frame sequence, where the first feature point cannot be found in the ith frame. Find the feature points that can form a matching pair of feature points, that is, the feature points that fail to track.
  • the terminal in response to the distance between the first feature point of the i+1 th frame and the adjacent feature points being less than the distance threshold D, deletes the first feature point in the i+1 th frame.
  • the distance threshold D is selected according to the amount of calculation and the size of the image. For example, the value range of the distance threshold D is 5 to 20.
  • Supplement to the feature points of the i+1th frame in response to the target area in the i+1th frame meeting the supplementary point conditions, extract new feature points from the target area; wherein, the supplementary point conditions include: the target area is a feature point tracking The result is an empty area.
  • the feature points in the target area of the ith frame there are 50 feature points in the target area of the ith frame, and through feature point tracking, there are 20 feature points in the target area of the i+1th frame, and the feature points of the i+1th frame are determined at this time. If the tracking result is empty, the operation of extracting new feature points from the target area is performed at this time. For the extraction method, refer to step 220 .
  • the target area of the ith frame is the "mobile phone” area, that is, a target frame can be added to the "mobile phone” through 50 feature points, when there are only 20 feature points in the "mobile phone” area of the i+1th frame. , at this time, the terminal cannot add a target frame to the mobile phone. In this case, it is necessary to extract new feature points from the "mobile phone" area before the terminal can add a target frame to the mobile phone. It is worth noting that the above third thread does not add a target frame to the "mobile phone” area, but only indicates that the terminal has the possibility of adding a target frame to the mobile phone. The operation of adding a target frame to the "mobile phone" area is implemented in the second thread.
  • the above method realizes the extraction of the initial frame and the feature point tracking of the video frame, and improves the stability of the feature points of adjacent frames by deleting the feature points and adding the feature points, and ensures the first
  • the second thread can obtain the target frame through the feature points of adjacent frames.
  • the feature point analysis is performed on the video frame sequence to obtain the feature points on each video frame in the video frame sequence, which can be implemented by the method shown in FIG. 7, which shows the third thread
  • the flow chart of the method includes:
  • Step 701 input a video frame sequence.
  • the terminal In response to the operation of starting to perform multi-target real-time detection, the terminal inputs a sequence of video frames.
  • Step 702 Determine whether the current frame is an initial frame.
  • the terminal determines whether the current frame is the initial frame; if the current frame is the initial frame, step 706 is performed; if the current frame is not the initial frame, step 703 is performed.
  • Step 703 Perform feature point tracking on the feature points in the previous frame of the current frame to obtain a tracking result.
  • the image coordinates of the feature points in the current frame are obtained by tracking the feature points of the previous frame through an optical flow tracking algorithm, and the optical flow tracking algorithm includes but is not limited to: Lucas-Kanade optical flow.
  • Step 704 based on the tracking result, perform non-maximum value suppression on the feature points.
  • the non-maximum suppression of the feature points means that the terminal deletes the feature points that fail to track, and when the distance between the two feature points is less than the distance threshold, deletes one feature point of the two feature points.
  • Deletion strategies include but are not limited to: deleting one at random; scoring feature points based on the feature point gradient, and deleting the one with a lower score. Refer to step 506 for the distance threshold.
  • Step 705 the feature points are supplemented.
  • the new feature point extraction method In response to extracting a new feature point in the region where no feature point is tracked on the current frame, the new feature point extraction method refers to step 706 .
  • Step 706 Extract the feature points of the initial frame to obtain the feature points of the initial frame.
  • the terminal In response to the current frame being the initial frame, the terminal performs a feature point extraction operation on the initial frame.
  • the terminal extracts feature points in the initial frame to ensure that the minimum interval between the feature points is not less than the interval threshold (the interval threshold is selected depending on the amount of calculation and the size of the image, such as 5 to 20).
  • Feature extraction methods include but are not limited to: Harris, FAST, Good Feature To Tracker, etc.
  • the terminal assigns a feature point label to each new feature point, where the label increases from 0.
  • Step 707 output the feature point list of the current frame.
  • a feature point list of each video frame in the video frame sequence is output.
  • the first thread performs target detection on the extracted frame based on the feature points, and obtains the target frame in the extracted frame, which can be realized by the following method: through the first thread, the terminal input video frame sequence frame, output the detected object bounding box and class.
  • Target detection algorithms include but are not limited to: One-Stage algorithm, Two-Stage algorithm, and Anchor-free algorithm.
  • the terminal before detection, the terminal adjusts the extracted frame to the direction of gravity to improve the detection effect.
  • the second thread tracks the target frame in the current frame based on the feature points to obtain the target frame in the current frame, which can be implemented by the method shown in FIG. 8 , which shows the A flowchart of the second thread of an exemplary embodiment of the present application, the method includes:
  • Step 801 input a list of adjacent video frames and corresponding feature points.
  • the terminal In response to the third thread outputting the feature points of the video frame sequence, the terminal inputs the adjacent video frames and the corresponding feature point list into the second thread.
  • Step 802 Match the current frame with the feature points of the previous frame.
  • the feature points of the current frame and the feature points of the previous frame are matched by the feature point labels to obtain feature point matching pairs.
  • Step 803 track the target frame of the previous frame.
  • the terminal determines the feature points in the target frame of the previous frame, and calculates the displacement and scale of the target frame of the previous frame in the current frame according to the feature point matching pair.
  • the calculation methods include but are not limited to: median flow method, homography matrix method, etc.
  • Step 804 it is determined whether there is a new target frame.
  • the terminal determines whether the first thread outputs the detection target frame, and if so, executes step 805 ; if not, executes step 808 .
  • Step 805 perform feature point matching between the current frame and the detection frame.
  • the terminal In response to the first thread outputting the detection target frame, the terminal performs feature point matching between the current frame and the detection frame through the feature point label to obtain a feature point matching pair.
  • Step 806 track and detect the frame target frame.
  • the terminal determines the feature points in the target frame, and calculates the displacement and scale of the detection target frame in the current frame according to the feature point matching pair.
  • the calculation methods include but are not limited to: median flow method, homography matrix method, etc.
  • Step 807 In the current frame, a fusion frame of the target frame and the tracking target frame is added.
  • the tracking target frame and the detection target frame may overlap.
  • the overlapping judgment criteria are:
  • the IOU of the tracking target frame and the detection target frame is greater than the IOU threshold, for example, the IOU threshold may be 0.5.
  • the terminal Based on the terminal determining that the tracking target frame and the detection target frame overlap, the terminal performs an overlapping frame fusion operation.
  • the two target frames need to be fused into one target frame through a strategy to obtain a fusion frame.
  • the fusion strategy includes at least the following methods: the target frame of the current frame is always selected for detection Target frame; according to the target detection algorithm, the terminal obtains the confidence score of the tracking target frame and the detection target frame, and the terminal deletes the target frame with a smaller confidence score in the current frame; when the detection target frame is close to the boundary of the current frame, the terminal determines the object If the detection is incomplete, at this time, the terminal determines that the tracking target frame is the target frame of the current frame, otherwise the terminal determines that the detection target frame is the target frame of the current frame.
  • Step 808 Output all target frames of the current frame.
  • the terminal Based on the above steps 801 to 807, the terminal outputs all target frames of the current frame.
  • FIG. 10 shows a schematic diagram of a video frame provided by another exemplary embodiment of the present application.
  • FIG. 11 shows a schematic diagram of a video frame according to another exemplary embodiment of the present application.
  • the terminal In response to inputting a video of a football match, the terminal detects target frames such as player 1101, goal 1102, football 1103, etc. These targets are tracked in frames, and subsequent football game analysis can be performed based on the tracking results.
  • the terminal performs feature point analysis on the video frame sequence of the football video, and obtains feature points on each video frame in the video frame sequence; the first thread performs target detection on the extracted frames based on the feature points, and the terminal obtains the extracted frame.
  • the target frame in the frame, the extracted frame is the video frame extracted from the video frame sequence by using the target step size; the second thread tracks the target frame in the current frame based on the feature points, and the terminal obtains the target frame in the current frame; the terminal Output the target box in the current frame.
  • FIG. 12 is a structural block diagram of a detection and tracking device provided by an exemplary embodiment of the present application. As shown in FIG. 12 , the device includes:
  • the analysis module 1010 is configured to perform feature point analysis on the video frame sequence to obtain feature points on each video frame in the video frame sequence;
  • the detection module 1020 is configured to perform target detection on the extracted frames based on the feature points through the first thread, and obtain Extract the target frame in the frame, and the extracted frame is a video frame extracted from the video frame sequence by using the target step size;
  • the tracking module 1030 is configured to use the second thread based on the feature points and the target frame in the extracted frame, in the current frame.
  • the target frame is tracked in the process to obtain the target frame in the current frame;
  • the output module 1050 is configured to output the target frame in the current frame.
  • the tracking module 1030 is further configured to track the second target frame in the current frame based on the feature points by the second thread when the first thread does not output the first target frame, to obtain The target box in the current frame.
  • the tracking module 1030 is further configured to track the first target frame and the second target frame in the current frame based on the feature points by the second thread when the first thread outputs the first target frame Track and get the target frame in the current frame.
  • the first target frame is the target frame detected in the most recent extracted frame before the current frame in the video frame sequence
  • the second target frame is the target frame tracked in the previous frame of the current frame.
  • the tracking module 1030 includes a tracking sub-module 1031 and a merging module 1032; wherein, the tracking sub-module 1031 is configured to track the first target frame in the current frame based on the feature points through the second thread, and obtain first tracking box.
  • the tracking sub-module 1031 is further configured to track the second target frame in the current frame based on the feature points through the second thread to obtain the second tracking frame.
  • the merging module 1032 is configured to merge the repeated frames in the first tracking frame and the second tracking frame to obtain the target frame in the current frame.
  • the apparatus further includes a determination module 1040; wherein the determination module 1040 is configured to determine the first tracking frame and the second tracking frame based on the IoU of the parallel-intersection ratio of the first tracking frame and the second tracking frame being greater than the IoU threshold There are duplicate boxes in the tracking box.
  • the determining module 1040 is further configured to determine based on the fact that the IoU of the intersection ratio of the first tracking frame and the second tracking frame is greater than the IoU threshold, and the first tracking frame and the second tracking frame are of the same category, determine Duplicate frames exist between the first tracking frame and the second tracking frame.
  • the determining module 1040 is further configured to determine that the first tracking frame is the target frame of the current frame in response to the presence of a duplicate frame between the first tracking frame and the second tracking frame.
  • the determining module 1040 is further configured to, in response to the existence of duplicate frames in the first tracking frame and the second tracking frame, determine that the tracking frame with high confidence in the first tracking frame and the second tracking frame is the current frame target box.
  • the determining module 1040 is further configured to determine that the second tracking frame is a part of the current frame in response to the presence of a duplicate frame between the first tracking frame and the second tracking frame, and the first tracking frame is at the boundary of the current frame target box.
  • the tracking module 1030 is further configured to acquire the tracking feature points of the current frame and the target feature points of the previous frame of the current frame, and use the second thread to convert the tracking feature points of the current frame and the previous frame
  • the target feature points of are composed of multiple sets of feature point matching pairs, and the target feature points are the feature points located in the second target frame.
  • the tracking module 1030 is further configured to determine multiple sets of feature point offset vectors of multiple sets of feature point matching pairs.
  • the tracking module 1030 is further configured to calculate the target frame offset vector of the second target frame based on the multiple sets of feature point offset vectors.
  • the tracking module 1030 is further configured to offset the second target frame according to the target frame offset vector to obtain the target frame in the current frame.
  • the analysis module 1010 is further configured to perform feature point extraction on an initial frame in the video frame sequence through a third thread to obtain feature points of the initial frame.
  • the analysis module 1010 is further configured to perform feature point tracking on the ith frame in the video frame sequence based on the feature points of the initial frame through the third thread, to obtain the feature point of the ith frame in the video frame sequence.
  • Feature point, the i-th frame is the video frame located after the initial frame, and the starting number of i is the frame number of the initial frame plus one.
  • the analysis module 1010 is further configured to perform feature point tracking on the i+1 th frame in the video frame sequence based on the feature points of the ith frame through the third thread, to obtain the ith frame in the video frame sequence. Feature points of i+1 frame.
  • the analysis module 1010 is further configured to perform optical flow tracking on the feature points of the ith frame through the third thread, so as to obtain the feature points of the ith+1th frame in the video frame sequence.
  • the analysis module 1010 is further configured to delete the first feature point in the i+1th frame in response to the first feature point in the i+1th frame meeting the deletion condition; wherein the deletion condition includes At least one of the following: the first feature point is a feature point that fails to track; the distance between the first feature point and adjacent feature points is less than a distance threshold.
  • the analysis module 1010 is further configured to extract new feature points from the target area in response to the target area in the i+1th frame satisfying the point-replenishment condition; wherein the point-replenishment condition includes: the target area is the area where the feature point tracking result is empty.
  • the above device divides detection and tracking into two thread operations.
  • the detection algorithm does not affect the tracking frame rate. Even if the detection thread takes a long time, the terminal can output the target frame of each video frame.
  • This method can not only output the target frame of the video frame in real time, but also the delay of real-time output will not increase significantly with the increase of the number of target frames.
  • the above device also realizes the judgment of whether there is a repeated frame in the current frame and the combination of the repeated frames, which ensures that the target frames of the current frame are clear and orderly, and avoids the repeated occurrence of target frames with the same function in the current frame.
  • the above device also realizes the extraction of the initial frame and the tracking of the feature points of other frames, and improves the stability of the feature points of adjacent frames by deleting the feature points and adding the feature points, and ensures that the second thread can pass through.
  • the adjacent frame feature points get the target frame.
  • FIG. 13 shows a structural block diagram of an electronic device 1300 provided by an exemplary embodiment of the present application.
  • the electronic device 1300 may be a portable mobile terminal, such as a smart phone, a tablet computer, a moving picture expert compression standard audio layer 3 (MP3, Moving Picture Experts Group Audio Layer III), a moving picture expert compression standard audio layer 4 (MP4, Moving Picture Experts Group Audio Layer III) Picture Experts Group Audio Layer IV) player, laptop or desktop computer.
  • Electronic device 1300 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.
  • the electronic device 1300 includes: a processor 1301 and a memory 1302 .
  • the processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 1301 can use at least one hardware form among digital signal processing (DSP, Digital Signal Processing), field programmable gate array (FPGA, Field-Programmable Gate Array), and programmable logic array (PLA, Programmable Logic Array).
  • DSP Digital Signal Processing
  • FPGA field programmable gate array
  • PDA Programmable logic array
  • the processor 1301 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake-up state, also called a central processing unit (CPU, Central Processing Unit); the coprocessor is a A low-power processor for processing data in a standby state.
  • CPU Central Processing Unit
  • the processor 1301 may be integrated with a graphics processor (GPU, Graphics Processing Unit), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 1301 may further include an artificial intelligence (AI, Artificial Intelligence) processor for processing computing operations related to machine learning.
  • AI Artificial Intelligence
  • Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. Memory 1302 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1302 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 1301 to implement the image restoration provided by the method embodiments in this application. method.
  • the electronic device 1300 may also optionally include: a peripheral device interface 1303 and at least one peripheral device.
  • the processor 1301, the memory 1302 and the peripheral device interface 1303 can be connected through a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 1303 through a bus, a signal line or a circuit board.
  • the peripheral equipment includes at least one of a radio frequency circuit 1304 , a display screen 1305 , a camera assembly 1306 , an audio circuit 1307 , a positioning assembly 1308 and a power supply 1309 .
  • the peripheral device interface 1303 may be used to connect at least one peripheral device related to input/output (I/O, Input/Output) to the processor 1301 and the memory 1302 .
  • processor 1301, memory 1302, and peripherals interface 1303 are integrated on the same chip or circuit board; in some other embodiments, any one of processor 1301, memory 1302, and peripherals interface 1303 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.
  • the radio frequency circuit 1304 is used for receiving and transmitting radio frequency (RF, Radio Frequency) signals, also called electromagnetic signals.
  • the radio frequency circuit 1304 communicates with communication networks and other communication devices via electromagnetic signals.
  • the radio frequency circuit 1304 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • radio frequency circuitry 1304 includes an antenna system, an RF transceiver, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and the like.
  • the radio frequency circuit 1304 may communicate with other terminals through at least one wireless communication protocol.
  • the wireless communication protocol includes but is not limited to at least one of the following: World Wide Web, Metropolitan Area Network, Intranet, various generations of mobile communication networks (2G, 3G, 4G and 5G), wireless local area networks and wireless fidelity (WiFi, Wireless Fidelity) networks .
  • the radio frequency circuit 1304 may further include a circuit related to Near Field Communication (NFC, Near Field Communication), which is not limited in this application.
  • NFC Near Field Communication
  • the display screen 1305 is used to display a user interface (UI, User Interface).
  • the UI can include graphics, text, icons, video, and any combination thereof.
  • the display screen 1305 also has the ability to acquire touch signals on or above the surface of the display screen 1305 .
  • the touch signal may be input to the processor 1301 as a control signal for processing.
  • the display screen 1305 may also be used to provide at least one of virtual buttons and a virtual keyboard, also referred to as soft buttons and soft keyboards.
  • the display screen 1305 may be a flexible display screen disposed on a curved or folded surface of the electronic device 1300 . Even, the display screen 1305 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen.
  • the display screen 1305 can be made of materials such as a liquid crystal display (LCD, Liquid Crystal Display), an organic light-emitting diode (OLED, Organic Light-Emitting Diode).
  • the camera assembly 1306 is used to capture images or video.
  • camera assembly 1306 includes a front-facing camera and a rear-facing camera.
  • the front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal.
  • there are at least two rear cameras which are any one of a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function, the main camera Integrate with wide-angle camera to achieve panoramic shooting and virtual reality (VR, Virtual Reality) shooting functions or other integrated shooting functions.
  • the camera assembly 1306 may also include a flash.
  • the flash can be a single color temperature flash or a dual color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
  • Audio circuitry 1307 may include a microphone and speakers.
  • the microphone is used to collect the sound waves of the user and the environment, convert the sound waves into electrical signals, and input them to the processor 1301 for processing, or to the radio frequency circuit 1304 to realize voice communication.
  • the microphone may also be an array microphone or an omnidirectional collection microphone.
  • the speaker is used to convert the electrical signal from the processor 1301 or the radio frequency circuit 1304 into sound waves.
  • the loudspeaker can be a traditional thin-film loudspeaker or a piezoelectric ceramic loudspeaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible to humans, but also convert electrical signals into sound waves inaudible to humans for distance measurement and other purposes.
  • the audio circuit 1307 may also include a headphone jack.
  • the positioning component 1308 is used to locate the current geographic location of the electronic device 1300 to implement navigation or Location Based Service (LBS).
  • LBS Location Based Service
  • the positioning component 1308 may be a positioning component based on the Global Positioning System (GPS, Global Positioning System) of the United States, the Beidou system of China, or the Galileo system of Russia.
  • GPS Global Positioning System
  • Power supply 1309 is used to power various components in electronic device 1300 .
  • the power source 1309 may be alternating current, direct current, disposable batteries, or rechargeable batteries.
  • the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. Wired rechargeable batteries are batteries that are charged through wired lines, and wireless rechargeable batteries are batteries that are charged through wireless coils.
  • the rechargeable battery can also be used to support fast charging technology.
  • the electronic device 1300 also includes one or more sensors 1310 .
  • the one or more sensors 1310 include, but are not limited to, an acceleration sensor 1311, a gyro sensor 1312, a pressure sensor 1313, a fingerprint sensor 1314, an optical sensor 1315, and a proximity sensor 1316.
  • the acceleration sensor 1311 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the electronic device 1300 .
  • the acceleration sensor 1311 can be used to detect the components of the gravitational acceleration on the three coordinate axes.
  • the processor 1301 can control the display screen 1305 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1313 .
  • the acceleration sensor 1313 can also be used for game or user movement data collection.
  • the gyroscope sensor 1312 can detect the body direction and rotation angle of the electronic device 1300 , and the gyroscope sensor 1312 can cooperate with the acceleration sensor 1311 to collect the 3D actions of the user on the electronic device 1300 .
  • the processor 1301 can implement the following functions according to the data collected by the gyro sensor 1312: motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.
  • the pressure sensor 1313 is disposed on the lower layer of at least one of the following: the side frame and the display screen 1305 of the electronic device 1300 .
  • the processor 1301 can perform left and right hand identification or quick operation according to the holding signal collected by the pressure sensor 1313.
  • the processor 1301 controls the operability controls on the UI interface according to the user's pressure operation on the display screen 1305.
  • the operability controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.
  • the fingerprint sensor 1314 is used to collect the user's fingerprint, and the processor 1301 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 1314, or the fingerprint sensor 1314 identifies the user's identity according to the collected fingerprint. When the user's identity is identified as a trusted identity, the processor 1301 authorizes the user to perform relevant sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, making payments, and changing settings.
  • the fingerprint sensor 1314 may be disposed on the front, back, or side of the electronic device 1300 . When the electronic device 1300 is provided with physical buttons or a manufacturer's logo, the fingerprint sensor 1314 may be integrated with the physical buttons or the manufacturer's logo.
  • Optical sensor 1315 is used to collect ambient light intensity.
  • the processor 1301 can control the display brightness of the display screen 1305 according to the ambient light intensity collected by the optical sensor 1315 .
  • the processor 1301 may also dynamically adjust the shooting parameters of the camera assembly 1306 according to the ambient light intensity collected by the optical sensor 1315 .
  • Proximity sensor 1316 also referred to as a distance sensor, is typically provided on the front panel of electronic device 1300 .
  • Proximity sensor 1316 is used to collect the distance between the user and the front of electronic device 1300 .
  • the processor 1301 controls the display screen 1305 to switch from the bright screen state to the off screen state; when the proximity sensor 1316 detects When the distance between the user and the front of the electronic device 1300 gradually increases, the processor 1301 controls the display screen 1305 to switch from the off-screen state to the bright-screen state.
  • FIG. 13 does not constitute a limitation on the electronic device 1300, and may include more or less components than the one shown, or combine some components, or adopt different component arrangements.
  • Embodiments of the present application further provide a computer-readable storage medium, where at least one instruction, at least one piece of program, code set or instruction set is stored in the storage medium, the at least one instruction, the at least one piece of program, the code The set or instruction set is loaded and executed by the processor to implement the detection and tracking method provided by the above method embodiments.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the detection and tracking method provided by the above method embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种检测跟踪方法、装置、设备、存储介质及计算机程序产品,属于视频处理领域。所述方法包括:对视频帧序列进行特征点分析,得到视频帧序列中每帧视频帧上的特征点;通过第一线程基于特征点对抽取帧进行目标检测,得到抽取帧中的目标框,抽取帧是采用目标步长在视频帧序列中抽取的视频帧;通过第二线程基于特征点和所述抽取帧中的目标框,在当前帧中进行目标框跟踪,得到当前帧中的目标框;输出当前帧中的目标框。上述方法将目标检测和目标跟踪分为两个线程操作,其中,检测算法并不会影响跟踪帧率,即使检测线程耗费时间较长,终端也能输出每帧视频帧的目标框,该方法能实时输出视频帧的目标框,提高目标检测跟踪的实时性和稳定性。

Description

检测跟踪方法、装置、设备、存储介质及计算机程序产品
相关申请的交叉引用
本申请基于申请号为202110287909.X、申请日为2021年03月17日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及视频处理领域,涉及但不限于一种检测跟踪方法、装置、设备、存储介质及计算机程序产品。
背景技术
为了实现对视频流的实时分析,需要在视频帧中检测和跟踪特定类别的物体(比如运动人体),并且实时输出物体的包围框和类别。
相关技术中采取对视频流的每一视频帧都进行检测的方法,即通过在每一个视频帧中检测出物体的包围框,将相邻视频帧的物体的包围框按照类别进行匹配关联。
但是对每一视频帧都进行检测往往耗时严重,难以保证实时输出物体的包围框和类别。
发明内容
本申请实施例提供了一种检测跟踪方法、装置、设备、存储介质及计算机程序产品,能够提高对目标检测跟踪的实时性和稳定性。所述技术方案如下:
本申请实施例提供一种检测跟踪方法,该方法由电子设备执行,方法包括:
对视频帧序列进行特征点分析,得到视频帧序列中每帧视频帧上的特征点;
通过第一线程基于特征点对抽取帧进行目标检测,得到抽取帧中的目标框,抽取帧是采用目标步长在视频帧序列中抽取的视频帧;
通过第二线程基于特征点和抽取帧中的目标框,在当前帧中进行目标框跟踪,得到当前帧中的目标框;
输出当前帧中的目标框。
本申请实施例提供一种检测跟踪装置,该装置包括:
分析模块,配置为对视频帧序列进行特征点分析,得到视频帧序列中 每帧视频帧上的特征点;
检测模块,配置为通过第一线程基于特征点对抽取帧进行目标检测,得到抽取帧中的目标框,抽取帧是采用目标步长在视频帧序列中抽取的视频帧;
跟踪模块,配置为通过第二线程基于特征点和抽取帧中的目标框,在当前帧中进行目标框跟踪,得到当前帧中的目标框;
输出模块,配置为输出当前帧中的目标框。
本申请实施例提供一种计算机设备,该所述计算机设备包括:处理器和存储器,所述存储器存储有计算机程序,所述计算机程序由所述处理器加载并执行以实现如上所述的检测跟踪方法。
本申请实施例提供一种计算机可读存储介质,该存储介质存储有计算机程序,该计算机程序由处理器加载并执行以实现如上的检测跟踪方法。
本申请实施例提供一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述检测跟踪方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
通过对视频帧序列进行特征点分析得到每帧视频帧的特征点序列,然后分别通过第一线程和第二线程,对抽取帧进行目标检测和对当前帧进行目标跟踪,最终得到每帧的目标框。上述方法将目标检测和目标跟踪分为两个线程操作,其中,检测算法并不会影响跟踪帧率,即使检测线程耗费时间较长,终端也能输出每帧视频帧的目标框,且目标检测过程是针对抽取帧实现的,无需对每一视频帧都进行检测,从而能够降低检测过程的耗时,进而能够实时输出视频帧的目标框,提高目标检测跟踪的实时性和稳定性。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一示例性实施例提供的多目标检测跟踪系统的示意图;
图2是本申请一个示例性实施例提供的检测跟踪方法的流程图;
图3是本申请一个示例性实施例提供的目标框的示意图;
图4是本申请一个示例性实施例提供的多目标实时检测系统的时序关系示意图;
图5是本申请另一个示例性实施例提供的检测跟踪方法的流程图;
图6是本申请另一个示例性实施例提供的检测跟踪方法的流程图;
图7是本申请一个示例性实施例提供的第三线程的流程图;
图8是本申请一个示例性实施例提供的第二线程的流程图;
图9是本申请一个示例性实施例提供的视频帧的示意图;
图10是本申请另一个示例性实施例提供的视频帧的示意图;
图11是本申请另一个示例性实施例提供的视频帧的示意图;
图12是本申请一个示例性实施例提供的检测跟踪装置的结构框图;
图13示出了本申请一个示例性实施例提供的电子设备的结构框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
首先,对本申请实施例中涉及的名词进行简单介绍:
检测跟踪:目标检测指在图像和视频(一系列的图像)中扫描和搜寻目标,即在一个场景中对目标进行定位和识别。目标跟踪指在视频中对目标的运动特征进行跟踪,并不对跟踪目标进行识别,故针对图像的检测跟踪可以广泛的应用于计算机视觉中的目标识别与追踪,例如,可以应用在自动驾驶场景下的目标检测和追踪。
第一线程:指检测线程,通过对输入的视频帧进行检测,输出检测到的物体目标框和类别。在一些实施例中,响应于输入视频帧,通过目标检测算法对视频帧中的物体进行检测,并输出物体的目标框和类别。示例性的,可以采用One-Stage(一种目标检测方法)算法、Two-Stage(一种目标检测方法)算法或Anchor-free(一种目标检测方法)算法对视频帧进行检测。
第二线程:指跟踪线程,通过目标特征点的匹配对实现目标框的跟踪。在一些实施例中,上一帧的目标框包含特征点x1、x2、x3,其在上一帧的坐标分别为a、b、c,特征点x1、x2、x3在当前帧中的坐标分别为a’、b’、c’,通过计算a、b、c和a’、b’、c’的位移和尺度,计算当前帧目标框和上一帧目标框的位移和尺度,因此得到当前帧的目标框。
第三线程:指运动分析线程,通过对初始帧进行特征点提取,并通过跟踪输出每帧视频帧的特征点。在一些实施例中,可以采用角点检测算法(例如Harris算法)、加速分段特征点测试算法(FAST,Features From Accelerated Segment Test)或特征点跟踪算法(GFTT,Good Feature To Tracker)进行特征点提取。在一些实施例中,可以采用光流跟踪算法实现对当前帧的上一帧的特征点跟踪,示例性的,可以采用光流跟踪算法(例如Lucas-Kanade算法)实现对当前帧的上一帧的特征点跟踪。
图1示出了本申请一个示例性实施例的多目标检测跟踪系统的结构框 图。该多目标检测跟踪系统上设置有三个处理线程,第一线程121用于检测抽取帧的目标,得到抽取帧的检测目标框;第二线程122用于跟踪当前帧的上一帧中目标框的运动轨迹,并结合抽取帧的检测目标框,得到当前帧的目标框;第三线程123用于对初始帧进行特征点提取,得到初始帧上的特征点,并对当前帧的上一帧的特征点进行跟踪,得到当前帧(每一帧)的特征点。
响应于将每帧视频帧输入第三线程123,进行特征点提取和跟踪,得到包含特征点的每帧视频帧,将每帧视频帧输入第二线程122。
响应于将抽取帧输入第一线程121,对抽取帧进行方向调整,检测调整后的抽取帧,得到抽取帧的检测目标框,并将检测目标框输入第二线程122。
基于第二线程122输入包含特征点的每帧视频帧,且上一帧存在目标框,第二线程122得到基于上一帧的当前帧的跟踪目标框。
在第二线程122未收到第一线程121输入的最近一个抽取帧的检测目标框时,将上述第二线程122得到的当前帧的跟踪目标框作为当前帧的目标框,输出当前帧的目标框;
在第二线程122收到第一线程121输入的最近一个抽取帧的检测目标框时,得到检测目标框在当前帧的跟踪目标框,将上述检测目标框在当前帧的跟踪目标框和上一帧的跟踪目标框进行重复框合并,得到当前帧的目标框,输出当前帧的目标框。
在一些实施例中,上述多目标检测跟踪系统可以至少运行在电子设备上,电子设备可以是服务器或服务器群组,也可以是终端。也就是说,上述多目标检测跟踪系统可以至少运行在终端上,或运行在服务器上,或运行在终端和服务器上。本申请实施例的检测跟踪方法可以由终端来实现,也可以由服务器或服务器群组来实现,还可以由终端和服务器共同交互实现。
上述检测目标框和跟踪目标框,可以简称为目标框。
本领域技术人员可以知晓,上述终端和服务器的数量可以更多或更少。比如上述终端可以仅为一个,或者上述终端为几十个或几百个,或者更多数量。上述服务器可以仅为一个,或者上述服务器为几十个或几百个,或者更多数量。本申请实施例对终端的数量和设备类型、服务器的数量不加以限定。
下述实施例以多目标实时检测跟踪系统应用于终端为例,进行解释说明。
为实现对多目标的实时检测跟踪,采取如图2所示的方法。
图2示出了本申请一个示例性实施例的检测跟踪方法,以该方法应用于图1所示的多目标检测跟踪系统举例说明,该方法包括:
步骤220,对视频帧序列进行特征点分析,得到视频帧序列中每帧视频帧上的特征点。
本申请实施例中,响应于输入视频帧序列,终端对视频帧序列进行特征点分析,得到视频帧序列中每帧视频帧上的特征点。特征点指视频帧中具有鲜明特性、能够有效反映视频帧本质特征的像素点,并且,特征点能够标识视频帧中的目标物体。在一些实施例中,通过对不同的特征点进行匹配,能够完成对目标物体的匹配,即对目标物体进行识别和分类。
在一些实施例中,特征点是由算法分析得到的含有丰富局部信息的点,例如,特征点存在于图像的拐角、纹理剧烈变化的区域。值得注意的是,特征点具有尺度不变性,即在不同图片中能够被识别出来的统一性质。
特征点分析指通过对输入的视频帧进行特征点提取和特征点跟踪。在本申请实施例中,响应于输入视频帧序列,终端对初始帧进行特征点提取,并通过特征点跟踪得到下一帧的跟踪特征点,依次跟踪得到所有视频帧的特征点。
在一些实施例中,可以采用Harris进行特征点提取,即通过对初始视频帧中设置一个固定窗口,使用该窗口在图像上进行任意方向上的滑动,比较滑动前与滑动后两种情况下,窗口中的像素点的灰度变化程度。如果存在任意方向上的滑动,像素点的灰度变化程度大于灰度变化阈值,或者,在多个像素点中,任一像素点的灰度变化程度大于该多个像素点中的每一像素点的灰度变化程度,则确定该像素点为特征点。
在一些实施例中,可以采用特征点提取算法(例如FAST-9算法)进行特征点提取,即通过对初始视频帧上的每个像素点进行检测,当像素点满足特定条件时,即认定该像素点为特征点,这里的特定条件至少包括:确定与像素点之间像素差的绝对值超过像素差阈值的目标相邻像素点的数量,并判断该是否大于或等于数量阈值,当该数量大于或等于数量阈值时则符合特定条件。举例来说,在一个以像素点P为圆心,半径为3的圆上存在16个像素点,计算圆周上下左右四个像素点(即像素点P的目标相邻像素点)与像素点P的像素差,假设数量阈值为3,若四个像素差的绝对值中有至少三个超过像素差阈值,则进入下一步判断,否则认定像素点P不是特征点;基于对像素点P的下一步判断,计算上述圆周的16个像素点与P的像素差,若这16个像素差中至少存在9个像素差的绝对值超过像素差阈值,则认定像素点P是特征点
在一些实施例中,采用Lucas-Kanade光流算法实现对上一帧特征点的跟踪。
步骤240,通过第一线程基于特征点对抽取帧进行目标检测,得到抽取帧中的目标框。
抽取帧是采用目标步长在视频帧序列中抽取的视频帧;目标步长是对视频帧序列进行抽取的帧间隔,如目标步长为2,即每两个视频帧抽取一个视频帧。在一些实施例中,目标步长为固定值,如以目标步长为2对视频帧序列进行抽取;在一些实施例中,目标步长可以是一个变量,也就是说, 目标步长存在多种可能,如抽取第0帧、第3帧、第7帧、第12帧,上述第二次抽取与第一次抽取的目标步长为3,第三次抽取与第二次抽取的目标步长为4,第四次抽取与第三次抽取的目标步长为5。
在一些实施例中,目标步长可以依据检测算法的耗时进行设置。如,对每一个视频帧进行检测,需要三帧的时长,则终端将目标步长设置为3。
在一些实施例中,可以采用步长为3对视频帧序列进行抽取。第一线程用于检测抽取帧的目标,得到抽取帧的检测目标框。示意性的,可以采用One-Stage算法、Two-Stage算法或Anchor-free算法对视频帧进行检测。
例如,检测算法耗时往往大于1帧,即无法对每帧视频帧都进行检测,基于此本申请实施例提供的技术方案对视频帧序列进行多线程检测跟踪。
目标框用于标识物体。在一些实施例中,目标框表示为物体的包围框,并在包围框内显示物体的类别信息。示意性的,如图3所示,图3中示出了手机的目标框301、橙子的目标框302、鼠标的目标框303和水杯的目标框304,在这四个目标框中,不仅包括物体的包围框,在包围框内还显示物体的名称。在一些实施例中,目标框表现为物体的贴图,即在物体周围添加贴图,以增加视频帧的趣味性。在本申请实施例中,对目标框的种类不加以限定。
在本申请实施例中,目标框包括跟踪目标框和检测目标框。其中,跟踪目标框指基于对上一帧的目标框进行跟踪,得到的目标框;检测目标框指基于对视频帧进行检测,得到的目标框。
步骤260,通过第二线程基于特征点和抽取帧中的目标框,在当前帧中进行目标框跟踪,得到当前帧中的目标框。
为论述第二线程的作用,首先介绍该多目标实时检测系统的时序关系。示意性的,图4示出了本申请一个示例性实施例的多目标实时检测系统的时序关系示意图。图4中显示,视频帧跟踪的时长小于视频帧采集(即图4中所示的图像采集)的间隔,且对每帧视频帧均执行跟踪操作,而检测帧率(即图4中所示的视频帧检测)较低,无法对每帧视频帧均执行图像检测,进而采取对抽取帧进行图像检测,图4中抽取的步长为3。当跟踪线程处理完第2帧视频帧时,第0帧视频帧的检测刚刚完成,此时需要将第0帧检测得到的目标框“转移”到第2帧从而与第2帧的跟踪框进行融合,相当于再做一次第0帧到第2帧的跟踪。
在一些实施例中,通过第二线程基于特征点和抽取帧中的目标框,在当前帧中进行目标框跟踪,得到当前帧中的目标框,分为以下两种情况:
第一种、在第一线程未输出有第一目标框的情况下,通过第二线程基于特征点在当前帧中对第二目标框进行跟踪,得到当前帧中的目标框;其中,第一目标框是视频帧序列中位于当前帧的最近一个抽取帧中检测到的目标框,第二目标框是当前帧的上一帧中跟踪到的目标框。例如,当前帧的上一帧中不存在目标框时,则当前帧也不存在基于上一帧目标框得到的 跟踪目标框。
结合参考图4,在当前输入的视频帧为第1帧时,第一线程未输出第0帧的检测框,此时第二线程基于第0帧的特征点和第1帧的特征点,对第0帧中的目标框进行跟踪,得到第1帧的跟踪目标框,此时,该跟踪目标框即为第1帧的目标框。
值得注意的是,当第0帧为初始帧时,第0帧上不存在目标框,因此第1帧也不存在基于第0帧得到的跟踪目标框。当第0帧不为初始帧时,第0帧上的目标框是基于第0帧的上一帧目标框跟踪得到的。
在一些实施例中,上述基于第0帧的特征点和第1帧的特征点,对第0帧中的目标框进行跟踪,得到第1帧的跟踪目标框,可由下述方法得到:首先,获取当前帧的跟踪特征点和当前帧的上一帧的目标特征点;然后,通过第二线程将当前帧的跟踪特征点和上一帧的目标特征点组成多组特征点匹配对,目标特征点是位于第二目标框中的特征点;确定多组特征点匹配对的多组特征点偏移向量;这里,可以通过计算得到多组特征点匹配对的多组特征点偏移向量;再然后,基于多组特征点偏移向量,计算得到第二目标框的目标框偏移向量;最后,根据目标框偏移向量对第二目标框进行偏移,得到当前帧中的目标框。
示意性的,第0帧的目标特征点为x1、x2、x3,其在第0帧的坐标分别为a、b、c,特征点x1、x2、x3在第1帧中对应的跟踪特征点为x1’、x2’、x3’,其在第1帧的坐标分别为a’、b’、c’,上述特征点x1、x1’组成特征点匹配对,x2、x2’组成特征点匹配对,x3、x3’组成特征点匹配对,得到多组特征点偏移向量为(a,a’)、(b,b’)、(c,c’)。假设第0帧的目标框的坐标表示为m。
在一些实施例中,目标框偏移向量为多组特征点偏移向量的平均向量,则第1帧的目标框坐标为m+((a,a’)+(b,b’)+(c,c’))/3。
在一些实施例中,目标框偏移向量为多组特征点偏移向量的加权向量,示意性的,偏移向量(a,a’)的权重为0.2,偏移向量(b,b’)的权重为0.4,偏移向量(c,c’)的权重为0.4,则第1帧的目标框坐标为m+(0.2(a,a’)+0.4(b,b’)+0.4(c,c’))。
第二种、在第一线程输出有第一目标框的情况下,通过第二线程基于特征点在当前帧中对第一目标框和第二目标框进行跟踪,得到当前帧中的目标框;其中,第一目标框是视频帧序列中位于当前帧的最近一个抽取帧中检测到的目标框,第二目标框是当前帧的上一帧中跟踪到的目标框。
在一些实施例中,上述方法包括以下步骤:通过第二线程基于特征点在当前帧中对第一目标框进行跟踪,得到第一跟踪框;通过第二线程基于特征点在当前帧中对第二目标框进行跟踪,得到第二跟踪框;将第一跟踪框和第二跟踪框中的重复框进行合并,得到当前帧中的目标框。
结合参考图4,当前帧为第2帧时,第一线程输出有第0帧的检测目标 框,通过第二线程对第0帧的检测目标框进行跟踪,得到第一跟踪框,通过第二线程基于特征点在第2帧中对第1帧的目标框进行跟踪,得到第二跟踪框,将第一跟踪框和第二跟踪框中的重复框进行合并,得到第2帧中的目标框。
上述基于特征点实现目标框的跟踪在上文已进行说明,在此不再赘述。
步骤280,输出当前帧中的目标框。
通过上述步骤,终端得到当前帧的目标框并完成当前帧目标框的输出。
综上所述,上述方法将检测和跟踪分为两个线程操作,其中,检测算法并不会影响跟踪帧率,即使检测线程耗费时间较长,终端也能输出每帧视频帧的目标框,该方法不仅能实时输出视频帧的目标框,且实时输出的延时并不会随目标框个数增加而显著增加。并且,目标检测过程是针对抽取帧实现的,无需对每一视频帧都进行检测,从而能够降低检测过程的耗时,进而能够实时输出视频帧的目标框,提高目标检测跟踪的实时性和稳定性。
为实现对重复框的判断,图5示出了本申请一个示例性实施例的检测跟踪方法,其中步骤220,步骤240,步骤260,步骤280在上述已有说明,在此不再赘述。其中,步骤260中将第一跟踪框和第二跟踪框中的重复框进行合并,得到当前帧中的目标框之前,还包括以下步骤:
步骤250-1,基于第一跟踪框和第二跟踪框的并交比(IoU,Intersection over Union)大于IoU阈值,确定第一跟踪框和第二跟踪框存在重复框。
本申请实施例中,通过第二线程基于特征点在当前帧中对第一目标框进行跟踪得到第一跟踪框,通过第二线程基于特征点在当前帧中对第二目标框进行跟踪,得到第二跟踪框。
IoU是在特定数据集中检测相应物体准确度的一个标准,在本申请实施例中,这个标准用于测量跟踪目标框和检测目标框之间的相关度,相关度越高,该值越高。示意性的,跟踪目标框所在区域为S1,检测目标框所在区域为S2,S1与S2的交集为S3,S1与S2组成区域S4,则IoU为S3/S4。
在一些实施例中,计算第一跟踪框和第二跟踪框在当前帧的IoU,终端预先存储有并交比阈值,示意性的,该IoU阈值为0.5,当第一跟踪框和第二跟踪框在当前帧的IoU大于0.5时,即确定第一跟踪框和第二跟踪框存在重复框;若第一跟踪框和第二跟踪框在当前帧的IoU不大于0.5时,即确定第一跟踪框和第二跟踪框不存在重复框。
本申请实施例中,无论第一跟踪框和第二跟踪框的类别是否相同,都可以认为第一跟踪框和第二跟踪框存在重复框。
步骤250-2,基于第一跟踪框和第二跟踪框的IoU大于IoU阈值,且,第一跟踪框和第二跟踪框的类别相同,确定第一跟踪框和第二跟踪框存在重复框。
在一些实施例中,当第一跟踪框和第二跟踪框在当前帧的IoU大于IoU 阈值0.5时,且,第一跟踪框和第二跟踪框中物体为同一类别时,即确定第一跟踪框和第二跟踪框存在重复框。
上述步骤250-1和步骤250-2为并列步骤,即,仅执行步骤250-1或仅执行步骤250-2,即可完成对重复框的判断。
基于图2的可选实施例中,执行步骤260中重复框合并存在以下至少一种方法:
方法一:响应于第一跟踪框和第二跟踪框存在重复框,将第一跟踪框确定为当前帧的目标框;
基于上述步骤250-1和步骤250-2完成对第一跟踪框和第二跟踪框存在重复框的判断,将第一跟踪框确定为当前帧的目标框。
方法二:响应于第一跟踪框和第二跟踪框存在重复框,将第一跟踪框和第二跟踪框中置信度最高的跟踪框确定为当前帧的目标框;
基于上述步骤250-1和步骤250-2完成对第一跟踪框和第二跟踪框存在重复框的判断,将第一跟踪框和第二跟踪框中置信度最高的一个跟踪框确定为当前帧的目标框。
在一些实施例中,采用目标检测算法输出目标框的置信度评分,终端删除评分低于置信度阈值的目标框,并将置信度大于或等于置信度阈值的跟踪框作为当前帧的目标框。
方法三:响应于第一跟踪框和第二跟踪框存在重复框,且第一跟踪框处于当前帧的边界,将第二跟踪框确定为当前帧的目标框。
基于上述步骤250-1和步骤250-2完成对第一跟踪框和第二跟踪框存在重复框的判断,当第一跟踪框处于当前帧的边界时,确定第二跟踪框为当前帧的目标框。
在一些实施例中,当目标框表现为物体的包围框时,当检测相邻抽取帧得到的检测目标框无法完全包围整个物体时,即,在相邻抽取帧中物体无法完全显示时,确定第二跟踪框为当前帧的目标框。
上述方法一、二和三为并列方法,即,仅执行方法一、仅执行方法二或仅执行方法三,都可完成对重复框的合并。
综上所述,上述方法实现了对当前帧中是否存在重复框的判断和进行了重复框的合并,保证了当前帧的目标框彼此清晰有序,避免当前帧中重复出现作用相同的目标框。
为实现对特征点的提取和跟踪,图6示出了本申请一个示例性实施例的检测跟踪方法,其中步骤240、步骤260、步骤280在上述已有说明,不再赘述。
步骤221,通过第三线程对视频帧序列中的初始帧进行特征点提取,得到初始帧的特征点;
在一些实施例中,结合参考图1,响应于终端输入视频帧序列,首先通过第三线程123对初始帧进行特征点提取。
步骤222,通过第三线程基于初始帧的特征点,对视频帧序列中的第i帧进行特征点跟踪,得到视频帧序列中的第i帧的特征点;第i帧为位于初始帧之后的视频帧,i的起始编号为初始帧的帧号加一,i为正整数。
在一些实施例中,结合参考图1,响应于终端通过第三线程123对初始帧的特征点进行特征点跟踪,可得到第i帧的特征点,其中第i帧为位于初始帧之后的视频帧,i的起始编号为初始帧的帧号加一。值得注意的是,第三线程123只对初始帧进行特征点提取,并不对第i帧视频帧进行特征点提取。
步骤223,通过第三线程基于第i帧的特征点,对视频帧序列中的第i+1帧进行特征点跟踪,得到视频帧序列中的第i+1帧的特征点。
在一些实施例中,结合参考图1,响应于终端通过第三线程123对第i帧的特征点进行特征点跟踪,得到视频帧序列中的第i+1帧的特征点。
示意性的,通过第三线程对所述第i帧的特征点进行光流跟踪,得到所述视频帧序列中的第i+1帧的特征点,例如,可以采用Lucas-Kanade光流算法实现对上一帧特征点的跟踪。
通过上述步骤221至步骤223,即可实现对视频帧序列特征点的提取和跟踪。在一些实施例中,通过第三线程基于第i帧的特征点进行特征点跟踪,得到视频帧序列中的第i+1帧的特征点,还包括对第i+1帧特征点的删除和补充。
对第i+1帧特征点的删除:响应于第i+1帧中的第一特征点满足删除条件,删除第i+1帧中的第一特征点;其中,删除条件包括如下至少之一:
(1)第一特征点是跟踪失败的特征点。
在一些实施例中,通过第三线程基于第i帧的特征点进行特征点跟踪,得到视频帧序列中的第i+1帧的第一特征点,第一特征点是在第i帧中无法找到能与之构成特征点匹配对的特征点,即为跟踪失败的特征点。
(2)第一特征点与相邻特征点的距离小于距离阈值。
在一些实施例中,响应于第i+1帧的第一特征点与相邻特征点的距离小于距离阈值D,终端删除第i+1帧中的第一特征点。示意性的,距离阈值D视计算量和图像大小选取,如距离阈值D的取值范围为5至20。
对第i+1帧特征点的补充:响应于第i+1帧中的目标区域满足补点条件,从目标区域中提取新增特征点;其中,补点条件包括:目标区域是特征点跟踪结果为空的区域。
在一些实施例中,第i帧的目标区域内存在50个特征点,通过特征点跟踪,第i+1帧的目标区域内存在20个特征点,此时判断第i+1帧的特征点跟踪结果为空,此时进行从目标区域中提取新增特征点的操作,提取方法可以参考步骤220。
示意性的,第i帧的目标区域为“手机”区域,即通过50个特征点可对“手机”添加目标框,当第i+1帧的“手机”区域中仅存在20个特征点 时,此时终端无法对手机添加目标框,此时需从“手机”区域中提取新增特征点,终端才可对手机添加目标框。值得注意的是,上述第三线程并不对“手机”区域添加目标框,仅表示终端存在对手机添加目标框的可能性,对“手机”区域添加目标框的操作在第二线程实现。
综上所述,上述方法实现了对初始帧的提取和对视频帧的特征点跟踪,并通过删除特征点和增加特征点的方式,提高了相邻帧特征点的稳定性,且保证了第二线程能通过相邻帧特征点得到目标框。
基于图2的可选实施例中,对视频帧序列进行特征点分析,得到视频帧序列中每帧视频帧上的特征点,可由图7所示的方法实现,图7示出了第三线程的流程图,其方法包括:
步骤701,输入视频帧序列。
响应于开始执行多目标实时检测的操作,终端输入视频帧序列。
步骤702,判断当前帧是否为初始帧。
基于终端输入的视频帧序列,终端对当前帧是否为初始帧进行判断;若当前帧为初始帧,则执行步骤706;若当前帧不是初始帧,则执行步骤703。
步骤703,对当前帧的上一帧中的特征点进行特征点跟踪,得到跟踪结果。
响应于当前帧不是初始帧,通过光流跟踪算法跟踪上一帧特征点得到特征点在当前帧的图像坐标,光流跟踪算法包括但不限于:Lucas-Kanade光流。
步骤704,基于跟踪结果,对特征点进行非极大值抑制。
这里,对特征点进行非极大值抑制是指终端删除跟踪失败的特征点,并当两个特征点之间距离小于距离阈值时,删除掉两个特征点中一个特征点。删除策略包括但不限于:随机删除一个;基于特征点梯度给特征点评分,删除评分较低的一个。距离阈值参考步骤506。
步骤705,特征点补点。
响应于在当前帧上没有跟踪特征点的区域提取新的特征点,新的特征点提取方法参考步骤706。
步骤706,初始帧的特征点提取,得到初始帧的特征点。
响应于当前帧是初始帧,终端进行对初始帧的特征点提取操作。终端在初始帧中提取特征点,确保特征点之间最低间隔不小于间隔阈值(间隔阈值视计算量和图像大小选取,如可以取值5至20),特征提取方法包括但不限于:Harris、FAST、Good Feature To Tracker等。终端给每个新特征点分配一个特征点标号,其中标号从0开始递增。
步骤707,输出当前帧的特征点列表。
基于上述步骤701至步骤706,输出视频帧序列中每个视频帧的特征点列表。
基于图2的可选实施例中,通过第一线程基于特征点对抽取帧进行目标检测,得到抽取帧中的目标框,可由下述方法实现:通过第一线程,终端输入视频帧序列的抽取帧,输出检测到的物体包围框和类别。目标检测算法包括但不限于:One-Stage算法、Two-Stage算法和Anchor-free算法等。在一些实施例中,在检测之前终端先将抽取帧调整成重力方向来提升检测效果。
基于图2的可选实施例中,通过第二线程基于特征点在当前帧中对目标框进行跟踪,得到当前帧中的目标框,可由图8所示的方法来实现,图8示出了本申请一个示例性实施例的第二线程的流程图,该方法包括:
步骤801,输入相邻视频帧和对应的特征点列表。
响应于第三线程输出视频帧序列的特征点,终端将相邻视频帧和对应的特征点列表输入第二线程。
步骤802,将当前帧与上一帧特征点进行匹配。
通过特征点标号将当前帧的特征点和上一帧的特征点进行匹配,得到特征点匹配对。
步骤803,跟踪上一帧目标框。
基于上一帧的每个目标框,终端确定上一帧目标框内的特征点,根据特征点匹配对计算上一帧目标框在当前帧的位移和尺度。计算方式包括但不限于:中值流法、单应性矩阵法等。
步骤804,判断是否有新增目标框。
终端判断第一线程是否输出检测目标框,如果是,则执行步骤805;如果否,则执行步骤808。
步骤805,将当前帧与检测帧进行特征点匹配。
响应于第一线程输出检测目标框,终端通过特征点标号进行当前帧与检测帧特征点匹配,得到特征点匹配对。
步骤806,跟踪检测帧目标框。
基于检测帧的每个目标框,终端确定目标框内的特征点,根据特征点匹配对计算检测目标框在当前帧的位移和尺度。计算方式包括但不限于:中值流法、单应性矩阵法等。
步骤807,在当前帧中,新增目标框与跟踪目标框的融合框。
基于重复检测,跟踪目标框和检测目标框可能会重叠,重叠判断标准为:
(1)跟踪目标框和检测目标框的IOU大于IOU阈值,例如,该IOU阈值可以取值为0.5。
(2)跟踪目标框和检测目标框的物体类别相同。
基于终端确定跟踪目标框和检测目标框重叠,终端执行重叠框融合操作。
在一些实施例中,当跟踪目标框和检测目标框重叠时,需要通过策略 将这两个目标框融合成一个目标框,得到融合框,融合策略至少包括以下方法:当前帧目标框始终选取检测目标框;依据目标检测算法,终端得到跟踪目标框和检测目标框的置信度评分,终端在当前帧中删除置信度评分较小的目标框;当检测目标框靠近当前帧边界时,终端确定物体检测不全,此时终端确定跟踪目标框为当前帧的目标框,否则终端确定检测目标框为当前帧的目标框。
步骤808,输出当前帧的所有目标框。
基于上述步骤801至步骤807,终端输出当前帧所有目标框。
下面对本申请实施例的应用场景进行说明:
在一些实施例中,当用户使用终端扫描真实环境中特定类别的物体时,终端的显示屏上弹出3D的增强现实(AR,Augmented Reality)特效,示意性的,图9示出了本申请一个示例性实施例提供的视频帧的示意图,图10示出了本申请另一个示例性实施例提供的视频帧的示意图。其中,当用户使用终端扫描图9中的饮料901时,饮料901周围出现带有颜色的立体文字902;当用户使用终端扫描图10中的植物1001时,植物周围弹出卡通挂件1002。
在一些实施例中,图11示出了本申请再一个示例性实施例的视频帧的示意图,响应于输入一段足球比赛视频,终端检测运动员1101、球门1102、足球1103等目标框,并在连续帧中跟踪这些目标,基于跟踪的结果可以进行后续的足球比赛分析。
在一些实施例中,终端对足球视频的视频帧序列进行特征点分析,得到视频帧序列中每帧视频帧上的特征点;通过第一线程基于特征点对抽取帧进行目标检测,终端得到抽取帧中的目标框,抽取帧是采用目标步长在视频帧序列中抽取的视频帧;通过第二线程基于特征点在当前帧中对目标框进行跟踪,终端得到当前帧中的目标框;终端输出当前帧中的目标框。
图12是本申请一个示例性实施例提供的检测跟踪装置的结构框图,如图12所示,该装置包括:
分析模块1010,配置为对视频帧序列进行特征点分析,得到视频帧序列中每帧视频帧上的特征点;检测模块1020,配置为通过第一线程基于特征点对抽取帧进行目标检测,得到抽取帧中的目标框,抽取帧是采用目标步长在视频帧序列中抽取的视频帧;跟踪模块1030,配置为通过第二线程基于特征点和所述抽取帧中的目标框,在当前帧中进行目标框跟踪,得到当前帧中的目标框;输出模块1050,配置为输出当前帧中的目标框。
在一个可选的实施例中,跟踪模块1030还配置为在第一线程未输出有第一目标框的情况下,通过第二线程基于特征点在当前帧中对第二目标框进行跟踪,得到当前帧中的目标框。
在一个可选的实施例中,跟踪模块1030还配置为在第一线程输出有第一目标框的情况下,通过第二线程基于特征点在当前帧中对第一目标框和 第二目标框进行跟踪,得到当前帧中的目标框。其中,第一目标框是视频帧序列中位于当前帧之前的最近一个抽取帧中检测到的目标框,第二目标框是当前帧的上一帧中跟踪到的目标框。
在一个可选的实施例中,跟踪模块1030包括跟踪子模块1031和合并模块1032;其中,踪子模块1031配置为通过第二线程基于特征点在当前帧中对第一目标框进行跟踪,得到第一跟踪框。
在一个可选的实施例中,跟踪子模块1031还配置为通过第二线程基于特征点在当前帧中对第二目标框进行跟踪,得到第二跟踪框。
在一个可选的实施例中,合并模块1032配置为将第一跟踪框和第二跟踪框中的重复框进行合并,得到当前帧中的目标框。
在一个可选的实施例中,装置还包括确定模块1040;其中,确定模块1040配置为基于第一跟踪框和第二跟踪框的并交比IoU大于IoU阈值,确定第一跟踪框和第二跟踪框存在重复框。
在一个可选的实施例中,确定模块1040还配置为基于第一跟踪框和第二跟踪框的并交比IoU大于IoU阈值,且,第一跟踪框和第二跟踪框的类别相同,确定第一跟踪框和第二跟踪框存在重复框。
在一个可选的实施例中,确定模块1040还配置为响应于第一跟踪框和第二跟踪框存在重复框,确定第一跟踪框为当前帧的目标框。
在一个可选的实施例中,确定模块1040还配置为响应于第一跟踪框和第二跟踪框存在重复框,确定第一跟踪框和第二跟踪框中置信度高的跟踪框为当前帧的目标框。
在一个可选的实施例中,确定模块1040还配置为响应于第一跟踪框和第二跟踪框存在重复框,且第一跟踪框处于当前帧的边界,确定第二跟踪框为当前帧的目标框。
在一个可选的实施例中,跟踪模块1030还配置为获取当前帧的跟踪特征点和当前帧的上一帧的目标特征点,并通过第二线程将当前帧的跟踪特征点和上一帧的目标特征点组成多组特征点匹配对,目标特征点是位于第二目标框中的特征点。
在一个可选的实施例中,跟踪模块1030还配置为确定多组特征点匹配对的多组特征点偏移向量。
在一个可选的实施例中,跟踪模块1030还配置为基于多组特征点偏移向量,计算得到第二目标框的目标框偏移向量。
在一个可选的实施例中,跟踪模块1030还配置为根据目标框偏移向量对第二目标框进行偏移,得到当前帧中的目标框。
在一个可选的实施例中,分析模块1010还配置为通过第三线程对视频帧序列中的初始帧进行特征点提取,得到初始帧的特征点。
在一个可选的实施例中,分析模块1010还配置为通过第三线程基于初始帧的特征点,对视频帧序列中的第i帧进行特征点跟踪,得到视频帧序列 中的第i帧的特征点,第i帧为位于初始帧之后的视频帧,i的起始编号为初始帧的帧号加一。
在一个可选的实施例中,分析模块1010还配置为通过第三线程基于第i帧的特征点,对视频帧序列中的第i+1帧进行特征点跟踪,得到视频帧序列中的第i+1帧的特征点。
在一个可选的实施例中,分析模块1010还配置为通过第三线程对第i帧的特征点进行光流跟踪,得到视频帧序列中的第i+1帧的特征点。
在一个可选的实施例中,分析模块1010还配置为响应于第i+1帧中的第一特征点满足删除条件,删除第i+1帧中的第一特征点;其中,删除条件包括如下至少之一:第一特征点是跟踪失败的特征点;第一特征点与相邻特征点的距离小于距离阈值。
在一个可选的实施例中,分析模块1010还配置为响应于第i+1帧中的目标区域满足补点条件,从目标区域中提取新增特征点;其中,补点条件包括:目标区域是特征点跟踪结果为空的区域。
综上所述,上述装置将检测和跟踪分为两个线程操作,其中,检测算法并不会影响跟踪帧率,即使检测线程耗费时间较长,终端也能输出每帧视频帧的目标框,该方法不仅能实时输出视频帧的目标框,且实时输出的延时并不会随目标框个数增加而显著增加。
上述装置还实现了对当前帧中是否存在重复框的判断和进行了重复框的合并,保证了当前帧的目标框彼此清晰有序,避免当前帧中重复出现作用相同的目标框。
上述装置还实现了对初始帧的提取和对其他帧的特征点跟踪,并通过删除特征点和增加特征点的方式,提高了相邻帧特征点的稳定性,且保证了第二线程能通过相邻帧特征点得到目标框。
图13示出了本申请一个示例性实施例提供的电子设备1300的结构框图。该电子设备1300可以是便携式移动终端,比如:智能手机、平板电脑、动态影像专家压缩标准音频层面3(MP3,Moving Picture Experts Group Audio Layer III)、动态影像专家压缩标准音频层面4(MP4,Moving Picture Experts Group Audio Layer IV)播放器、笔记本电脑或台式电脑。电子设备1300还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,电子设备1300包括有:处理器1301和存储器1302。
处理器1301可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1301可以采用数字信号处理(DSP,Digital Signal Processing)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)、可编程逻辑阵列(PLA,Programmable Logic Array)中的至少一种硬件形式来实现。处理器1301也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称中央处理器(CPU,Central  Processing Unit);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1301可以集成有图像处理器(GPU,Graphics Processing Unit),GPU用于负责显示屏所需要显示的内容的渲染和绘制。在一些实施例中,处理器1301还可以包括人工智能(AI,Artificial Intelligence)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1302可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1302还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1302中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器1301所执行以实现本申请中方法实施例提供的图像修复方法。
在一些实施例中,电子设备1300还可选包括有:外围设备接口1303和至少一个外围设备。处理器1301、存储器1302和外围设备接口1303之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1303相连。外围设备包括:射频电路1304、显示屏1305、摄像头组件1306、音频电路1307、定位组件1308和电源1309中的至少一种。
外围设备接口1303可被用于将输入/输出(I/O,Input/Output)相关的至少一个外围设备连接到处理器1301和存储器1302。在一些实施例中,处理器1301、存储器1302和外围设备接口1303被集成在同一芯片或电路板上;在一些其他实施例中,处理器1301、存储器1302和外围设备接口1303中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。
射频电路1304用于接收和发射射频(RF,Radio Frequency)信号,也称电磁信号。射频电路1304通过电磁信号与通信网络以及其他通信设备进行通信。射频电路1304将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。例如,射频电路1304包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路1304可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于以下至少之一:万维网、城域网、内联网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和无线保真(WiFi,Wireless Fidelity)网络。在一些实施例中,射频电路1304还可以包括近距离无线通信(NFC,Near Field Communication)有关的电路,本申请对此不加以限定。
显示屏1305用于显示用户界面(UI,User Interface)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏1305是触摸显示屏时,显示屏1305还具有采集在显示屏1305的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器1301进行处理。此 时,显示屏1305还可以用于提供虚拟按钮和虚拟键盘中的至少之一,也称软按钮和软键盘。在一些实施例中,显示屏1305可以为一个,设置在电子设备1300的前面板;在另一些实施例中,显示屏1305可以为至少两个,分别设置在电子设备1300的不同表面或呈折叠设计;在另一些实施例中,显示屏1305可以是柔性显示屏,设置在电子设备1300的弯曲表面上或折叠面上。甚至,显示屏1305还可以设置成非矩形的不规则图形,也即异形屏。显示屏1305可以采用液晶显示屏(LCD,Liquid Crystal Display)、有机发光二极管(OLED,Organic Light-Emitting Diode)等材质制备。
摄像头组件1306用于采集图像或视频。例如,摄像头组件1306包括前置摄像头和后置摄像头。通常,前置摄像头设置在终端的前面板,后置摄像头设置在终端的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及虚拟现实(VR,Virtual Reality)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件1306还可以包括闪光灯。闪光灯可以是单色温闪光灯,也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,可以用于不同色温下的光线补偿。
音频电路1307可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器1301进行处理,或者输入至射频电路1304以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在电子设备1300的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器1301或射频电路1304的电信号转换为声波。扬声器可以是传统的薄膜扬声器,也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅可以将电信号转换为人类可听见的声波,也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路1307还可以包括耳机插孔。
定位组件1308用于定位电子设备1300的当前地理位置,以实现导航或基于位置的服务(LBS,Location Based Service)。定位组件1308可以是基于美国的全球定位系统(GPS,Global Positioning System)、中国的北斗系统或俄罗斯的伽利略系统的定位组件。
电源1309用于为电子设备1300中的各个组件进行供电。电源1309可以是交流电、直流电、一次性电池或可充电电池。当电源1309包括可充电电池时,该可充电电池可以是有线充电电池或无线充电电池。有线充电电池是通过有线线路充电的电池,无线充电电池是通过无线线圈充电的电池。该可充电电池还可以用于支持快充技术。
在一些实施例中,电子设备1300还包括有一个或多个传感器1310。该一个或多个传感器1310包括但不限于:加速度传感器1311、陀螺仪传感器1312、压力传感器1313、指纹传感器1314、光学传感器1315以及接近传 感器1316。
加速度传感器1311可以检测以电子设备1300建立的坐标系的三个坐标轴上的加速度大小。比如,加速度传感器1311可以用于检测重力加速度在三个坐标轴上的分量。处理器1301可以根据加速度传感器1313采集的重力加速度信号,控制显示屏1305以横向视图或纵向视图进行用户界面的显示。加速度传感器1313还可以用于游戏或者用户的运动数据的采集。
陀螺仪传感器1312可以检测电子设备1300的机体方向及转动角度,陀螺仪传感器1312可以与加速度传感器1311协同采集用户对电子设备1300的3D动作。处理器1301根据陀螺仪传感器1312采集的数据,可以实现如下功能:动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。
压力传感器1313设置在电子设备1300的以下至少之一的下层:侧边框和显示屏1305。当压力传感器1313设置在电子设备1300的侧边框时,可以检测用户对电子设备1300的握持信号,由处理器1301根据压力传感器1313采集的握持信号进行左右手识别或快捷操作。当压力传感器1313设置在显示屏1305的下层时,由处理器1301根据用户对显示屏1305的压力操作,实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。
指纹传感器1314用于采集用户的指纹,由处理器1301根据指纹传感器1314采集到的指纹识别用户的身份,或者,由指纹传感器1314根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时,由处理器1301授权该用户执行相关的敏感操作,该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。指纹传感器1314可以被设置在电子设备1300的正面、背面或侧面。当电子设备1300上设置有物理按键或厂商Logo时,指纹传感器1314可以与物理按键或厂商Logo集成在一起。
光学传感器1315用于采集环境光强度。在一些实施例中,处理器1301可以根据光学传感器1315采集的环境光强度,控制显示屏1305的显示亮度。当环境光强度较高时,调高显示屏1305的显示亮度;当环境光强度较低时,调低显示屏1305的显示亮度。在另一个实施例中,处理器1301还可以根据光学传感器1315采集的环境光强度,动态调整摄像头组件1306的拍摄参数。
接近传感器1316,也称距离传感器,通常设置在电子设备1300的前面板。接近传感器1316用于采集用户与电子设备1300的正面之间的距离。在一些实施例中,当接近传感器1316检测到用户与电子设备1300的正面之间的距离逐渐变小时,由处理器1301控制显示屏1305从亮屏状态切换为息屏状态;当接近传感器1316检测到用户与电子设备1300的正面之间的距离逐渐变大时,由处理器1301控制显示屏1305从息屏状态切换为亮屏状态。
本领域技术人员可以理解,图13中示出的结构并不构成对电子设备1300的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
本申请实施例还提供一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述方法实施例提供的检测跟踪方法。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述方法实施例提供的检测跟踪方法。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种检测跟踪方法,所述方法由电子设备执行,所述方法包括:
    对视频帧序列进行特征点分析,得到所述视频帧序列中每帧视频帧上的特征点;
    通过第一线程基于所述特征点对抽取帧进行目标检测,得到所述抽取帧中的目标框,所述抽取帧是采用目标步长在所述视频帧序列中抽取的视频帧;
    通过第二线程基于所述特征点和所述抽取帧中的目标框,在当前帧中进行目标框跟踪,得到所述当前帧中的目标框;
    输出所述当前帧中的目标框。
  2. 根据权利要求1所述的方法,其中,所述通过第二线程基于所述特征点和所述抽取帧中的目标框,在当前帧中进行目标框跟踪,得到所述当前帧中的目标框,包括:
    在所述第一线程未输出有第一目标框的情况下,通过所述第二线程基于所述特征点在所述当前帧中对第二目标框进行跟踪,得到所述当前帧中的目标框;
    在所述第一线程输出有所述第一目标框的情况下,通过所述第二线程基于所述特征点在所述当前帧中对所述第一目标框和所述第二目标框进行跟踪,得到所述当前帧中的目标框;
    其中,所述第一目标框是所述视频帧序列中位于所述当前帧之前的最近一个抽取帧中检测到的目标框,所述第二目标框是所述当前帧的上一帧中跟踪到的目标框。
  3. 根据权利要求2所述的方法,其中,所述通过所述第二线程基于所述特征点在所述当前帧中对所述第一目标框和所述第二目标框进行跟踪,得到所述当前帧中的目标框,包括:
    通过所述第二线程基于所述特征点在所述当前帧中对所述第一目标框进行跟踪,得到第一跟踪框;
    通过所述第二线程基于所述特征点在所述当前帧中对所述第二目标框进行跟踪,得到第二跟踪框;
    将所述第一跟踪框和所述第二跟踪框中的重复框进行合并,得到所述当前帧中的目标框。
  4. 根据权利要求3所述的方法,其中,所述将所述第一跟踪框和所述第二跟踪框中的重复框进行合并,得到所述当前帧中的目标框之前,所述方法还包括:
    如果所述第一跟踪框和所述第二跟踪框的并交比IoU大于IoU阈值,确定所述第一跟踪框和所述第二跟踪框存在重复框。
  5. 根据权利要求4所述的方法,其中,所述如果所述第一跟踪框和所 述第二跟踪框的并交比IoU大于IoU阈值,确定所述第一跟踪框和所述第二跟踪框存在重复框,包括:
    如果所述第一跟踪框和所述第二跟踪框的并交比IoU大于IoU阈值,且,所述第一跟踪框和所述第二跟踪框的类别相同,确定所述第一跟踪框和所述第二跟踪框存在重复框。
  6. 根据权利要求4或5所述的方法,其中,所述将所述第一跟踪框和所述第二跟踪框中的重复框进行合并,得到所述当前帧中的目标框,包括执行以下任意一种处理:
    如果所述第一跟踪框和所述第二跟踪框存在重复框,将所述第一跟踪框确定为所述当前帧的目标框;
    如果所述第一跟踪框和所述第二跟踪框存在重复框,将所述第一跟踪框和所述第二跟踪框中置信度最高的跟踪框确定为所述当前帧的目标框;
    如果所述第一跟踪框和所述第二跟踪框存在重复框,且所述第一跟踪框处于所述当前帧的边界,将所述第二跟踪框确定为所述当前帧的目标框。
  7. 根据权利要求2所述的方法,其中,所述通过所述第二线程基于所述特征点在所述当前帧中对第二目标框进行跟踪,得到所述当前帧中的目标框,包括:
    获取所述当前帧的跟踪特征点和所述当前帧的上一帧的目标特征点;
    通过所述第二线程将所述当前帧的跟踪特征点和所述上一帧的目标特征点组成多组特征点匹配对,所述目标特征点是位于所述第二目标框中的特征点;
    确定所述多组特征点匹配对的多组特征点偏移向量;
    基于所述多组特征点偏移向量,计算得到所述第二目标框的目标框偏移向量;
    根据所述目标框偏移向量对所述第二目标框进行偏移,得到所述当前帧中的目标框。
  8. 根据权利要求1至5任一项所述的方法,其中,所述对视频帧序列进行特征点分析,得到所述视频帧序列中每帧视频帧上的特征点,包括:
    通过第三线程对所述视频帧序列中的初始帧进行特征点提取,得到所述初始帧的特征点;
    通过所述第三线程基于所述初始帧的特征点,对所述视频帧序列中的第i帧进行特征点跟踪,得到所述视频帧序列中的第i帧的特征点,所述第i帧为位于所述初始帧之后的视频帧,i的起始编号为所述初始帧的帧号加一,i为正整数;
    通过所述第三线程基于所述第i帧的特征点,对所述视频帧序列中的第i+1帧进行特征点跟踪,得到所述视频帧序列中的第i+1帧的特征点。
  9. 根据权利要求8所述的方法,其中,所述通过所述第三线程基于所述第i帧的特征点,对所述视频帧序列中的第i+1帧进行特征点跟踪,得到 所述视频帧序列中的第i+1帧的特征点,包括:
    通过所述第三线程对所述第i帧的特征点进行光流跟踪,得到所述视频帧序列中的第i+1帧的特征点。
  10. 根据权利要求8所述的方法,其中,所述方法还包括:
    如果所述第i+1帧中的第一特征点满足删除条件,删除所述第i+1帧中的所述第一特征点;
    其中,所述删除条件包括如下至少之一:
    所述第一特征点是跟踪失败的特征点;
    所述第一特征点与相邻特征点的距离小于距离阈值。
  11. 根据权利要求8所述的方法,其中,所述方法还包括:
    如果所述第i+1帧中的目标区域满足补点条件,从所述目标区域中提取新增特征点;
    其中,所述补点条件包括:所述目标区域是特征点跟踪结果为空的区域。
  12. 一种检测跟踪装置,所述装置包括:
    分析模块,配置为对视频帧序列进行特征点分析,得到所述视频帧序列中每帧视频帧上的特征点;
    检测模块,配置为通过第一线程基于所述特征点对抽取帧进行目标检测,得到所述抽取帧中的目标框,所述抽取帧是采用目标步长在所述视频帧序列中抽取的视频帧;
    跟踪模块,配置为通过第二线程基于所述特征点和所述抽取帧中的目标框,在当前帧中进行目标框跟踪,得到所述当前帧中的目标框;
    输出模块,配置为输出所述当前帧中的目标框。
  13. 一种计算机设备,所述计算机设备包括:处理器和存储器,所述存储器存储有计算机程序,所述计算机程序由所述处理器加载并执行以实现如权利要求1至11任一项所述的检测跟踪方法。
  14. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序由处理器加载并执行以实现如权利要求1至11任一项所述的检测跟踪方法。
  15. 一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中;
    当电子设备的处理器从所述计算机可读存储介质读取所述计算机指令,并执行所述计算机指令时,实现权利要求1至11任一项所述的检测跟踪方法。
PCT/CN2022/079697 2021-03-17 2022-03-08 检测跟踪方法、装置、设备、存储介质及计算机程序产品 WO2022193990A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/976,287 US20230047514A1 (en) 2021-03-17 2022-10-28 Method and apparatus for detection and tracking, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110287909.XA CN113706576A (zh) 2021-03-17 2021-03-17 检测跟踪方法、装置、设备及介质
CN202110287909.X 2021-03-17

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/976,287 Continuation US20230047514A1 (en) 2021-03-17 2022-10-28 Method and apparatus for detection and tracking, and storage medium

Publications (1)

Publication Number Publication Date
WO2022193990A1 true WO2022193990A1 (zh) 2022-09-22

Family

ID=78647830

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/079697 WO2022193990A1 (zh) 2021-03-17 2022-03-08 检测跟踪方法、装置、设备、存储介质及计算机程序产品

Country Status (3)

Country Link
US (1) US20230047514A1 (zh)
CN (1) CN113706576A (zh)
WO (1) WO2022193990A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862148B (zh) * 2020-06-05 2024-02-09 中国人民解放军军事科学院国防科技创新研究院 实现视觉跟踪的方法、装置、电子设备及介质
CN113706576A (zh) * 2021-03-17 2021-11-26 腾讯科技(深圳)有限公司 检测跟踪方法、装置、设备及介质
CN114445710A (zh) * 2022-01-29 2022-05-06 北京百度网讯科技有限公司 图像识别方法、装置、电子设备以及存储介质
CN117649537B (zh) * 2024-01-30 2024-04-26 浙江省公众信息产业有限公司 监控视频对象识别跟踪方法、系统、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140334668A1 (en) * 2013-05-10 2014-11-13 Palo Alto Research Center Incorporated System and method for visual motion based object segmentation and tracking
CN110111363A (zh) * 2019-04-28 2019-08-09 深兰科技(上海)有限公司 一种基于目标检测的跟踪方法及设备
CN110610510A (zh) * 2019-08-29 2019-12-24 Oppo广东移动通信有限公司 目标跟踪方法、装置、电子设备及存储介质
CN110799984A (zh) * 2018-07-27 2020-02-14 深圳市大疆创新科技有限公司 跟踪控制方法、设备、计算机可读存储介质
CN110930434A (zh) * 2019-11-21 2020-03-27 腾讯科技(深圳)有限公司 目标对象跟踪方法、装置、存储介质和计算机设备
CN113706576A (zh) * 2021-03-17 2021-11-26 腾讯科技(深圳)有限公司 检测跟踪方法、装置、设备及介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140334668A1 (en) * 2013-05-10 2014-11-13 Palo Alto Research Center Incorporated System and method for visual motion based object segmentation and tracking
CN110799984A (zh) * 2018-07-27 2020-02-14 深圳市大疆创新科技有限公司 跟踪控制方法、设备、计算机可读存储介质
CN110111363A (zh) * 2019-04-28 2019-08-09 深兰科技(上海)有限公司 一种基于目标检测的跟踪方法及设备
CN110610510A (zh) * 2019-08-29 2019-12-24 Oppo广东移动通信有限公司 目标跟踪方法、装置、电子设备及存储介质
CN110930434A (zh) * 2019-11-21 2020-03-27 腾讯科技(深圳)有限公司 目标对象跟踪方法、装置、存储介质和计算机设备
CN113706576A (zh) * 2021-03-17 2021-11-26 腾讯科技(深圳)有限公司 检测跟踪方法、装置、设备及介质

Also Published As

Publication number Publication date
US20230047514A1 (en) 2023-02-16
CN113706576A (zh) 2021-11-26

Similar Documents

Publication Publication Date Title
US11678734B2 (en) Method for processing images and electronic device
US11205282B2 (en) Relocalization method and apparatus in camera pose tracking process and storage medium
WO2022193990A1 (zh) 检测跟踪方法、装置、设备、存储介质及计算机程序产品
WO2019101021A1 (zh) 图像识别方法、装置及电子设备
CN111079576B (zh) 活体检测方法、装置、设备及存储介质
CN109947886B (zh) 图像处理方法、装置、电子设备及存储介质
WO2020221012A1 (zh) 图像特征点的运动信息确定方法、任务执行方法和设备
US11210810B2 (en) Camera localization method and apparatus, terminal, and storage medium
CN110807361A (zh) 人体识别方法、装置、计算机设备及存储介质
CN111127509B (zh) 目标跟踪方法、装置和计算机可读存储介质
WO2020249025A1 (zh) 身份信息的确定方法、装置及存储介质
CN110570460A (zh) 目标跟踪方法、装置、计算机设备及计算机可读存储介质
CN113627413B (zh) 数据标注方法、图像比对方法及装置
CN111754386A (zh) 图像区域屏蔽方法、装置、设备及存储介质
CN111862148A (zh) 实现视觉跟踪的方法、装置、电子设备及介质
CN111860064B (zh) 基于视频的目标检测方法、装置、设备及存储介质
CN111931712A (zh) 人脸识别方法、装置、抓拍机及系统
WO2023066373A1 (zh) 确定样本图像的方法、装置、设备及存储介质
CN111191579A (zh) 物品检测方法、装置、终端及计算机可读存储介质
CN111611414A (zh) 车辆检索方法、装置及存储介质
CN111757146B (zh) 视频拼接的方法、系统及存储介质
CN111160248A (zh) 物品跟踪的方法、装置、计算机设备及存储介质
CN115221888A (zh) 实体提及的识别方法、装置、设备及存储介质
CN112861565A (zh) 确定轨迹相似度的方法、装置、计算机设备和存储介质
CN116342866A (zh) 目标检测方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22770345

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 01/02/2024)