WO2024012367A1 - Visual-target tracking method and apparatus, and device and storage medium - Google Patents

Visual-target tracking method and apparatus, and device and storage medium Download PDF

Info

Publication number
WO2024012367A1
WO2024012367A1 PCT/CN2023/106311 CN2023106311W WO2024012367A1 WO 2024012367 A1 WO2024012367 A1 WO 2024012367A1 CN 2023106311 W CN2023106311 W CN 2023106311W WO 2024012367 A1 WO2024012367 A1 WO 2024012367A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
visual
video frame
visual object
tracking
Prior art date
Application number
PCT/CN2023/106311
Other languages
French (fr)
Chinese (zh)
Inventor
张伟俊
Original Assignee
影石创新科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 影石创新科技股份有限公司 filed Critical 影石创新科技股份有限公司
Publication of WO2024012367A1 publication Critical patent/WO2024012367A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Definitions

  • Embodiments of the present invention relate to the field of computer vision technology, and in particular to a visual target tracking method, device, equipment and storage medium.
  • Tracking visual targets in videos is a technology that, given the target size and position of a visual target in a specific image frame, predicts the size and position of the object corresponding to the visual target in subsequent image frames of the video sequence. It is used in video surveillance, It is widely used in many fields such as human-computer interaction and multimedia analysis.
  • Embodiments of the present invention provide a visual target tracking method, device, equipment and storage medium, which can accurately track specific visual targets without interference from visual targets of the same category.
  • embodiments of the present invention provide a visual target tracking method for use in electronic devices.
  • the method includes: inputting the current video frame of the video to be processed into a preset target tracking model, and marking based on the video to be processed.
  • Process the target visual object determined by the key video frame of the video in the current video frame corresponding to the first image area; perform target detection on the current video frame, and identify multiple visual objects of the same type as the target visual object corresponding to the first image area.
  • Two image areas using a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, determining from the first image area and a plurality of the second image areas that the target visual object is in the current Video frame function area.
  • the above visual target tracking method uses a target tracking model that completes the target tracking of the previous frame image to detect the target object to be tracked; on the other hand, it detects the target to be tracked in the current image frame.
  • Other objects of the same type; using a target classification model that distinguishes specific visual objects from multiple visual objects of the same type, the detected target object to be tracked and other target objects of the same type as the target object to be tracked are processed Classify and distinguish the target objects to be tracked; extract the target objects to be tracked considered by the target tracking model and other objects of the same type, and then use the classifier to perform secondary detection on the above extracted visual objects, from the same type
  • the target visual object is classified among different visual objects. Through the secondary detection method, the interference of similar objects is avoided, and the specific visual target is accurately tracked without interference from visual targets of the same category.
  • the method further includes:
  • the method further includes:
  • Target detection is performed on the current video frame, and multiple second image areas corresponding to visual objects with the same recording type as the target visual object are identified, including:
  • target detection is performed on the current video frame, and a plurality of second image regions corresponding to visual objects having the same recording type as the target visual object are identified.
  • the process of setting the target classification model includes:
  • the tracking visual target sample is used as a positive sample, and the interfering visual target sample is used as a negative sample.
  • the pre-built classifier is trained multiple times to obtain the target classification model that distinguishes a specific visual object from multiple visual objects with the same category information.
  • the method further includes:
  • the tracking visual target sample is obtained from the tracking visual target sample list, and the interfering visual target sample is obtained from the interfering visual target sample list.
  • the method before adding the target visual object in the functional area of the current video frame to the tracking visual target sample list, the method further includes:
  • Add the functional area of the target visual object in the current video frame to the tracking visual target sample list including:
  • extracting the category information of the target visual object from the key video frame of the video to be processed includes:
  • extracting the category information of the target visual object from the key video frame of the video to be processed includes:
  • the visual object corresponding to the trigger signal is obtained as the target visual object and the category information.
  • an embodiment of the present invention provides a visual target tracking device, which is provided in an electronic device.
  • the device includes:
  • a marking module configured to input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed when the current video frame corresponds to the first image area;
  • a detection module configured to perform target detection on the current video frame and identify multiple visual objects of the same type as the target visual object corresponding to the second image area;
  • a classification module configured to utilize a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, and determine from the first image area and a plurality of the second image areas where the target visual object is in the The functional area of the current video frame.
  • the device further includes:
  • Optimization module configured to use the target visual object in the functional area of the current video frame to optimize the preset value when the target visual object in the functional area of the current video frame is inconsistent with the first image area. Optimize a certain target tracking model.
  • the device further includes:
  • An extraction module configured to respond to a trigger operation for the video to be processed and extract the category information of the target visual object from the key video frame of the video to be processed;
  • the detection module is specifically configured to perform target detection on the current video frame according to the category information, and identify the second image regions corresponding to multiple visual objects with the same recording type as the target visual object.
  • the device further includes a target classification model setting module, and the target classification model setting module includes:
  • the sample acquisition sub-module is used to acquire the target tracking video frames of the video to be processed.
  • the target visual object corresponds to a tracking visual target sample, and a visual object having the category information other than the target visual object corresponds to an interfering visual target sample;
  • the training sub-module is used to use the tracking visual target sample as a positive sample and the interfering visual target sample as a negative sample, train the pre-built classifier multiple times, and obtain the results from multiple visual objects with the same category information.
  • the target classification model distinguishes specific visual objects.
  • the target classification model setting module further includes:
  • the first adding sub-module is used to add the target visual object in the functional area of the current video frame to the tracking visual target sample list;
  • the second adding sub-module is used to obtain the third image area after deleting the functional area of the current video frame of the target visual object from the set composed of the first image area and a plurality of second image areas, Added to the list of interfering visual target samples;
  • the sample acquisition sub-module is specifically configured to obtain the tracking visual target sample from the tracking visual target sample list, and obtain the interfering visual target sample from the interfering visual target sample list.
  • the target classification model setting module further includes:
  • Time acquisition submodule used to obtain the time information of the current video frame
  • the first added sub-module includes:
  • the deletion subunit is used to delete elements in the tracking visual target sample list whose storage time exceeds a preset time length or elements ranked first in the tracking visual target sample list.
  • the extraction module includes:
  • the response sub-module is used to respond to the trigger operation for the video to be processed and mark the target visual object selected in the key video frame of the video to be processed;
  • a detection submodule is used to perform category detection on the target visual object and obtain category information of the target visual object.
  • the extraction module further includes:
  • An identification submodule configured to identify and display multiple visual objects whose categories are associated with user-input category information in key video frames of the video to be processed;
  • the receiving submodule is configured to, when receiving a trigger signal for a visual object associated with the category information, obtain the visual object corresponding to the trigger signal as the target visual object and the category information.
  • embodiments of the present invention provide an electronic device, including: at least one processor; and at least one memory communicatively connected to the processor, wherein: the memory stores a program that can be executed by the processor. Instructions, the processor calls the program instructions to be able to execute the method provided in the first aspect.
  • embodiments of the present invention provide a non-transitory computer-readable storage medium.
  • the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the method provided in the first aspect.
  • Figure 1 is a flow chart of the steps of the visual target tracking method proposed by the embodiment of the present invention.
  • Figure 2 is a step flow chart of another visual target tracking method according to an embodiment of the present invention.
  • Figure 3 is a functional module diagram of the visual target tracking device proposed by the embodiment of the present invention.
  • Figure 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
  • the visual target tracking method proposed in the embodiment of the present invention can be applied to electronic devices such as terminals and servers.
  • FIG. 1 is a flow chart of the steps of the visual target tracking method proposed by the embodiment of the present invention. As shown in Figure 1, the steps include:
  • S110 Input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame.
  • the target visual object is a moving visual target that needs to be tracked.
  • the moving visual target in the image sequence of the video to be processed is detected, extracted, identified, and tracked, and the motion parameters of the moving visual target, such as position, speed, acceleration, and motion trajectory, are obtained.
  • the visual objects in visual target tracking can be objects in the image frames of the video, for example, they can be people, cars, animals, robots, etc. in the video image frames.
  • the image area corresponding to the visual object on the image can represent: the area in the image where the pixels displaying the visual object are located.
  • Each frame of image in the video to be processed can be used as the current video frame in turn, or an image frame can be extracted from the image sequence of the video to be processed every preset number of frames as the current video frame. For example, extract an image from the video to be processed every N (N is greater than or equal to 1) frames as the current video frame, and detect the position and size of the specified visual object in the current video frame.
  • the preset target tracking model can be a discriminative correlation filter (Discriminative Correlation Filter, DCF) and other related filter type trackers, or a twin network visual tracker (Evolution of Siamese Visual Tracking with Very Deep Networks, SiamRPN), etc. Tracker based on convolutional neural network (CNN) technology.
  • DCF discriminative Correlation Filter
  • CNN convolutional neural network
  • the set target tracking model can be optimized through the tracking results after completing visual object tracking in each frame of the video to be processed, thereby improving the robustness of the set target tracking model.
  • the method of marking the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame may be frame selection, highlighting, etc.
  • the key video frame of the video to be processed may be the first frame of the video to be processed, or the image with the best quality in the video to be processed.
  • the embodiment of the present invention responds to user selection instructions based on key video frames and determines the visual object corresponding to the specified object as the target visual object in other video frames of the video to be processed.
  • An example electronic device of the present invention performs S110: determine the target visual object based on the first frame image of the video to be processed, input the first frame image and the second frame image into the target tracking model, and use the target tracking model to frame the second frame image. Select the position of the target visual object, and output the target visual object in the second frame image corresponding to the first image area.
  • S120 Perform target detection on the current video frame, and identify multiple visual objects of the same type as the target visual object corresponding to the second image area.
  • Methods for target detection on the current video frame include: detection methods based on manual features (such as template matching method, key point matching method, key feature method, etc.), or detection methods based on convolutional neural network technology (such as YOLO, SSD, R-CNN, Mask R-CNN, etc.).
  • manual features such as template matching method, key point matching method, key feature method, etc.
  • detection methods based on convolutional neural network technology such as YOLO, SSD, R-CNN, Mask R-CNN, etc.
  • the type of the target visual object is pedestrians, and all pedestrians in the current video frame are detected.
  • S130 Using a target classification model that distinguishes specific visual objects from multiple visual objects of the same type, determine from the first image area and multiple second image areas where the target visual object is in the current video frame. functional area.
  • the functional area of the target visual object in the current video frame includes: the target visual object is in the display area of the current video frame or an area that triggers an intelligent device to perform mechanical movement, such as a mobile robot, a drone, or a camera with a pan/tilt, In response to the instruction to calculate the target visual object in the functional area of the current video frame, trigger the execution of the moving position, or rotate the pan/tilt to actually track the target visual object.
  • the target classification model determines that the target visual object is in the functional area of the current video frame, and can output a frame to select the current image frame of the functional area.
  • the target classification model can classify multiple visual objects of the same type into target visual objects and non-target visual objects.
  • the target tracking model is used to identify the possible target visual object in the current video frame.
  • the image area is A.
  • Target detection is performed on the current video frame, and multiple visual objects of the same type as the target visual object are identified in the image area of the current video frame: B, C, D, and A.
  • Use the target classification model to classify image areas A, B, C, and D, and determine that image area A is the area where the target visual object is displayed on the current video frame. Due to the two calculations of the target tracking model and the target classification model, it is guaranteed that the target visual objects tracked in the current video frame are more accurate; among them, the target classification model classifies visual objects of the same type, which can effectively avoid visual objects of the same type. interference.
  • Embodiments of the present invention also propose that when the target visual object output by the target classification model is in the functional area of the current video frame, it is determined that the target visual object output by the target classification model is in the functional area of the current video frame, and the target tracking model detects Whether the target visual object output by the target classification model is consistent in the first image area of the current video frame is consistent with the target visual object detected by the target tracking model in the functional area of the current video frame.
  • the first image area of the current video frame is consistent, and the target tracking model can currently accurately detect the target visual object in the current video frame.
  • FIG. 1 is a step flow chart of another visual target tracking method according to the embodiment of the present invention. The steps include:
  • S210 In response to a triggering operation on the video to be processed, extract the category information of the target visual object from the key video frame of the video to be processed.
  • Executing S210 may be implemented by executing sub-step S211 or executing sub-step S212.
  • S211 In response to the triggering operation on the video to be processed, mark the target visual object selected in the key video frame of the video to be processed; perform category detection on the target visual object to obtain category information of the target visual object.
  • the electronic device displays key video frames, the user selects a specific image area, and the electronic device performs category detection on the selected specific image area to obtain category information of the target visual object.
  • electronic The device can calculate the intersection and union ratio of each detection result and the initial target bounding box selected by the user, and select the category of the detection result with the largest intersection and union ratio as the category information.
  • the electronic device can also send key video frames and initial target bounding boxes marked on specific image areas to the CNN-based target classification algorithm to obtain the category with the highest score as category information.
  • S212 In the key video frame of the video to be processed, identify and display multiple visual objects whose categories are associated with the category information input by the user; when receiving a trigger signal for a visual object associated with the category information, obtain the visual object corresponding to the trigger signal.
  • the objects are the target visual object and the category information.
  • the electronic device receives category information input by the user, detects key video frames, identifies multiple visual objects corresponding to the category information, and displays the multiple visual objects.
  • the visual object selected by the user is determined as the target visual object, thereby obtaining the target visual object and category information of the target visual object.
  • S220 Input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame.
  • S230 Perform target detection on the current video frame according to the category information, and identify second image areas corresponding to multiple visual objects with the same recording type as the target visual object.
  • S240 Using a target classification model that distinguishes specific visual objects from multiple visual objects of the same type, determine from the first image area and multiple second image areas where the target visual object is in the current video frame. functional area.
  • the target classification model may be obtained by: obtaining the tracking visual target sample corresponding to the target visual object in the video frame of the video to be processed that has undergone target tracking, and the visual objects with the category information in addition to the target visual object.
  • the interfering visual target sample Corresponding to the interfering visual target sample; use the tracking visual target sample as a positive sample, the interfering visual target sample as a negative sample, conduct multiple trainings on the pre-built classifier, and obtain the results from multiple visual objects with the same category information.
  • the target classification model distinguishes specific visual objects.
  • the classifier can be a nearest neighbor classifier, a decision tree, a support vector machine classifier, etc.
  • the visual target tracking method includes periodically extracting the current video frame of the video to be processed, and performing the Nth extracted current video frame. Perform visual object tracking, and in the process of determining the target visual object, obtain Visual objects of the same category are added to the interference visual target sample list, and the target visual object identified in the current video frame extracted for the Nth time is added to the tracking visual target sample list.
  • the elements in the tracking visual target sample list are used as positive samples, and the elements in the interfering visual target sample list are used as negative samples.
  • the pre-built classifier is trained multiple times to obtain the target classification model, which can be used for the N+1th time Classify similar visual objects during the visual object tracking process of the extracted current video frame; or train the target classification model used in the target time object tracking process of the N-1th extracted current video frame, and obtain A target classification model that classifies similar visual objects during the visual object tracking process of the current video frame extracted for the N+1th time.
  • Yet another visual target tracking method includes adding the functional area of the current video frame obtained based on the current video frame to the tracking visual target sample list, and adding the image area where the visual object with the category information in addition to the target visual object is located.
  • List of interfering visual target samples The negative samples for training the target classification model are obtained from the interference visual target sample list, and the positive samples for training the target classification model are obtained from the tracking visual target sample list.
  • Obtaining the tracking visual target samples corresponding to the target visual object and the interfering visual target samples corresponding to the visual objects with the category information in addition to the target visual object in the video frames of the target tracking video including: from The tracking visual target sample is obtained from the tracking visual target sample list, and the interfering visual target sample is obtained from the interfering visual target sample list.
  • Another visual target tracking method includes steps during the Nth execution of the visual target tracking method:
  • K101 Input the Nth frame image of the video to be processed into the preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the Nth frame image of the video to be processed.
  • First image area A Input the Nth frame image of the video to be processed into the preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the Nth frame image of the video to be processed.
  • K102 Perform target detection on the Nth frame image of the video to be processed, and identify multiple visual objects of the same type as the target visual object corresponding to the second image areas B, C, and D.
  • K103 Using a target classification model that distinguishes specific visual objects from multiple visual objects of the same type, determine from the first image area and multiple second image areas where the target visual object is in the video to be processed. Functional area B of the Nth frame image.
  • the interference visual target sample list D ⁇ D1, D2,...,DL ⁇ .
  • the interference visual target sample list includes elements D1, D2,... ., DL, where D1, D2,..., DL includes the image area of other perspective objects of the same type as the target visual object obtained based on the Nth frame image of the video to be processed: A, C, D, and based on the Nth frame image of the video to be processed.
  • the N-i frame image obtains the image area where other perspective objects of the same type as the target visual object are located, i is less than N and greater than or equal to 1.
  • K105 Add image area B to the tracking visual target sample list.
  • the tracking visual target sample list T ⁇ T1, T2,...,TM ⁇ .
  • the tracking visual target sample list includes the Nth frame image obtained based on the video to be processed.
  • the tracking visual target samples and interference visual target samples are determined based on the current video frame, and are added to the tracking visual target sample list and the interference visual target sample list respectively.
  • K106 Get the tracking visual target sample from the tracking visual target sample list as a positive sample, get the interference visual target sample from the interference visual target sample list as a negative sample, train the pre-built classifier, or train to execute the visual target at the N-1th time
  • the target classification model trained during the tracking process is used to obtain the target classification model used in the N+1th execution of the visual target tracking method.
  • Methods for training target classification models include:
  • the squared difference of gray values is used to define the distance between two samples, and the nearest neighbor classifier is used for target classification.
  • the method further includes:
  • Add the functional area of the target visual object in the current video frame to the tracking visual target sample list including:
  • Add the functional area of the target visual object in the current video frame to the tracking visual target sample list including:
  • the above method can also be used.
  • the time information can be obtained , to subsequently remove inapplicable elements.
  • the adding time of the sample (image frame number, or absolute time) can be recorded at the same time, and a history forgetting mechanism can be implemented, such as:
  • the embodiment of the present invention trains a target classification model through tracking target samples and interference target samples collected in historical frames of the video to be processed, and updates and corrects the output results of the tracking algorithm of the current video frame, so that the algorithm is not easily interfered by targets of the same category, and the target classification model is updated and corrected. It is highly robust to occlusion/interference by targets of the same category, significantly improving the usability of the tracking algorithm in actual scenarios.
  • Figure 3 is a functional module diagram of the visual target tracking device proposed by the embodiment of the present invention.
  • the above picture A display device is provided in an electronic device, as shown in Figure 3.
  • the device includes:
  • Marking module 31 configured to input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed when the current video frame corresponds to the first image area;
  • the detection module 32 is configured to perform target detection on the current video frame and identify multiple visual objects of the same type as the target visual object corresponding to the second image area;
  • Classification module 33 configured to use a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, and determine from the first image area and the plurality of second image areas where the target visual object is located. Describes the functional area of the current video frame.
  • the visual target tracking device provided by the embodiment shown in Figure 3 can be used to implement the technical solutions of the method embodiments shown in Figures 1 and 2 of this specification. For its implementation principles and technical effects, further reference can be made to the relevant descriptions in the method embodiments.
  • the device also includes:
  • Optimization module configured to use the target visual object in the functional area of the current video frame to optimize the preset value when the target visual object in the functional area of the current video frame is inconsistent with the first image area. Optimize a certain target tracking model.
  • the device also includes:
  • An extraction module configured to respond to a trigger operation for the video to be processed and extract the category information of the target visual object from the key video frame of the video to be processed;
  • the detection module is specifically configured to perform target detection on the current video frame according to the category information, and identify the second image regions corresponding to multiple visual objects with the same recording type as the target visual object.
  • the device further includes a target classification model setting module, which includes:
  • a sample acquisition submodule used to obtain the tracking visual target samples corresponding to the target visual object in the video frames of the video to be processed that have undergone target tracking, and the corresponding interference of visual objects with the category information in addition to the target visual object.
  • Visual target samples
  • the training submodule is used to use the tracking visual target sample as a positive sample and the interfering visual target sample as a negative sample to train the pre-built classifier multiple times to obtain the same
  • the target classification model uses category information to distinguish a specific visual object among multiple visual objects.
  • the target classification model setting module also includes:
  • the first adding sub-module is used to add the target visual object in the functional area of the current video frame to the tracking visual target sample list;
  • the second adding sub-module is used to obtain the third image area after deleting the functional area of the current video frame of the target visual object from the set composed of the first image area and a plurality of second image areas, Added to the list of interfering visual target samples;
  • the sample acquisition sub-module is specifically configured to obtain the tracking visual target sample from the tracking visual target sample list, and obtain the interfering visual target sample from the interfering visual target sample list.
  • the target classification model setting module also includes:
  • Time acquisition submodule used to obtain the time information of the current video frame
  • the first added sub-module includes:
  • the deletion subunit is used to delete elements in the tracking visual target sample list whose storage time exceeds a preset time length or elements ranked first in the tracking visual target sample list.
  • the extraction module includes:
  • the response sub-module is used to respond to the trigger operation for the video to be processed and mark the target visual object selected in the key video frame of the video to be processed;
  • a detection submodule is used to perform category detection on the target visual object and obtain category information of the target visual object.
  • the extraction module also includes:
  • An identification submodule configured to identify and display multiple visual objects whose categories are associated with user-input category information in key video frames of the video to be processed;
  • the receiving submodule is configured to, when receiving a trigger signal for a visual object associated with the category information, obtain the visual object corresponding to the trigger signal as the target visual object and the category information.
  • the device provided in the above-described embodiments may be, for example, a chip or a chip module.
  • the devices provided by the above-described embodiments are used to execute the technical solutions of the above-described method embodiments. For its implementation principles and technical effects, further reference can be made to the relevant descriptions in the method embodiments, which will not be described again here.
  • each module/unit included in each device described in the above embodiment it may be a software module/unit or a hardware module/unit, or it may be partly a software module/unit and partly a hardware module/unit.
  • each module/unit included therein can be implemented in the form of hardware such as circuits, or at least some of the modules/units can be implemented in the form of a software program that runs on
  • the remaining modules/units can be implemented using circuits and other hardware methods; for each device applied or integrated into the chip module, each module/unit included in it can be implemented using circuits and other hardware methods.
  • different modules/units can be located in the same component (such as chip, circuit module, etc.) or in different components of the chip module, or at least some of the modules/units can be implemented in the form of software programs that run on the chip module
  • the remaining modules/units can be implemented using circuits and other hardware methods; for each device applied or integrated into electronic terminal equipment, each module/unit included in it can be implemented using circuits and other hardware methods.
  • Different modules/units can be located in the same component (e.g., chip, circuit module, etc.) or in different components within the electronic terminal equipment, or at least some of the modules/units can be implemented in the form of software programs that run on the electronic terminal equipment.
  • the remaining (if any) modules/units can be implemented using circuits and other hardware methods.
  • FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
  • the electronic device 400 includes a processor 410, a memory 411, and a computer program stored in the memory 411 and executable on the processor 410.
  • the processor 410 executes the program, it implements the steps in the foregoing method embodiments.
  • the electronic equipment provided by the embodiments can be used to execute the technical solutions of the method embodiments shown above. For its implementation principles and technical effects, please refer to the method embodiments for further reference. The relevant descriptions will not be repeated here.
  • Embodiments of the present invention provide a non-transitory computer-readable storage medium.
  • the non-transitory computer-readable storage medium stores computer instructions.
  • the computer instructions cause the computer to execute the embodiments shown in Figures 1 and 2 of this specification.
  • Non-transitory computer-readable storage media may refer to non-volatile computer storage media.
  • the above-mentioned non-transitory computer-readable storage medium may adopt one or more computer-readable media. Any combination of.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including - but not limited to - electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
  • RF radio frequency
  • Computer program code for performing the operations described herein may be written in one or more programming languages, including visual object-oriented programming languages—such as Java, Smalltalk, C++, and conventional A procedural programming language—such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (e.g. Use an Internet service provider to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • first and second are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as “first” and “second” may explicitly or implicitly include at least one of these features.
  • plurality means at least two, such as two, three, etc., unless otherwise clearly and specifically limited.
  • the word “if” as used herein may be interpreted as “when” or “when” or “in response to determination” or “in response to detection.”
  • the phrase “if determined” or “if (stated condition or event) is detected” may be interpreted as “when determined” or “in response to determining” or “when (stated condition or event) is detected )” or “in response to detecting (a stated condition or event)”.
  • terminals involved in the embodiments of the present invention may include but are not limited to personal computers (PCs), personal digital assistants (Personal Digital Assistants, PDAs), wireless handheld devices, tablet computers (tablet computers), Mobile phones, MP3 players, MP4 players, etc.
  • PCs personal computers
  • PDAs Personal Digital Assistants
  • PDAs Personal Digital Assistants
  • wireless handheld devices tablet computers
  • tablet computers tablet computers
  • Mobile phones MP3 players, MP4 players, etc.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined. Either it can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • each functional unit in each embodiment of this specification may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the above-mentioned integrated unit implemented in the form of a software functional unit can be stored in a computer-readable storage medium.
  • the above-mentioned software functional unit is stored in a storage medium and includes a number of instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute the methods described in various embodiments of this specification. Some steps.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other various media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiments of the present invention relate to the technical field of computer vision. Provided are a visual-target tracking method and apparatus, and a device and a storage medium, which can accurately track a specific visual target without interference from visual targets of the same category. The method comprises: inputting into a preset target tracking model the current video frame of a video to be processed, and marking a corresponding first image area of a target visual object in the current video frame, which target visual object is determined on the basis of a key video frame of said video; performing target detection on the current video frame, so as to identify second image areas corresponding to a plurality of visual objects of the same type as the target visual object; and by using a target classification model for distinguishing a specific visual object from among the plurality of visual objects of the same type, determining, from the first image area and the plurality of second image areas, a functional area of the target visual object in the current video frame.

Description

视觉目标跟踪方法、装置、设备以及存储介质Visual target tracking method, device, equipment and storage medium 【技术领域】【Technical field】
本发明实施例涉及计算机视觉技术领域,尤其涉及一种视觉目标跟踪方法、装置、设备以及存储介质。Embodiments of the present invention relate to the field of computer vision technology, and in particular to a visual target tracking method, device, equipment and storage medium.
【背景技术】【Background technique】
对视频中的视觉目标进行追踪,是给定视觉目标在特定图像帧的目标大小与位置情况下,预测该视觉目标对应物体在视频序列后续图像帧中的大小和位置的技术,在视频监控、人机交互、多媒体分析等多个领域都有广泛的应用。Tracking visual targets in videos is a technology that, given the target size and position of a visual target in a specific image frame, predicts the size and position of the object corresponding to the visual target in subsequent image frames of the video sequence. It is used in video surveillance, It is widely used in many fields such as human-computer interaction and multimedia analysis.
实际应用中,目前视觉目标跟踪领域,不管是基于DCF技术的传统算法,还是以SiamRPN为代表的基于深度学习的跟踪算法,在有同类别目标干扰(例如颜色相似的干扰车辆、表观结构信息相似的干扰行人等)的场景下算法鲁棒性都比较差。特别是,当前跟踪目标被相似干扰目标遮挡后,算法很容易跟踪到干扰目标上去。In practical applications, in the current field of visual target tracking, whether it is a traditional algorithm based on DCF technology or a tracking algorithm based on deep learning represented by SiamRPN, when there is interference from targets of the same category (such as interfering vehicles with similar colors, apparent structure information The algorithm's robustness is relatively poor in similar scenarios that interfere with pedestrians, etc.). In particular, after the current tracking target is occluded by a similar interference target, the algorithm can easily track the interference target.
【发明内容】[Content of the invention]
本发明实施例提供了一种视觉目标跟踪方法、装置、设备以及存储介质,能够准确追踪到特定视觉目标不受同类别视觉目标干扰。Embodiments of the present invention provide a visual target tracking method, device, equipment and storage medium, which can accurately track specific visual targets without interference from visual targets of the same category.
第一方面,本发明实施例提供一种视觉目标跟踪方法,应用于电子设备,所述方法包括:将所述待处理视频的当前视频帧输入预先设定的目标跟踪模型,标记基于所述待处理视频的关键视频帧确定的目标视觉对象在所述当前视频帧对应第一图像区域;对所述当前视频帧进行目标检测,识别出多个类型与所述目标视觉对象相同的视觉对象对应第二图像区域;利用从多个类型相同视觉对象中区分出特定视觉对象的目标分类模型,从所述第一图像区域和多个所述第二图像区域中确定所述目标视觉对象在所述当前视频帧的功能 区域。In a first aspect, embodiments of the present invention provide a visual target tracking method for use in electronic devices. The method includes: inputting the current video frame of the video to be processed into a preset target tracking model, and marking based on the video to be processed. Process the target visual object determined by the key video frame of the video in the current video frame corresponding to the first image area; perform target detection on the current video frame, and identify multiple visual objects of the same type as the target visual object corresponding to the first image area. Two image areas; using a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, determining from the first image area and a plurality of the second image areas that the target visual object is in the current Video frame function area.
上述视觉目标跟踪方法针对处理的当前帧图像,一方面采用完成对上一帧图像实现目标跟踪的目标跟踪模型,检测出待追踪的目标对象;一方面检测出当前图像帧中与待追踪的目标对象相同类型的其他对象;采用从多个类型相同视觉对象中区分出特定视觉对象的目标分类模型,对检测出的待追踪的目标对象,以及与待追踪的目标对象相同类型的其他目标对象进行分类,区分出待追踪的目标对象;将目标跟踪模型认为的待追踪的目标对象,和同类型的其他对象提取出来,再利用分类器对上述提取出的视觉对象进行二次检测,从同一类型的不同视觉对象中分类出目标视觉对象,通过二次检测的方法,避免了同类对象的干扰,实现准确追踪到特定视觉目标不受同类别视觉目标干扰的目标。The above visual target tracking method uses a target tracking model that completes the target tracking of the previous frame image to detect the target object to be tracked; on the other hand, it detects the target to be tracked in the current image frame. Other objects of the same type; using a target classification model that distinguishes specific visual objects from multiple visual objects of the same type, the detected target object to be tracked and other target objects of the same type as the target object to be tracked are processed Classify and distinguish the target objects to be tracked; extract the target objects to be tracked considered by the target tracking model and other objects of the same type, and then use the classifier to perform secondary detection on the above extracted visual objects, from the same type The target visual object is classified among different visual objects. Through the secondary detection method, the interference of similar objects is avoided, and the specific visual target is accurately tracked without interference from visual targets of the same category.
其中一种可能的实现方式中,从所述第一图像区域和多个所述第二图像区域中确定所述目标视觉对象在所述当前视频帧的功能区域之后,所述方法还包括:In one possible implementation, after determining that the target visual object is in the functional area of the current video frame from the first image area and a plurality of the second image areas, the method further includes:
当所述目标视觉对象在所述当前视频帧的功能区域与所述第一图像区域不一致时,利用所述目标视觉对象在所述当前视频帧的功能区域对所述预先设定的目标跟踪模型进行优化。When the target visual object in the functional area of the current video frame is inconsistent with the first image area, using the target visual object in the functional area of the current video frame to track the preset target model optimize.
其中一种可能的实现方式中,所述方法还包括:In one possible implementation manner, the method further includes:
响应针对待处理视频的触发操作,从所述待处理视频的关键视频帧提取目标视觉对象的类别信息;In response to a trigger operation for the video to be processed, extract category information of the target visual object from key video frames of the video to be processed;
对所述当前视频帧进行目标检测,识别出多个记载类型与所述目标视觉对象相同的视觉对象对应第二图像区域,包括:Target detection is performed on the current video frame, and multiple second image areas corresponding to visual objects with the same recording type as the target visual object are identified, including:
根据所述类别信息,对所述当前视频帧进行目标检测,识别出多个记载类型与所述目标视觉对象相同的视觉对象对应第二图像区域。According to the category information, target detection is performed on the current video frame, and a plurality of second image regions corresponding to visual objects having the same recording type as the target visual object are identified.
其中一种可能的实现方式中,设定所述目标分类模型的过程包括:In one possible implementation manner, the process of setting the target classification model includes:
在所述待处理视频经过目标追踪的视频帧中获取所述目标视觉对象对应跟踪视觉目标样本,以及除所述目标视觉对象外具有所述类别信息的视觉对象对应干扰视觉目标样本;Obtain the tracking visual target samples corresponding to the target visual object in the video frames of the video to be processed that have undergone target tracking, and the interference visual target samples corresponding to the visual objects having the category information in addition to the target visual object;
将所述跟踪视觉目标样本作为正样本,所述干扰视觉目标样本作为负样 本,对预先搭建的分类器进行多次训练,获得从具有相同类别信息的多个视觉对象中区分出特定视觉对象的所述目标分类模型。The tracking visual target sample is used as a positive sample, and the interfering visual target sample is used as a negative sample. Here, the pre-built classifier is trained multiple times to obtain the target classification model that distinguishes a specific visual object from multiple visual objects with the same category information.
其中一种可能的实现方式中,在确定所述目标视觉对象在所述当前视频帧的功能区域之后,所述方法还包括:In one possible implementation, after determining that the target visual object is in the functional area of the current video frame, the method further includes:
将所述目标视觉对象在所述当前视频帧的功能区域添加到跟踪视觉目标样本列表;Add the functional area of the target visual object in the current video frame to the tracking visual target sample list;
获得在所述第一图像区域和多个所述第二图像区域组成的集合中删除所述目标视觉对象在所述当前视频帧的功能区域后第三图像区域,添加到干扰视觉目标样本列表;Obtain a third image area after deleting the target visual object in the functional area of the current video frame from the set composed of the first image area and a plurality of the second image areas, and add it to the interfering visual target sample list;
在所述待处理视频经过目标追踪的视频帧中获取所述目标视觉对象对应跟踪视觉目标样本,以及除所述目标视觉对象外具有所述类别信息的视觉对象对应干扰视觉目标样本,包括:Obtaining the tracking visual target samples corresponding to the target visual object and the interfering visual target samples corresponding to the visual objects with the category information in addition to the target visual object in the video frames of the target tracking video, including:
从所述跟踪视觉目标样本列表中获取所述跟踪视觉目标样本,从所述干扰视觉目标样本列表中获取所述干扰视觉目标样本。The tracking visual target sample is obtained from the tracking visual target sample list, and the interfering visual target sample is obtained from the interfering visual target sample list.
其中一种可能的实现方式中,在将所述目标视觉对象在所述当前视频帧的功能区域添加到跟踪视觉目标样本列表之前,所述方法还包括:In one possible implementation, before adding the target visual object in the functional area of the current video frame to the tracking visual target sample list, the method further includes:
获取所述当前视频帧的时间信息;Obtain the time information of the current video frame;
将所述目标视觉对象在所述当前视频帧的功能区域添加到跟踪视觉目标样本列表,包括:Add the functional area of the target visual object in the current video frame to the tracking visual target sample list, including:
将所述目标视觉对象在所述当前视频帧的功能区域添加到所述跟踪视觉目标样本列表的最后一位;所述跟踪视觉目标样本列表的中元素按照对应时间信息从大到小排列;Add the functional area of the target visual object in the current video frame to the last position of the tracking visual target sample list; the middle elements of the tracking visual target sample list are arranged from large to small according to the corresponding time information;
删除所述跟踪视觉目标样本列表中存储时间超过预设时间长度的元素或排列在所述跟踪视觉目标样本列表第一位的元素。Delete elements in the tracking visual target sample list whose storage time exceeds a preset time length or elements ranked first in the tracking visual target sample list.
其中一种可能的实现方式中,响应针对待处理视频的触发操作,从所述待处理视频的关键视频帧提取目标视觉对象的类别信息,包括:In one possible implementation, in response to a triggering operation on the video to be processed, extracting the category information of the target visual object from the key video frame of the video to be processed includes:
响应针对待处理视频的触发操作,标记所述待处理视频的关键视频帧选定的目标视觉对象;In response to a trigger operation for the video to be processed, mark the target visual object selected in the key video frame of the video to be processed;
对所述目标视觉对象进行类别检测,获得所述目标视觉对象的类别信息。 Perform category detection on the target visual object to obtain category information of the target visual object.
其中一种可能的实现方式中,响应针对待处理视频的触发操作,从所述待处理视频的关键视频帧提取目标视觉对象的类别信息,包括:In one possible implementation, in response to a triggering operation on the video to be processed, extracting the category information of the target visual object from the key video frame of the video to be processed includes:
在所述待处理视频的关键视频帧,识别并显示类别与用户输入类别信息关联的多个视觉对象;In key video frames of the video to be processed, identify and display multiple visual objects whose categories are associated with the category information input by the user;
接收到针对关联所述类别信息的视觉对象的触发信号时,获取触发信号对应视觉对象为所述目标视觉对象和所述类别信息。When a trigger signal for a visual object associated with the category information is received, the visual object corresponding to the trigger signal is obtained as the target visual object and the category information.
第二方面,本发明实施例提供一种视觉目标跟踪装置,设置在电子设备中,所述装置包括:In a second aspect, an embodiment of the present invention provides a visual target tracking device, which is provided in an electronic device. The device includes:
标记模块,用于将所述待处理视频的当前视频帧输入预先设定的目标跟踪模型,标记基于所述待处理视频的关键视频帧确定的目标视觉对象在所述当前视频帧对应第一图像区域;A marking module, configured to input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed when the current video frame corresponds to the first image area;
检测模块,用于对所述当前视频帧进行目标检测,识别出多个类型与所述目标视觉对象相同的视觉对象对应第二图像区域;A detection module, configured to perform target detection on the current video frame and identify multiple visual objects of the same type as the target visual object corresponding to the second image area;
分类模块,用于利用从多个类型相同视觉对象中区分出特定视觉对象的目标分类模型,从所述第一图像区域和多个所述第二图像区域中确定所述目标视觉对象在所述当前视频帧的功能区域。A classification module, configured to utilize a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, and determine from the first image area and a plurality of the second image areas where the target visual object is in the The functional area of the current video frame.
其中一种可能的实现方式中,所述装置还包括:In one possible implementation, the device further includes:
优化模块,用于当所述目标视觉对象在所述当前视频帧的功能区域与所述第一图像区域不一致时,利用所述目标视觉对象在所述当前视频帧的功能区域对所述预先设定的目标跟踪模型进行优化。Optimization module, configured to use the target visual object in the functional area of the current video frame to optimize the preset value when the target visual object in the functional area of the current video frame is inconsistent with the first image area. Optimize a certain target tracking model.
其中一种可能的实现方式中,所述装置还包括:In one possible implementation, the device further includes:
提取模块,用于响应针对待处理视频的触发操作,从所述待处理视频的关键视频帧提取目标视觉对象的类别信息;An extraction module, configured to respond to a trigger operation for the video to be processed and extract the category information of the target visual object from the key video frame of the video to be processed;
所述检测模块具体用于根据所述类别信息,对所述当前视频帧进行目标检测,识别出多个记载类型与所述目标视觉对象相同的视觉对象对应第二图像区域。The detection module is specifically configured to perform target detection on the current video frame according to the category information, and identify the second image regions corresponding to multiple visual objects with the same recording type as the target visual object.
其中一种可能的实现方式中,所述装置还包括目标分类模型设定模块,所述目标分类模型设定模块包括:In one possible implementation, the device further includes a target classification model setting module, and the target classification model setting module includes:
样本获取子模块,用于在所述待处理视频经过目标追踪的视频帧中获取 所述目标视觉对象对应跟踪视觉目标样本,以及除所述目标视觉对象外具有所述类别信息的视觉对象对应干扰视觉目标样本;The sample acquisition sub-module is used to acquire the target tracking video frames of the video to be processed. The target visual object corresponds to a tracking visual target sample, and a visual object having the category information other than the target visual object corresponds to an interfering visual target sample;
训练子模块,用于将所述跟踪视觉目标样本作为正样本,所述干扰视觉目标样本作为负样本,对预先搭建的分类器进行多次训练,获得从具有相同类别信息的多个视觉对象中区分出特定视觉对象的所述目标分类模型。The training sub-module is used to use the tracking visual target sample as a positive sample and the interfering visual target sample as a negative sample, train the pre-built classifier multiple times, and obtain the results from multiple visual objects with the same category information. The target classification model distinguishes specific visual objects.
其中一种可能的实现方式中,所述目标分类模型设定模块还包括:In one possible implementation manner, the target classification model setting module further includes:
第一添加子模块,用于将所述目标视觉对象在所述当前视频帧的功能区域添加到跟踪视觉目标样本列表;The first adding sub-module is used to add the target visual object in the functional area of the current video frame to the tracking visual target sample list;
第二添加子模块,用于获得在所述第一图像区域和多个所述第二图像区域组成的集合中删除所述目标视觉对象在所述当前视频帧的功能区域后第三图像区域,添加到干扰视觉目标样本列表;The second adding sub-module is used to obtain the third image area after deleting the functional area of the current video frame of the target visual object from the set composed of the first image area and a plurality of second image areas, Added to the list of interfering visual target samples;
样本获取子模块具体用于从所述跟踪视觉目标样本列表中获取所述跟踪视觉目标样本,从所述干扰视觉目标样本列表中获取所述干扰视觉目标样本。The sample acquisition sub-module is specifically configured to obtain the tracking visual target sample from the tracking visual target sample list, and obtain the interfering visual target sample from the interfering visual target sample list.
其中一种可能的实现方式中,所述目标分类模型设定模块还包括:In one possible implementation manner, the target classification model setting module further includes:
时间获取子模块,用于获取所述当前视频帧的时间信息;Time acquisition submodule, used to obtain the time information of the current video frame;
所述第一添加子模块包括:The first added sub-module includes:
添加子单元,用于将所述目标视觉对象在所述当前视频帧的功能区域添加到所述跟踪视觉目标样本列表的最后一位;所述跟踪视觉目标样本列表的中元素按照对应时间信息从大到小排列;Add a subunit for adding the target visual object in the functional area of the current video frame to the last position of the tracking visual target sample list; the middle element of the tracking visual target sample list is from the corresponding time information. arranged from large to small;
删除子单元,用于删除所述跟踪视觉目标样本列表中存储时间超过预设时间长度的元素或排列在所述跟踪视觉目标样本列表第一位的元素。The deletion subunit is used to delete elements in the tracking visual target sample list whose storage time exceeds a preset time length or elements ranked first in the tracking visual target sample list.
其中一种可能的实现方式中,所述提取模块包括:In one possible implementation, the extraction module includes:
响应子模块,用于响应针对待处理视频的触发操作,标记所述待处理视频的关键视频帧选定的目标视觉对象;The response sub-module is used to respond to the trigger operation for the video to be processed and mark the target visual object selected in the key video frame of the video to be processed;
检测子模块,用于对所述目标视觉对象进行类别检测,获得所述目标视觉对象的类别信息。A detection submodule is used to perform category detection on the target visual object and obtain category information of the target visual object.
其中一种可能的实现方式中,所述提取模块还包括:In one possible implementation, the extraction module further includes:
识别子模块,用于在所述待处理视频的关键视频帧,识别并显示类别与用户输入类别信息关联的多个视觉对象; An identification submodule, configured to identify and display multiple visual objects whose categories are associated with user-input category information in key video frames of the video to be processed;
接收子模块,用于接收到针对关联所述类别信息的视觉对象的触发信号时,获取触发信号对应视觉对象为所述目标视觉对象和所述类别信息。The receiving submodule is configured to, when receiving a trigger signal for a visual object associated with the category information, obtain the visual object corresponding to the trigger signal as the target visual object and the category information.
第三方面,本发明实施例提供一种电子设备,包括:至少一个处理器;以及与所述处理器通信连接的至少一个存储器,其中:所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用所述程序指令能够执行第一方面提供的方法。In a third aspect, embodiments of the present invention provide an electronic device, including: at least one processor; and at least one memory communicatively connected to the processor, wherein: the memory stores a program that can be executed by the processor. Instructions, the processor calls the program instructions to be able to execute the method provided in the first aspect.
第四方面,本发明实施例提供一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令使所述计算机执行第一方面提供的方法。In a fourth aspect, embodiments of the present invention provide a non-transitory computer-readable storage medium. The non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the method provided in the first aspect.
应当理解的是,本发明实施例的第二~四方面与本发明实施例的第一方面的技术方案一致,各方面及对应的可行实施方式所取得的有益效果相似,不再赘述。It should be understood that the second to fourth aspects of the embodiments of the present invention are consistent with the technical solution of the first aspect of the embodiments of the present invention, and the beneficial effects achieved by each aspect and corresponding feasible implementations are similar, and will not be described again.
【附图说明】[Picture description]
为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings required to be used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of this specification and are not relevant to this specification. Those skilled in the art can also obtain other drawings based on these drawings without exerting creative efforts.
图1是本发明实施例提出的视觉目标跟踪方法步骤流程图;Figure 1 is a flow chart of the steps of the visual target tracking method proposed by the embodiment of the present invention;
图2是本发明实施例另一种视觉目标跟踪方法的步骤流程图;Figure 2 is a step flow chart of another visual target tracking method according to an embodiment of the present invention;
图3是本发明实施例提出的视觉目标跟踪装置的功能模块图;Figure 3 is a functional module diagram of the visual target tracking device proposed by the embodiment of the present invention;
图4为本发明实施例提供的一种电子设备的结构示意图。Figure 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
【具体实施方式】【Detailed ways】
为了更好的理解本说明书的技术方案,下面结合附图对本发明实施例进行详细描述。In order to better understand the technical solutions of this specification, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
应当明确,所描述的实施例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有作出创造 性劳动前提下所获得的所有其它实施例,都属于本说明书保护的范围。It should be clear that the described embodiments are only some of the embodiments of this specification, but not all of the embodiments. Based on the embodiments in this specification, those of ordinary skill in the art will not create any All other embodiments obtained under the premise of sexual labor fall within the scope of protection of this specification.
在本发明实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本说明书。在本发明实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。The terminology used in the embodiments of the present invention is for the purpose of describing specific embodiments only and is not intended to limit the description. As used in this embodiment and the appended claims, the singular forms "a," "the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise.
本发明实施例提出的视觉目标跟踪方法可以应用于终端、服务器等电子设备。The visual target tracking method proposed in the embodiment of the present invention can be applied to electronic devices such as terminals and servers.
图1是本发明实施例提出的视觉目标跟踪方法步骤流程图,如图1所示,步骤包括:Figure 1 is a flow chart of the steps of the visual target tracking method proposed by the embodiment of the present invention. As shown in Figure 1, the steps include:
S110:将所述待处理视频的当前视频帧输入预先设定的目标跟踪模型,标记基于所述待处理视频的关键视频帧确定的目标视觉对象在所述当前视频帧对应第一图像区域。S110: Input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame.
目标视觉对象是需要跟踪的运动视觉目标,对待处理视频的图像序列中的运动视觉目标进行检测、提取、识别和跟踪,获得运动视觉目标的运动参数,如位置、速度、加速度和运动轨迹等。The target visual object is a moving visual target that needs to be tracked. The moving visual target in the image sequence of the video to be processed is detected, extracted, identified, and tracked, and the motion parameters of the moving visual target, such as position, speed, acceleration, and motion trajectory, are obtained.
视觉目标跟踪中的视觉对象可以是在视频的图像帧中的物体,例如可以是视频图像帧中的人、车、动物以及机器人等。The visual objects in visual target tracking can be objects in the image frames of the video, for example, they can be people, cars, animals, robots, etc. in the video image frames.
视觉对象在图像上对应的图像区域可以表示:图像中显示视觉对象的像素点所在区域。The image area corresponding to the visual object on the image can represent: the area in the image where the pixels displaying the visual object are located.
可以依次将待处理视频中的每帧图像作为当前视频帧,或者每隔预设帧数,从待处理视频的图像序列中抽取图像帧,作为当前视频帧。例如,每隔N(N大于或者等于1)帧抽取待处理视频中的图像,作为当前视频帧,检测指定视觉对象在当前视频帧的位置和大小。Each frame of image in the video to be processed can be used as the current video frame in turn, or an image frame can be extracted from the image sequence of the video to be processed every preset number of frames as the current video frame. For example, extract an image from the video to be processed every N (N is greater than or equal to 1) frames as the current video frame, and detect the position and size of the specified visual object in the current video frame.
预先设定的目标跟踪模型可以是判别相关滤波器(Discriminative Correlation Filter,DCF)等其他相关滤波类的跟踪器,可以是孪生网络视觉跟踪器(Evolution of Siamese Visual Tracking with Very Deep Networks,SiamRPN)等基于卷积神经网络(CNN)技术的跟踪器。The preset target tracking model can be a discriminative correlation filter (Discriminative Correlation Filter, DCF) and other related filter type trackers, or a twin network visual tracker (Evolution of Siamese Visual Tracking with Very Deep Networks, SiamRPN), etc. Tracker based on convolutional neural network (CNN) technology.
设定的目标跟踪模型可以在待处理视频每一帧图像完成视觉对象追踪后,通过追踪结果进行优化,提高设定的目标跟踪模型的鲁棒。 The set target tracking model can be optimized through the tracking results after completing visual object tracking in each frame of the video to be processed, thereby improving the robustness of the set target tracking model.
标记基于所述待处理视频的关键视频帧确定的目标视觉对象在所述当前视频帧对应第一图像区域的方式可以是框选、突出显示等。The method of marking the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame may be frame selection, highlighting, etc.
待处理视频的关键视频帧可以是待处理视频的第一帧图像,还可以是待处理视频中质量最好的一帧图像。本发明实施例基于关键视频帧响应用户选择指令,确定指定物体对应视觉对象为目标视觉对象,以在待处理视频的其他视频帧中。The key video frame of the video to be processed may be the first frame of the video to be processed, or the image with the best quality in the video to be processed. The embodiment of the present invention responds to user selection instructions based on key video frames and determines the visual object corresponding to the specified object as the target visual object in other video frames of the video to be processed.
本发明一种示例电子设备执行S110:基于待处理视频第一帧图像,确定目标视觉对象,将第一帧图像和第二帧图像输入目标跟踪模型,利用目标跟踪模型在第二帧图像中框选出目标视觉对象的位置,输出目标视觉对象在第二帧图像对应第一图像区域。An example electronic device of the present invention performs S110: determine the target visual object based on the first frame image of the video to be processed, input the first frame image and the second frame image into the target tracking model, and use the target tracking model to frame the second frame image. Select the position of the target visual object, and output the target visual object in the second frame image corresponding to the first image area.
S120:对所述当前视频帧进行目标检测,识别出多个类型与所述目标视觉对象相同的视觉对象对应第二图像区域。S120: Perform target detection on the current video frame, and identify multiple visual objects of the same type as the target visual object corresponding to the second image area.
对当前视频帧进行目标检测的方式包括:基于手工特征的检测方法(如模板匹配法、关键点匹配法、关键特征法等),也可以是基于卷积神经网络技术的检测方法(如YOLO,SSD,R-CNN,Mask R-CNN等)。Methods for target detection on the current video frame include: detection methods based on manual features (such as template matching method, key point matching method, key feature method, etc.), or detection methods based on convolutional neural network technology (such as YOLO, SSD, R-CNN, Mask R-CNN, etc.).
示例地,目标视觉对象的类型是行人,检测出当前视频帧中所有的行人。For example, the type of the target visual object is pedestrians, and all pedestrians in the current video frame are detected.
S130:利用从多个类型相同视觉对象中区分出特定视觉对象的目标分类模型,从所述第一图像区域和多个所述第二图像区域中确定所述目标视觉对象在所述当前视频帧的功能区域。S130: Using a target classification model that distinguishes specific visual objects from multiple visual objects of the same type, determine from the first image area and multiple second image areas where the target visual object is in the current video frame. functional area.
目标视觉对象在所述当前视频帧的功能区域包括:目标视觉对象在所述当前视频帧的显示区域或触发智能设备执行机械运动的区域,例如移动机器人、无人机、带云台的摄像头,响应计算出目标视觉对象在当前视频帧的功能区域的指令,触发执行移动的位置,或者转动云台的,对目标视觉对象进行实际的追踪。The functional area of the target visual object in the current video frame includes: the target visual object is in the display area of the current video frame or an area that triggers an intelligent device to perform mechanical movement, such as a mobile robot, a drone, or a camera with a pan/tilt, In response to the instruction to calculate the target visual object in the functional area of the current video frame, trigger the execution of the moving position, or rotate the pan/tilt to actually track the target visual object.
目标分类模型确定所述目标视觉对象在所述当前视频帧的功能区域,可以输出框选出功能区域的当前图像帧。The target classification model determines that the target visual object is in the functional area of the current video frame, and can output a frame to select the current image frame of the functional area.
目标分类模型可以将多个类型相同的视觉对象分类为目标视觉对象和非目标视觉对象。The target classification model can classify multiple visual objects of the same type into target visual objects and non-target visual objects.
示例地,利用目标跟踪模型从识别出目标视觉对象在当前视频帧可能的 图像区域为A,对所述当前视频帧进行目标检测,识别出类型与目标视觉对象相同的多个视觉对象在当前视频帧的图像区域:B、C、D、A。利用目标分类模型对图像区域A、B、C、D进行分类,确定图像区域A为当前视频帧上显示目标视觉对象的区域。由于经过了目标跟踪模型和目标分类模型的两次计算,保证在当前视频帧追踪到的目标视觉对象更加准确;其中目标分类模型是针对同类型视觉对象进行分类的,能够有效避免同类别视觉对象的干扰。For example, the target tracking model is used to identify the possible target visual object in the current video frame. The image area is A. Target detection is performed on the current video frame, and multiple visual objects of the same type as the target visual object are identified in the image area of the current video frame: B, C, D, and A. Use the target classification model to classify image areas A, B, C, and D, and determine that image area A is the area where the target visual object is displayed on the current video frame. Due to the two calculations of the target tracking model and the target classification model, it is guaranteed that the target visual objects tracked in the current video frame are more accurate; among them, the target classification model classifies visual objects of the same type, which can effectively avoid visual objects of the same type. interference.
本发明实施例还提出,当目标分类模型输出目标视觉对象在所述当前视频帧的功能区域后,判断目标分类模型输出的目标视觉对象在所述当前视频帧的功能区域,与目标跟踪模型检测出的目标视觉对象在所述当前视频帧的第一图像区域是否一致,当目标分类模型输出的目标视觉对象在所述当前视频帧的功能区域与目标跟踪模型检测出的目标视觉对象在所述当前视频帧的第一图像区域一致,目标跟踪模型当前能够准确在当前视频帧检测出目标视觉对象。Embodiments of the present invention also propose that when the target visual object output by the target classification model is in the functional area of the current video frame, it is determined that the target visual object output by the target classification model is in the functional area of the current video frame, and the target tracking model detects Whether the target visual object output by the target classification model is consistent in the first image area of the current video frame is consistent with the target visual object detected by the target tracking model in the functional area of the current video frame. The first image area of the current video frame is consistent, and the target tracking model can currently accurately detect the target visual object in the current video frame.
当所述目标视觉对象在所述当前视频帧的功能区域与所述第一图像区域不一致时,利用所述目标视觉对象在所述当前视频帧的功能区域对所述预先设定的目标跟踪模型进行优化。When the target visual object in the functional area of the current video frame is inconsistent with the first image area, using the target visual object in the functional area of the current video frame to track the preset target model optimize.
可以使用目标分类模型输出的框选出目标视觉目标的当前视频帧,对目标跟踪模型进行训练,或对目标跟踪模型进行重新初始化,使得目标跟踪模型在对下一帧的图像识别中更加准确。You can use the frame output by the target classification model to select the current video frame of the target visual target, train the target tracking model, or reinitialize the target tracking model to make the target tracking model more accurate in image recognition of the next frame.
本发明另一种实施例提出另一种视觉目标跟踪方法,图2是本发明实施例另一种视觉目标跟踪方法的步骤流程图,步骤包括:Another embodiment of the present invention provides another visual target tracking method. Figure 2 is a step flow chart of another visual target tracking method according to the embodiment of the present invention. The steps include:
S210:响应针对待处理视频的触发操作,从所述待处理视频的关键视频帧提取目标视觉对象的类别信息。S210: In response to a triggering operation on the video to be processed, extract the category information of the target visual object from the key video frame of the video to be processed.
执行S210可以通过执行子步骤S211或执行子步骤S212实现。Executing S210 may be implemented by executing sub-step S211 or executing sub-step S212.
S211:响应针对待处理视频的触发操作,标记所述待处理视频的关键视频帧选定的目标视觉对象;对所述目标视觉对象进行类别检测,获得所述目标视觉对象的类别信息。S211: In response to the triggering operation on the video to be processed, mark the target visual object selected in the key video frame of the video to be processed; perform category detection on the target visual object to obtain category information of the target visual object.
示例地,电子设备显示关键视频帧,用户选择特定图像区域,电子设备对选择的特定图像区域进行类别检测,获得目标视觉对象的类别信息。电子 设备可以计算每个检测结果与用户选择的初始目标边界框的交并比,选取交并比最大的检测结果的类别作为类别信息。电子设备还可以将关键视频帧和对特定图像区域标注的初始目标边界框送入基于CNN的目标分类算法,得到分数最高的类别作为类别信息。For example, the electronic device displays key video frames, the user selects a specific image area, and the electronic device performs category detection on the selected specific image area to obtain category information of the target visual object. electronic The device can calculate the intersection and union ratio of each detection result and the initial target bounding box selected by the user, and select the category of the detection result with the largest intersection and union ratio as the category information. The electronic device can also send key video frames and initial target bounding boxes marked on specific image areas to the CNN-based target classification algorithm to obtain the category with the highest score as category information.
S212:在所述待处理视频的关键视频帧,识别并显示类别与用户输入类别信息关联的多个视觉对象;接收到针对关联所述类别信息的视觉对象的触发信号时,获取触发信号对应视觉对象为所述目标视觉对象和所述类别信息。S212: In the key video frame of the video to be processed, identify and display multiple visual objects whose categories are associated with the category information input by the user; when receiving a trigger signal for a visual object associated with the category information, obtain the visual object corresponding to the trigger signal. The objects are the target visual object and the category information.
示例地,电子设备接收用户输入类别信息,对关键视频帧进行检测,识别出类别信息对应的多个视觉对象,显示多个视觉对象。将用户选择的视觉对象确定为目标视觉对象,从而获得目标视觉对象和目标视觉对象的类别信息。For example, the electronic device receives category information input by the user, detects key video frames, identifies multiple visual objects corresponding to the category information, and displays the multiple visual objects. The visual object selected by the user is determined as the target visual object, thereby obtaining the target visual object and category information of the target visual object.
S220:将所述待处理视频的当前视频帧输入预先设定的目标跟踪模型,标记基于所述待处理视频的关键视频帧确定的目标视觉对象在所述当前视频帧对应第一图像区域。S220: Input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame.
S230:根据所述类别信息,对所述当前视频帧进行目标检测,识别出多个记载类型与所述目标视觉对象相同的视觉对象对应第二图像区域。S230: Perform target detection on the current video frame according to the category information, and identify second image areas corresponding to multiple visual objects with the same recording type as the target visual object.
S240:利用从多个类型相同视觉对象中区分出特定视觉对象的目标分类模型,从所述第一图像区域和多个所述第二图像区域中确定所述目标视觉对象在所述当前视频帧的功能区域。S240: Using a target classification model that distinguishes specific visual objects from multiple visual objects of the same type, determine from the first image area and multiple second image areas where the target visual object is in the current video frame. functional area.
目标分类模型的获得方式可以是:在所述待处理视频经过目标追踪的视频帧中获取所述目标视觉对象对应跟踪视觉目标样本,以及除所述目标视觉对象外具有所述类别信息的视觉对象对应干扰视觉目标样本;将所述跟踪视觉目标样本作为正样本,所述干扰视觉目标样本作为负样本,对预先搭建的分类器进行多次训练,获得从具有相同类别信息的多个视觉对象中区分出特定视觉对象的所述目标分类模型。The target classification model may be obtained by: obtaining the tracking visual target sample corresponding to the target visual object in the video frame of the video to be processed that has undergone target tracking, and the visual objects with the category information in addition to the target visual object. Corresponding to the interfering visual target sample; use the tracking visual target sample as a positive sample, the interfering visual target sample as a negative sample, conduct multiple trainings on the pre-built classifier, and obtain the results from multiple visual objects with the same category information. The target classification model distinguishes specific visual objects.
分类器可以是最近邻分类器,决策树、支持向量机分类器等。The classifier can be a nearest neighbor classifier, a decision tree, a support vector machine classifier, etc.
本发明再一种实施例提出再一种视觉目标跟踪方法,本发明再一种实施例中视觉目标跟踪方法包括周期性提取待处理视频的当前视频帧,对第N次提取出的当前视频帧进行视觉对象追踪,在确定目标视觉对象过程中,获得 同类别的视觉对象,添加到干扰视觉目标样本列表,将在第N次提取出的当前视频帧识别出的目标视觉对象添加到跟踪视觉目标样本列表。采用跟踪视觉目标样本列表中的元素作为正样本,采用干扰视觉目标样本列表中的元素作为负样本,对预先搭建的分类器进行多次训练,得到目标分类模型,以在对第N+1次提取出的当前视频帧进行视觉对象追踪过程中,对同类视觉对象进行分类;或者在对第N-1次提取出的当前视频帧进行目标时间对象追踪过程中使用的目标分类模型进行训练,得到在对第N+1次提取出的当前视频帧进行视觉对象追踪过程中,对同类视觉对象进行分类的目标分类模型。Yet another embodiment of the present invention provides yet another visual target tracking method. In yet another embodiment of the present invention, the visual target tracking method includes periodically extracting the current video frame of the video to be processed, and performing the Nth extracted current video frame. Perform visual object tracking, and in the process of determining the target visual object, obtain Visual objects of the same category are added to the interference visual target sample list, and the target visual object identified in the current video frame extracted for the Nth time is added to the tracking visual target sample list. The elements in the tracking visual target sample list are used as positive samples, and the elements in the interfering visual target sample list are used as negative samples. The pre-built classifier is trained multiple times to obtain the target classification model, which can be used for the N+1th time Classify similar visual objects during the visual object tracking process of the extracted current video frame; or train the target classification model used in the target time object tracking process of the N-1th extracted current video frame, and obtain A target classification model that classifies similar visual objects during the visual object tracking process of the current video frame extracted for the N+1th time.
再一种视觉目标跟踪方法包括将基于当前视频帧得到的当前视频帧的功能区域添加到跟踪视觉目标样本列表,将除所述目标视觉对象外具有所述类别信息的视觉对象所在图像区域添加到干扰视觉目标样本列表。从干扰视觉目标样本列表中获取训练目标分类模型的负样本,从跟踪视觉目标样本列表中获取训练目标分类模型的正样本。Yet another visual target tracking method includes adding the functional area of the current video frame obtained based on the current video frame to the tracking visual target sample list, and adding the image area where the visual object with the category information in addition to the target visual object is located. List of interfering visual target samples. The negative samples for training the target classification model are obtained from the interference visual target sample list, and the positive samples for training the target classification model are obtained from the tracking visual target sample list.
将所述目标视觉对象在所述当前视频帧的功能区域添加到跟踪视觉目标样本列表;获得在所述第一图像区域和多个所述第二图像区域组成的集合中删除所述目标视觉对象在所述当前视频帧的功能区域后第三图像区域,添加到干扰视觉目标样本列表;Add the target visual object in the functional area of the current video frame to the tracking visual target sample list; obtain the target visual object in a set composed of the first image area and a plurality of the second image areas and delete the target visual object Add the third image area after the functional area of the current video frame to the interfering visual target sample list;
在所述待处理视频经过目标追踪的视频帧中获取所述目标视觉对象对应跟踪视觉目标样本,以及除所述目标视觉对象外具有所述类别信息的视觉对象对应干扰视觉目标样本,包括:从所述跟踪视觉目标样本列表中获取所述跟踪视觉目标样本,从所述干扰视觉目标样本列表中获取所述干扰视觉目标样本。Obtaining the tracking visual target samples corresponding to the target visual object and the interfering visual target samples corresponding to the visual objects with the category information in addition to the target visual object in the video frames of the target tracking video, including: from The tracking visual target sample is obtained from the tracking visual target sample list, and the interfering visual target sample is obtained from the interfering visual target sample list.
再一种视觉目标跟踪方法在第N次执行视觉目标跟踪方法过程中,包括步骤:Another visual target tracking method includes steps during the Nth execution of the visual target tracking method:
K101:将所述待处理视频的第N帧图像输入预先设定的目标跟踪模型,标记基于所述待处理视频的关键视频帧确定的目标视觉对象在所述待处理视频的第N帧图像对应第一图像区域A。K101: Input the Nth frame image of the video to be processed into the preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the Nth frame image of the video to be processed. First image area A.
K102:对所述待处理视频的第N帧图像进行目标检测,识别出多个类型与所述目标视觉对象相同的视觉对象对应第二图像区域B、C、D。 K102: Perform target detection on the Nth frame image of the video to be processed, and identify multiple visual objects of the same type as the target visual object corresponding to the second image areas B, C, and D.
K103:利用从多个类型相同视觉对象中区分出特定视觉对象的目标分类模型,从所述第一图像区域和多个所述第二图像区域中确定所述目标视觉对象在所述待处理视频的第N帧图像的功能区域B。K103: Using a target classification model that distinguishes specific visual objects from multiple visual objects of the same type, determine from the first image area and multiple second image areas where the target visual object is in the video to be processed. Functional area B of the Nth frame image.
K104:将图像区域A、C、D添加到干扰视觉目标样本列表,干扰视觉目标样本列表D={D1,D2,...,DL},干扰视觉目标样本列表包括元素D1,D2,...,DL,其中D1,D2,...,DL中包括基于待处理视频的第N帧图像获得的类型与目标视觉对象相同的其他视角对象所在图像区域:A、C、D,以及基于第N-i帧图像获得的类型与目标视觉对象相同的其他视角对象所在图像区域,i小于N并大于等于1。K104: Add image areas A, C, and D to the interference visual target sample list. The interference visual target sample list D = {D1, D2,...,DL}. The interference visual target sample list includes elements D1, D2,... ., DL, where D1, D2,..., DL includes the image area of other perspective objects of the same type as the target visual object obtained based on the Nth frame image of the video to be processed: A, C, D, and based on the Nth frame image of the video to be processed. The N-i frame image obtains the image area where other perspective objects of the same type as the target visual object are located, i is less than N and greater than or equal to 1.
K105:将图像区域B添加到跟踪视觉目标样本列表,跟踪视觉目标样本列表T={T1,T2,...,TM},跟踪视觉目标样本列表包括基于待处理视频的第N帧图像获得的目标视觉对象所在图像区域:B,以及基于第N-i帧图像获得的目标视觉对象所在图像区域,i小于N并大于等于1。K105: Add image area B to the tracking visual target sample list. The tracking visual target sample list T = {T1, T2,...,TM}. The tracking visual target sample list includes the Nth frame image obtained based on the video to be processed. The image area where the target visual object is located: B, and the image area where the target visual object is located based on the N-ith frame image, i is less than N and greater than or equal to 1.
对待处理视频的历史图像帧执行视觉目标跟踪过程中,基于当前视频帧确定的跟踪视觉目标样本和干扰视觉目标样本,并分别添加到了跟踪视觉目标样本列表和干扰视觉目标样本列表。During the visual target tracking process of the historical image frames of the video to be processed, the tracking visual target samples and interference visual target samples are determined based on the current video frame, and are added to the tracking visual target sample list and the interference visual target sample list respectively.
K106:从踪视觉目标样本列表获取踪视觉目标样本作为正样本,从干扰视觉目标样本列表获取干扰视觉目标样本作为负样本,训练预先搭建的分类器,或者训练在第N-1次执行视觉目标跟踪过程中训练得到的目标分类模型,得到在第N+1次执行视觉目标跟踪方法过程中使用的目标分类模型。K106: Get the tracking visual target sample from the tracking visual target sample list as a positive sample, get the interference visual target sample from the interference visual target sample list as a negative sample, train the pre-built classifier, or train to execute the visual target at the N-1th time The target classification model trained during the tracking process is used to obtain the target classification model used in the N+1th execution of the visual target tracking method.
训练目标分类模型方式包括:Methods for training target classification models include:
对跟踪视觉目标样本和干扰视觉目标样本分别提取灰度、颜色、纹理、梯度直方图等多类特征,使用决策树作为分类器。Extract multiple types of features such as grayscale, color, texture, gradient histogram, etc. from the tracking visual target samples and interference visual target samples respectively, and use decision trees as classifiers.
使用基于CNN方法提取样本特征,使用支持向量机作为分类器。Use CNN-based method to extract sample features, and use support vector machine as a classifier.
使用灰度值平方差来定义两个样本之间的距离,使用最近邻分类器进行目标分类。The squared difference of gray values is used to define the distance between two samples, and the nearest neighbor classifier is used for target classification.
在将所述目标视觉对象在所述当前视频帧的功能区域添加到跟踪视觉目标样本列表之前,所述方法还包括:Before adding the target visual object in the functional area of the current video frame to the tracking visual target sample list, the method further includes:
获取所述当前视频帧的时间信息; Obtain the time information of the current video frame;
将所述目标视觉对象在所述当前视频帧的功能区域添加到跟踪视觉目标样本列表,包括:Add the functional area of the target visual object in the current video frame to the tracking visual target sample list, including:
将所述目标视觉对象在所述当前视频帧的功能区域添加到所述跟踪视觉目标样本列表的最后一位;所述跟踪视觉目标样本列表的中元素按照对应时间信息从大到小排列;Add the functional area of the target visual object in the current video frame to the last position of the tracking visual target sample list; the middle elements of the tracking visual target sample list are arranged from large to small according to the corresponding time information;
删除所述跟踪视觉目标样本列表中存储时间超过预设时间长度的元素或排列在所述跟踪视觉目标样本列表第一位的元素。Delete elements in the tracking visual target sample list whose storage time exceeds a preset time length or elements ranked first in the tracking visual target sample list.
将所述目标视觉对象在所述当前视频帧的功能区域添加到跟踪视觉目标样本列表,包括:Add the functional area of the target visual object in the current video frame to the tracking visual target sample list, including:
将所述目标视觉对象在所述当前视频帧的功能区域添加到所述跟踪视觉目标样本列表的最后一位;所述跟踪视觉目标样本列表的中元素按照对应时间信息从大到小排列;Add the functional area of the target visual object in the current video frame to the last position of the tracking visual target sample list; the middle elements of the tracking visual target sample list are arranged from large to small according to the corresponding time information;
删除所述跟踪视觉目标样本列表中存储时间超过预设时间长度的元素或排列在所述跟踪视觉目标样本列表第一位的元素。Delete elements in the tracking visual target sample list whose storage time exceeds a preset time length or elements ranked first in the tracking visual target sample list.
针对干扰视觉目标样本列表的元素添加,也可以采用上述方法,在将除所述目标视觉对象外具有所述类别信息的视觉对象所在图像区域添加到干扰视觉目标样本列表过程中,可以获取时间信息,以在后续删除不适用的元素。For adding elements to the interfering visual target sample list, the above method can also be used. In the process of adding the image area where the visual object with the category information in addition to the target visual object is added to the interfering visual target sample list, the time information can be obtained , to subsequently remove inapplicable elements.
在添加跟踪视觉目标样本和干扰视觉目标样本过程中,可以同时记录样本的加入时间(图像的帧号,或者绝对时间),可以执行历史遗忘机制,例如:During the process of adding tracking visual target samples and interference visual target samples, the adding time of the sample (image frame number, or absolute time) can be recorded at the same time, and a history forgetting mechanism can be implemented, such as:
在第K帧,丢弃掉第K-J帧以前加入的样本。At the Kth frame, discard the samples added before the K-Jth frame.
设置集合的最大容量为L,在元素个数达到L的时候丢弃最早加入的样本。Set the maximum capacity of the collection to L, and discard the earliest added sample when the number of elements reaches L.
本发明实施例通过在待处理视频历史帧中采集的跟踪目标样本和干扰目标样本来训练目标分类模型,对当前视频帧的跟踪算法输出结果进行更新校正,使得算法不易被同类别目标干扰,对同类别目标遮挡/干扰具有很强的鲁棒性,显著提升了跟踪算法在实际场景中的可用性。The embodiment of the present invention trains a target classification model through tracking target samples and interference target samples collected in historical frames of the video to be processed, and updates and corrects the output results of the tracking algorithm of the current video frame, so that the algorithm is not easily interfered by targets of the same category, and the target classification model is updated and corrected. It is highly robust to occlusion/interference by targets of the same category, significantly improving the usability of the tracking algorithm in actual scenarios.
图3是本发明实施例提出的视觉目标跟踪装置的功能模块图,上述图片 显示装置设置在电子设备中,如图3所示,所述装置包括:Figure 3 is a functional module diagram of the visual target tracking device proposed by the embodiment of the present invention. The above picture A display device is provided in an electronic device, as shown in Figure 3. The device includes:
标记模块31,用于将所述待处理视频的当前视频帧输入预先设定的目标跟踪模型,标记基于所述待处理视频的关键视频帧确定的目标视觉对象在所述当前视频帧对应第一图像区域;Marking module 31, configured to input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed when the current video frame corresponds to the first image area;
检测模块32,用于对所述当前视频帧进行目标检测,识别出多个类型与所述目标视觉对象相同的视觉对象对应第二图像区域;The detection module 32 is configured to perform target detection on the current video frame and identify multiple visual objects of the same type as the target visual object corresponding to the second image area;
分类模块33,用于利用从多个类型相同视觉对象中区分出特定视觉对象的目标分类模型,从所述第一图像区域和多个所述第二图像区域中确定所述目标视觉对象在所述当前视频帧的功能区域。Classification module 33, configured to use a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, and determine from the first image area and the plurality of second image areas where the target visual object is located. Describes the functional area of the current video frame.
图3所示实施例提供的视觉目标跟踪装置可用于执行本说明书图1和图2所示方法实施例的技术方案,其实现原理和技术效果可以进一步参考方法实施例中的相关描述。The visual target tracking device provided by the embodiment shown in Figure 3 can be used to implement the technical solutions of the method embodiments shown in Figures 1 and 2 of this specification. For its implementation principles and technical effects, further reference can be made to the relevant descriptions in the method embodiments.
可选地,所述装置还包括:Optionally, the device also includes:
优化模块,用于当所述目标视觉对象在所述当前视频帧的功能区域与所述第一图像区域不一致时,利用所述目标视觉对象在所述当前视频帧的功能区域对所述预先设定的目标跟踪模型进行优化。Optimization module, configured to use the target visual object in the functional area of the current video frame to optimize the preset value when the target visual object in the functional area of the current video frame is inconsistent with the first image area. Optimize a certain target tracking model.
可选地,所述装置还包括:Optionally, the device also includes:
提取模块,用于响应针对待处理视频的触发操作,从所述待处理视频的关键视频帧提取目标视觉对象的类别信息;An extraction module, configured to respond to a trigger operation for the video to be processed and extract the category information of the target visual object from the key video frame of the video to be processed;
所述检测模块具体用于根据所述类别信息,对所述当前视频帧进行目标检测,识别出多个记载类型与所述目标视觉对象相同的视觉对象对应第二图像区域。The detection module is specifically configured to perform target detection on the current video frame according to the category information, and identify the second image regions corresponding to multiple visual objects with the same recording type as the target visual object.
可选地,所述装置还包括目标分类模型设定模块,所述目标分类模型设定模块包括:Optionally, the device further includes a target classification model setting module, which includes:
样本获取子模块,用于在所述待处理视频经过目标追踪的视频帧中获取所述目标视觉对象对应跟踪视觉目标样本,以及除所述目标视觉对象外具有所述类别信息的视觉对象对应干扰视觉目标样本;A sample acquisition submodule, used to obtain the tracking visual target samples corresponding to the target visual object in the video frames of the video to be processed that have undergone target tracking, and the corresponding interference of visual objects with the category information in addition to the target visual object. Visual target samples;
训练子模块,用于将所述跟踪视觉目标样本作为正样本,所述干扰视觉目标样本作为负样本,对预先搭建的分类器进行多次训练,获得从具有相同 类别信息的多个视觉对象中区分出特定视觉对象的所述目标分类模型。The training submodule is used to use the tracking visual target sample as a positive sample and the interfering visual target sample as a negative sample to train the pre-built classifier multiple times to obtain the same The target classification model uses category information to distinguish a specific visual object among multiple visual objects.
可选地,所述目标分类模型设定模块还包括:Optionally, the target classification model setting module also includes:
第一添加子模块,用于将所述目标视觉对象在所述当前视频帧的功能区域添加到跟踪视觉目标样本列表;The first adding sub-module is used to add the target visual object in the functional area of the current video frame to the tracking visual target sample list;
第二添加子模块,用于获得在所述第一图像区域和多个所述第二图像区域组成的集合中删除所述目标视觉对象在所述当前视频帧的功能区域后第三图像区域,添加到干扰视觉目标样本列表;The second adding sub-module is used to obtain the third image area after deleting the functional area of the current video frame of the target visual object from the set composed of the first image area and a plurality of second image areas, Added to the list of interfering visual target samples;
样本获取子模块具体用于从所述跟踪视觉目标样本列表中获取所述跟踪视觉目标样本,从所述干扰视觉目标样本列表中获取所述干扰视觉目标样本。The sample acquisition sub-module is specifically configured to obtain the tracking visual target sample from the tracking visual target sample list, and obtain the interfering visual target sample from the interfering visual target sample list.
可选地,所述目标分类模型设定模块还包括:Optionally, the target classification model setting module also includes:
时间获取子模块,用于获取所述当前视频帧的时间信息;Time acquisition submodule, used to obtain the time information of the current video frame;
所述第一添加子模块包括:The first added sub-module includes:
添加子单元,用于将所述目标视觉对象在所述当前视频帧的功能区域添加到所述跟踪视觉目标样本列表的最后一位;所述跟踪视觉目标样本列表的中元素按照对应时间信息从大到小排列;Add a subunit for adding the target visual object in the functional area of the current video frame to the last position of the tracking visual target sample list; the middle element of the tracking visual target sample list is from the corresponding time information. arranged from large to small;
删除子单元,用于删除所述跟踪视觉目标样本列表中存储时间超过预设时间长度的元素或排列在所述跟踪视觉目标样本列表第一位的元素。The deletion subunit is used to delete elements in the tracking visual target sample list whose storage time exceeds a preset time length or elements ranked first in the tracking visual target sample list.
可选地,所述提取模块包括:Optionally, the extraction module includes:
响应子模块,用于响应针对待处理视频的触发操作,标记所述待处理视频的关键视频帧选定的目标视觉对象;The response sub-module is used to respond to the trigger operation for the video to be processed and mark the target visual object selected in the key video frame of the video to be processed;
检测子模块,用于对所述目标视觉对象进行类别检测,获得所述目标视觉对象的类别信息。A detection submodule is used to perform category detection on the target visual object and obtain category information of the target visual object.
可选地,所述提取模块还包括:Optionally, the extraction module also includes:
识别子模块,用于在所述待处理视频的关键视频帧,识别并显示类别与用户输入类别信息关联的多个视觉对象;An identification submodule, configured to identify and display multiple visual objects whose categories are associated with user-input category information in key video frames of the video to be processed;
接收子模块,用于接收到针对关联所述类别信息的视觉对象的触发信号时,获取触发信号对应视觉对象为所述目标视觉对象和所述类别信息。The receiving submodule is configured to, when receiving a trigger signal for a visual object associated with the category information, obtain the visual object corresponding to the trigger signal as the target visual object and the category information.
上述所示实施例提供的装置用于执行上述所示方法实施例的技术方案,其实现原理和技术效果可以进一步参考方法实施例中的相关描述,在此不再 赘述。The apparatus provided by the above-mentioned embodiments is used to execute the technical solutions of the above-mentioned method embodiments. For its implementation principles and technical effects, further reference can be made to the relevant descriptions in the method embodiments, which will not be repeated here. Repeat.
上述所示实施例提供的装置例如可以是:芯片或者芯片模组。上述所示实施例提供的装置用于执行上述所示方法实施例的技术方案,其实现原理和技术效果可以进一步参考方法实施例中的相关描述,在此不再赘述。The device provided in the above-described embodiments may be, for example, a chip or a chip module. The devices provided by the above-described embodiments are used to execute the technical solutions of the above-described method embodiments. For its implementation principles and technical effects, further reference can be made to the relevant descriptions in the method embodiments, which will not be described again here.
关于上述实施例中描述的各个装置包含的各个模块/单元,其可以是软件模块/单元,也可以是硬件模块/单元,或者也可以部分是软件模块/单元,部分是硬件模块/单元。例如,对于应用于或集成于芯片的各个装置,其包含的各个模块/单元可以都采用电路等硬件的方式实现,或者,至少部分模块/单元可以采用软件程序的方式实现,该软件程序运行于芯片内部集成的处理器,剩余的部分模块/单元可以采用电路等硬件方式实现;对于应用于或集成于芯片模组的各个装置,其包含的各个模块/单元可以都采用电路等硬件的方式实现,不同的模块/单元可以位于芯片模组的同一组件(例如芯片、电路模块等)或者不同组件中,或者,至少部分模块/单元可以采用软件程序的方式实现,该软件程序运行于芯片模组内部集成的处理器,剩余的部分模块/单元可以采用电路等硬件方式实现;对于应用于或集成于电子终端设备的各个装置,其包含的各个模块/单元可以都采用电路等硬件的方式实现,不同的模块/单元可以位于电子终端设备内同一组件(例如,芯片、电路模块等)或者不同组件中,或者,至少部分模块/单元可以采用软件程序的方式实现,该软件程序运行于电子终端设备内部集成的处理器,剩余的(如果有)部分模块/单元可以采用电路等硬件方式实现。Regarding each module/unit included in each device described in the above embodiment, it may be a software module/unit or a hardware module/unit, or it may be partly a software module/unit and partly a hardware module/unit. For example, for each device applied or integrated in a chip, each module/unit included therein can be implemented in the form of hardware such as circuits, or at least some of the modules/units can be implemented in the form of a software program that runs on For the processor integrated inside the chip, the remaining modules/units can be implemented using circuits and other hardware methods; for each device applied or integrated into the chip module, each module/unit included in it can be implemented using circuits and other hardware methods. , different modules/units can be located in the same component (such as chip, circuit module, etc.) or in different components of the chip module, or at least some of the modules/units can be implemented in the form of software programs that run on the chip module For the internally integrated processor, the remaining modules/units can be implemented using circuits and other hardware methods; for each device applied or integrated into electronic terminal equipment, each module/unit included in it can be implemented using circuits and other hardware methods. Different modules/units can be located in the same component (e.g., chip, circuit module, etc.) or in different components within the electronic terminal equipment, or at least some of the modules/units can be implemented in the form of software programs that run on the electronic terminal equipment. For the internally integrated processor, the remaining (if any) modules/units can be implemented using circuits and other hardware methods.
图4为本发明实施例提供的一种电子设备的结构示意图,该电子设备400包括处理器410,存储器411,存储在存储器411上并可在所述处理器410上运行的计算机程序,所述处理器410执行所述程序时实现前述方法实施例中的步骤,实施例提供的电子设备可用于执行本上述所示方法实施例的技术方案,其实现原理和技术效果可以进一步参考方法实施例中的相关描述,在此不再赘述。Figure 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. The electronic device 400 includes a processor 410, a memory 411, and a computer program stored in the memory 411 and executable on the processor 410. When the processor 410 executes the program, it implements the steps in the foregoing method embodiments. The electronic equipment provided by the embodiments can be used to execute the technical solutions of the method embodiments shown above. For its implementation principles and technical effects, please refer to the method embodiments for further reference. The relevant descriptions will not be repeated here.
本发明实施例提供一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令使所述计算机执行本说明书图1和图2所示实施例提供的视觉目标跟踪方法。非暂态计算机可读存储介质可以指非易失性计算机存储介质。Embodiments of the present invention provide a non-transitory computer-readable storage medium. The non-transitory computer-readable storage medium stores computer instructions. The computer instructions cause the computer to execute the embodiments shown in Figures 1 and 2 of this specification. Provided visual target tracking methods. Non-transitory computer-readable storage media may refer to non-volatile computer storage media.
上述非暂态计算机可读存储介质可以采用一个或多个计算机可读的介质 的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(read only memory,ROM)、可擦式可编程只读存储器(erasable programmable read only memory,EPROM)或闪存、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。The above-mentioned non-transitory computer-readable storage medium may adopt one or more computer-readable media. any combination of. The computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples (non-exhaustive list) of computer-readable storage media include: electrical connections having one or more conductors, portable computer disks, hard drives, random access memory (RAM), read only memory , ROM), erasable programmable read only memory (EPROM) or flash memory, optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any of the above The right combination. As used herein, a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括——但不限于——电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。A computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including - but not limited to - electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括——但不限于——无线、电线、光缆、射频(radio frequency,RF)等等,或者上述的任意合适的组合。Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
可以以一种或多种程序设计语言或其组合来编写用于执行本说明书操作的计算机程序代码,所述程序设计语言包括面向视觉对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(local area network,LAN)或广域网(wide area network,WAN)连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。 Computer program code for performing the operations described herein may be written in one or more programming languages, including visual object-oriented programming languages—such as Java, Smalltalk, C++, and conventional A procedural programming language—such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In situations involving remote computers, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (e.g. Use an Internet service provider to connect via the Internet).
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desired results. Additionally, the processes depicted in the figures do not necessarily require the specific order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain implementations.
在本发明实施例的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本说明书的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of embodiments of the present invention, reference to the description of the terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples" or the like means that the description is made in connection with the embodiment or example. A specific feature, structure, material, or characteristic is included in at least one embodiment or example of this specification. In this specification, the schematic expressions of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine different embodiments or examples and features of different embodiments or examples described in this specification unless they are inconsistent with each other.
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本说明书的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。In addition, the terms “first” and “second” are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In the description of this specification, "plurality" means at least two, such as two, three, etc., unless otherwise clearly and specifically limited.
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本说明书的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本说明书的实施例所属技术领域的技术人员所理解。Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments, or portions of code that include one or more executable instructions for implementing customized logical functions or steps of the process. , and the scope of the preferred embodiments of this description includes additional implementations in which functions may be performed out of the order shown or discussed, including in a substantially simultaneous manner or in the reverse order, depending on the functionality involved, which shall It should be understood by those skilled in the art to which the embodiments of this specification belong.
取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”或“响应于检测”。类似地,取决于语境,短语“如果确定”或“如果检测(陈述的条件或事件)”可以被解释成为“当确定时”或“响应于确定”或“当检测(陈述的条件或事件)时”或“响应于检测(陈述的条件或事件)”。 Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determination" or "in response to detection." Similarly, depending on the context, the phrase "if determined" or "if (stated condition or event) is detected" may be interpreted as "when determined" or "in response to determining" or "when (stated condition or event) is detected )" or "in response to detecting (a stated condition or event)".
需要说明的是,本发明实施例中所涉及的终端可以包括但不限于个人计算机(personal computer,PC)、个人数字助理(personal digital assistant,PDA)、无线手持设备、平板电脑(tablet computer)、手机、MP3播放器、MP4播放器等。It should be noted that the terminals involved in the embodiments of the present invention may include but are not limited to personal computers (PCs), personal digital assistants (Personal Digital Assistants, PDAs), wireless handheld devices, tablet computers (tablet computers), Mobile phones, MP3 players, MP4 players, etc.
在本说明书所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this specification, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined. Either it can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
另外,在本说明书各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of this specification may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
上述以软件功能单元的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中,包括若干指令用以使得一台计算机装置(可以是个人计算机,服务器,或者网络装置等)或处理器(processor)执行本说明书各个实施例所述方法的部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM)、随机存取存储器(RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The above-mentioned integrated unit implemented in the form of a software functional unit can be stored in a computer-readable storage medium. The above-mentioned software functional unit is stored in a storage medium and includes a number of instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute the methods described in various embodiments of this specification. Some steps. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other various media that can store program codes.
以上所述仅为本说明书的较佳实施例而已,并不用以限制本说明书,凡在本说明书的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书保护的范围之内。 The above are only preferred embodiments of this specification and are not intended to limit this specification. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this specification shall be included in this specification. within the scope of protection.

Claims (11)

  1. 一种视觉目标跟踪方法,其特征在于,所述方法包括:A visual target tracking method, characterized in that the method includes:
    将待处理视频的当前视频帧输入预先设定的目标跟踪模型,标记基于所述待处理视频的关键视频帧确定的目标视觉对象在所述当前视频帧对应第一图像区域;Input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame;
    对所述当前视频帧进行目标检测,识别出多个类型与所述目标视觉对象相同的视觉对象对应第二图像区域;Perform target detection on the current video frame, and identify multiple visual objects of the same type as the target visual object corresponding to the second image area;
    利用从多个类型相同视觉对象中区分出特定视觉对象的目标分类模型,从所述第一图像区域和多个所述第二图像区域中确定所述目标视觉对象在所述当前视频帧的功能区域。Utilizing a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, determining a function of the target visual object in the current video frame from the first image area and a plurality of the second image areas. area.
  2. 根据权利要求1所述的方法,其特征在于,从所述第一图像区域和多个所述第二图像区域中确定所述目标视觉对象在所述当前视频帧的功能区域之后,所述方法还包括:The method according to claim 1, characterized in that after determining that the target visual object is in the functional area of the current video frame from the first image area and a plurality of the second image areas, the method Also includes:
    当所述目标视觉对象在所述当前视频帧的功能区域与所述第一图像区域不一致时,利用所述目标视觉对象在所述当前视频帧的功能区域对所述预先设定的目标跟踪模型进行优化。When the target visual object in the functional area of the current video frame is inconsistent with the first image area, using the target visual object in the functional area of the current video frame to track the preset target model optimize.
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1, further comprising:
    响应针对待处理视频的触发操作,从所述待处理视频的关键视频帧提取目标视觉对象的类别信息;In response to a trigger operation for the video to be processed, extract category information of the target visual object from key video frames of the video to be processed;
    对所述当前视频帧进行目标检测,识别出多个记载类型与所述目标视觉对象相同的视觉对象对应第二图像区域,包括:Target detection is performed on the current video frame, and multiple second image areas corresponding to visual objects with the same recording type as the target visual object are identified, including:
    根据所述类别信息,对所述当前视频帧进行目标检测,识别出多个记载类型与所述目标视觉对象相同的视觉对象对应第二图像区域。According to the category information, target detection is performed on the current video frame, and a plurality of second image regions corresponding to visual objects having the same recording type as the target visual object are identified.
  4. 根据权利要求3所述的方法,其特征在于,设定所述目标分类模型的过程包括:The method according to claim 3, characterized in that the process of setting the target classification model includes:
    在所述待处理视频经过目标追踪的视频帧中获取所述目标视觉对象对应跟踪视觉目标样本,以及除所述目标视觉对象外具有所述类别信息的视觉对象对应干扰视觉目标样本; Obtain the tracking visual target samples corresponding to the target visual object in the video frames of the video to be processed that have undergone target tracking, and the interference visual target samples corresponding to the visual objects having the category information in addition to the target visual object;
    将所述跟踪视觉目标样本作为正样本,所述干扰视觉目标样本作为负样本,对预先搭建的分类器进行多次训练,获得从具有相同类别信息的多个视觉对象中区分出特定视觉对象的所述目标分类模型。The tracking visual target sample is used as a positive sample, and the interfering visual target sample is used as a negative sample. The pre-built classifier is trained multiple times to obtain a method for distinguishing specific visual objects from multiple visual objects with the same category information. The target classification model.
  5. 根据权利要求4所述的方法,其特征在于,在确定所述目标视觉对象在所述当前视频帧的功能区域之后,所述方法还包括:The method according to claim 4, characterized in that, after determining that the target visual object is in the functional area of the current video frame, the method further includes:
    将所述目标视觉对象在所述当前视频帧的功能区域添加到跟踪视觉目标样本列表;Add the functional area of the target visual object in the current video frame to the tracking visual target sample list;
    获得在所述第一图像区域和多个所述第二图像区域组成的集合中删除所述目标视觉对象在所述当前视频帧的功能区域后第三图像区域,添加到干扰视觉目标样本列表;Obtain a third image area after deleting the target visual object in the functional area of the current video frame from the set composed of the first image area and a plurality of the second image areas, and add it to the interfering visual target sample list;
    在所述待处理视频经过目标追踪的视频帧中获取所述目标视觉对象对应跟踪视觉目标样本,以及除所述目标视觉对象外具有所述类别信息的视觉对象对应干扰视觉目标样本,包括:Obtaining the tracking visual target samples corresponding to the target visual object and the interfering visual target samples corresponding to the visual objects with the category information in addition to the target visual object in the video frames of the target tracking video, including:
    从所述跟踪视觉目标样本列表中获取所述跟踪视觉目标样本,从所述干扰视觉目标样本列表中获取所述干扰视觉目标样本。The tracking visual target sample is obtained from the tracking visual target sample list, and the interfering visual target sample is obtained from the interfering visual target sample list.
  6. 根据权利要求5所述的方法,其特征在于,在将所述目标视觉对象在所述当前视频帧的功能区域添加到跟踪视觉目标样本列表之前,所述方法还包括:The method according to claim 5, characterized in that, before adding the target visual object in the functional area of the current video frame to the tracking visual target sample list, the method further includes:
    获取所述当前视频帧的时间信息;Obtain the time information of the current video frame;
    将所述目标视觉对象在所述当前视频帧的功能区域添加到跟踪视觉目标样本列表,包括:Add the functional area of the target visual object in the current video frame to the tracking visual target sample list, including:
    将所述目标视觉对象在所述当前视频帧的功能区域添加到所述跟踪视觉目标样本列表的最后一位;所述跟踪视觉目标样本列表的中元素按照对应时间信息从大到小排列;Add the functional area of the target visual object in the current video frame to the last position of the tracking visual target sample list; the middle elements of the tracking visual target sample list are arranged from large to small according to the corresponding time information;
    删除所述跟踪视觉目标样本列表中存储时间超过预设时间长度的元素或排列在所述跟踪视觉目标样本列表第一位的元素。Delete elements in the tracking visual target sample list whose storage time exceeds a preset time length or elements ranked first in the tracking visual target sample list.
  7. 根据权利要求3所述的方法,其特征在于,响应针对待处理视频的触发操作,从所述待处理视频的关键视频帧提取目标视觉对象的类别信息,包括: The method according to claim 3, characterized in that, in response to a trigger operation for the video to be processed, extracting the category information of the target visual object from the key video frame of the video to be processed includes:
    响应针对待处理视频的触发操作,标记所述待处理视频的关键视频帧选定的目标视觉对象;In response to a trigger operation for the video to be processed, mark the target visual object selected in the key video frame of the video to be processed;
    对所述目标视觉对象进行类别检测,获得所述目标视觉对象的类别信息。Perform category detection on the target visual object to obtain category information of the target visual object.
  8. 根据权利要求3所述的方法,其特征在于,响应针对待处理视频的触发操作,从所述待处理视频的关键视频帧提取目标视觉对象的类别信息,包括:The method according to claim 3, characterized in that, in response to a trigger operation for the video to be processed, extracting the category information of the target visual object from the key video frame of the video to be processed includes:
    在所述待处理视频的关键视频帧,识别并显示类别与用户输入类别信息关联的多个视觉对象;In key video frames of the video to be processed, identify and display multiple visual objects whose categories are associated with the category information input by the user;
    接收到针对关联所述类别信息的视觉对象的触发信号时,获取触发信号对应视觉对象为所述目标视觉对象和所述类别信息。When a trigger signal for a visual object associated with the category information is received, the visual object corresponding to the trigger signal is obtained as the target visual object and the category information.
  9. 一种视觉目标跟踪装置,其特征在于,所述装置包括:A visual target tracking device, characterized in that the device includes:
    标记模块,用于将待处理视频的当前视频帧输入预先设定的目标跟踪模型,标记基于所述待处理视频的关键视频帧确定的目标视觉对象在所述当前视频帧对应第一图像区域;A marking module, configured to input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame;
    检测模块,用于对所述当前视频帧进行目标检测,识别出多个类型与所述目标视觉对象相同的视觉对象对应第二图像区域;A detection module, configured to perform target detection on the current video frame and identify multiple visual objects of the same type as the target visual object corresponding to the second image area;
    分类模块,用于利用从多个类型相同视觉对象中区分出特定视觉对象的目标分类模型,从所述第一图像区域和多个所述第二图像区域中确定所述目标视觉对象在所述当前视频帧的功能区域。A classification module, configured to utilize a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, and determine from the first image area and a plurality of the second image areas where the target visual object is located in the The functional area of the current video frame.
  10. 一种电子设备,包括:An electronic device including:
    至少一个处理器;以及at least one processor; and
    与所述处理器通信连接的至少一个存储器,其特征在于,At least one memory communicatively connected to the processor, characterized in that:
    所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用所述程序指令能够执行如权利要求1至8任一所述的方法。The memory stores program instructions that can be executed by the processor, and the processor can execute the method according to any one of claims 1 to 8 by calling the program instructions.
  11. 一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,其特征在于,所述计算机指令使所述计算机执行如权利要求1至8任一所述的方法。 A non-transitory computer-readable storage medium, the non-transitory computer-readable storage medium stores computer instructions, characterized in that the computer instructions cause the computer to execute the method as described in any one of claims 1 to 8 .
PCT/CN2023/106311 2022-07-11 2023-07-07 Visual-target tracking method and apparatus, and device and storage medium WO2024012367A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210813372.0 2022-07-11
CN202210813372.0A CN115393755A (en) 2022-07-11 2022-07-11 Visual target tracking method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2024012367A1 true WO2024012367A1 (en) 2024-01-18

Family

ID=84117176

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/106311 WO2024012367A1 (en) 2022-07-11 2023-07-07 Visual-target tracking method and apparatus, and device and storage medium

Country Status (2)

Country Link
CN (1) CN115393755A (en)
WO (1) WO2024012367A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115393755A (en) * 2022-07-11 2022-11-25 影石创新科技股份有限公司 Visual target tracking method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104599287A (en) * 2013-11-01 2015-05-06 株式会社理光 Object tracking method and device and object recognition method and device
CN110349176A (en) * 2019-06-28 2019-10-18 华中科技大学 Method for tracking target and system based on triple convolutional networks and perception interference in learning
CN112639815A (en) * 2020-03-27 2021-04-09 深圳市大疆创新科技有限公司 Target tracking method, target tracking apparatus, movable platform, and storage medium
US20220366576A1 (en) * 2020-01-06 2022-11-17 Shanghai Sensetime Lingang Intelligent Technology Co., Ltd. Method for target tracking, electronic device, and storage medium
CN115393755A (en) * 2022-07-11 2022-11-25 影石创新科技股份有限公司 Visual target tracking method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104599287A (en) * 2013-11-01 2015-05-06 株式会社理光 Object tracking method and device and object recognition method and device
CN110349176A (en) * 2019-06-28 2019-10-18 华中科技大学 Method for tracking target and system based on triple convolutional networks and perception interference in learning
US20220366576A1 (en) * 2020-01-06 2022-11-17 Shanghai Sensetime Lingang Intelligent Technology Co., Ltd. Method for target tracking, electronic device, and storage medium
CN112639815A (en) * 2020-03-27 2021-04-09 深圳市大疆创新科技有限公司 Target tracking method, target tracking apparatus, movable platform, and storage medium
CN115393755A (en) * 2022-07-11 2022-11-25 影石创新科技股份有限公司 Visual target tracking method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115393755A (en) 2022-11-25

Similar Documents

Publication Publication Date Title
US20220051061A1 (en) Artificial intelligence-based action recognition method and related apparatus
WO2022127180A1 (en) Target tracking method and apparatus, and electronic device and storage medium
EP3467707A1 (en) System and method for deep learning based hand gesture recognition in first person view
CN109325440B (en) Human body action recognition method and system
WO2020211624A1 (en) Object tracking method, tracking processing method, corresponding apparatus and electronic device
CN111582185B (en) Method and device for recognizing images
WO2021103868A1 (en) Method for structuring pedestrian information, device, apparatus and storage medium
WO2018063608A1 (en) Place recognition algorithm
WO2024012367A1 (en) Visual-target tracking method and apparatus, and device and storage medium
CN113378770B (en) Gesture recognition method, device, equipment and storage medium
CN112527113A (en) Method and apparatus for training gesture recognition and gesture recognition network, medium, and device
CN111767831B (en) Method, apparatus, device and storage medium for processing image
CN103105924A (en) Man-machine interaction method and device
CN111967433A (en) Action identification method based on self-supervision learning network
CN112150457A (en) Video detection method, device and computer readable storage medium
CN111127837A (en) Alarm method, camera and alarm system
CN111340213B (en) Neural network training method, electronic device, and storage medium
CN115527269A (en) Intelligent human body posture image identification method and system
CN111783674A (en) Face recognition method and system based on AR glasses
WO2024012371A1 (en) Target tracking method and apparatus, and device and storage medium
US8498978B2 (en) Slideshow video file detection
CN106934339B (en) Target tracking and tracking target identification feature extraction method and device
CN117372928A (en) Video target detection method and device and related equipment
CN115482436B (en) Training method and device for image screening model and image screening method
CN111571567A (en) Robot translation skill training method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23838859

Country of ref document: EP

Kind code of ref document: A1