WO2024012367A1

WO2024012367A1 - Visual-target tracking method and apparatus, and device and storage medium

Info

Publication number: WO2024012367A1
Application number: PCT/CN2023/106311
Authority: WO
Inventors: 张伟俊
Original assignee: 影石创新科技股份有限公司
Priority date: 2022-07-11
Filing date: 2023-07-07
Publication date: 2024-01-18
Also published as: CN115393755A

Abstract

The embodiments of the present invention relate to the technical field of computer vision. Provided are a visual-target tracking method and apparatus, and a device and a storage medium, which can accurately track a specific visual target without interference from visual targets of the same category. The method comprises: inputting into a preset target tracking model the current video frame of a video to be processed, and marking a corresponding first image area of a target visual object in the current video frame, which target visual object is determined on the basis of a key video frame of said video; performing target detection on the current video frame, so as to identify second image areas corresponding to a plurality of visual objects of the same type as the target visual object; and by using a target classification model for distinguishing a specific visual object from among the plurality of visual objects of the same type, determining, from the first image area and the plurality of second image areas, a functional area of the target visual object in the current video frame.

Description

Visual target tracking method, device, equipment and storage medium

【Technical field】

Embodiments of the present invention relate to the field of computer vision technology, and in particular to a visual target tracking method, device, equipment and storage medium.

【Background technique】

Tracking visual targets in videos is a technology that, given the target size and position of a visual target in a specific image frame, predicts the size and position of the object corresponding to the visual target in subsequent image frames of the video sequence. It is used in video surveillance, It is widely used in many fields such as human-computer interaction and multimedia analysis.

In practical applications, in the current field of visual target tracking, whether it is a traditional algorithm based on DCF technology or a tracking algorithm based on deep learning represented by SiamRPN, when there is interference from targets of the same category (such as interfering vehicles with similar colors, apparent structure information The algorithm's robustness is relatively poor in similar scenarios that interfere with pedestrians, etc.). In particular, after the current tracking target is occluded by a similar interference target, the algorithm can easily track the interference target.

[Content of the invention]

Embodiments of the present invention provide a visual target tracking method, device, equipment and storage medium, which can accurately track specific visual targets without interference from visual targets of the same category.

In a first aspect, embodiments of the present invention provide a visual target tracking method for use in electronic devices. The method includes: inputting the current video frame of the video to be processed into a preset target tracking model, and marking based on the video to be processed. Process the target visual object determined by the key video frame of the video in the current video frame corresponding to the first image area; perform target detection on the current video frame, and identify multiple visual objects of the same type as the target visual object corresponding to the first image area. Two image areas; using a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, determining from the first image area and a plurality of the second image areas that the target visual object is in the current Video frame function area.

The above visual target tracking method uses a target tracking model that completes the target tracking of the previous frame image to detect the target object to be tracked; on the other hand, it detects the target to be tracked in the current image frame. Other objects of the same type; using a target classification model that distinguishes specific visual objects from multiple visual objects of the same type, the detected target object to be tracked and other target objects of the same type as the target object to be tracked are processed Classify and distinguish the target objects to be tracked; extract the target objects to be tracked considered by the target tracking model and other objects of the same type, and then use the classifier to perform secondary detection on the above extracted visual objects, from the same type The target visual object is classified among different visual objects. Through the secondary detection method, the interference of similar objects is avoided, and the specific visual target is accurately tracked without interference from visual targets of the same category.

In one possible implementation, after determining that the target visual object is in the functional area of the current video frame from the first image area and a plurality of the second image areas, the method further includes:

When the target visual object in the functional area of the current video frame is inconsistent with the first image area, using the target visual object in the functional area of the current video frame to track the preset target model optimize.

In one possible implementation manner, the method further includes:

In response to a trigger operation for the video to be processed, extract category information of the target visual object from key video frames of the video to be processed;

Target detection is performed on the current video frame, and multiple second image areas corresponding to visual objects with the same recording type as the target visual object are identified, including:

According to the category information, target detection is performed on the current video frame, and a plurality of second image regions corresponding to visual objects having the same recording type as the target visual object are identified.

In one possible implementation manner, the process of setting the target classification model includes:

Obtain the tracking visual target samples corresponding to the target visual object in the video frames of the video to be processed that have undergone target tracking, and the interference visual target samples corresponding to the visual objects having the category information in addition to the target visual object;

The tracking visual target sample is used as a positive sample, and the interfering visual target sample is used as a negative sample. Here, the pre-built classifier is trained multiple times to obtain the target classification model that distinguishes a specific visual object from multiple visual objects with the same category information.

In one possible implementation, after determining that the target visual object is in the functional area of the current video frame, the method further includes:

Add the functional area of the target visual object in the current video frame to the tracking visual target sample list;

Obtain a third image area after deleting the target visual object in the functional area of the current video frame from the set composed of the first image area and a plurality of the second image areas, and add it to the interfering visual target sample list;

Obtaining the tracking visual target samples corresponding to the target visual object and the interfering visual target samples corresponding to the visual objects with the category information in addition to the target visual object in the video frames of the target tracking video, including:

The tracking visual target sample is obtained from the tracking visual target sample list, and the interfering visual target sample is obtained from the interfering visual target sample list.

In one possible implementation, before adding the target visual object in the functional area of the current video frame to the tracking visual target sample list, the method further includes:

Obtain the time information of the current video frame;

Add the functional area of the target visual object in the current video frame to the tracking visual target sample list, including:

Add the functional area of the target visual object in the current video frame to the last position of the tracking visual target sample list; the middle elements of the tracking visual target sample list are arranged from large to small according to the corresponding time information;

Delete elements in the tracking visual target sample list whose storage time exceeds a preset time length or elements ranked first in the tracking visual target sample list.

In one possible implementation, in response to a triggering operation on the video to be processed, extracting the category information of the target visual object from the key video frame of the video to be processed includes:

In response to a trigger operation for the video to be processed, mark the target visual object selected in the key video frame of the video to be processed;

Perform category detection on the target visual object to obtain category information of the target visual object.

In key video frames of the video to be processed, identify and display multiple visual objects whose categories are associated with the category information input by the user;

When a trigger signal for a visual object associated with the category information is received, the visual object corresponding to the trigger signal is obtained as the target visual object and the category information.

In a second aspect, an embodiment of the present invention provides a visual target tracking device, which is provided in an electronic device. The device includes:

A marking module, configured to input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed when the current video frame corresponds to the first image area;

A detection module, configured to perform target detection on the current video frame and identify multiple visual objects of the same type as the target visual object corresponding to the second image area;

A classification module, configured to utilize a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, and determine from the first image area and a plurality of the second image areas where the target visual object is in the The functional area of the current video frame.

In one possible implementation, the device further includes:

Optimization module, configured to use the target visual object in the functional area of the current video frame to optimize the preset value when the target visual object in the functional area of the current video frame is inconsistent with the first image area. Optimize a certain target tracking model.

In one possible implementation, the device further includes:

An extraction module, configured to respond to a trigger operation for the video to be processed and extract the category information of the target visual object from the key video frame of the video to be processed;

The detection module is specifically configured to perform target detection on the current video frame according to the category information, and identify the second image regions corresponding to multiple visual objects with the same recording type as the target visual object.

In one possible implementation, the device further includes a target classification model setting module, and the target classification model setting module includes:

The sample acquisition sub-module is used to acquire the target tracking video frames of the video to be processed. The target visual object corresponds to a tracking visual target sample, and a visual object having the category information other than the target visual object corresponds to an interfering visual target sample;

The training sub-module is used to use the tracking visual target sample as a positive sample and the interfering visual target sample as a negative sample, train the pre-built classifier multiple times, and obtain the results from multiple visual objects with the same category information. The target classification model distinguishes specific visual objects.

In one possible implementation manner, the target classification model setting module further includes:

The first adding sub-module is used to add the target visual object in the functional area of the current video frame to the tracking visual target sample list;

The second adding sub-module is used to obtain the third image area after deleting the functional area of the current video frame of the target visual object from the set composed of the first image area and a plurality of second image areas, Added to the list of interfering visual target samples;

The sample acquisition sub-module is specifically configured to obtain the tracking visual target sample from the tracking visual target sample list, and obtain the interfering visual target sample from the interfering visual target sample list.

Time acquisition submodule, used to obtain the time information of the current video frame;

The first added sub-module includes:

Add a subunit for adding the target visual object in the functional area of the current video frame to the last position of the tracking visual target sample list; the middle element of the tracking visual target sample list is from the corresponding time information. arranged from large to small;

The deletion subunit is used to delete elements in the tracking visual target sample list whose storage time exceeds a preset time length or elements ranked first in the tracking visual target sample list.

In one possible implementation, the extraction module includes:

The response sub-module is used to respond to the trigger operation for the video to be processed and mark the target visual object selected in the key video frame of the video to be processed;

A detection submodule is used to perform category detection on the target visual object and obtain category information of the target visual object.

In one possible implementation, the extraction module further includes:

An identification submodule, configured to identify and display multiple visual objects whose categories are associated with user-input category information in key video frames of the video to be processed;

The receiving submodule is configured to, when receiving a trigger signal for a visual object associated with the category information, obtain the visual object corresponding to the trigger signal as the target visual object and the category information.

In a third aspect, embodiments of the present invention provide an electronic device, including: at least one processor; and at least one memory communicatively connected to the processor, wherein: the memory stores a program that can be executed by the processor. Instructions, the processor calls the program instructions to be able to execute the method provided in the first aspect.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer-readable storage medium. The non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the method provided in the first aspect.

It should be understood that the second to fourth aspects of the embodiments of the present invention are consistent with the technical solution of the first aspect of the embodiments of the present invention, and the beneficial effects achieved by each aspect and corresponding feasible implementations are similar, and will not be described again.

[Picture description]

In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings required to be used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of this specification and are not relevant to this specification. Those skilled in the art can also obtain other drawings based on these drawings without exerting creative efforts.

Figure 1 is a flow chart of the steps of the visual target tracking method proposed by the embodiment of the present invention;

Figure 2 is a step flow chart of another visual target tracking method according to an embodiment of the present invention;

Figure 3 is a functional module diagram of the visual target tracking device proposed by the embodiment of the present invention;

Figure 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

【Detailed ways】

In order to better understand the technical solutions of this specification, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

It should be clear that the described embodiments are only some of the embodiments of this specification, but not all of the embodiments. Based on the embodiments in this specification, those of ordinary skill in the art will not create any All other embodiments obtained under the premise of sexual labor fall within the scope of protection of this specification.

The terminology used in the embodiments of the present invention is for the purpose of describing specific embodiments only and is not intended to limit the description. As used in this embodiment and the appended claims, the singular forms "a," "the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise.

The visual target tracking method proposed in the embodiment of the present invention can be applied to electronic devices such as terminals and servers.

Figure 1 is a flow chart of the steps of the visual target tracking method proposed by the embodiment of the present invention. As shown in Figure 1, the steps include:

S110: Input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame.

The target visual object is a moving visual target that needs to be tracked. The moving visual target in the image sequence of the video to be processed is detected, extracted, identified, and tracked, and the motion parameters of the moving visual target, such as position, speed, acceleration, and motion trajectory, are obtained.

The visual objects in visual target tracking can be objects in the image frames of the video, for example, they can be people, cars, animals, robots, etc. in the video image frames.

The image area corresponding to the visual object on the image can represent: the area in the image where the pixels displaying the visual object are located.

Each frame of image in the video to be processed can be used as the current video frame in turn, or an image frame can be extracted from the image sequence of the video to be processed every preset number of frames as the current video frame. For example, extract an image from the video to be processed every N (N is greater than or equal to 1) frames as the current video frame, and detect the position and size of the specified visual object in the current video frame.

The preset target tracking model can be a discriminative correlation filter (Discriminative Correlation Filter, DCF) and other related filter type trackers, or a twin network visual tracker (Evolution of Siamese Visual Tracking with Very Deep Networks, SiamRPN), etc. Tracker based on convolutional neural network (CNN) technology.

The set target tracking model can be optimized through the tracking results after completing visual object tracking in each frame of the video to be processed, thereby improving the robustness of the set target tracking model.

The method of marking the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame may be frame selection, highlighting, etc.

The key video frame of the video to be processed may be the first frame of the video to be processed, or the image with the best quality in the video to be processed. The embodiment of the present invention responds to user selection instructions based on key video frames and determines the visual object corresponding to the specified object as the target visual object in other video frames of the video to be processed.

An example electronic device of the present invention performs S110: determine the target visual object based on the first frame image of the video to be processed, input the first frame image and the second frame image into the target tracking model, and use the target tracking model to frame the second frame image. Select the position of the target visual object, and output the target visual object in the second frame image corresponding to the first image area.

S120: Perform target detection on the current video frame, and identify multiple visual objects of the same type as the target visual object corresponding to the second image area.

Methods for target detection on the current video frame include: detection methods based on manual features (such as template matching method, key point matching method, key feature method, etc.), or detection methods based on convolutional neural network technology (such as YOLO, SSD, R-CNN, Mask R-CNN, etc.).

For example, the type of the target visual object is pedestrians, and all pedestrians in the current video frame are detected.

S130: Using a target classification model that distinguishes specific visual objects from multiple visual objects of the same type, determine from the first image area and multiple second image areas where the target visual object is in the current video frame. functional area.

The functional area of the target visual object in the current video frame includes: the target visual object is in the display area of the current video frame or an area that triggers an intelligent device to perform mechanical movement, such as a mobile robot, a drone, or a camera with a pan/tilt, In response to the instruction to calculate the target visual object in the functional area of the current video frame, trigger the execution of the moving position, or rotate the pan/tilt to actually track the target visual object.

The target classification model determines that the target visual object is in the functional area of the current video frame, and can output a frame to select the current image frame of the functional area.

The target classification model can classify multiple visual objects of the same type into target visual objects and non-target visual objects.

For example, the target tracking model is used to identify the possible target visual object in the current video frame. The image area is A. Target detection is performed on the current video frame, and multiple visual objects of the same type as the target visual object are identified in the image area of the current video frame: B, C, D, and A. Use the target classification model to classify image areas A, B, C, and D, and determine that image area A is the area where the target visual object is displayed on the current video frame. Due to the two calculations of the target tracking model and the target classification model, it is guaranteed that the target visual objects tracked in the current video frame are more accurate; among them, the target classification model classifies visual objects of the same type, which can effectively avoid visual objects of the same type. interference.

Embodiments of the present invention also propose that when the target visual object output by the target classification model is in the functional area of the current video frame, it is determined that the target visual object output by the target classification model is in the functional area of the current video frame, and the target tracking model detects Whether the target visual object output by the target classification model is consistent in the first image area of the current video frame is consistent with the target visual object detected by the target tracking model in the functional area of the current video frame. The first image area of the current video frame is consistent, and the target tracking model can currently accurately detect the target visual object in the current video frame.

You can use the frame output by the target classification model to select the current video frame of the target visual target, train the target tracking model, or reinitialize the target tracking model to make the target tracking model more accurate in image recognition of the next frame.

Another embodiment of the present invention provides another visual target tracking method. Figure 2 is a step flow chart of another visual target tracking method according to the embodiment of the present invention. The steps include:

S210: In response to a triggering operation on the video to be processed, extract the category information of the target visual object from the key video frame of the video to be processed.

Executing S210 may be implemented by executing sub-step S211 or executing sub-step S212.

S211: In response to the triggering operation on the video to be processed, mark the target visual object selected in the key video frame of the video to be processed; perform category detection on the target visual object to obtain category information of the target visual object.

For example, the electronic device displays key video frames, the user selects a specific image area, and the electronic device performs category detection on the selected specific image area to obtain category information of the target visual object. electronic The device can calculate the intersection and union ratio of each detection result and the initial target bounding box selected by the user, and select the category of the detection result with the largest intersection and union ratio as the category information. The electronic device can also send key video frames and initial target bounding boxes marked on specific image areas to the CNN-based target classification algorithm to obtain the category with the highest score as category information.

S212: In the key video frame of the video to be processed, identify and display multiple visual objects whose categories are associated with the category information input by the user; when receiving a trigger signal for a visual object associated with the category information, obtain the visual object corresponding to the trigger signal. The objects are the target visual object and the category information.

For example, the electronic device receives category information input by the user, detects key video frames, identifies multiple visual objects corresponding to the category information, and displays the multiple visual objects. The visual object selected by the user is determined as the target visual object, thereby obtaining the target visual object and category information of the target visual object.

S220: Input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame.

S230: Perform target detection on the current video frame according to the category information, and identify second image areas corresponding to multiple visual objects with the same recording type as the target visual object.

S240: Using a target classification model that distinguishes specific visual objects from multiple visual objects of the same type, determine from the first image area and multiple second image areas where the target visual object is in the current video frame. functional area.

The target classification model may be obtained by: obtaining the tracking visual target sample corresponding to the target visual object in the video frame of the video to be processed that has undergone target tracking, and the visual objects with the category information in addition to the target visual object. Corresponding to the interfering visual target sample; use the tracking visual target sample as a positive sample, the interfering visual target sample as a negative sample, conduct multiple trainings on the pre-built classifier, and obtain the results from multiple visual objects with the same category information. The target classification model distinguishes specific visual objects.

The classifier can be a nearest neighbor classifier, a decision tree, a support vector machine classifier, etc.

Yet another embodiment of the present invention provides yet another visual target tracking method. In yet another embodiment of the present invention, the visual target tracking method includes periodically extracting the current video frame of the video to be processed, and performing the Nth extracted current video frame. Perform visual object tracking, and in the process of determining the target visual object, obtain Visual objects of the same category are added to the interference visual target sample list, and the target visual object identified in the current video frame extracted for the Nth time is added to the tracking visual target sample list. The elements in the tracking visual target sample list are used as positive samples, and the elements in the interfering visual target sample list are used as negative samples. The pre-built classifier is trained multiple times to obtain the target classification model, which can be used for the N+1th time Classify similar visual objects during the visual object tracking process of the extracted current video frame; or train the target classification model used in the target time object tracking process of the N-1th extracted current video frame, and obtain A target classification model that classifies similar visual objects during the visual object tracking process of the current video frame extracted for the N+1th time.

Yet another visual target tracking method includes adding the functional area of the current video frame obtained based on the current video frame to the tracking visual target sample list, and adding the image area where the visual object with the category information in addition to the target visual object is located. List of interfering visual target samples. The negative samples for training the target classification model are obtained from the interference visual target sample list, and the positive samples for training the target classification model are obtained from the tracking visual target sample list.

Add the target visual object in the functional area of the current video frame to the tracking visual target sample list; obtain the target visual object in a set composed of the first image area and a plurality of the second image areas and delete the target visual object Add the third image area after the functional area of the current video frame to the interfering visual target sample list;

Obtaining the tracking visual target samples corresponding to the target visual object and the interfering visual target samples corresponding to the visual objects with the category information in addition to the target visual object in the video frames of the target tracking video, including: from The tracking visual target sample is obtained from the tracking visual target sample list, and the interfering visual target sample is obtained from the interfering visual target sample list.

Another visual target tracking method includes steps during the Nth execution of the visual target tracking method:

K101: Input the Nth frame image of the video to be processed into the preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the Nth frame image of the video to be processed. First image area A.

K102: Perform target detection on the Nth frame image of the video to be processed, and identify multiple visual objects of the same type as the target visual object corresponding to the second image areas B, C, and D.

K103: Using a target classification model that distinguishes specific visual objects from multiple visual objects of the same type, determine from the first image area and multiple second image areas where the target visual object is in the video to be processed. Functional area B of the Nth frame image.

K104: Add image areas A, C, and D to the interference visual target sample list. The interference visual target sample list D = {D1, D2,...,DL}. The interference visual target sample list includes elements D1, D2,... ., DL, where D1, D2,..., DL includes the image area of other perspective objects of the same type as the target visual object obtained based on the Nth frame image of the video to be processed: A, C, D, and based on the Nth frame image of the video to be processed. The N-i frame image obtains the image area where other perspective objects of the same type as the target visual object are located, i is less than N and greater than or equal to 1.

K105: Add image area B to the tracking visual target sample list. The tracking visual target sample list T = {T1, T2,...,TM}. The tracking visual target sample list includes the Nth frame image obtained based on the video to be processed. The image area where the target visual object is located: B, and the image area where the target visual object is located based on the N-ith frame image, i is less than N and greater than or equal to 1.

During the visual target tracking process of the historical image frames of the video to be processed, the tracking visual target samples and interference visual target samples are determined based on the current video frame, and are added to the tracking visual target sample list and the interference visual target sample list respectively.

K106: Get the tracking visual target sample from the tracking visual target sample list as a positive sample, get the interference visual target sample from the interference visual target sample list as a negative sample, train the pre-built classifier, or train to execute the visual target at the N-1th time The target classification model trained during the tracking process is used to obtain the target classification model used in the N+1th execution of the visual target tracking method.

Methods for training target classification models include:

Extract multiple types of features such as grayscale, color, texture, gradient histogram, etc. from the tracking visual target samples and interference visual target samples respectively, and use decision trees as classifiers.

Use CNN-based method to extract sample features, and use support vector machine as a classifier.

The squared difference of gray values is used to define the distance between two samples, and the nearest neighbor classifier is used for target classification.

Before adding the target visual object in the functional area of the current video frame to the tracking visual target sample list, the method further includes:

Obtain the time information of the current video frame;

For adding elements to the interfering visual target sample list, the above method can also be used. In the process of adding the image area where the visual object with the category information in addition to the target visual object is added to the interfering visual target sample list, the time information can be obtained , to subsequently remove inapplicable elements.

During the process of adding tracking visual target samples and interference visual target samples, the adding time of the sample (image frame number, or absolute time) can be recorded at the same time, and a history forgetting mechanism can be implemented, such as:

At the Kth frame, discard the samples added before the K-Jth frame.

Set the maximum capacity of the collection to L, and discard the earliest added sample when the number of elements reaches L.

The embodiment of the present invention trains a target classification model through tracking target samples and interference target samples collected in historical frames of the video to be processed, and updates and corrects the output results of the tracking algorithm of the current video frame, so that the algorithm is not easily interfered by targets of the same category, and the target classification model is updated and corrected. It is highly robust to occlusion/interference by targets of the same category, significantly improving the usability of the tracking algorithm in actual scenarios.

Figure 3 is a functional module diagram of the visual target tracking device proposed by the embodiment of the present invention. The above picture A display device is provided in an electronic device, as shown in Figure 3. The device includes:

Marking module 31, configured to input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed when the current video frame corresponds to the first image area;

The detection module 32 is configured to perform target detection on the current video frame and identify multiple visual objects of the same type as the target visual object corresponding to the second image area;

Classification module 33, configured to use a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, and determine from the first image area and the plurality of second image areas where the target visual object is located. Describes the functional area of the current video frame.

The visual target tracking device provided by the embodiment shown in Figure 3 can be used to implement the technical solutions of the method embodiments shown in Figures 1 and 2 of this specification. For its implementation principles and technical effects, further reference can be made to the relevant descriptions in the method embodiments.

Optionally, the device also includes:

Optionally, the device further includes a target classification model setting module, which includes:

A sample acquisition submodule, used to obtain the tracking visual target samples corresponding to the target visual object in the video frames of the video to be processed that have undergone target tracking, and the corresponding interference of visual objects with the category information in addition to the target visual object. Visual target samples;

The training submodule is used to use the tracking visual target sample as a positive sample and the interfering visual target sample as a negative sample to train the pre-built classifier multiple times to obtain the same The target classification model uses category information to distinguish a specific visual object among multiple visual objects.

Optionally, the target classification model setting module also includes:

The first added sub-module includes:

Optionally, the extraction module includes:

Optionally, the extraction module also includes:

The apparatus provided by the above-mentioned embodiments is used to execute the technical solutions of the above-mentioned method embodiments. For its implementation principles and technical effects, further reference can be made to the relevant descriptions in the method embodiments, which will not be repeated here. Repeat.

The device provided in the above-described embodiments may be, for example, a chip or a chip module. The devices provided by the above-described embodiments are used to execute the technical solutions of the above-described method embodiments. For its implementation principles and technical effects, further reference can be made to the relevant descriptions in the method embodiments, which will not be described again here.

Regarding each module/unit included in each device described in the above embodiment, it may be a software module/unit or a hardware module/unit, or it may be partly a software module/unit and partly a hardware module/unit. For example, for each device applied or integrated in a chip, each module/unit included therein can be implemented in the form of hardware such as circuits, or at least some of the modules/units can be implemented in the form of a software program that runs on For the processor integrated inside the chip, the remaining modules/units can be implemented using circuits and other hardware methods; for each device applied or integrated into the chip module, each module/unit included in it can be implemented using circuits and other hardware methods. , different modules/units can be located in the same component (such as chip, circuit module, etc.) or in different components of the chip module, or at least some of the modules/units can be implemented in the form of software programs that run on the chip module For the internally integrated processor, the remaining modules/units can be implemented using circuits and other hardware methods; for each device applied or integrated into electronic terminal equipment, each module/unit included in it can be implemented using circuits and other hardware methods. Different modules/units can be located in the same component (e.g., chip, circuit module, etc.) or in different components within the electronic terminal equipment, or at least some of the modules/units can be implemented in the form of software programs that run on the electronic terminal equipment. For the internally integrated processor, the remaining (if any) modules/units can be implemented using circuits and other hardware methods.

Figure 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. The electronic device 400 includes a processor 410, a memory 411, and a computer program stored in the memory 411 and executable on the processor 410. When the processor 410 executes the program, it implements the steps in the foregoing method embodiments. The electronic equipment provided by the embodiments can be used to execute the technical solutions of the method embodiments shown above. For its implementation principles and technical effects, please refer to the method embodiments for further reference. The relevant descriptions will not be repeated here.

Embodiments of the present invention provide a non-transitory computer-readable storage medium. The non-transitory computer-readable storage medium stores computer instructions. The computer instructions cause the computer to execute the embodiments shown in Figures 1 and 2 of this specification. Provided visual target tracking methods. Non-transitory computer-readable storage media may refer to non-volatile computer storage media.

The above-mentioned non-transitory computer-readable storage medium may adopt one or more computer-readable media. any combination of. The computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples (non-exhaustive list) of computer-readable storage media include: electrical connections having one or more conductors, portable computer disks, hard drives, random access memory (RAM), read only memory , ROM), erasable programmable read only memory (EPROM) or flash memory, optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any of the above The right combination. As used herein, a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including - but not limited to - electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .

Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.

Computer program code for performing the operations described herein may be written in one or more programming languages, including visual object-oriented programming languages—such as Java, Smalltalk, C++, and conventional A procedural programming language—such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In situations involving remote computers, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (e.g. Use an Internet service provider to connect via the Internet).

The foregoing describes specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desired results. Additionally, the processes depicted in the figures do not necessarily require the specific order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain implementations.

In the description of embodiments of the present invention, reference to the description of the terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples" or the like means that the description is made in connection with the embodiment or example. A specific feature, structure, material, or characteristic is included in at least one embodiment or example of this specification. In this specification, the schematic expressions of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine different embodiments or examples and features of different embodiments or examples described in this specification unless they are inconsistent with each other.

In addition, the terms “first” and “second” are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In the description of this specification, "plurality" means at least two, such as two, three, etc., unless otherwise clearly and specifically limited.

Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments, or portions of code that include one or more executable instructions for implementing customized logical functions or steps of the process. , and the scope of the preferred embodiments of this description includes additional implementations in which functions may be performed out of the order shown or discussed, including in a substantially simultaneous manner or in the reverse order, depending on the functionality involved, which shall It should be understood by those skilled in the art to which the embodiments of this specification belong.

Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determination" or "in response to detection." Similarly, depending on the context, the phrase "if determined" or "if (stated condition or event) is detected" may be interpreted as "when determined" or "in response to determining" or "when (stated condition or event) is detected )" or "in response to detecting (a stated condition or event)".

It should be noted that the terminals involved in the embodiments of the present invention may include but are not limited to personal computers (PCs), personal digital assistants (Personal Digital Assistants, PDAs), wireless handheld devices, tablet computers (tablet computers), Mobile phones, MP3 players, MP4 players, etc.

In the several embodiments provided in this specification, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined. Either it can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.

In addition, each functional unit in each embodiment of this specification may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.

The above-mentioned integrated unit implemented in the form of a software functional unit can be stored in a computer-readable storage medium. The above-mentioned software functional unit is stored in a storage medium and includes a number of instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute the methods described in various embodiments of this specification. Some steps. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other various media that can store program codes.

The above are only preferred embodiments of this specification and are not intended to limit this specification. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this specification shall be included in this specification. within the scope of protection.

Claims

A visual target tracking method, characterized in that the method includes:

Input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame;

Perform target detection on the current video frame, and identify multiple visual objects of the same type as the target visual object corresponding to the second image area;

Utilizing a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, determining a function of the target visual object in the current video frame from the first image area and a plurality of the second image areas. area.
The method according to claim 1, characterized in that after determining that the target visual object is in the functional area of the current video frame from the first image area and a plurality of the second image areas, the method Also includes:

When the target visual object in the functional area of the current video frame is inconsistent with the first image area, using the target visual object in the functional area of the current video frame to track the preset target model optimize.
The method of claim 1, further comprising:

In response to a trigger operation for the video to be processed, extract category information of the target visual object from key video frames of the video to be processed;

Target detection is performed on the current video frame, and multiple second image areas corresponding to visual objects with the same recording type as the target visual object are identified, including:

According to the category information, target detection is performed on the current video frame, and a plurality of second image regions corresponding to visual objects having the same recording type as the target visual object are identified.
The method according to claim 3, characterized in that the process of setting the target classification model includes:

Obtain the tracking visual target samples corresponding to the target visual object in the video frames of the video to be processed that have undergone target tracking, and the interference visual target samples corresponding to the visual objects having the category information in addition to the target visual object;

The tracking visual target sample is used as a positive sample, and the interfering visual target sample is used as a negative sample. The pre-built classifier is trained multiple times to obtain a method for distinguishing specific visual objects from multiple visual objects with the same category information. The target classification model.
The method according to claim 4, characterized in that, after determining that the target visual object is in the functional area of the current video frame, the method further includes:

Add the functional area of the target visual object in the current video frame to the tracking visual target sample list;

Obtain a third image area after deleting the target visual object in the functional area of the current video frame from the set composed of the first image area and a plurality of the second image areas, and add it to the interfering visual target sample list;

Obtaining the tracking visual target samples corresponding to the target visual object and the interfering visual target samples corresponding to the visual objects with the category information in addition to the target visual object in the video frames of the target tracking video, including:

The tracking visual target sample is obtained from the tracking visual target sample list, and the interfering visual target sample is obtained from the interfering visual target sample list.
The method according to claim 5, characterized in that, before adding the target visual object in the functional area of the current video frame to the tracking visual target sample list, the method further includes:

Obtain the time information of the current video frame;

Add the functional area of the target visual object in the current video frame to the tracking visual target sample list, including:

Add the functional area of the target visual object in the current video frame to the last position of the tracking visual target sample list; the middle elements of the tracking visual target sample list are arranged from large to small according to the corresponding time information;

Delete elements in the tracking visual target sample list whose storage time exceeds a preset time length or elements ranked first in the tracking visual target sample list.
The method according to claim 3, characterized in that, in response to a trigger operation for the video to be processed, extracting the category information of the target visual object from the key video frame of the video to be processed includes:

In response to a trigger operation for the video to be processed, mark the target visual object selected in the key video frame of the video to be processed;

Perform category detection on the target visual object to obtain category information of the target visual object.
The method according to claim 3, characterized in that, in response to a trigger operation for the video to be processed, extracting the category information of the target visual object from the key video frame of the video to be processed includes:

In key video frames of the video to be processed, identify and display multiple visual objects whose categories are associated with the category information input by the user;

When a trigger signal for a visual object associated with the category information is received, the visual object corresponding to the trigger signal is obtained as the target visual object and the category information.
A visual target tracking device, characterized in that the device includes:

A marking module, configured to input the current video frame of the video to be processed into a preset target tracking model, and mark the target visual object determined based on the key video frame of the video to be processed in the first image area corresponding to the current video frame;

A detection module, configured to perform target detection on the current video frame and identify multiple visual objects of the same type as the target visual object corresponding to the second image area;

A classification module, configured to utilize a target classification model that distinguishes a specific visual object from a plurality of visual objects of the same type, and determine from the first image area and a plurality of the second image areas where the target visual object is located in the The functional area of the current video frame.
An electronic device including:

at least one processor; and

At least one memory communicatively connected to the processor, characterized in that:

The memory stores program instructions that can be executed by the processor, and the processor can execute the method according to any one of claims 1 to 8 by calling the program instructions.
A non-transitory computer-readable storage medium, the non-transitory computer-readable storage medium stores computer instructions, characterized in that the computer instructions cause the computer to execute the method as described in any one of claims 1 to 8 .