CN110610510A

CN110610510A - Target tracking method and device, electronic equipment and storage medium

Info

Publication number: CN110610510A
Application number: CN201910810400.1A
Authority: CN
Inventors: 许明
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-12-24
Anticipated expiration: 2039-08-29
Also published as: CN110610510B

Abstract

The application discloses a target tracking method, a target tracking device, electronic equipment and a storage medium, and belongs to the technical field of image processing. The method comprises the following steps: acquiring a plurality of frame images of a video to be tracked, and acquiring a plurality of key frames from the plurality of frame images; inputting the plurality of key frames into a detection network to obtain detection data, wherein the detection data comprises position information of a target object; and tracking the target object in the plurality of frame images by using the detection data and a kernel correlation filtering algorithm. When the plurality of frame images are acquired, the plurality of key frames can be acquired from the plurality of frame images, the introduction of the key frames can effectively delete the blurred images in the plurality of frame images, and the accuracy of target tracking can be improved to a certain extent.

Description

Target tracking method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a target tracking method and apparatus, an electronic device, and a storage medium.

Background

The target tracking technology is a technology developed and applied in recent years, and is also an important subject and research hotspot currently, the target tracking estimates the occupied area of the position and the shape of a tracking target in a continuous video image sequence, determines the motion information of the target, such as the motion speed, the direction, the track and the like, and realizes the analysis and understanding of the motion of the moving target so as to complete higher-level tasks. Obviously, target tracking is one of the core problems of video analysis, and has wide application in the fields of video monitoring, human-computer interaction, virtual reality and the like. Although the research on the technology has been greatly advanced in the past decades, due to the application of various factors such as scale change, illumination change, rapid movement, partial occlusion, and complex background, it is always a challenge to find a tracking method which is fast and stable and suitable for object change.

Disclosure of Invention

In view of the above, the present invention provides a target tracking method, an apparatus, an electronic device and a storage medium to solve the above problems.

In a first aspect, an embodiment of the present application provides a target tracking method, which is applied to an electronic device, and the method includes: acquiring a plurality of frame images of a video to be tracked, and acquiring a plurality of key frames from the plurality of frame images; inputting the plurality of key frames into a detection network to obtain detection data, wherein the detection data comprises position information of a target object; and tracking the target object in the plurality of frame images by using the detection data and a kernel correlation filtering algorithm.

In a second aspect, an embodiment of the present application provides a target tracking apparatus, which is applied to an electronic device. The device comprises: the device comprises a key frame acquisition module, a detection data acquisition module and a target tracking module. And the key frame acquisition module is used for acquiring a plurality of frame images of the video to be tracked and acquiring a plurality of key frames from the plurality of frame images. And the detection data acquisition module is used for inputting the plurality of key frames into a detection network to obtain detection data, and the detection data comprises position information of the target object. And the target tracking module is used for tracking the target object in the plurality of frame images by using the detection data and the nuclear correlation filtering algorithm.

In a third aspect, an embodiment of the present application further provides an electronic device, including one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the above-described methods.

In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, and the program code can be called by a processor to execute the above method.

Compared with the prior art, the target tracking method, the device, the electronic device and the storage medium provided by the embodiment of the application can acquire a plurality of frame images of a video to be tracked, acquire a plurality of key frames from the frame images, input the plurality of key frames into a detection network to obtain detection data, wherein the detection data can include position information of a target object, and finally track the target object in the plurality of frame images by using the acquired detection data and a nuclear correlation filtering algorithm. After the plurality of frame images are acquired, the plurality of key frames are acquired from the frame images instead of being directly input into the detection network, and then the plurality of key frames are input into the detection network.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 illustrates an overall framework diagram of object tracking provided by embodiments of the present application;

FIG. 2 illustrates a method flow diagram of a target tracking method provided by an embodiment of the present application;

FIG. 3 illustrates a method flow diagram of a target tracking method provided by another embodiment of the present application;

fig. 4 shows a detailed flowchart of step S302 in the target tracking method provided in another embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a target tracking method according to another embodiment of the present application for obtaining a plurality of keyframes;

FIG. 6 illustrates a method flow diagram of a target tracking method provided by yet another embodiment of the present application;

FIG. 7 illustrates a block diagram of a target tracking device provided by an embodiment of the present application;

FIG. 8 shows a block diagram of an electronic device provided by an embodiment of the present application;

fig. 9 shows a storage unit for storing or carrying a program code for implementing the object tracking method according to the embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

With the continuous development of computer technology, the requirements of people on the intelligent camera are also continuously improved, and then different functions of the intelligent camera are developed. For example, baidu lens of hundred degrees and intelligent literacy of Huacheng, etc., through which the smart camera can display detailed information of an object in its screen in real time. Generally, a plurality of objects are displayed on a screen, and in order to visually display tag information of each object, a tag is required to be capable of moving along with the object in real time, so that a tracking requirement is provided for an electronic device. The existing target tracking method generally utilizes a CNN detection network to give the position of a target object, so that the problem that a label jumps around the object and is stuck easily occurs when the target is tracked. Although some target tracking methods solve the above-mentioned problems, the cost of improving accuracy is usually to reduce the real-time performance of target tracking.

Therefore, in order to overcome the above-mentioned defects, the present application provides an overall frame diagram of target tracking as shown in fig. 1, and it can be seen from fig. 1 that the whole target tracking can be divided into two parts, the first part is to acquire data to be filtered, and the second part is to acquire a detection frame, that is, the data to be filtered and the detection frame corresponding to the original image can be acquired by the present application, and the target object can be tracked by combining the two data. Specifically, the original image may be input to an initialization extractor, and feature extraction may be performed on the original image using the initialization extractor.

The initialization extractor in the embodiment of the present application may include a target detection network, which may be a MobileNet-SSD (deep convolutional neural network), a goturn (generic Object Tracking using regresion networks), and the like. After the feature extraction is performed by initialization, feature data of a tracking target can be acquired, the feature data can be stored in a Database (DB) of the electronic device, and then the target can be tracked by using the feature data of the tracking target. In addition, as can be seen from fig. 1, when the target tracking is performed, the present application may continuously acquire a new detection frame by using the update extractor, and may further continuously update the feature data stored in the database.

The embodiment of the present application as shown in fig. 2 provides a target tracking method applied to an electronic device, and the method may include steps S201 to S203.

Step S201: the method comprises the steps of obtaining a plurality of frame images of a video to be tracked, and obtaining a plurality of key frames from the plurality of frame images.

In an embodiment of the present invention, the electronic device may be a mobile phone, a notebook computer, a tablet computer (tablet personal computer), a palmtop computer, a laptop computer (laptop computer), a Personal Digital Assistant (PDA), a Mobile Internet Device (MID), or a wearable device (e.g., a smart watch (such as iWatch), a smart bracelet, a pedometer, etc.), or other electronic devices that can be installed and deployed with an instant messaging application client.

In an embodiment, after a video to be tracked is acquired, the video to be tracked may be subjected to framing processing to obtain a plurality of frame images, and relevant parameters of the video to be tracked are stored, where the relevant parameters may include: the video recording method comprises the following steps of video duration length, video coding format, total video frame number, video recording time point, video exposure time, video frame rate, latitude and longitude of the video recording location and the like. In the embodiment of the present application, the acquisition of the multiple frame images may start from the first frame of the video, and the frame-by-frame acquisition is performed sequentially, so that the corresponding state and parameters are saved after each frame image is acquired. For example, acquiring a plurality of frame images from a video file requires saving a file name of the video file, a corresponding decoding type, and the like, and specifically, which way to acquire the frame images of the video to be tracked is used, and is not explicitly displayed here.

When the electronic device acquires a plurality of frame images of a video to be tracked, a plurality of key frames can be acquired from the plurality of frame images, wherein the key frames can be any number of images randomly selected from the plurality of frame images or images selected according to a certain rule. The embodiment of the application can select a plurality of key frames by comparing the pixel values of the plurality of frame images, wherein the pixel values of the key frames are higher than the pixel values of the common frame images; key frames can also be selected by comparing the brightness value of each frame image, namely the brightness value of the key frame is higher; alternatively, a distortion value, an equalization value, or a resolution of each of the frame images may be acquired to acquire a plurality of key frames. Obviously, the key frame mainly refers to an image with better image quality in a plurality of frame images, and in the embodiment of the present application, a plurality of key frames may be obtained by using any one of the above methods, or may be obtained by combining the above methods, and detailed description of how to obtain the plurality of key frames is omitted here.

Step S202: and inputting the plurality of key frames into a detection network to obtain detection data, wherein the detection data comprises the position information of the target object.

In a real-time manner, the electronic device may input a plurality of acquired key frames into a detection network, perform feature extraction using the detection network, and further obtain detection data, where the detection data may include location information of a target object. In the embodiment of the application, the detection network may be a target detection network, and the target detection network may also be referred to as a target extraction network, and the main function of the network is to segment a target object from a target image and identify the target object. The commonly used target detection networks include GOTURN network, MobileNet-SSD deep convolutional neural network, FasterRCNN neural network, Yolo neural network, and SPP-Net (spatial Pyramid) neural network, and the neural networks commonly used for video tracking are GOTURN and MobileNet-SSD.

The GOTURN neural network is a target tracking algorithm for off-line training by using a convolutional neural network, and is characterized by extracting features by using a CNN classification network pre-trained by the existing large-scale classification data set and classifying by using an observation model to obtain a tracking result. The tracking process comprises the steps of obtaining feature expressions of two frames of images of a current frame image and a previous frame image through a VGG-Net (Very Deep consistent networks for Large-scale electronic registration), and obtaining the position of a current frame target image through a decision network for off-line training. The GOTURN neural network has good tracking capability on a complete target in a simple scene and has strong robustness on scale change, deformation and the like.

The MobileNet-SSD deep convolutional neural network is based on a streamline framework, and the lightweight deep neural network is mainly constructed through deep separable convolution, namely the MobileNet-SSD deep convolutional neural network is mainly a lightweight deep network model provided for being suitable for a mobile terminal. The MobileNet-SSD deep convolutional neural network may include two simple global superparameters that effectively balance delay and accuracy, allowing a model builder to select a model of appropriate size according to constraints of a problem, with slightly reduced precision, and with greater real-time performance.

In one embodiment, the MobileNet-SSD deep convolutional neural network can adopt a Feature pyramid for target detection, and during detection, MobileNet (basic network) can be used as a Feature extractor to extract a plurality of Feature Maps (Feature Maps) with different sizes, and Softmax classification and position regression are performed on the plurality of Feature Maps at the same time; for a feature map with width W and height H, setting W × H color boxes (preselected boxes), uniformly distributing the centers of the preselected boxes on the whole map, and then acquiring the accurate positions of the targets through Softmax classification and Bounding Box regreseSion (Bounding Box RegreSSion). In addition, the MobileNet-SSD deep convolution neural network can operate in a neural computing rod, the neural computing rod is a deep learning device using a USB interface, the size of the neural computing rod is slightly larger than that of a U disk, the power consumption is 1W, the floating point performance can reach 100GFLOPs, and the neural computing rod has the advantages of low power consumption and high performance.

Compared with a GOTURN neural network, the MobileNet-SSD deep convolution neural network has high detection precision, strong robustness and good real-time performance when target tracking is carried out, namely, compared with the GOTURN neural network, the MobileNet-SSD deep convolution neural network has higher frame output precision and higher frame output speed, and meanwhile, the method can also preprocess the image according to the frame output result and transmit the preprocessed image to a next-stage classification model. Therefore, in the embodiment of the application, the MobileNet-SSD deep convolutional neural network is used as the detection network, and the main reason is that the speed and the accuracy of the target detection can be well considered.

In one embodiment, the MobileNet-SSD deep convolutional neural network may include four parts: the first part is an input layer, which is mainly used for inputting images, i.e. inputting key frames; the second part is an improved MobileNet convolution network which is mainly used for extracting the characteristics of an input key frame; the third part is an SSD meta structure and is used for classification regression and boundary regression; the fourth part is an output layer which is mainly used for outputting the detection result. The SSD structure is a regression model, classification regression and boundary box regression are performed by using features output by different convolution layers, the contradiction between translation invariance and translation variability can be relieved well, detection precision and speed are compromised well, and the detection precision is improved while the detection speed is improved.

The process of obtaining the detection data by utilizing the MobileNet-SSD deep convolution neural network comprises the following steps: acquiring a first key frame, acquiring a target detection frame corresponding to a target object in the first key frame, and storing position information corresponding to the target detection frame as detection data; reading the next key frame, acquiring a target detection frame corresponding to a target object in the key frame, storing position information corresponding to the target detection frame as detection data, and repeating the steps until the target detection frame corresponding to the target object in each key frame is acquired.

Step S203: and tracking the target object in the plurality of frame images by using the detection data and a kernel correlation filtering algorithm.

In one embodiment, when the detection data is acquired and stored in the database, the target object in the plurality of frame images may be tracked by using a Kernel Correlation Filter (KCF), where the KCF is a target tracking algorithm based on an online learning model. The KCF tracking process can be divided into four steps of target feature extraction, template training, target position prediction and template updating. The KCF adopts a ridge regression model, the ridge regression is a linear regression model, and the ridge regression has a simple closed-form solution, so that the classification effect of the model is equivalent to that of a complex model such as SVM.

The specific process of tracking the target object in the plurality of frame images by using the detection data and the kernel correlation filtering algorithm is as follows: and transmitting the position information of the target object contained in the detection data to a kernel correlation filtering algorithm, and tracking the detected target object by using the kernel correlation filtering algorithm to obtain the motion track of the target object. Obviously, the kernel correlation filtering algorithm is a discriminant tracking method, which can train a target detector during target tracking, detect whether the next keyframe predicted position is a target or not by using the target detector, and then update the target detector by using a new detection result, while the target detector is trained, a target area is generally selected as a positive sample, an area around the target is a negative sample, and the probability that an area closer to the target is a positive sample is higher.

According to the target tracking method, after the frame images of the video to be tracked are acquired, the key frames can be acquired from the frame images, the overall quality of the key frames is better than that of the frame images, therefore, the key frames are input into a detection network, the acquired detection data are more accurate, namely the efficiency and accuracy of target detection can be improved to a certain extent, meanwhile, the target objects in the frame images are tracked by using a kernel correlation filtering algorithm, and the requirements of stability and effectiveness of tracking can be met.

Another embodiment of the present application provides a target tracking method applied to an electronic device, please refer to fig. 3, where the target tracking method may include steps S301 to S304.

Step S301: the method comprises the steps of obtaining a plurality of frame images of a video to be tracked, and dividing the plurality of frame images into N image groups.

As can be known from the above description, when a video to be tracked is acquired, an electronic device may perform framing processing on the video to be tracked, so as to obtain a plurality of frame images, and in order to improve accuracy of target tracking, in the embodiment of the present application, a plurality of key frames with a fixed number may be acquired from the plurality of frame images, so that the plurality of frame images need to be divided into N image groups, and the number of frame images included in each image group is kept consistent. For example, a video to be tracked is acquired and is subjected to framing processing to obtain 14400 frame images, and the frame images can be divided into 720 image groups, that is, each image group contains 20 frame images.

Step S302: and acquiring a key frame from each image group to obtain a plurality of key frames.

In an embodiment, after dividing a plurality of frame images of a video to be tracked into N image groups, one key frame may be obtained from each of the image groups, so that a plurality of key frames may be obtained. For example, after 14400 frame images can be divided into 720 image groups, one key frame can be obtained from each image group, that is, one key frame is selected from 20 frame images, and finally 720 key frames can be obtained. The specific steps of obtaining a plurality of key frames from each of the image groups are shown in fig. 4, and it can be known from fig. 4 that obtaining a plurality of key frames may include steps S401 to S405.

Step S401: an alternate frame is obtained from the current group of images.

In one embodiment, the first frame image in the current image group may be used as the candidate frame, or one frame image may be randomly acquired from the current image group as the candidate frame. In a specific implementation, each image group includes 20 frame images, in the embodiment of the present application, a first frame of the 20 frame images may be used as a candidate frame, or a tenth frame of the 20 frame images may be used as a candidate frame, and specifically which frame image in the current image group is used as a candidate frame is not specifically limited here.

Step S402: and judging whether the alternative frame is a clear image.

In one embodiment, determining whether the candidate frame is a sharp image may include: and acquiring a pixel mean value of the candidate frame, and judging whether the pixel mean value is greater than a preset threshold value, wherein when the pixel mean value is greater than the preset threshold value, the candidate frame is a clear image, and then the step S403 is performed. The pixel mean value refers to the mean value of all R channel pixels, G channel pixels and B channel pixels in the alternative frame, and then the mean values of the three channels are averaged, so that the obtained value is the pixel mean value. When the pixel mean value is less than or equal to the preset threshold, it may be determined that the candidate frame is not a sharp image, that is, the candidate frame is a blurred image, at this time, a next frame image of the candidate frame may be acquired, that is, step S404 is performed.

In a specific embodiment, a current image group may include 20 frame images, a first frame image of the 20 frame images is first used as a candidate frame, a pixel mean value of the candidate frame is obtained, whether the pixel mean value is greater than a preset threshold is judged, if the pixel mean value is greater than the preset threshold, it is indicated that the first frame image is a clear image, that is, the candidate frame may be used as a key frame, and if the pixel mean value is less than or equal to the preset threshold, a second frame image of the 20 frame images is obtained, and whether the second frame image is a clear image is judged, and so on until a clear image is obtained. In summary, in the embodiment of the present application, whether the candidate frame is a blurred image may be determined by using laplacian edge detection, whether the candidate frame is a blurred image may also be determined by using a sharpness algorithm BreNNer, or whether the candidate frame is a blurred image may also be determined by using a Tenengrad gradient function, and specifically, which method is used to determine the sharpness of the candidate frame is not explicitly limited herein.

Step S403: and taking the alternative frame as a key frame of the current image group.

If the candidate frame is determined to be a clear image, the candidate frame is used as a key frame of the current image group, after which the next image group can be acquired, i.e. the process proceeds to step S405.

Step S404: and acquiring a next frame image of the alternative frame, and taking the next frame image as a new alternative frame.

If the candidate frame is not a clear image, acquiring a next frame image of the candidate frame, taking the next frame image as a new candidate frame, and returning to execute the operation of judging whether the candidate frame is a clear image, namely, entering step S402. For example, after a first frame image of a first image group is acquired, it is determined that the first frame image is a blurred image, at this time, a second frame image of the first image group may be acquired, and the second frame image is used as a candidate frame to determine whether a pixel mean value of the second frame image is greater than a preset threshold, if the pixel mean value of the second frame image is still less than or equal to the preset threshold, a third frame image of the first image group is acquired, and it is determined whether a pixel mean value of the third frame image is greater than the preset threshold, and the steps are sequentially performed until a key frame meeting a condition is acquired.

It should be noted that, if the pixel mean values of all the frame images in the current image group are less than or equal to the preset threshold, the pixel mean values of the frame images may be sorted, and the frame image with the largest pixel mean value is selected as the key frame of the current image group; the last frame image in the current image group may also be directly used as the key frame of the current image group, or the key frame may not be acquired from the current image group, and the key frame may continue to be acquired from the next image group.

Step S405: and acquiring a next image group, taking the next image group as a current image group, and acquiring a key frame in the current image group.

In one embodiment, if the candidate frame is a clear image, the next image group may be acquired after the candidate frame is used as a key frame of the current image group, and the key frame in the current image group may be acquired by using the next image group as the current image group. To more clearly understand how to obtain multiple key frames from multiple frame images, a specific embodiment is given in which one key frame can be obtained from every 20 frames. Firstly, an initial frame (reference numeral 1) of a first image group can be used as a candidate frame, whether the candidate frame is blurred or not is judged, if the candidate frame is not blurred, the initial frame is used as a key frame to be input into a detection network, then a second image group is obtained, a first frame (reference numeral 21) of the second image group is used as the candidate frame, if the first frame (reference numeral 21) is blurred, a second frame (reference numeral 22) of the image group is obtained, the frame is used as the candidate frame, if the second frame (reference numeral 22) is not blurred, the second frame (reference numeral 22) is input into the detection network, then a third image group is obtained, a second frame (reference numeral 42) of the third image group is used as the candidate frame, and so on until all key frames of the image groups are obtained.

As shown in fig. 5, a specific schematic diagram of acquiring a plurality of key frames is given, and it can be seen from fig. 5 that the key frames are mainly acquired by a key frame extractor, before selecting the key frames, a plurality of frame images need to be sent to the key frame extractor in a data stream manner, where a sending frame rate (several frames are sent in each second) of the data stream is 30fps, that is, a time interval of every 20 frames is 2/3 ≈ 0.67s, and the time is also an update frequency of a convolution kernel of a kernel correlation filtering algorithm, which can completely satisfy a frequency of scene update in actual use, and can avoid a power consumption problem caused by too frequent operation of a detection network model.

In an embodiment, the key extractor may select a key frame from every 20 frame images, before which a candidate frame is acquired, determine whether the candidate frame is blurred through laplacian edge detection, determine that the candidate frame is a key frame if the candidate frame is not blurred, skip 20 frames to continue acquiring the candidate frame, and determine whether the selected candidate frame is blurred. And if the current frame is fuzzy, selecting the next frame of the current frame to perform fuzzy judgment again, and so on. The initial frame may be considered as a special key frame (a first key frame), and the fuzzy judgment may be to judge whether the pixel mean of the candidate frame is greater than a preset threshold, and if so, the candidate frame is sent to the detection network as a key frame.

It should be noted that in the embodiment of the present application, one key frame may be acquired from each image group, two key frames may be acquired from each image group, or three key frames may be acquired from each image group, and specifically, how many key frames are acquired from each image group is not specifically limited, and may be selected according to actual situations. For example, if the quality of the video to be tracked is overall good, several more key frames can be acquired from each image group, and if the overall quality of the video is not good and the definition is poor, the number of acquired key frames in each image group can be reduced appropriately.

Step S303: and inputting the plurality of key frames into a detection network to obtain detection data, wherein the detection data comprises the position information of the target object.

Step S304: and tracking the target object in the plurality of frame images by using the detection data and a kernel correlation filtering algorithm.

The target tracking method is implemented and provided by the application, in order to enable target tracking to be more accurate, the acquired frame images can be divided into N image groups, a key frame is acquired from each image group, and the key frames are acquired, wherein the key frames can be acquired by judging the pixel mean value of the images.

Still another embodiment of the present application provides a target tracking method applied to an electronic device, please refer to fig. 6, where the target tracking method may include steps S601 to S606.

Step S601: the method comprises the steps of obtaining a plurality of frame images of a video to be tracked, and obtaining a plurality of key frames from the plurality of frame images.

Step S602: and inputting the plurality of key frames into a detection network to obtain detection data, wherein the detection data comprises the position information of the target object.

In an embodiment, in order to make target tracking more accurate, a plurality of key frames may be input into a detection network to obtain a target detection frame corresponding to a target object in each key frame, and then a confidence degree corresponding to the target detection frame may be obtained, and it is determined whether the confidence degree of the target detection frame is greater than a confidence threshold, and if the confidence degree of the target detection frame is greater than the confidence threshold, position information corresponding to the target detection frame is used as detection data. The confidence of the target boundary box refers to the probability that the candidate box really includes the target, and the calculation process of the confidence of the target boundary box is related to a specific detection network.

In addition, if the confidence of the target detection frame is less than or equal to the confidence threshold, the position information corresponding to the target detection frame is not used as the detection data, that is, the detection frame with the confidence less than or equal to the confidence threshold is not displayed when the screen of the electronic device displays the target. For example, the confidence threshold is 0.5, the target object included in the key frame includes a person, a dog, and a cat, the confidence of the detection frame corresponding to the person obtained by the detection network is 0.88, the confidence of the detection frame corresponding to the dog is 0.75, and the confidence of the detection frame corresponding to the cat is 0.45, and obviously, both the confidence of the detection frame corresponding to the person and the confidence of the detection frame corresponding to the dog are greater than the confidence threshold 0.5, and therefore, the position information corresponding to the detection frame corresponding to the person and the position information corresponding to the detection frame corresponding to the dog can be used as the detection data, and the position information corresponding to the detection frame corresponding to the cat cannot be used as the detection data, so only the detection frames corresponding to the person and the dog are finally displayed on the screen of the electronic device, and the detection frame corresponding to the cat is not displayed.

Step S603: and judging whether the detection data is successfully acquired.

According to the introduction, after the electronic device acquires the plurality of key frames, the plurality of key frames can be input into the detection network, feature extraction is performed on the key frames by using the detection network, and then detection data can be acquired, wherein the detection data can include position information of the target object, and then the electronic device can store the detection data in the database. When the key frame is not input to the detection network, or the key frame is input to the detection network, a frame loss phenomenon occurs, or when the parameter of the detection network generates an error warning during operation, the detection data may not be obtained. Therefore, it is necessary to determine whether the detection data is successfully acquired, if the detection data is successfully acquired, the detection data and the kernel-dependent filtering algorithm are used to track the target object in the plurality of frame images, that is, step S604 is performed, and if the detection data is not successfully acquired, the tracking of the target object in the target frame image in the plurality of frame images is suspended, that is, step S605 is performed.

Step S604: and tracking the target object in the plurality of frame images by using the detection data and a kernel correlation filtering algorithm.

Step S605: pausing tracking the target object in a target frame image of the plurality of frame images.

In one embodiment, if the detection data is not successfully acquired, the tracking of the target object in the target frame image of the plurality of frame images is suspended, and then the next frame image of the target frame is taken as a new target frame image, i.e., the process proceeds to step S606.

Step S606: and taking the next frame image of the target frame as a new target frame image.

When it is determined that the electronic device has not successfully acquired the detection data, the tracking of the target object in the target frame image of the plurality of frame images may be suspended, then the next frame image of the target frame is used as a new target frame image, and the operation of determining whether the detection data has been successfully acquired is executed, that is, the process proceeds to step S603. Therefore, before the initial detection data is stored in the database, the images transmitted to the kernel correlation filter algorithm are discarded and are not processed until the initialization extractor is completed, and target tracking is started after the kernel correlation filter algorithm can access the detection data in the database. When the target tracking is carried out, the embodiment of the application still extracts new features according to the position information of the target object obtained after the key frame is sent into the detection network, and then updates the new features into the database, and the latest data can also be used for the target tracking immediately.

As can be understood from fig. 1, in the embodiment of the present application, the operation of inputting the plurality of keyframes to the detection network, and the operation of tracking the target object in the plurality of frame images by using the detection data and the kernel-dependent filtering algorithm may be performed in parallel. In other words, the operation of obtaining the detection data by using the detection network is independent of the structure of the kernel-related filtering algorithm, the detection network and the kernel-related filtering algorithm run in respective threads in the software framework design, a plurality of frame images are divided by the key frame extractor and then are respectively transmitted to the threads corresponding to the detection network and the kernel-related filtering algorithm, the two threads are only interacted in the database, namely, the detection network obtains the detection data and stores the detection data in the database, and the kernel-related filtering algorithm extracts the detection data from the database and tracks the target object by using the detection data.

According to the introduction, the real-time performance can be well reflected without waiting for the operation result of the detection network for tracking the nuclear correlation filtering algorithm, the frame-out precision and the frame-out quantity of the detection network can be independently adjusted, the accuracy and the quantity of the target which can be tracked by the nuclear correlation filtering algorithm can be further controlled, meanwhile, the detection result of the detection network can be used for target tracking and image preprocessing, and the accurate classification result can be obtained even if the detection result is transmitted to the network with poor real-time performance.

According to the target tracking method provided by the embodiment of the application, more accurate position information can be acquired by using the detection network, so that the tracking accuracy output by the nuclear correlation filtering algorithm can be improved, the number of frames can be controlled through a confidence threshold value, so that the tracking accuracy is improved, the detection network and the nuclear correlation filtering algorithm are operated in parallel, the data updating of the nuclear correlation filtering algorithm from the operation result of the detection network does not fall into the waiting state, the nuclear correlation filtering algorithm can still operate even if the detection network stops operating, and the real-time performance of target tracking is improved.

Referring to fig. 7, a target tracking apparatus 700 according to an embodiment of the present application is applied to an electronic device, and specifically, the target tracking apparatus 700 includes: a key frame acquisition module 701, a detection data acquisition module 702 and a target tracking module 703.

The key frame obtaining module 701 is configured to obtain a plurality of frame images of a video to be tracked, and obtain a plurality of key frames from the plurality of frame images.

Further, the key frame acquiring module 701 is further configured to divide the plurality of frame images into N image groups, and acquire a key frame from each of the image groups to obtain a plurality of key frames.

Further, the key frame acquiring module 701 is further configured to acquire an alternative frame from the current image group, determine whether the alternative frame is a clear image, and if the alternative frame is a clear image, use the alternative frame as the key frame of the current image group.

Further, the key frame acquiring module 701 is further configured to, if the candidate frame is a clear image, acquire a next image group after taking the candidate frame as a key frame of the current image group, and acquire a key frame in the current image group by taking the next image group as the current image group.

Further, the key frame obtaining module 701 is further configured to obtain a pixel mean value of the candidate frame, and determine whether the pixel mean value is greater than a preset threshold, where when the pixel mean value is greater than the preset threshold, the candidate frame is a clear image.

Further, the key frame obtaining module 701 is further configured to, if the candidate frame is not a clear image, obtain a next frame image of the candidate frame, use the next frame image as a new candidate frame, and return to perform the operation of determining whether the candidate frame is a clear image.

A detection data obtaining module 702, configured to input the plurality of key frames to a detection network to obtain detection data, where the detection data includes position information of the target object.

Further, the detection data obtaining module 702 is further configured to input the plurality of key frames into a detection network, so as to obtain a target detection frame corresponding to the target object in each key frame, obtain a confidence level of the target detection frame, and determine whether the confidence level of the target detection frame is greater than a confidence threshold, and if the confidence level of the target detection frame is greater than the confidence threshold, use position information corresponding to the target detection frame as the detection data.

A target tracking module 703, configured to track the target object in the multiple frame images by using the detection data and a kernel correlation filtering algorithm.

Further, the target tracking module 703 is further configured to determine whether the detection data is successfully acquired, and if the detection data is successfully acquired, track the target object in the plurality of frame images by using the detection data and a kernel-dependent filtering algorithm.

Further, the target tracking module 703 is further configured to, if the detection data is not successfully acquired, suspend tracking the target object in a target frame image of the plurality of frame images, use a next frame image of the target frame as a new target frame image, and return to perform the operation of determining whether the detection data is successfully acquired.

In addition, the operation of inputting the plurality of key frames into a detection network and the operation of tracking the target object in the plurality of frame images by using the detection data and a kernel correlation filtering algorithm are executed in parallel.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other type of coupling.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Referring to fig. 8, a block diagram of an electronic device according to an embodiment of the present disclosure is shown. The electronic device 100 may be a smart phone, a tablet computer, an electronic book, or other electronic devices capable of running an application. The electronic device 100 in the present application may include one or more of the following components: a processor 110, a memory 120, a screen 130, and one or more applications, wherein the one or more applications may be stored in the memory 120 and configured to be executed by the one or more processors 110, the one or more programs configured to perform the methods as described in the aforementioned method embodiments.

Processor 110 may include one or more processing cores. The processor 110 connects various parts within the overall electronic device 100 using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120 and calling data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-programmable gate array (FPGA), and Programmable Logic Array (PLA). The processor 110 may integrate one or a combination of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 110, but may be implemented by a communication chip.

Specifically, the processor 110 may include one or a combination of a Central Processing Unit (CPU), a multimedia display processor and Graphics Processing Unit (GPU), a modem, and the like. The CPU mainly processes an operating system, a user interface, a client and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 110, but may be implemented by a communication chip.

The memory 120 may include a Random Access Memory (RAM) or a Read-only memory (Read-only memory). The memory 120 may be used to store instructions, programs, code sets, or instruction sets. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The data storage area may also store data created by the electronic device 100 during use (e.g., phone book, audio-video data, chat log data), and the like.

The screen 130 is used to display information input by a user, information provided to the user, and various graphic user interfaces of the electronic device, which may be formed of graphics, text, icons, numerals, video, and any combination thereof, and in one example, a touch screen may be provided on the display panel so as to be integrated with the display panel.

Referring to fig. 9, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable medium 1000 has stored therein program code that can be called by a processor to execute the method described in the above-described method embodiments.

The computer-readable storage medium 1000 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 1000 includes a non-volatile computer-readable medium (non-transitory-readablestogramedium). The computer readable storage medium 1000 has storage space for program code 1010 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 1010 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A target tracking method is applied to an electronic device, and the method comprises the following steps:

acquiring a plurality of frame images of a video to be tracked, and acquiring a plurality of key frames from the plurality of frame images;

inputting the plurality of key frames into a detection network to obtain detection data, wherein the detection data comprises position information of a target object;

and tracking the target object in the plurality of frame images by using the detection data and a kernel correlation filtering algorithm.

2. The method of claim 1, wherein said obtaining a plurality of keyframes from said plurality of frame images comprises:

dividing the plurality of frame images into N image groups;

and acquiring a key frame from each image group to obtain a plurality of key frames.

3. The method of claim 2, wherein said obtaining a key frame from each of said image groups comprises:

acquiring an alternative frame from a current image group;

judging whether the alternative frame is a clear image;

and if the alternative frame is a clear image, taking the alternative frame as a key frame of the current image group.

4. The method according to claim 3, wherein said taking said candidate frame as a key frame of said current group of pictures after said candidate frame if said candidate frame is a sharp picture comprises:

and acquiring a next image group, taking the next image group as a current image group, and acquiring a key frame in the current image group.

5. The method of claim 3, wherein the determining whether the candidate frame is a sharp image comprises:

acquiring a pixel mean value of the alternative frame, and judging whether the pixel mean value is greater than a preset threshold value;

and when the pixel mean value is larger than a preset threshold value, the alternative frame is a clear image.

6. The method of claim 3, further comprising:

and if the alternative frame is not a clear image, acquiring a next frame image of the alternative frame, taking the next frame image as a new alternative frame, and returning to execute the operation of judging whether the alternative frame is the clear image.

7. The method of claim 1, wherein tracking the target object in the plurality of frame images using the detection data and a kernel-dependent filtering algorithm comprises:

judging whether the detection data is successfully acquired or not;

and if the detection data are successfully acquired, tracking the target object in the plurality of frame images by using the detection data and a nuclear correlation filtering algorithm.

8. The method of claim 7, further comprising:

if the detection data is not successfully acquired, suspending tracking of the target object in a target frame image in the plurality of frame images;

and taking the next frame image of the target frame as a new target frame image, and returning to execute the operation of judging whether the detection data is successfully acquired.

9. The method of claim 1, wherein the inputting the plurality of keyframes into a detection network and the tracking the target object in the plurality of frame images using the detection data and a kernel-dependent filtering algorithm are performed in parallel.

10. The method of claim 1, wherein inputting the plurality of key frames into a detection network to obtain detection data comprises:

inputting the plurality of key frames into a detection network to obtain a target detection frame corresponding to a target object in each key frame;

obtaining the confidence of the target detection frame, and judging whether the confidence of the target detection frame is greater than a confidence threshold value;

and if the confidence of the target detection box is greater than a confidence threshold, using the position information corresponding to the target detection box as detection data.

11. An object tracking apparatus applied to an electronic device, the apparatus comprising:

the system comprises a key frame acquisition module, a tracking module and a tracking module, wherein the key frame acquisition module is used for acquiring a plurality of frame images of a video to be tracked and acquiring a plurality of key frames from the plurality of frame images;

the detection data acquisition module is used for inputting the plurality of key frames into a detection network to obtain detection data, and the detection data comprises position information of a target object;

and the target tracking module is used for tracking the target object in the plurality of frame images by using the detection data and the nuclear correlation filtering algorithm.

12. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-10.

13. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 10.