CN110378264B

CN110378264B - Target tracking method and device

Info

Publication number: CN110378264B
Application number: CN201910611097.2A
Authority: CN
Inventors: 卓世杰
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-07-08
Filing date: 2019-07-08
Publication date: 2023-04-18
Anticipated expiration: 2039-07-08
Also published as: CN110378264A

Abstract

The present disclosure relates to the field of image processing technologies, and in particular, to a target tracking method, a target tracking apparatus, a computer-readable medium, and an electronic device. The method comprises the following steps: acquiring a video to be tracked, and carrying out target detection on the video to be tracked to acquire a key frame image containing a tracking target; performing image recognition on the key frame image to acquire an object region containing a tracking target, and performing key point extraction on the object region to acquire key point data of the tracking target; extracting an image to be detected of a next frame adjacent to the key frame, and performing feature extraction on the image to be detected to obtain a second feature map of the image to be detected; inputting the second feature map and the key point data into a prediction model as input parameters to obtain corresponding prediction key point data of each key point in the image to be detected; and determining the tracking target in the image to be detected according to the prediction key point data.

Description

Target tracking method and device

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a target tracking method, a target tracking apparatus, a computer-readable medium, and an electronic device.

Background

Object tracking is one of the hotspots in the field of computer vision research and has found widespread application in a number of fields. Generally, target tracking is to establish a position relationship of an object to be tracked in a continuous video sequence to obtain a complete motion track of the object.

In the prior art, an optical flow method and a deep learning technology are widely applied to a target tracking technology. In a target tracking scheme based on deep learning, a deep learning training network model is generally required to be utilized in advance; when the tracking is executed, the characteristics learned by the network model are directly applied to a tracking frame of the relevant filtering, so that a better tracking result is obtained. But also brings the disadvantage of increasing the amount of calculation, which further makes it difficult to perform online real-time tracking. Further, when tracking a target by the optical flow method, three strong assumption conditions are required: constant brightness, less time continuous or motion displacement, consistent space and the like. In the process of practical application, a large number of scenes cannot meet the above requirements.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a target tracking method, a target tracking apparatus, a computer-readable medium, and an electronic device, so as to provide a target tracking scheme with a small amount of computation and low requirements for an environment where a target is tracked, thereby overcoming limitations and drawbacks of the related art to some extent.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the present disclosure, there is provided a target tracking method, including:

acquiring a video to be tracked, and carrying out target detection on the video to be tracked to acquire a key frame image containing a tracking target;

performing image recognition on the key frame image to acquire an object region containing a tracking target, and performing key point extraction on the object region to acquire key point data of the tracking target; and

extracting an image to be detected of a next frame adjacent to the key frame, and performing feature extraction on the image to be detected to obtain a second feature map of the image to be detected;

inputting the second feature map and the key point data into a prediction model as input parameters to obtain corresponding prediction key point data of each key point in the image to be detected;

and determining the tracking target in the image to be detected according to the prediction key point data.

According to a second aspect of the present disclosure, there is provided a target tracking apparatus comprising:

the key frame identification module is used for acquiring a video to be tracked, and performing target detection on the video to be tracked to acquire a key frame image containing a tracking target;

the key point data calculation module is used for carrying out image recognition on the key frame image so as to obtain an object area containing a tracking target and carrying out feature extraction on the object area so as to obtain key point data of the tracking target;

the second characteristic information calculation module is used for extracting an image to be detected of a next frame adjacent to the key frame and extracting the characteristics of the image to be detected so as to obtain second characteristic information of the image to be detected;

the predicted key point calculation module is used for inputting the second characteristic information and the key point information as input parameters into a trained prediction model so as to obtain predicted key point data of the key point in the image to be detected;

and the tracking target acquisition module is used for determining the tracking target in the image to be detected according to the prediction key point data.

According to a third aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the above-described object tracking method.

According to a fourth aspect of the present disclosure, there is provided an electronic apparatus comprising:

one or more processors;

a storage device to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the above-described target tracking method.

The target tracking method provided by the embodiment of the disclosure comprises the steps of firstly determining a key frame image containing a tracking target in a tracking video, then identifying the key frame image, extracting key point data of the tracking target, and extracting a second feature map of an adjacent next frame image to be detected; therefore, the positions of the key points in the image to be detected are predicted according to the key point data and the second characteristic graph, and the accurate positions of the tracking targets in the image to be detected are described according to the predicted key point data, so that the continuous tracking of the tracking targets is realized. The motion trend of the tracking target in the image to be detected is predicted by utilizing the key point data of the key frame image and the feature map of the image to be detected, and the tracking is kept without using the overall feature of the tracking target, so that the data calculation amount is effectively reduced; and moreover, the adaptive learning tracking of the key point data of the target can be realized, and the tracking efficiency is effectively improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 schematically illustrates a flow diagram of a target tracking method in an exemplary embodiment of the disclosure;

FIG. 2 is a flow diagram schematically illustrating a method of obtaining predicted keypoint data, according to an exemplary embodiment of the disclosure;

FIG. 3 schematically illustrates a schematic diagram of a stacked hourglass network structure in an exemplary embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic composition diagram of a target tracking apparatus in an exemplary embodiment of the disclosure;

fig. 5 schematically illustrates a structural diagram of a computer system of an electronic device in an exemplary embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

In the existing tracking scheme based on the deep learning technology, the initial application mode is to directly apply the characteristics learned by the network to a tracking frame of the relevant filtering, so that a better tracking result is obtained. In essence, the convolution output obtains better feature expression, which is one of the advantages of deep learning, but also increases the calculation amount. Many existing research tracking frameworks and methods tend to compare two features simultaneously, thereby verifying the improvement of the tracking method or framework, one is the traditional manual feature, and the other is the feature of deep network learning. However, no matter how the method and framework is improved, it still accomplishes the object tracking task at all, based on the ability of object detection. When the target tracking task is completed based on the target detection capability, the calculated amount is large, and online real-time tracking is difficult to achieve. The traditional optical flow law requires three strong assumption conditions: the brightness is constant, the time is continuous or the motion displacement is small, and the space is consistent. In real situations, a large number of scenes cannot satisfy the three assumptions.

In view of the above drawbacks and deficiencies of the prior art, the exemplary embodiment provides a target tracking method, which can be applied to online real-time tracking of a moving object in a complex scene. Referring to fig. 1, the above-mentioned target tracking method may include the steps of:

s11, acquiring a video to be tracked, and carrying out target detection on the video to be tracked to acquire a key frame image containing a tracking target;

s12, performing image recognition on the key frame image to acquire an object region containing a tracking target, and performing key point extraction on the object region to acquire key point data of the tracking target; and

s13, extracting an image to be detected of a next frame adjacent to the key frame, and performing feature extraction on the image to be detected to obtain a second feature map of the image to be detected;

s14, inputting the second feature map and the key point data into a prediction model as input parameters to obtain corresponding prediction key point data of each key point in the image to be detected;

s15, determining the tracking target in the image to be detected according to the prediction key point data.

In the target tracking method provided by the exemplary embodiment, on one hand, the movement trend of the tracking target in the image to be detected is predicted by using the key point data of the key frame image and the feature map of the image to be detected, and the tracking is not required to be kept by using the integral feature of the tracking target, so that the data calculation amount is effectively reduced; on the other hand, the method can realize the self-adaptive learning of the key point data of the tracking target, and effectively improve the tracking efficiency.

Hereinafter, each step of the target tracking method in the present exemplary embodiment will be described in more detail with reference to the drawings and examples.

And S11, acquiring a video to be tracked, and performing target detection on the video to be tracked to acquire a key frame image containing a tracking target.

In this exemplary embodiment, the video to be tracked may be directly captured and acquired by an image capturing device such as a monitoring camera or a video camera, or may be video data acquired through wired network or wireless network transmission. After the video to be tracked is obtained, the video data can be edited, so that continuous multi-frame image data can be obtained. When tracking is carried out for the first time, after continuous image data are obtained, target detection can be carried out on each frame of image data, and a key frame image containing a tracking target is determined. For example, when determining the key frame image, the key frame image may be determined by manual selection. Alternatively, the key frame image may be extracted by using a target recognition algorithm, for example, by performing target recognition on each frame of image data by using a target recognition algorithm based on a deep learning network or a target detection algorithm based on an SSD (Single Shot multi box Detector) architecture, so as to determine the key frame image. For example, the tracking target may be an automobile, a drone, a human or animal, and the like.

And S12, performing image recognition on the key frame image to acquire an object area containing a tracking target, and performing key point extraction on the object area to acquire key point data of the tracking target.

In the present exemplary embodiment, after determining the key frame image, the key frame image may be further processed. Specifically, the image recognition processing may be performed on the key frame image, and the rectangular frame may be used to frame out the object region where the tracking target is located in the key frame image. After the target area is extracted, the key point data can be extracted from the target area. In addition, the key point data can be generated into a corresponding key point heat map, and the key point heat map is stored and used for predicting tracking targets in other images to be detected in the subsequent tracking process. For example, a convolutional neural Network model based on a Stacked Hourglass Network (Stacked Hourglass Network) can be used to extract the key point data of the object region, so that the model outputs a plurality of pieces of key point information which can be used for target tracking.

In particular, the stacked hourglass network structure may be comprised of a plurality of trained hourglass networks, each of which is comprised in series, with the output of the former being the input of the latter. In each hourglass network, in order to capture information at each scale, the size of input is changed in a downsampling mode, when the lowest resolution is reached, the network starts to carry out upsampling and combine features at different scales, and the features of the same scale before downsampling are added through a residual correction field, so that capture of multi-scale information is achieved.

For example, referring to the stacked hourglass network configuration shown in fig. 3, a first stage hourglass network 301 and a second stage hourglass network 302 are included. Taking the object region as an input parameter N1 of the first-stage hourglass network 301, wherein an output parameter N2 of the first-stage hourglass network 301 may include feature information extracted for the object region, and may further include a heat map corresponding to the object region; and performing convolution processing on the output parameter N2 of the first-stage hourglass network to serve as an input parameter of the second-stage hourglass network 302, and finally outputting the feature information corresponding to the key point information and the heat map by the second-stage hourglass network 302. For example, when the tracking target is a drone, the key point information thereof may be key points describing the outline and main features of the drone, and may be represented in the form of a heat map.

In other exemplary embodiments of the present disclosure, after an object region including a tracking target is obtained, the object region may be further subjected to image segmentation to distinguish a foreground image and a background image of the tracking target within a rectangular frame. Therefore, the object region only contains the pure image of the tracking target, and more accurate key point data of the tracking target can be obtained when the key point is extracted.

And S13, extracting an image to be detected of the next frame adjacent to the key frame, and performing feature extraction on the image to be detected to obtain a second feature map of the image to be detected.

In the present exemplary embodiment, when extracting the key point data from the key frame image, the to-be-detected image of the next frame adjacent to the key frame image may also be processed. Specifically, feature extraction may be performed on the image to be detected, and a second feature map corresponding to the image to be detected is generated.

For example, a convolutional neural network based on a MobilenetV3 (mobile network) structure may be utilized to perform feature extraction on an image to be detected. Specifically, after an image to be detected is input into a convolutional neural network model based on a mobilenetV3 structure, firstly, performing convolution processing, batch Normalization processing and activation processing on the input image to be detected by using a convolution layer, a BN (Batch Normalization) layer and an h-switch activation layer of an initial part of the model in sequence to obtain a first intermediate result; inputting the first intermediate result into the middle part of the model, and performing convolution calculation and expansion processing on the first intermediate result by using the convolution layer and the expansion layer of the middle part to obtain a second intermediate result; and inputting the second intermediate result into the last part of the model, and performing convolution calculation again by the convolution layer of the last part to obtain a second characteristic diagram of the image to be detected. In the initial, middle and final portions described above, each convolution layer may have a different convolution kernel and a specified step size.

And S14, inputting the second feature map and the key point data into a prediction model as input parameters to obtain corresponding prediction key point data of each key point in the image to be detected.

In this exemplary embodiment, after obtaining the key point heat map corresponding to the key point data of the key frame image and the second feature map corresponding to the image to be detected, the step S14 may specifically include:

step 1411, merging the key point heat map and the second feature map based on a pixel channel to obtain a merged feature image;

and step S1412, inputting the merged characteristic image into a trained prediction model based on the stacked hourglass network structure to obtain the prediction key point data of the image to be detected.

For example, the prediction model based on the stacked hourglass network structure may be a convolutional neural network model based on the stacked hourglass network structure. In particular, a stacked hourglass network architecture can be made up of a plurality of successfully trained hourglass networks, in which the outputs and inputs of a previous hourglass network can be used as inputs to a subsequent hourglass network. For example, the stacked yam network structure shown in fig. 3 in the above embodiment can be used. Additionally, a messaging layer may be provided between the first stage hourglass network 301 and the second stage hourglass network 302. The message transfer layer uses the geometric transformation core to transform the position of the key point heat map of the key frame image output by the first-stage hourglass network 301 into a relative position, so as to obtain the position of the key point of the second feature map.

By combining the heat map of the key points and the second feature map along a channel and then utilizing a convolutional neural Network based on a Stacked Hourglass Network, the change positions of the key points can be searched through the convolutional neural Network, the motion range of the key points in the next frame can be predicted, and the predicted key point data can be output in the form of the heat map.

Based on the above, in other exemplary embodiments of the present disclosure, after the key frame image is obtained, feature extraction may be performed on the key frame image by using a convolutional neural network model based on a Mobile-netV3 (Mobile network V3) structure, so as to obtain a first feature map corresponding to the key frame image, and use the first feature map as an input parameter of the prediction model. Specifically, as shown with reference to fig. 2, the following steps may be included:

step S1421, merging the key point heat map, the first feature map and the second feature map based on a pixel channel to obtain a merged feature image;

step S1422, inputting the merged feature image into a trained prediction model based on a stacked hourglass network structure to obtain prediction key point data of the image to be detected.

By inputting the first feature map, the second feature map and the key point heat map into the prediction model at the same time, the motion tracks of other feature points in the key frame image can be effectively considered, and when the key point prediction is carried out in the image to be detected, the motion tracks and the motion directions of other feature points can be referred to, so that the motion ranges of the key points and other feature points in the key frame image in the image to be detected can be more accurately predicted, and the accuracy of the key point prediction is further improved.

Of course, in other exemplary embodiments of the present disclosure, when performing feature recognition on an image, other models or algorithms may also be used to obtain a first feature map corresponding to a key frame image and a second feature map corresponding to an image to be detected, which is not particularly limited in the present disclosure.

And S15, determining the tracking target in the image to be detected according to the prediction key point data.

In this exemplary embodiment, after obtaining the heat map corresponding to the preset key point data, the bounding box calculation may be performed on the heat map, and the result of the bounding box calculation is used as the tracking target. Thereby obtaining the position of the tracking target in the image to be detected. When the target tracking is performed on the image to be detected of the next frame adjacent to the image to be detected, the key point data corresponding to the key frame image can still be used for prediction and calculation. For example, a smallest rectangular area containing all the keypoints may be taken as a bounding box.

Based on the above, the method may further include:

s21, when the number of the continuously tracked images to be detected is larger than a preset threshold value, acquiring current key point data of a tracked target in the current images to be monitored;

step S22, matching the current key point data with the key point data; and when the proportion of the changed data in the matching result is larger than a preset threshold value, updating the key point data of the tracking target according to the current key point data.

In the present exemplary embodiment, in the tracking process, a detection period of the key frame image may also be set, for example, 10 frames, 20 frames, or 50 frames are set as one detection period. For example, when the detection period of the key frame image is 20 frames, in the tracking process, target tracking is started after the key frame image is determined, the key frame image of the ice cream is the 1 st frame, and tracking is successful in subsequent continuous images. If the current image to be detected is the 21 st frame image, the current image to be detected can be identified according to the method in the step S12, so as to obtain the current key point data of the tracking target in the current image to be detected, and a corresponding heat map can be generated according to the current key point data. And comparing the current key point data with the key point data, for example, comparing through a heat map, and if the change of the key point data is large, for example, the change of the number of key points is larger than a preset threshold, or the change of the displacement of the change of the position of the key points is larger than a preset threshold, using the current key point data as new key point data, that is, using the current image to be detected as a new key frame image. Therefore, the updating of the key point data is realized, and the effectiveness of the key point data in the tracking process is further ensured.

Based on the above, in other exemplary embodiments of the present disclosure, in the tracking process, if the tracking target is not detected in the current image to be detected, in the image to be detected of the next frame, the key point data in the previous key frame image is still used for identifying the tracking target. If the tracking target is not detected in the continuous n frames of images to be detected, namely when the tracking is lost, the detection of the key frame is carried out again in the (n + 1) th frame of image, thereby ensuring the continuity of the tracking process. Wherein n is a positive integer.

The method provided by the embodiment of the disclosure can be operated on the terminal side where the user is located, for example, the tracking video is acquired through the external camera device or the internet, and the method is executed on the terminal side, so that the tracking of the tracking target is realized in real time. Or, the tracking method may also be executed at the server side, for example, after the server side receives the tracking video, the server side executes the above method on the tracking video to obtain the tracking target, and then sends the tracking result to the user terminal.

According to the method provided by the embodiment of the disclosure, the key frame image containing the tracking target is determined firstly, and then the key frame image is processed to obtain the key point data corresponding to the tracking target and the corresponding feature map of the key frame image; and predicting the motion range, direction and position of the key points in the image to be detected according to the key point data and the feature map, and outputting the motion range, direction and position in the image to be detected in a heat map form, so that the tracking of the tracking target is successful. By predicting the motion trend of the key points, the search range is narrowed in the image to be detected in the next frame, the calculation amount can be effectively reduced, the speed is increased, and the efficiency of tracking tasks is further improved.

It is to be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to an exemplary embodiment of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Further, referring to fig. 4, in an embodiment of the present example, there is also provided a target tracking apparatus 40, including: a key frame identification module 401, a key point data calculation module 402, a second feature information calculation module 403, a predicted key point calculation module 404, and a tracking target acquisition module 405. Wherein:

the key frame identification module 401 may be configured to acquire a video to be tracked, and perform target detection on the video to be tracked to acquire a key frame image including a tracking target.

The key point data calculation module 402 may be configured to perform image recognition on the key frame image to obtain an object region containing a tracking target, and perform feature extraction on the object region to obtain key point data of the tracking target.

The second feature information calculation module 403 may be configured to extract an image to be detected of a next frame adjacent to the key frame, and perform feature extraction on the image to be detected, so as to obtain second feature information of the image to be detected.

The predicted keypoint calculation module 404 may be configured to input the second feature information and the keypoint information as input parameters into a trained prediction model to obtain predicted keypoint data of keypoints in the image to be detected.

The tracking target obtaining module 405 may be configured to determine the tracking target in the image to be detected according to the prediction key point data.

In an example of the present disclosure, the apparatus may further include: a heat map conversion module (not shown).

The heat map conversion module may be configured to generate a corresponding key point heat map according to the key point data, and use the key point heat map as an input parameter.

In an example of the disclosure, when the input parameters are the second feature map and the keypoint heat map, the predicted keypoint calculation module 404 may include: a first merging module and a first calculation module (not shown in the figure). Wherein:

the first merging module may be configured to merge the keypoint heat map and the second feature map based on pixel channels to obtain a merged feature image.

The first calculation module can be used for inputting the merged feature image into a trained prediction model based on a stacked hourglass network structure so as to obtain the prediction key point data of the image to be detected.

In an example of the present disclosure, the apparatus may further include: a first profile calculation module (not shown).

The first feature map calculation module may be configured to perform feature extraction on the key frame image to obtain a first feature map of the key frame image; and using the first feature map as an input parameter of a prediction model.

In an example of the disclosure, when the input parameters include the keypoint heat map, the first feature map, and the second feature map, the predicted keypoint calculation module 404 may include: a second merging module and a second calculating module (not shown in the figure). Wherein:

the second merging module may be configured to merge the keypoint heat map, the first feature map, and the second feature map based on pixel channels to obtain a merged feature image.

The second calculation module may be configured to input the merged feature image into a trained prediction model based on a stacked hourglass network structure to obtain prediction key point data of an image to be detected.

In one example of the present disclosure, the tracking target acquisition module 405 includes: a bounding box calculation unit (not shown in the figure).

The bounding box calculation unit may be configured to perform a bounding box calculation based on the prediction key point data, and take a bounding box calculation result as a tracking target.

In one example of the present disclosure, the apparatus may further include: an image judgment module and a key point updating module (not shown in the figure). Wherein:

the image judgment module can be used for acquiring the current key point data of the tracked target in the current image to be monitored when the number of the continuously tracked frames of the image to be detected is greater than a preset threshold value.

The key point updating module may be configured to match the current key point data with the key point data; and when the proportion of the changed data in the matching result is larger than a preset threshold value, updating the key point data of the tracking target according to the current key point data.

The specific details of each module in the above target tracking device have been described in detail in the corresponding target tracking method, and therefore are not described herein again.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

FIG. 5 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.

It should be noted that the computer system 500 of the electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiment of the present invention.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU) 501 that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 402 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for system operation are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An Input/Output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output section 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present invention. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program executes various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 501.

It should be noted that the computer readable medium shown in the embodiment of the present invention may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiment; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method as described in the embodiments below. For example, the electronic device may implement the various steps shown in fig. 1.

Furthermore, the above-described drawings are only schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily appreciated that the processes illustrated in the above figures are not intended to indicate or limit the temporal order of the processes. In addition, it is also readily understood that these processes may be performed, for example, synchronously or asynchronously in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A target tracking method, comprising:

performing image recognition on the key frame image to acquire an object region containing a tracking target, and performing key point extraction on the object region to acquire key point data of the tracking target; generating a corresponding key point heat map according to the key point data; and

inputting the second feature map and the key point heat map corresponding to the key point data into a prediction model by taking the second feature map and the key point heat map as input parameters to obtain the corresponding prediction key point data of each key point in the image to be detected, wherein the method comprises the following steps: merging the key point heat map and the second feature map based on pixel channels to obtain a merged feature image; inputting the merged characteristic image into a trained prediction model based on a stacked hourglass network structure to obtain prediction key point data of the image to be detected;

2. The target tracking method of claim 1, wherein in obtaining the key point data of the tracked target, the method further comprises:

performing feature extraction on the key frame image to obtain a first feature map of the key frame image; and using the first feature map as an input parameter of a prediction model.

3. The target tracking method according to claim 2, wherein when the input parameters include the key point heat map, the first feature map and the second feature map, the obtaining of the corresponding predicted key point data of each key point in the image to be detected comprises:

merging the key point heat map, the first feature map and the second feature map based on a pixel channel to obtain a merged feature image;

and inputting the merged characteristic image into a trained prediction model based on a stacked hourglass network structure to obtain the prediction key point data of the image to be detected.

4. The target tracking method according to claim 1, wherein the determining the tracking target in the image to be detected according to the prediction key point data comprises:

and carrying out bounding box calculation according to the predicted key point data, and taking a bounding box calculation result as a tracking target.

5. The target tracking method of claim 1, further comprising:

when the number of the continuously tracked images to be detected is larger than a preset threshold value, acquiring current key point data of a tracked target in the current images to be detected;

matching the current key point data with the key point data; and when the proportion of the changed data in the matching result is larger than a preset threshold value, updating the key point data of the tracking target according to the current key point data.

6. An object tracking device, comprising:

the key frame identification module is used for acquiring a video to be tracked, and carrying out target detection on the video to be tracked so as to acquire a key frame image containing a tracking target;

the key point data calculation module is used for carrying out image identification on the key frame image to obtain an object area containing a tracking target and carrying out feature extraction on the object area to obtain key point data of the tracking target;

the heat map conversion module is used for generating a corresponding key point heat map according to the key point data and taking the key point heat map as an input parameter;

the prediction key point calculation module is used for inputting the second characteristic information and the key point data into a trained prediction model as input parameters so as to obtain the prediction key point data of the key point in the image to be detected, and comprises the following steps: the first merging module is used for merging the key point heat map and the second feature map on the basis of a pixel channel so as to obtain a merged feature image; the first calculation module is used for inputting the merged characteristic image into a trained prediction model based on a stacked hourglass network structure so as to obtain the prediction key point data of the image to be detected;

7. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the object tracking method according to any one of claims 1 to 5.

8. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the target tracking method of any one of claims 1 to 5.