CN115661198A

CN115661198A - Target tracking method, device and medium based on single-stage target tracking model

Info

Publication number: CN115661198A
Application number: CN202211287178.XA
Authority: CN
Inventors: 蒋召; 黄泽元; 祁晓婷; 杨战波
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-01-31

Abstract

The application provides a target tracking method, a device and a medium based on a single-stage target tracking model. The method comprises the following steps: acquiring video data containing a tracking object, and inputting the video data into a preset single-stage target tracking model; processing each frame of image by using a feature extraction module in the single-stage target tracking model to obtain a feature map; inputting the feature map into the detection branch, and performing regression on the position of the tracking object in the image by using the detection branch to obtain a target frame corresponding to the tracking object; inputting the feature map into a pedestrian re-identification branch, and extracting the low-dimensional feature vector of the feature map by using the pedestrian re-identification branch to obtain the pedestrian re-identification low-dimensional feature vector; and tracking the track generated by the tracking object in the video data based on the target frame corresponding to the tracking object and the pedestrian re-identification low-dimensional feature vector. The method and the device improve the precision of the tracking algorithm, improve the tracking speed and improve the tracking robustness.

Description

Target tracking method, device and medium based on single-stage target tracking model

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, and a medium for tracking a target based on a single-stage target tracking model.

Background

Pedestrian detection is the use of computer vision techniques to identify the presence of pedestrians in an image or video stream and to give accurate positioning. The technology has wide application field, can be combined with technologies such as pedestrian tracking, pedestrian re-identification and the like, and can be well applied to the fields of practical scenes such as artificial intelligence systems, vehicle auxiliary driving systems, intelligent video monitoring, human body behavior analysis, intelligent traffic and the like.

In the current target tracking method based on the pedestrian re-identification network, each frame of image is extracted from a video stream, then the image is input into the pedestrian re-identification network for target detection, and the position of a pedestrian in the image is determined according to the result of the target detection. The current target detection mode mainly comprises a single-stage target detection mode and a two-stage target detection mode, wherein the single-stage target detection mode is short in inference time and high in tracking speed, but is low in tracking precision, and although the two-stage target detection mode has certain precision, the inference time of a two-stage target detection algorithm is long, so that the target tracking speed is reduced.

Disclosure of Invention

In view of this, embodiments of the present application provide a target tracking method, an apparatus, and a medium based on a single-stage target tracking model, so as to solve the problems of long inference time, reduced target tracking speed, and low tracking accuracy of a target tracking algorithm in the prior art.

In a first aspect of the embodiments of the present application, a target tracking method based on a single-stage target tracking model is provided, including: acquiring video data containing a tracking object, and inputting the video data into a preset single-stage target tracking model; processing each frame of image in the video data by utilizing a feature extraction module in the single-stage target tracking model to obtain a feature map; inputting the characteristic diagram into a detection branch in a single-stage target tracking model, and regressing the position of a tracked object in an image by using the detection branch to obtain a target frame corresponding to the tracked object, wherein the detection branch adopts an Anchor-Free-based detection module; inputting the feature map into a pedestrian re-recognition branch in the single-stage target tracking model, and extracting the low-dimensional feature vector of the feature map by using the pedestrian re-recognition branch to obtain the pedestrian re-recognition low-dimensional feature vector; and tracking the track generated by the tracking object in the video data based on the target frame corresponding to the tracking object and the pedestrian re-identification low-dimensional feature vector.

In a second aspect of the embodiments of the present application, a target tracking apparatus based on a single-stage target tracking model is provided, including: an input module configured to acquire video data including a tracking object and input the video data into a predetermined single-stage target tracking model; the processing module is configured to process each frame of image in the video data by using the feature extraction module in the single-stage target tracking model to obtain a feature map; the regression module is configured to input the feature map into a detection branch in the single-stage target tracking model, regress the position of the tracked object in the image by using the detection branch to obtain a target frame corresponding to the tracked object, wherein the detection branch adopts an Anchor-Free-based detection module; the extraction module is configured to input the feature map into a pedestrian re-identification branch in the single-stage target tracking model, and extract the low-dimensional feature vector of the feature map by using the pedestrian re-identification branch to obtain the pedestrian re-identification low-dimensional feature vector; and the tracking module is configured to track the track generated by the tracking object in the video data based on the target frame corresponding to the tracking object and the low-dimensional feature vector re-identified by the pedestrian.

In a third aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method.

In a fourth aspect of the embodiments of the present application, a computer-readable storage medium is provided, in which a computer program is stored, and the computer program realizes the steps of the above method when being executed by a processor.

The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:

inputting video data into a predetermined single-stage target tracking model by acquiring the video data containing a tracking object; processing each frame of image in the video data by utilizing a feature extraction module in the single-stage target tracking model to obtain a feature map; inputting the characteristic diagram into a detection branch in the single-stage target tracking model, and regressing the position of the tracked object in the image by using the detection branch to obtain a target frame corresponding to the tracked object, wherein the detection branch adopts an Anchor-Free-based detection module; inputting the feature map into a pedestrian re-identification branch in the single-stage target tracking model, and extracting the low-dimensional feature vector of the feature map by using the pedestrian re-identification branch to obtain the low-dimensional feature vector of the pedestrian re-identification; and tracking the track generated by the tracking object in the video data based on the target frame corresponding to the tracking object and the pedestrian re-identification low-dimensional feature vector. According to the single-stage target tracking model based on Anchor-Free, the accuracy of target tracking is improved and the tracking robustness is improved on the premise that the tracking speed is guaranteed.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic structural diagram of a single-stage object tracking model according to an embodiment of the present disclosure;

FIG. 2 is a schematic flowchart of a target tracking method based on a single-stage target tracking model according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating a principle of data enhancement by using a MixUp enhancement algorithm according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an Anchor-Free based detection branch provided in the embodiments of the present application;

FIG. 5 is a schematic diagram of the structure of the Re-ID branch provided in the embodiments of the present application;

FIG. 6 is a schematic structural diagram of a target tracking device based on a single-stage target tracking model according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

Multi-target tracking (MOT) has been a long-term goal of computer vision, the goal being to estimate the trajectories of multiple targets in a video, and the successful solution of this task will be beneficial for many applications, such as motion recognition, motion video analysis, geriatric care, and human-computer interaction. Most of the existing SOTA methods adopt a two-stage target detection mode (two-step method), although the two-step method has obvious performance improvement on target tracking along with the development of a target detection algorithm and Re-ID in recent years, the two-step method does not share a characteristic diagram of the detection algorithm and the Re-ID, so that the speed is very slow, and reasoning is difficult to carry out at a video rate, and therefore a rapid target tracking algorithm needs to be provided urgently.

The current target detection mode mainly comprises a single-stage target detection mode and a two-stage target detection mode, wherein although the two-stage target detection mode has certain precision, the inference time of a two-stage target detection algorithm is longer, and the target tracking speed is reduced. With the maturity of the two-step target tracking algorithm, more researchers begin to research the one-shot algorithm for simultaneously detecting the target and learning the Re-ID feature, and after the feature map is shared between the target detection and the Re-ID, the inference time can be greatly reduced, but the precision is much lower than that of the two-step method, so that a single-stage target tracking algorithm which can ensure the tracking speed and the tracking precision is required to be provided.

In view of the above, embodiments of the present application provide a single-stage object tracking algorithm based on Anchor-Free, which first uses a feature extraction module to process each frame of image to obtain a feature map, then inputs the feature map into a detection branch and a pedestrian re-identification branch, regresses the position of a tracked object in the image using the detection branch to determine a target frame of the tracked object, uses the pedestrian re-identification branch to extract a low-dimensional feature vector in the feature map, and finally identifies the low-dimensional feature vector based on the target frame corresponding to the tracked object and the pedestrian re-identification, thereby implementing single-stage object tracking of pedestrians. The following detailed description of the embodiments of the present application is made with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of a single-stage target tracking model provided in an embodiment of the present application, and as shown in fig. 1, the single-stage target tracking model may specifically include:

the single-stage target tracking model comprises an Input module, a feature extraction module, a Detection branch Detection Head and a pedestrian Re-identification branch Re-ID Head; the Input module is used for processing video data into a series of image frames, and sequentially inputting each image frame into a single-stage target tracking model for target detection; the feature extraction module Backbone is used for extracting a feature map from the image, and a deformable convolution network DCN is inserted into the feature extraction module Backbone; the Detection branch Detection Head adopts an Anchor-Free-based Detection module and is used for regressing the specific position of a tracking object (such as a pedestrian) from the characteristic diagram to obtain a target frame corresponding to the tracking object; the pedestrian Re-identification branch Re-ID Head is used for extracting a low-dimensional feature vector from the feature map, and Re-ID feature information (i.e., low-dimensional vector information) is input.

The following describes in detail an implementation process of the target tracking method based on the single-stage target tracking model according to the present application, based on the structure of the single-stage target tracking model shown in fig. 1.

Fig. 2 is a schematic flowchart of a target tracking method based on a single-stage target tracking model according to an embodiment of the present disclosure. The single-stage object tracking model-based object tracking method of fig. 2 may be performed by a server. As shown in fig. 2, the target tracking method based on the single-stage target tracking model may specifically include:

s201, acquiring video data containing a tracking object, and inputting the video data into a preset single-stage target tracking model;

s202, processing each frame of image in the video data by using a feature extraction module in the single-stage target tracking model to obtain a feature map;

s203, inputting the feature map into a detection branch in the single-stage target tracking model, and regressing the position of the tracked object in the image by using the detection branch to obtain a target frame corresponding to the tracked object, wherein the detection branch adopts an Anchor-Free-based detection module;

s204, inputting the feature map into a pedestrian re-identification branch in the single-stage target tracking model, and extracting the low-dimensional feature vector of the feature map by using the pedestrian re-identification branch to obtain the pedestrian re-identification low-dimensional feature vector;

and S205, tracking the track generated by the tracking object in the video data based on the target frame corresponding to the tracking object and the pedestrian re-identification low-dimensional feature vector.

Specifically, when tracing the track of a person in a period of time, some video streams are often acquired, and the video streams are input into a model for detection, but the object processed by the single-stage target tracking model is an image, and the single-stage target tracking model actually performs independent target detection processing on each frame of image in the video stream. Therefore, the embodiment of the application can acquire the video data containing the tracking object, and can also directly acquire the image data, and the image data is used as the input of the single-stage target tracking model. The type of the acquired input data does not limit the technical solution of the present application.

Furthermore, the single-stage target tracking model in the embodiment of the application adopts the detection branch Based on Anchor-Free, compared with an Anchor-Based target detection algorithm, a large number of detection frames need to be generated in an image, one frame closest to a target position is selected from the large number of detection frames as a detection result, the positions of pedestrians in the image can be directly regressed by utilizing Anchor-Free, a large number of detection frames do not need to be generated, and therefore the detection speed of the target detection algorithm is accelerated.

In some embodiments, the pre-training process of the single-stage object tracking model comprises: acquiring pre-configured detection data, performing data enhancement on the detection data by using a MixUp enhancement algorithm to obtain new detection data, and training a detection branch in a single-stage target tracking model by using the new detection data; after the detection branches are trained, a complete single-stage target tracking model is trained by using preconfigured tracking data to obtain the trained single-stage target tracking model, wherein the tracking data comprises target frames and object identifications corresponding to the target frames.

Specifically, before a single-stage target tracking model is used for target tracking, the single-stage target tracking model needs to be pre-trained, in the pre-training process, detection data are firstly subjected to data enhancement by using a MixUp enhancement algorithm, new detection data formed after the data enhancement are used for training detection branches based on Anchor-Free, and when the detection branches are trained, pedestrian re-recognition branches can be removed from the model; after the detection branches are trained, the tracking data is used for training a complete single-stage target tracking model.

It should be noted that the detection data may only include the target frame, but not include the object identifier corresponding to the target frame, and the tracking data includes both the target frame and the object identifier corresponding to each target frame, where the object identifier may be considered as a tag of a pedestrian Re-identification branch (Re-ID branch), and the object identifier of the target frame refers to which tracking object (for example, which pedestrian) the target frame corresponds to.

In some embodiments, the data enhancement of the detection data by using the MixUp enhancement algorithm to obtain new detection data includes: and fusing any two original images in the detection data so as to fuse the two original images into one image to obtain a fused image, superposing target frames in the original images in the fused image, and generating new detection data according to the original images and the fused image.

In particular, mixUp is an algorithm for enhancing image clustering applied in computer vision, which can cluster images between different clusters to expand a training data set. The following describes an implementation process and principle of the MixUp enhancement algorithm with reference to the accompanying drawings, where fig. 3 is a schematic diagram of a principle of performing data enhancement by using the MixUp enhancement algorithm according to an embodiment of the present application, and as shown in fig. 3, the process of performing data enhancement by using the MixUp enhancement algorithm may specifically include:

in the pre-training stage of the Detection branch Detection Head, in order to improve the performance of the Detection algorithm, a MixUp enhancement algorithm is used for data enhancement in the embodiment of the application, the principle of the enhancement algorithm is that two groups of pictures are fused into one group, two groups of original pictures in the Detection data are fused in pairs to obtain new pictures, the obtained new pictures are used as new fused pictures, meanwhile, target frames in the original pictures are overlapped in the new pictures, finally, the original pictures and the new fused pictures generate new Detection data together, and the new Detection data are used for training the Detection branch Detection Head.

In some embodiments, a deformable convolutional network is inserted into the feature extraction module, the deformable convolutional network adopts a ResNet34 residual network, 2 3 × 3 convolutional layers are arranged at a stem stage in the ResNet34 residual network, the step sizes of the two convolutional layers before the residual branch in the original residual module are exchanged, and the 1 × 1 convolutional layer with the step size of 2 in the original short-circuit branch is replaced by the average pooling layer with the step size of 2 and the 1 × 1 convolutional layer.

Specifically, a deformable convolutional network DCN is inserted into the feature extraction module backhaul in the embodiment of the present application, and in practical application, the deformable convolutional network DCN may adopt a ResNet34 network, and compared with the existing ResNet34 network, the present application makes the following improvements to the original ResNet34 network:

in the feature extraction stage, in order to extract richer features, the original ResNet34 network is improved in the embodiment of the present application, a stem stage in the original ResNet backsbone includes a convolution layer of 7 × 7 and stride (step size) is 2, and since the calculation amount of convolution is quadratic to length and width, and the convolution calculation amount of 7 × 7 is 5.4 times greater than that of 3 × 3, the convolution kernel of 7 × 7 here is replaced by three conventional convolution kernels of 3 × 3, so that the calculation amount of convolution is reduced, and the same receptive field as that of 7 × 7 is maintained; next, in the embodiment of the present application, stride of the two convolution layers before the residual branch in the residual module is exchanged, so as to avoid information loss caused by 1 × 1 convolution with stride of 2, and similarly, 1 × 1 convolution with stride of 2 in the short-circuit branch is exchanged with average pooling layer and 1 × 1 convolution with stride of 2, so that information loss caused to the feature map in the residual module can be further reduced.

In some embodiments, the detection branches include a heatmap branch, a center offset branch, and a frame branch, each of the heatmap branch, the center offset branch, and the frame branch being comprised of 2 3 × 3 convolutional layers and 1 × 1 convolutional layer; wherein the heatmap branch is for outputting a heatmap of size (1, H, W), the center offset branch is for outputting a center offset of size (2, H, W), and the box branch is for outputting box coordinates of size (2, H, W).

Specifically, the current Anchor-Based tracker extracts Re-ID information (vector information in a low dimension) of the detection frame and the object in the detection frame at the same time. The anchors generated Based on the anchors-Based detector are not suitable for learning proper Re-ID information, because one object may be charged by a plurality of anchors and detected, but the differences among the anchors are very large, and the differences among the extracted Re-IDs are also very large, so that serious network ambiguity can be caused, the quality of Re-ID characteristics is seriously influenced by the quality of a detection frame, and the Re-ID branches cannot be learnt by the network due to the framework of first detection and then Re-identification. In view of the problems of the conventional Anchor-Based tracker, the embodiment of the application provides an Anchor-Free-Based detection branch.

The following describes a structure of the Anchor-Free detection branch provided in the embodiment of the present application with reference to the accompanying drawings, where fig. 4 is a schematic structural diagram of the Anchor-Free detection branch provided in the embodiment of the present application, and as shown in fig. 4, the Anchor-Free detection branch may specifically include:

the detection branch structure comprises three branches, namely a heat map branch, a center offset branch and a frame branch, wherein each branch consists of 2 3x3 convolutional layers and 1x1 convolutional layer, the output size of the heat map branch is (1, H, W), the output size of the center offset branch is (2, H, W), the output size of the frame branch is (2, H, W), H is height, and W is width.

In some embodiments, the regressing the position of the tracking object in the image by using the detection branch to obtain a target frame corresponding to the tracking object includes: and regressing the position of the tracking object in the image based on the heat map, the central offset and the frame coordinates to obtain a corresponding target frame of the tracking object in the image, and determining the position of the target frame in the image.

Specifically, the output of the detection branch based on Anchor-Free contains three data, namely a heat map, a center offset and frame coordinates, and based on the three data, the detection branch based on Anchor-Free can return the specific position of a tracked object (such as a pedestrian) in the image (namely the position of the target frame), so as to accurately judge the position of the pedestrian in the image.

In some embodiments, the pedestrian Re-identification branch takes the Re-ID branch, which contains 2 3 × 3 convolutional layers and 1 × 1 convolutional layer, and outputs a low-dimensional feature vector of size (128, H, W).

In particular, a pedestrian Re-identification branch (Re-ID branch) is used for efficiently extracting low-dimensional feature vectors, the feature vectors output in the Re-ID task are generally high in dimensionality, and a large amount of training data is needed for training high-dimensional Re-ID features, which is not available for a one-shot tracking algorithm. Previous two-step methods are less affected by this problem because they can utilize rich reid datasets, which provide cropped human bodies. But the one-shot tracking algorithm cannot use them because the original, uncut image is needed. One solution is to reduce the dependency of the reid features on the data by reducing their dimensionality. Therefore, the one-shot method is friendly to the one-shot by adopting the feature vector with lower dimensionality, the risk of over-fitting can be reduced by learning the feature vector with lower dimensionality, and the tracking robustness is improved.

The structure of the Re-ID branch provided in the embodiment of the present application is described below with reference to the accompanying drawings, and fig. 5 is a schematic structural diagram of the Re-ID branch provided in the embodiment of the present application, and as shown in fig. 5, the Re-ID branch may specifically include:

the Re-ID branch comprises a branch consisting of 2 3x3 convolutional layers and 1x1 convolutional layer with an output size of (128,h, W), where H is height and W is width, and in practical applications, the output of the Re-ID branch is a low-dimensional feature vector.

According to the technical scheme provided by the embodiment of the application, the application provides a single-stage target tracking algorithm based on Anchor-Free, and the algorithm can guarantee the tracking speed and the tracking precision; the method is improved on the basis of the traditional feature extraction module Backbone, and meanwhile, DCN is introduced to extract more sufficient information, and the information is beneficial to the subsequent learning of the Re-ID branch; in addition, an Anchor-Free-based detection branch and an efficient Re-ID branch are designed in the structure of a single-stage target tracking model, so that the effects of improving the tracking precision and the tracking robustness on the premise of ensuring the tracking speed are achieved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 6 is a schematic structural diagram of a target tracking device based on a single-stage target tracking model according to an embodiment of the present disclosure. As shown in fig. 6, the target tracking apparatus based on the single-stage target tracking model includes:

an input module 601 configured to acquire video data including a tracking object and input the video data into a predetermined single-stage target tracking model;

the processing module 602 is configured to process each frame of image in the video data by using a feature extraction module in the single-stage target tracking model to obtain a feature map;

the regression module 603 is configured to input the feature map into a detection branch of the single-stage target tracking model, and regress the position of the tracked object in the image by using the detection branch to obtain a target frame corresponding to the tracked object, wherein the detection branch adopts an Anchor-Free-based detection module;

the extraction module 604 is configured to input the feature map into a pedestrian re-identification branch in the single-stage target tracking model, and extract the low-dimensional feature vector of the feature map by using the pedestrian re-identification branch to obtain a pedestrian re-identification low-dimensional feature vector;

and a tracking module 605 configured to track a trajectory generated by the tracking object in the video data based on a target frame corresponding to the tracking object and the pedestrian re-identification low-dimensional feature vector.

In some embodiments, the pre-training module 606 of fig. 6 obtains pre-configured detection data, performs data enhancement on the detection data by using a MixUp enhancement algorithm to obtain new detection data, and trains a detection branch in the single-stage target tracking model by using the new detection data; after the detection branches are trained, a complete single-stage target tracking model is trained by using preconfigured tracking data to obtain the trained single-stage target tracking model, wherein the tracking data comprises target frames and object identifications corresponding to the target frames.

In some embodiments, the pre-training module 606 shown in fig. 6 fuses any two original images in the detection data, so as to fuse the two original images into one image, obtain a fused image, superimpose the target frames in the original images in the fused image, and generate new detection data according to the original images and the fused image.

In some embodiments, the regression module 603 of fig. 6 regresses the position of the tracking object in the image based on the heat map, the center offset, and the frame coordinates to obtain a corresponding target frame of the tracking object in the image, and determines the position of the target frame in the image.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 7 is a schematic structural diagram of an electronic device 7 provided in an embodiment of the present application. As shown in fig. 7, the electronic apparatus 7 of this embodiment includes: a processor 701, a memory 702, and a computer program 703 stored in the memory 702 and executable on the processor 701. The steps in the various method embodiments described above are implemented when the processor 701 executes the computer program 703. Alternatively, the processor 701 implements the functions of each module/unit in each device embodiment described above when executing the computer program 703.

Illustratively, the computer program 703 may be partitioned into one or more modules/units, which are stored in the memory 702 and executed by the processor 701 to accomplish the present application. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 703 in the electronic device 7.

The electronic device 7 may be an electronic device such as a desktop computer, a notebook, a palm computer, and a cloud server. The electronic device 7 may include, but is not limited to, a processor 701 and a memory 702. Those skilled in the art will appreciate that fig. 7 is merely an example of the electronic device 7, does not constitute a limitation of the electronic device 7, and may include more or less components than those shown, or combine certain components, or different components, e.g., the electronic device may also include input-output devices, network access devices, buses, etc.

The Processor 701 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 702 may be an internal storage unit of the electronic device 7, for example, a hard disk or a memory of the electronic device 7. The memory 702 may also be an external storage device of the electronic device 7, such as a plug-in hard disk provided on the electronic device 7, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 702 may also include both an internal storage unit and an external storage device of the electronic device 7. The memory 702 is used for storing computer programs and other programs and data required by the electronic device. The memory 702 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/computer device and method may be implemented in other ways. For example, the above-described apparatus/computer device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logical function, another division may be made in actual implementation, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by the present application, and the computer program can be stored in a computer readable storage medium to instruct related hardware, and when the computer program is executed by a processor, the steps of the method embodiments described above can be realized. The computer program may comprise computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunication signals in accordance with legislation and patent practice.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A target tracking method based on a single-stage target tracking model is characterized by comprising the following steps:

acquiring video data containing a tracking object, and inputting the video data into a preset single-stage target tracking model;

processing each frame image in the video data by utilizing a feature extraction module in the single-stage target tracking model to obtain a feature map;

inputting the characteristic diagram into a detection branch in the single-stage target tracking model, and regressing the position of the tracked object in an image by using the detection branch to obtain a target frame corresponding to the tracked object, wherein the detection branch adopts an Anchor-Free-based detection module;

inputting the feature map into a pedestrian re-identification branch in the single-stage target tracking model, and extracting a low-dimensional feature vector of the feature map by using the pedestrian re-identification branch to obtain a pedestrian re-identification low-dimensional feature vector;

and tracking the track generated by the tracking object in the video data based on the target frame corresponding to the tracking object and the low-dimensional feature vector re-identified by the pedestrian.

2. The method of claim 1, wherein the pre-training process of the single-stage object tracking model comprises:

acquiring pre-configured detection data, performing data enhancement on the detection data by using a MixUp enhancement algorithm to obtain new detection data, and training a detection branch in the single-stage target tracking model by using the new detection data;

after the detection branches are trained, training the complete single-stage target tracking model by using preconfigured tracking data to obtain the trained single-stage target tracking model, wherein the tracking data comprises target frames and object identifications corresponding to the target frames.

3. The method of claim 2, wherein the data enhancing the detection data by using the MixUp enhancing algorithm to obtain new detection data comprises:

and fusing any two original images in the detection data so as to fuse the two original images into one image to obtain a fused image, superposing target frames in the original images in the fused image, and generating the new detection data according to the original images and the fused image.

4. The method of claim 1, wherein a deformable convolutional network is inserted into the feature extraction module, the deformable convolutional network adopts a ResNet34 residual network, a stem stage in the ResNet34 residual network is provided with 2 3x3 convolutional layers, the step sizes of the two convolutional layers before the residual branch in the original residual module are exchanged, and the 1x1 convolutional layer with the step size of 2 in the original short-circuit branch is replaced by the average pooling layer with the step size of 2 and the 1x1 convolutional layer.

5. The method of claim 1, wherein the detection branches comprise a heatmap branch, a center offset branch, and a frame branch, and wherein the heatmap branch, the center offset branch, and the frame branch are each comprised of 2 3x3 convolutional layers and 1x1 convolutional layer;

wherein the heatmap branch is for outputting a heatmap of size (1, H, W), the center offset branch is for outputting a center offset of size (2, H, W), and the block branch is for outputting block coordinates of size (2, H, W).

6. The method according to claim 5, wherein the obtaining of the target frame corresponding to the tracking object by performing regression on the position of the tracking object in the image using the detection branch comprises:

and regressing the position of the tracking object in the image based on the heat map, the central offset and the frame coordinates to obtain a corresponding target frame of the tracking object in the image, and determining the position of the target frame in the image.

7. The method of claim 1, wherein the pedestrian Re-identification branch employs a Re-ID branch comprising 2 3x3 convolutional layers and 1x1 convolutional layer, the Re-ID branch outputting a low-dimensional feature vector of size (128,h,w).

8. A target tracking device based on a single-stage target tracking model is characterized by comprising:

an input module configured to acquire video data containing a tracking object, and input the video data into a predetermined single-stage target tracking model;

the processing module is configured to process each frame of image in the video data by using a feature extraction module in the single-stage target tracking model to obtain a feature map;

the regression module is configured to input the feature map into a detection branch in the single-stage target tracking model, and regress the position of the tracked object in an image by using the detection branch to obtain a target frame corresponding to the tracked object, wherein the detection branch adopts an Anchor-Free-based detection module;

the extraction module is configured to input the feature map into a pedestrian re-identification branch in the single-stage target tracking model, and extract a low-dimensional feature vector of the feature map by using the pedestrian re-identification branch to obtain a pedestrian re-identification low-dimensional feature vector;

a tracking module configured to track a track generated by the tracking object in the video data based on a target frame corresponding to the tracking object and the pedestrian re-identification low-dimensional feature vector.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 7 when executing the program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.