WO2023050678A1

WO2023050678A1 - Multi-target tracking method and apparatus, and electronic device, storage medium and program

Info

Publication number: WO2023050678A1
Application number: PCT/CN2022/075415
Authority: WO
Inventors: 李震宇; 李昂
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-09-30
Filing date: 2022-02-07
Publication date: 2023-04-06
Also published as: CN113822910A

Abstract

Provided in the present disclosure are a multi-target tracking method and apparatus, and an electronic device and a storage medium. The multi-target tracking method comprises: performing target detection on the current frame of image, so as to obtain a first detection result of at least one detected first target object; extracting an appearance feature vector of the first target object; calculating the similarity between the appearance feature vector of the first target object and an appearance feature vector of each target object detected in at least one frame of image before the current frame of image; and determining a target tracking result for the first target object on the basis of the similarity, wherein the target tracking result is used for reflecting detection results of the first target object in the current frame of image and a plurality of frames of images. By means of the embodiments of the present disclosure, the stability and precision of multi-target tracking can be improved.

Description

Multi-target tracking method, device, electronic equipment, storage medium and program

Cross References to Related Applications

This patent application requires that the Chinese patent application number submitted on September 30, 2021 is 202111165457.4, the applicant is Shanghai Shangtang Lingang Intelligent Technology Co., Ltd., and the application name is "multi-target tracking method, device, electronic equipment and storage medium" priority rights, the entirety of which is incorporated into this application by reference.

technical field

The present disclosure relates to the technical field of image processing, and in particular, to a multi-target tracking method, device, electronic equipment, storage medium and program.

Background technique

Multi-target tracking technology is a research hotspot in the field of computer vision. Multi-target tracking refers to the use of a computer to determine the position, size and complete trajectory of each independent moving target of interest in a video sequence. It has a very wide range of applications in vehicle auxiliary systems, military fields and intelligent security fields.

However, overlapping objects are often encountered in multi-object tracking tasks. When the tracking target overlaps with other targets, the target tracking track matching error may occur, which leads to poor stability of tracking multiple targets.

Contents of the invention

Embodiments of the present disclosure at least provide a multi-target tracking method, device, electronic equipment, storage medium and program.

In a first aspect, an embodiment of the present disclosure provides a multi-target tracking method applied to an electronic device, including:

Perform target detection on the current frame image to obtain a first detection result of at least one detected first target object;

extracting an appearance feature vector of the first target object;

calculating the similarity between the appearance feature vector of the first target object and the appearance feature vectors of each target object detected in at least one frame image before the current frame image;

Based on the similarity, determine a target tracking result for the first target object; the target tracking result is used to reflect a detection result of the first target object in the current frame image and the at least one frame image .

In the embodiment of the present disclosure, since the extracted appearance features of the first target object can better represent the identity information of the first target object, using this better feature information and similarity, it is possible to deal with the reappearance of the target after being occluded. The trajectory convergence can also reduce the probability of tracking instability caused by vehicle bumps, obtain more stable multi-target tracking results, and improve the stability of multi-target tracking.

In the second aspect, an embodiment of the present disclosure provides a multi-target tracking device, which is applied to an electronic device, including:

The target detection module is configured to perform target detection on the current frame image, and obtain a first detection result of at least one detected first target object.

A feature extraction module configured to extract an appearance feature vector of the first target object.

The similarity calculation module is configured to calculate the similarity between the appearance feature vector of the first target object and the appearance feature vectors of each target object detected in at least one frame image before the current frame image.

The tracking result determining module is configured to determine a target tracking result for the first target object based on the similarity; the target tracking result is configured to reflect the first target object in the current frame image and the multiple Detection results in frame images.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processing The processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the multi-target tracking method as described in the first aspect is executed.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the multi-target tracking method as described in the first aspect is executed .

In a fifth aspect, an embodiment of the present disclosure provides a computer program, including computer readable code, when the computer readable code is run in an electronic device, a processor in the electronic device executes any one of the above A multi-target tracking method.

In order to make the above-mentioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments will be described in detail below together with the accompanying drawings.

Description of drawings

FIG. 1 shows a schematic diagram of an execution body of a multi-target tracking method provided by an embodiment of the present disclosure;

FIG. 2 shows a flow chart of a multi-target tracking method provided by an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a target detection result of a current frame image provided by an embodiment of the present disclosure;

FIG. 4 shows a flow chart of a method for determining a target tracking result for a first target object provided by an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a tracking effect of a first target object provided by an embodiment of the present disclosure;

FIG. 6 shows a schematic diagram of another tracking effect of a first target object provided by an embodiment of the present disclosure;

FIG. 7 shows a flow chart of a method for training a re-identification model provided by an embodiment of the present disclosure;

FIG. 8 shows a schematic diagram of a radar tracking result provided by an embodiment of the present disclosure;

FIG. 9 shows a flow chart of a method for acquiring an image sample provided by an embodiment of the present disclosure;

FIG. 10 shows a schematic diagram of an image sample set provided by an embodiment of the present disclosure;

Fig. 11 shows a schematic structural diagram of a multi-target tracking device provided by an embodiment of the present disclosure;

Fig. 12 shows a schematic structural diagram of another multi-target tracking device provided by an embodiment of the present disclosure;

Fig. 13 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only It is a part of the embodiments of the present disclosure, but not all of them. The components of the disclosed embodiments generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations. Accordingly, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the claimed disclosure, but merely represents selected embodiments of the present disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative effort shall fall within the protection scope of the present disclosure.

It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

The term "and/or" in this article only describes an association relationship, which means that there can be three kinds of relationships, for example, A and/or B can mean: there is A alone, A and B exist at the same time, and B exists alone. situation. In addition, the term "at least one" herein means any one of a variety or any combination of at least two of the more, for example, including at least one of A, B, and C, which may mean including from A, Any one or more elements selected from the set formed by B and C.

With the frequent occurrence of traffic accidents, a series of safety issues caused by traffic accidents have received widespread attention from the society. Facing the increasingly severe traffic safety situation, the development of automotive intelligent assisted driving systems has become an urgent requirement of the current automotive industry; and vehicles The forward collision warning system is the most important part of the car intelligent assisted driving system. The vehicle detection and tracking algorithm plays a vital role in the car intelligent assisted driving system. For vehicles driving on high-speed roads, it is required The tracking results can be updated accurately in real time, but the methods using traditional machine vision are often difficult to meet the requirements in terms of speed and accuracy.

Due to the development and application of Convolutional Neural Networks (CNN), many tasks in the field of computer vision have been greatly developed. With the development of deep learning technology, the multi-target tracking algorithm based on Convolutional Neural Networks has achieved great progress Certain breakthroughs have enabled it to have a tracking accuracy much higher than traditional multi-target tracking methods.

However, the research found that due to frequent occlusions in the multi-target tracking process, when the target is occluded during the tracking process, the number of detected targets changes, and the occluded track of the tracked target cannot match the detected target of the current frame. It is necessary to stop tracking when judging whether the track disappears temporarily due to occlusion or leaves the detection area, and a part of the track that is occluded is terminated due to misjudgment. After the target occlusion ends, the originally tracked target reappears in the detection area. If the original tracking track has stopped tracking, the target will generate a new initial track at this time, resulting in a change in the target identity. In addition, when the vehicle is bumping, the distance between the detection results of the same target will be large, which will lead to low similarity, data association failure, and target tracking failure.

The present disclosure provides a multi-target tracking method, including: performing target detection on the current frame image to obtain a first detection result of at least one first target object detected; extracting the appearance feature vector of the first target object; calculating The similarity between the appearance feature vector of the first target object and the appearance feature vectors of each target object detected in at least one frame image before the current frame image; based on the similarity, determine the A target tracking result of the first target object; the target tracking result is used to reflect detection results of the first target object in the current frame image and the multiple frame images.

In the embodiment of the present disclosure, since the extracted appearance features can better represent the identity information of the first target object, using this better feature information can handle the reappearance of the trajectory after the target is occluded, and can also reduce the factor The probability of unstable tracking caused by vehicle bumps can be obtained to obtain a more stable multi-target tracking result and improve the stability of multi-target tracking.

The multi-target tracking method provided by the embodiments of the present disclosure can be applied to an automatic driving system to track target objects within the field of view of the vehicle, and accurately obtain object tracking results, which is helpful for developers to design related tracking strategies and alarm strategies. Hereinafter, the multi-target tracking method provided by the embodiments of the present disclosure will be introduced in detail.

Referring to FIG. 1 , it is a schematic diagram of an execution body of a multi-target tracking method provided by an embodiment of the present disclosure. The execution body of the method is an electronic device, wherein the electronic device may include a terminal and a server. For example, the method can be applied to a terminal. The terminal can be a terminal device as shown in FIG. And robots, not limited here.

Among them, voice interactive devices include but are not limited to smart speakers and smart home appliances. The method can also be applied to a server, or can be applied to an implementation environment composed of a terminal and a server. The server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it can provide basic cloud computing such as cloud services, cloud databases, cloud computing, cloud storage, big data and artificial intelligence platforms. Cloud server for the service.

It should be understood that, in some implementation manners, the server may communicate with the terminal through a network. A network may include various connection types such as wires, wireless communication links, or fiber optic cables, among others.

In addition, the multi-target tracking method can also be implemented by software running on a terminal or a server, for example, the multi-target tracking method provided by the embodiments of the present disclosure is implemented by using an application program with a multi-target tracking function. In some possible implementation manners, the multi-target tracking method may be implemented by a processor invoking computer-readable instructions stored in a memory.

Referring to FIG. 2 , it is a flowchart of a multi-target tracking method provided by an embodiment of the present disclosure. The multi-target tracking method is applied to an electronic device and executed by the electronic device, and may include the following steps S101 to S104:

Step S101: Perform target detection on the current frame image to obtain a first detection result of at least one detected first target object.

In an example, in an image (for example, each frame image in a video), a closed area that is distinguished from the surrounding environment is often called an object. The process of giving the location of an object in an image is called detection. For example, the trained target detection model (or target detection network) can be used to detect the position and category information of multiple tracking targets in the current frame image.

In some embodiments, the target detection technology can also be used to detect the target. The target detection technology can use a variety of methods, such as: frame difference method, background subtraction method, optical flow method, directional gradient feature, etc., and can also be manually way to mark the initial position of the target. Of course, any other suitable technology that can be used for target detection can also be used, which is not limited here, as long as the target detection can be realized.

Referring to FIG. 3 , it is a schematic diagram of a target detection result of the current frame image provided in the embodiment of the present disclosure. In an example, the current frame image T can be input to the target detection model for target detection, and the detected The first detection result of at least one first target object 10, in this embodiment, the first detection result includes three first target objects 10, and the first target object is a vehicle. In other embodiments, the number of the first target objects 10 may also be 2 or 4, and the type of the first target objects 10 may also be other types (such as pedestrians), which are not limited here.

In some implementations, the first detection result also includes detection frame information of the first target object 10 (as shown in A in FIG. 3 ), the type of the first target object 10, and the first target object 10. At least one of the confidence levels of a detection result of a target object 10 .

In the embodiment of the present disclosure, since the first detection result also includes at least one of the detection frame information of the first target object, the type of the first target object, and the confidence level of the detection result of the first target object This makes the content of the output of the first detection result more abundant, provides more basis for the calculation of the subsequent similarity, and can improve the accuracy of the similarity calculation.

It should be understood that before object detection is performed on the current frame image, the video to be detected needs to be acquired, and then the current frame image is acquired from the video to be detected according to a preset time interval or frame number interval. Wherein, the video to be detected is a video or a sequence of video frames to be detected. For example, the video to be detected may be a video or a video stream with a certain video frame length.

In some implementations, after the target detection model acquires the video to be detected, it can acquire multiple frames of images to be detected at intervals from the video to be detected. For example, the video to be detected includes M frames of images to be detected, and the target detection model starts from M frames of images to be detected Acquiring at least one frame of the image to be detected according to a preset time interval or every interval of N frames.

In the embodiment of the present disclosure, after the electronic device acquires the video to be detected, multiple frames of images to be detected are obtained at intervals from the video to be detected, which can improve the processing speed of multi-target tracking in the video to be detected and increase the access of the video to be detected that can be processed road number.

It should be understood that the frame rate of the image to be detected in the video to be detected is generally more than 25 frames per second. If an electronic device (such as a server) detects each frame of the image to be detected, the amount of calculation will be too large, which will cause the server to overload and affect The processing speed of multi-target tracking and the number of access channels of the video to be detected. In this embodiment, after the electronic device acquires the video to be detected, it acquires multiple frames of images to be detected at intervals from the video to be detected, which can improve the processing speed of multi-target tracking in the video to be detected and increase the access of the video to be detected that can be processed road number.

Step S102: Extracting an appearance feature vector of the first target object.

In an example, the previously obtained image to be detected (current frame image) may be input to a pre-trained re-identification model to extract appearance features of the first target object to obtain an appearance feature vector. In this embodiment, the extracted appearance features are all 64-dimensional vectors. Among them, the training method of the re-identification model will be described in detail later. Of course, other methods may also be used to extract the appearance feature vector of the first target object, which is not limited here.

Step S103: Calculate the similarity between the appearance feature vector of the first target object and the appearance feature vectors of each target object detected in at least one frame image before the current frame image.

In the example, three first target objects 10 in FIG. 3 are taken as an example for illustration. In order to clearly show the calculation process of the similarity, in this embodiment, the current frame image (also called the subsequent frame image in the table) The detected three first target objects 10 are assigned codes D0, D1 and D2 respectively, and the multi-frame images before the current frame, for the convenience of description, take one frame from the multi-frame images (also called the previous frame in the table image) as an example to calculate the similarity, and assign codes Q0, Q1, and Q2 to the target objects in the previous frame image respectively. The calculation results of the similarity are shown in Table 1 below:

Table 1

It can be seen from the above table that after similarity calculation, the first target object D0 in the current frame image has the highest similarity with the target object Q0 in the previous frame image of the current frame image, and the similarity calculation result is 0.82; The first target object D1 in the frame image has the highest similarity with the target object Q1 in the previous frame image of the current frame image, and the similarity calculation result is 0.91; the first target object D2 in the current frame image and the previous frame image of the current frame image have the highest similarity. The target object Q2 in the frame image has the highest similarity, and the similarity calculation result is 0.92.

It should be understood that for simplicity of expression, only two frames of images are used for illustration. In the actual operation process, in order to reduce the impact of environmental mutations on the appearance features, you can use the cache to store the appearance features of the same object on the latest multi-frame (such as 100 frames) images, and use the maximum similarity as the calculation result, which can improve Reliability of object tracking. That is, the temporary storage strategy is used to store the appearance features of the same object on the most recent multi-frame images. This strategy reduces the probability of similarity jumps caused by appearance mutations, thereby obtaining better multi-target tracking results.

Step S104: Based on the similarity, determine a target tracking result for the first target object; the target tracking result is used to reflect that the first target object is in the current frame image and the at least one frame image test results.

In some implementations, see FIG. 4 , which is a flow chart of a method for determining a target tracking result for a first target object provided by an embodiment of the present disclosure. Based on the similarity, determine the target tracking result for the first target object When the target tracking result of , may include the following steps S1041 to S1042:

Step S1041: Based on the similarity, match the first detection result with the detection results of each target object in the at least one frame of image, and determine all the target objects in the at least one frame of image that match the first target object Describe the detection result of the first target object.

In the example, matching, also known as data association, is a typical processing method often used in multi-target tracking tasks to solve the matching problem between targets. In the embodiment of the present disclosure, the Hungarian algorithm may be used to match the first detection result with the detection results of each target object in the at least one frame of image, so that the matching accuracy may be improved.

Step S1042: Determine a target tracking result for the first target object according to the detection result of the first target object in the at least one frame of images and the first detection result.

In an example, please refer to FIG. 5 and FIG. 6 in conjunction. In the embodiment of the present disclosure, since the appearance feature vector of the first target object 10 is extracted based on the pre-trained re-identification model, the extracted appearance features can be better Indicates the identity information of the first target object 10, wherein the address of the detection frame of one first target object 10 is id1, and the address of the detection frame of another first target object 10 is id2.

It can be seen from Figure 5 and Figure 6 that even in the case of occlusion, the identity information of different first target objects can still be recognized well, and with this better feature information, not only the first target object can be processed 10 The problem of track convergence after being occluded can also reduce the probability of tracking instability caused by vehicle bumps, and then can obtain a more stable multi-target tracking result and improve the stability of multi-target tracking.

In the embodiment of the present disclosure, by matching the first detection result with the detection results of each target object in the at least one frame image, not only the tracking result of the first target object can be obtained, but also the determination of the tracking result can be improved. precision.

In some implementations, see FIG. 7 , which is a flow chart of a method for training a re-identification model provided by an embodiment of the present disclosure. The above-mentioned re-identification model can be obtained through training using the following method. Specifically, when training the re-identification model, the following steps S1021 to S1023 may be included:

Step S1021: Obtain an image sample set, the image sample set includes a plurality of image samples and annotation information of the image samples, and the annotation information is used to indicate the image samples corresponding to the same target object.

It should be understood that since most of the existing images are taken from the overhead perspective of the surveillance camera, the captured images do not match the automatic driving scene. Therefore, in order to realize the re-identification capability of the re-identification model, in In some implementations, the image samples are taken in an automatic driving scene.

In the embodiment of the present disclosure, since the image samples are taken in a driving scene, the trained re-identification model can better adapt to the driving scene, improving the accuracy of model recognition and the adaptability in the driving scene.

Step S1022: Based on the image sample set, train the re-identification model to be trained to obtain the re-identification model.

In the embodiment of the present disclosure, based on the image sample set, the re-recognition model to be trained is trained to obtain the re-recognition model, and the re-recognition model is used to extract the appearance feature vector of the first target object, so that the first target object can be improved. The extraction accuracy of the appearance feature vector.

In some implementations, the basic network to be trained can be determined according to specific needs. In this implementation, a depth-separable convolutional network (such as a mobileNetV2 network) is selected as the backbone network. Of course, a lightweight volume such as a mobileNetV1 network can also be selected. The product neural network is not limited here. In this way, by using a lightweight convolutional neural network as the basic training network, the recognition efficiency of the trained re-recognition model can be improved, and the real-time performance is stronger.

In an example, after the basic network is determined, the image samples in the image sample set can be respectively input to the basic network for feature extraction, and then the basic network is trained according to the classification result and loss function to obtain a re-identification model. Wherein, the specific model training method is similar to the existing model training method.

It should be understood that in order to improve the recognition accuracy of the re-identification model, the tracking results of the lidar can be used to build a re-image sample set, as shown in Figure 8, through the tracking of the lidar, multiple detection frames can be obtained, such as point cloud detection frame 1, Point cloud detection frame 2, point cloud detection frame 17, point cloud detection frame 18, point cloud detection frame 19, etc., however, as shown in Figure 8, there will be some point cloud detection frames (such as point cloud detection frame 19 or point cloud The detection frame 17) is occluded, and the tracking frame of the lidar is not well fitted to the target object (such as a vehicle), and the image samples in the image sample set should not contain occluded objects, and the detection frame is required to fit The target object, otherwise unnecessary noise will be introduced, resulting in poor re-identification model training results.

In some implementations, as shown in FIG. 9, the method for acquiring an image sample set includes the following steps S10211 to S10214:

Step S10211: Acquire candidate images captured by the camera, perform target detection on the candidate images, and obtain a second detection result indicating at least one detected second target object.

Step S10212: Obtain point cloud data collected synchronously with the candidate image for the same scene, detect the point cloud data, and obtain at least one point cloud detection frame.

Step S10213: Determine the intersection-over-union ratio IOU between the detection frame of the second target object and the point cloud detection frame in the second detection result.

Step S10214: If the IOU between any point cloud detection frame and the detection frame of the second target object is greater than a preset threshold, determine the image sample based on the candidate image.

In the embodiment of the present disclosure, by combining the candidate image collected by the camera and the point cloud data collected by the laser radar, that is, using the second detection result obtained by detecting the candidate image, the laser radar multi-target tracking result is filtered , obtaining image samples can improve the acquisition accuracy of image samples, thereby improving the recognition accuracy of the trained re-identification model.

In an example, the candidate image collected by the camera can be obtained, and the point cloud data collected synchronously with the candidate image for the same scene can be obtained, and the point cloud data can be detected to obtain at least one point cloud detection frame; then use the The second detection result obtained by detecting the candidate image is used to filter the lidar multi-target tracking result. That is, after obtaining the second detection result, determine the intersection-over-union ratio IOU between the detection frame of the second target object and the point cloud detection frame in the second detection result, and only keep the IOU greater than a preset threshold (such as 0.7 ) candidate images.

In addition, the position information of the detection frame of the second target object in the image sample can also be determined based on the position information of the point cloud detection frame, that is, the ID of the detection frame of the second target object is set as the point cloud of the lidar Detection box ID. Afterwards, the part corresponding to the detection frame of the second target object is cut out from the candidate image, and classified according to the ID.

In an example, refer to FIG. 10 , which is a schematic diagram of an image sample set provided by an embodiment of the present disclosure. As shown in Figure 10, the target object M corresponds to multiple image samples, but the IDs of the image samples corresponding to the target object M are the same, and the IDs are all 001, that is, the label information is the same; similarly, the target object N corresponds to multiple image samples , but the IDs of the image samples corresponding to the target object N are the same, and the IDs are all 002, that is, the labeling information is the same. However, since the target object M is different from the target object N, the IDs of the image samples corresponding to the target object M and the image samples corresponding to the target object N are different.

It should be understood that in the method steps of the above embodiments, the writing order of each step does not imply a strict execution order and constitutes any limitation on the implementation process, and the specific execution order of each step should be determined by its function and possible internal logic.

In some implementation manners, the determining the image sample based on the candidate image includes: clipping a part of the image corresponding to the detection frame of the second target object from the candidate image to obtain the image sample .

In the embodiment of the present disclosure, the image sample is obtained by cutting the partial image corresponding to the detection frame of the second target object from the candidate image, which can reduce unnecessary noise introduced by the image sample.

Based on the same technical idea, the embodiment of the present disclosure also provides a multi-target tracking device corresponding to the multi-target tracking method. Since the problem-solving principle of the device in the embodiment of the present disclosure is similar to the above-mentioned multi-target tracking method in the embodiment of the present disclosure, therefore The implementation of the device can refer to the implementation of the method.

Referring to FIG. 11 , it is a schematic diagram of a multi-target tracking device 500 provided by an embodiment of the present disclosure. The device is applied to electronic equipment, including:

The target detection module 501 is configured to perform target detection on the current frame image, and obtain a first detection result of at least one detected first target object.

The feature extraction module 502 is configured to extract an appearance feature vector of the first target object.

The similarity calculation module 503 is configured to calculate the similarity between the appearance feature vector of the first target object and the appearance feature vectors of each target object detected in at least one frame image before the current frame image.

The tracking result determining module 504 is configured to determine a target tracking result for the first target object based on the similarity; the target tracking result is configured to reflect the first target object in the current frame image and the Detection results in at least one frame of images.

In a possible implementation manner, the tracking result determining module 504 is specifically configured as:

Based on the similarity, match the first detection result with the detection results of each target object in the at least one frame of images, and determine the first detection result in the at least one frame of images that matches the first target object The detection result of the target object;

Determine a target tracking result for the first target object according to the detection result of the first target object in the at least one frame of images and the first detection result.

In a possible implementation manner, the first detection result further includes the detection frame information of the first target object, the type of the first target object, and the confidence of the detection result of the first target object. at least one of the degrees.

In a possible implementation manner, as shown in FIG. 12, the device further includes a model training module 505, and the model training module 505 is configured to:

Obtain an image sample set, the image sample set includes a plurality of image samples and annotation information of the image samples, the annotation information is configured to indicate image samples corresponding to the same target object;

Based on the set of image samples, the re-identification model to be trained is trained to obtain the re-identification model.

In a possible implementation manner, the image sample is taken in a driving scene.

In a possible implementation manner, the model training module 505 is specifically configured as:

Acquiring candidate images collected by the camera, performing target detection on the candidate images, and obtaining a second detection result indicating at least one detected second target object;

Obtain point cloud data collected synchronously with the candidate image for the same scene, detect the point cloud data, and obtain at least one point cloud detection frame;

Determining the intersection-over-union ratio IOU between the detection frame of the second target object and the point cloud detection frame in the second detection result;

The image sample is determined based on the candidate image when there is an IOU between any point cloud detection frame and the detection frame of the second target object that is greater than a preset threshold.

Cutting a part of the image corresponding to the detection frame of the second target object from the candidate image to obtain the image sample.

In a possible implementation manner, the target detection module 501 is further configured to:

The current frame image is acquired from the video to be detected according to a preset time interval or frame number interval.

For the description of the processing flow of each module in the device and the interaction flow between the modules, reference may be made to the relevant description in the above method embodiment, and details will not be described here.

Based on the same technical idea, an embodiment of the present disclosure also provides an electronic device. Referring to FIG. 13 , it is a schematic structural diagram of an electronic device 700 provided by an embodiment of the present disclosure, including a processor 701 , a memory 702 , and a bus 703 . Among them, the memory 702 is used to store execution instructions, including a memory 7021 and an external memory 7022; the memory 7021 here is also called an internal memory, and is used to temporarily store calculation data in the processor 701 and exchange data with an external memory 7022 such as a hard disk. The processor 701 exchanges data with the external memory 7022 through the memory 7021 .

In the embodiment of the present disclosure, the memory 702 is specifically used to store application program codes for executing the solutions of the embodiments of the present disclosure, and the execution is controlled by the processor 701 . That is, when the electronic device 700 is running, the processor 701 communicates with the memory 702 through the bus 703, so that the processor 701 executes the application program code stored in the memory 702, and then executes the method described in any of the foregoing embodiments.

Wherein, memory 702 can be, but not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), can Erasable Programmable Read-Only Memory (EPROM), Electric Erasable Programmable Read-Only Memory (EEPROM), etc.

The processor 701 may be an integrated circuit chip with signal processing capability. The above-mentioned processor can be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC) , field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Various methods, steps and logic block diagrams disclosed in the embodiments of the present disclosure may be implemented or executed. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

It should be understood that the structure shown in the embodiment of the present disclosure does not constitute a specific limitation on the electronic device 700 . In other embodiments of the present disclosure, the electronic device 700 may include more or fewer components than shown in the illustration, or combine certain components, or separate certain components, or arrange different components. The illustrated components can be realized in hardware, software or a combination of software and hardware.

Embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the multi-target tracking method in the foregoing method embodiments are executed. Wherein, the storage medium may be a volatile computer-readable storage medium or a non-volatile computer-readable storage medium.

The embodiment of the present disclosure also provides a computer program product, the computer program product carries a program code, and the instructions included in the program code can be used to execute the steps of the multi-target tracking method in the above method embodiment, for details, please refer to the above method implementation example.

In a possible implementation manner, the above computer program product may be specifically implemented by means of hardware, software or a combination thereof. For example, the computer program product is embodied as a computer storage medium or a software product, such as a software development kit (Software Development Kit, SDK) and the like.

Those skilled in the art can clearly understand that for the convenience and brevity of description, for the specific working process of the system and device described above, reference can be made to the corresponding process in the foregoing method embodiments. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

If the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor. Based on this understanding, the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

The above-described embodiments are only specific implementations of the present disclosure, and are used to illustrate the technical solutions of the present disclosure, not to limit them. The protection scope of the present disclosure is not limited thereto, although the present disclosure has been described with reference to the foregoing For detailed description, those of ordinary skill in the art should understand that: within the technical scope of the present disclosure, any person familiar with the technical field can still modify the technical solutions described in the foregoing embodiments or can easily think of changes. Or perform equivalent replacements for some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present disclosure, and should be covered within the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be defined by the protection scope of the claims.

Industrial Applicability

Embodiments of the present disclosure provide a multi-target tracking method, device, electronic equipment, and storage medium. The multi-target tracking method includes: performing target detection on the current frame image to obtain the first detection of at least one first target object detected Result; extract the appearance feature vector of the first target object; calculate the appearance feature vector of the first target object, and the difference between the appearance feature vectors of each target object detected in at least one frame image before the current frame image Based on the similarity, determine the target tracking result for the first target object; the target tracking result is used to reflect the first target object in the current frame image and the multi-frame images The detection results in . The embodiments of the present disclosure can improve the stability and precision of multi-target tracking.

Claims

A multi-target tracking method applied to electronic equipment, including:

Perform target detection on the current frame image to obtain a first detection result of at least one detected first target object;

extracting an appearance feature vector of the first target object;

calculating the similarity between the appearance feature vector of the first target object and the appearance feature vectors of each target object detected in at least one frame image before the current frame image;

Based on the similarity, determine a target tracking result for the first target object; the target tracking result is used to reflect a detection result of the first target object in the current frame image and the at least one frame image .
The method according to claim 1, wherein said determining the target tracking result for the first target object based on the similarity includes:

Based on the similarity, match the first detection result with the detection results of each target object in the at least one frame of images, and determine the first detection result in the at least one frame of images that matches the first target object The detection result of the target object;

Determine a target tracking result for the first target object according to the detection result of the first target object in the at least one frame of images and the first detection result.
The method according to claim 1 or 2, wherein the first detection result further includes the detection frame information of the first target object, the type of the first target object, and the At least one of the confidence levels of the detection results.
The method according to any one of claims 1-3, wherein a re-identification model is used to extract the appearance feature vector of the first target object, and the re-recognition model is obtained by training as follows:

Obtain an image sample set, the image sample set includes a plurality of image samples and annotation information of the image samples, the annotation information is used to indicate image samples corresponding to the same target object;

Based on the set of image samples, the re-identification model to be trained is trained to obtain the re-identification model.
The method of claim 4, wherein the image samples are captured in a driving scene.
The method according to claim 4 or 5, wherein image samples are obtained according to the following steps:

Acquiring candidate images collected by the camera, performing target detection on the candidate images, and obtaining a second detection result indicating at least one detected second target object;

Obtain point cloud data collected synchronously with the candidate image for the same scene, detect the point cloud data, and obtain at least one point cloud detection frame;

Determining the intersection-over-union ratio IOU between the detection frame of the second target object and the point cloud detection frame in the second detection result;

The image sample is determined based on the candidate image when there is an IOU between any point cloud detection frame and the detection frame of the second target object that is greater than a preset threshold.
The method of claim 6, wherein said determining said image samples based on said candidate images comprises:

Cutting a part of the image corresponding to the detection frame of the second target object from the candidate image to obtain the image sample.
The method according to any one of claims 1-7, wherein, before performing target detection on the current frame image based on the target detection model, the method further comprises:

The current frame image is acquired from the video to be detected according to a preset time interval or frame number interval.
A multi-target tracking device applied to electronic equipment, comprising:

The target detection module is configured to perform target detection on the current frame image, and obtain a first detection result of at least one first target object detected.

A feature extraction module, configured to extract an appearance feature vector of the first target object.

A similarity calculation module, configured to calculate the similarity between the appearance feature vector of the first target object and the appearance feature vectors of each target object detected in at least one frame image before the current frame image.

A tracking result determining module, configured to determine a target tracking result for the first target object based on the similarity; the target tracking result is used to reflect the first target object in the current frame image and the at least Detection results in a frame of images.
An electronic device, comprising: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor communicates with the memory through the bus , when the machine-readable instructions are executed by the processor, the multi-target tracking method according to any one of claims 1-8 is executed.
A computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the multi-target tracking method according to any one of claims 1-8 is executed.