CN113326719A

CN113326719A - Method, equipment and system for target tracking

Info

Publication number: CN113326719A
Application number: CN202010427448.7A
Authority: CN
Inventors: 吕跃强; 李冬虎; 冷继南
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-02-28
Filing date: 2020-05-19
Publication date: 2021-08-31
Also published as: WO2021170030A1

Abstract

The application discloses a method, equipment and a system for target tracking, which are mainly applied to the fields of personnel tracking and the like. Wherein, the method comprises the following steps: acquiring a motion track of an initial tracking target (such as a pedestrian) and motion tracks of other targets (such as vehicles) in a scene where the initial tracking target is located through a sensor; and determining the switched tracking target according to the motion trail of the initial tracking target and the motion trails of other targets in the scene. The method can help the monitoring personnel to master the track of the tracking target, can realize the switching and continuous tracking of the tracking target when the vehicle is changed (namely, the switching of the tracking target occurs) when the tracking target is touched, and can greatly improve the efficiency of target tracking compared with the prior art which utilizes manpower to analyze and judge.

Description

Method, equipment and system for target tracking

Technical Field

The application relates to the technical field of intelligent security, in particular to a method, equipment and a system for tracking a target, and more particularly, to a method, equipment and a system for tracking a target, which are suitable for switching the tracked target (such as tracking the target to a vehicle) after the tracked target (such as suspicious people) changes vehicles (such as vehicles).

Background

With the development of video monitoring technology, high-precision intelligent cameras are distributed on various main roads in many cities at home and abroad so as to monitor road information conveniently. The monitoring camera plays a role in warning criminals, and is also one of main tools for tracking suspicious persons and assisting case detection.

The current automatic target tracking method mainly comprises the following steps: (1) target tracking of a single lens: that is, a pedestrian or a vehicle is tracked in the same camera, and when the target disappears (for example, occlusion), the target is tracked again in the lens through a pedestrian Re-identification (Person Re-ID) algorithm. (2) Target tracking across shots: that is, when the object leaves the shooting range of the current shot, the object is identified in another shot and tracking is performed again through the Re-ID algorithm. Therefore, the current automatic tracking method is limited to tracking the same target in each lens, which results in a discontinuous tracking process, i.e. the target leaves the view of the monitor within a certain time period. In addition, the environment (light, shading, target pose, etc.) of the cross-lens scene is very complex, and the accuracy and computational efficiency of the Re-ID algorithm still need to be considered.

In addition to the tracking scene for the same target, there is a cross-target tracking situation in an actual scene, where the cross-target tracking refers to switching of the tracked target (for example, a suspicious person as an initial tracking target changes a vehicle) during monitoring, and the switched target needs to be tracked. At present, for the application scenario of the target tracking across targets, the targets are mainly manually switched by monitoring personnel for tracking, for example, after suspicious people get on a bus, the monitoring personnel switch the tracked targets from people to vehicles. The method realized through manual participation can not ensure the real-time performance of target tracking and occupies more human resources.

In summary, there is a need for a method for continuously and accurately tracking a target, thereby improving the efficiency of target tracking.

Disclosure of Invention

The application provides a method, equipment and a system for target tracking, which can realize the switching and continuous tracking of a tracked target after a vehicle is replaced by the tracked target (namely, the target switching occurs), so that a monitoring person can completely master the track of the tracked target, and the target tracking efficiency is greatly improved.

A first aspect of the present application provides a method applied to target tracking, the method comprising: acquiring a motion track of a target contained in a scene where the first target is located through a sensor, wherein the target contained in the scene where the first target is located comprises the first target and at least one other target except the first target, and the first target is an initial tracking target; and then determining a second target according to the motion tracks of the first target and at least one other target, and taking the second target as a new tracking target. It should be noted that the motion trajectory of the target during a period of time is formed by the positions of the target at each time that constitutes the period of time. The position of the object may be represented by coordinates in an east-north-sky coordinate system or in a north-east-earth coordinate system. All embodiments of the present invention are not particularly limited with respect to the specific type of coordinate system. By the method, when the original tracking target is switched (for example, the tracking target gets on the bus), the video monitoring system automatically determines the updated tracking target, so that the continuity of the action track of the tracking target is ensured, the position information of the tracking target in any time period is not missed, and the tracking efficiency is improved. In addition, the method judges whether the original tracking target has the switching behavior through the track data, reduces the calculation amount and reduces the requirement on calculation power compared with a method of directly utilizing video data to carry out intelligent behavior analysis.

In the above method, the "scene in which the first object is located" refers to a real scene in which the first object is located, and for example, the range of the scene may be an area with a radius of 100 meters and the first object as a center. The embodiment of the present application does not limit the specific scope of the scene, which is determined according to the specific situation. It should be noted that the scene in which the first target is located changes along with the movement of the first target, and when the first target starts to be tracked, the tracks of the first target and other surrounding targets start to be acquired.

In one possible implementation, determining the second target according to the motion trajectory of the first target and the motion trajectory of the at least one other target includes: determining a set of candidate targets, the candidate targets being: the at least one other target, or other targets in the at least one other target whose distance from the first target is less than a preset threshold; for each candidate target, inputting the motion trail of the candidate target and the motion trail of the first target into a pre-trained first neural network to obtain the probability that the candidate target is the second target; determining the second target based on the probability of at least one candidate target. The above "candidate targets" include two cases (1) all other targets in the scene except the original tracked target; (2) and other targets in the scene, wherein the distance between the other targets and the original tracking target is less than a preset threshold value. According to the method, the motion trail of the first target and the motion trail of each candidate target are respectively input to the pre-trained neural network, and the probability that each candidate target is suspected to be the second target is output, so that the accuracy and the real-time performance are guaranteed.

Illustratively, the first neural network may be a long short term memory network LSTM. The first neural network needs to be trained before it can be used. For example, historical videos of people getting on the bus can be manually screened out, and video data of people and the bus in a period of time before the bus is acquired to generate track data of the people and the bus. And converting the track data into track feature pairs, and marking labels as a training set to train the first neural network. When the first neural network is used, the input is the track characteristic pair of the targets contained in the scene where the first target is located, and the output is the probability that each candidate target is suspected to be the second target.

It should be noted that, in addition to inputting the trajectory data into the neural network, other artificial intelligence algorithms may be used to determine the second target, for example, a conventional classification model such as a Support Vector Machine (SVM) may be used. In addition to the above-mentioned artificial intelligence algorithm, the determination may also be performed by using a rule selected by a person, for example, the distance between the first target and another target and the corresponding speed are obtained from the trajectory data, and the change of the distance and the change of the speed are used as reference indexes to determine the second target. In summary, the embodiment of the present application is not particularly limited to how to determine the second target using the trajectory data.

In another possible implementation, determining the second objective according to the obtained probability includes: when a first probability in the obtained probabilities is higher than a preset threshold, determining that a target corresponding to the first probability is the second target. In the tracking process, the probability that each candidate target in other targets is the second target is continuously calculated, and at a certain moment, when the probability of a certain candidate target exceeds a preset threshold, the candidate target corresponding to the probability can be judged to be the second target.

In another implementation, for each candidate target, inputting the motion trajectory of the first target and the motion trajectory of the candidate target to a pre-trained first neural network, and obtaining the probability that the candidate target is suspected to be the second target includes: and for each candidate target, establishing at least one group of track characteristic pairs according to the motion track of the candidate target and the motion track of the first target, wherein each group of track characteristic pairs comprises at least two track characteristic pairs at continuous moments, and each track characteristic pair at each moment comprises the position and the speed of the first target at the moment, the positions and the speeds of other targets and included angles between the first target and the other targets in the motion direction. And (3) respectively establishing a track feature pair with the first target according to the method for each candidate target, so that at least one group of track feature pairs can be obtained, and the probability that each candidate target is the second target can be output by inputting the at least one group of track feature pairs into the first neural network. For example, if it is desired to obtain the probability that a certain candidate target is the second target at the current time, a group of feature pairs needs to be input, where the group of feature pairs may include 10 feature pairs, and the 10 feature pairs are the position, the velocity, and the included angle of the other target and the first target at 10 times before the current time, respectively. Inputting the set of track feature pairs into a neural network, so as to obtain the probability that the candidate target is the second target at the current moment. The method comprises the steps of establishing a track characteristic pair by using track data of a first target and track data of other targets as input of a neural network, and determining a second target from at least one other target by analyzing a characteristic relation between a track of each other target and a track of the first target.

In another possible implementation manner, a time at which the first probability is higher than the preset threshold is set as the first time, and the method further includes: acquiring a video frame of a period of time before and after a first moment, wherein the video frame comprises a first target; and then inputting the video frame into a pre-trained second neural network, and determining a third target as a switched tracking target according to an output result. The above-mentioned video frame "including the first object" means that the original tracking object appears in the video picture. For example, the video frame may be a frame of a person getting on the vehicle captured from the side, or a frame of a vehicle door captured from the front. The realization mode is further verification on the basis of track judgment, and when a third target and a second target are the same target, the target is directly used as a switched target; and when the third target and the second target are not the same target, taking the third target as the target after switching. The method mainly utilizes a computer vision method to carry out intelligent behavior analysis on the video data and judge whether the first target is switched into the second target. On the basis of utilizing the track judgment, related video data is extracted for secondary judgment, and the precision of final judgment is improved.

In another possible implementation, the second neural network includes a convolutional neural network and a graph convolutional neural network, and inputting the video data into the second neural network determines a third target, including: inputting video data into a pre-trained convolutional neural network, and outputting the characteristics and bounding boxes of all targets in the video data; constructing a graph model according to the characteristics of the target contained in the video data and the bounding box; and inputting the graph model into a pre-trained graph convolution neural network, and determining the third target as a switched tracking target according to an output result. And extracting the characteristics of the target in the video frame and generating a surrounding frame corresponding to the target through the second neural network, establishing a special graph model, and judging the behavior of the target by utilizing a graph convolution neural network so as to determine the new target after switching. And the graph convolution neural network is used, so that more space-time information is brought in, and the judgment precision is improved.

The second neural network comprises a convolutional neural network and a graph convolutional neural network, wherein the convolutional neural network is mainly used for extracting the characteristics of the target in the video and generating a bounding box of the target and is used for constructing a graph model; the graph convolution neural network is mainly used for judging whether cross-target behaviors occur in an original tracking target according to the constructed graph model. Individual neural networks need to be pre-trained before they can be used. For example, for a convolutional neural network, the training set may be an image containing a bounding box that has been manually labeled, with a car in the image. Illustratively, for the graph convolutional neural network, it is necessary to manually select a piece of video of getting on the vehicle and input the video into the convolutional neural network to extract the features of the target and generate a bounding box, generate a graph model according to the method just described, label the graph model as getting on the vehicle behavior, and train the graph convolutional neural network as a training set. It should be noted that, in addition to using the neural network to determine whether the target in the video has cross-target behavior, a conventional machine learning model may be used to perform the determination, such as a Support Vector Machine (SVM) or the like.

In another possible implementation, the sensors include at least two sets of sensors, and different sets of sensors are positioned in different orientations. For each group of sensors in the at least two groups of sensors, generating a motion track of each target corresponding to the group of sensors according to the sensing data acquired by the group of sensors, thereby obtaining at least two motion tracks of the target; and fusing at least two motion tracks of the target to form a motion track after target fusion. Where "orientation" refers to the position or direction in which an object is located in real space. By "different sets of sensors are oriented differently" it is meant that the actual physical locations of the sets of sensors are spaced apart, illustratively, at least two meters or more from one set to the next. The same target under different induction ranges needs to be associated before target motion trajectories obtained from different directions are fused. The "sensing range" refers to a spatial range that can be sensed by the sensor, and the range of scenes that can be sensed by the sensors in different directions is also different. For example, a clustering algorithm may be used to associate the tracks (sequences of positions) of the first target in different sensing ranges, which means that the tracks (sequences of positions) belong to the same target (first target), and then perform fusion. When the motion tracks of different groups of sensors are fused, a Kalman fusion method can be adopted. Illustratively, each moment, the same target in different induction ranges has a plurality of positions, a preferred position is selected as a measurement position, the estimation position is obtained by adopting methods such as fitting, and the final position of the target at the moment can be obtained by performing Kalman fusion on the estimation position and the measurement position. The method fuses the motion tracks of the target in different induction ranges to form the final motion track of the target. The visual angles of the cameras at different positions are different, and the sensing ranges of the radars at different positions are also different. When the target is shielded by foreign matters (such as a billboard) in a certain sensing range (viewing angle), the sensors in other directions can continuously provide position data of the target, and the continuity of the target track is ensured. Meanwhile, due to uncontrollable factors such as environment, light and the like, the target is lost frequently when only one group of sensors are used, so that the track data of a plurality of groups of sensors in different directions are fused, the accuracy of the final motion track of the target is improved, and the target tracking efficiency is also improved.

In another possible implementation, each set of sensors includes at least two types of sensors, namely a camera and at least one of the following two types of sensors: millimeter wave radar and laser radar, and two kinds of sensors are in the same position. "a group of sensors in the same orientation" means that the various types of sensors comprising the group are located in close physical proximity, such as: the group of sensors comprises a camera and a millimeter wave radar, and the two types of sensors are arranged on the same telegraph pole on the street. The distance between at least two types of sensors in the same group is not specifically limited in the embodiment of the application, as long as the sensing ranges of the two types of sensors are approximately the same. It should be noted that "orientation" refers to the physical position of the sensor in real space, and "sensing range" refers to the spatial range that the sensor can sense. For example, when the sensor is a camera, the sensing range of the sensor refers to the spatial range of the scene which can be shot by the sensor; when the sensor is a radar, its sensing range refers to the spatial range of the actual scene it can detect. For each type of sensor in the same group of sensors, generating a monitoring track of the target corresponding to the type of sensor according to the sensing data acquired by the type of sensor, so as to obtain at least two monitoring tracks of the target; and fusing the at least two monitoring tracks to form a motion track of the target. For the same target, the track data collected by different sensors can be fused by adopting an innovative Kalman fusion method: and performing Kalman fusion on the unified prior estimation value (the optimal estimation value at the previous moment) and the measurement values of different sensors at the current moment in sequence to form the optimal estimation value at the current moment, wherein the optimal estimation value at each moment forms the final motion track of the target. The method integrates at least two types of sensor data of adjacent positions, and improves the track precision of the target. Moreover, if only a camera is used for tracking, the influence of external weather is avoided, the track data provided by the radar is fused, the tracking target can be ensured not to be lost, and the tracking efficiency is improved.

Through the description, the target tracking method provided by the application can automatically update the tracking target to the switched new target when the original tracking target is switched, so that a monitoring person can always master the track of the tracking object. In addition, the method and the device adopt the motion trajectory data of the target to determine the new target after the original tracking target is switched, and greatly reduce the calculated amount on the premise of ensuring the accuracy. Furthermore, a section of video frame data is selected, a special graph model is established, and a graph convolution neural network is used for carrying out secondary judgment on the switching behavior of the original tracking target, so that more space-time information is brought in, and the reliability of target tracking is improved.

In a second aspect, the present application provides another method for target tracking, comprising: acquiring a motion track of a first target through a sensor, wherein the first target is an initial tracking target; determining the moment when the motion trail of the first target disappears as a first moment, and acquiring video frames of a period of time before and after the first moment, wherein the video frames comprise pictures which are possible to be updated into second targets by the first target; and inputting the video frame to a pre-trained neural network to determine a second target, and taking the second target as an updated tracking target. According to the method, the time point of switching behaviors (such as getting on of the tracking target) of the original tracking target is searched by analyzing the characteristics of the track, and the related video data containing the original tracking target picture is obtained according to the time information to perform behavior analysis so as to determine the new tracking target, so that the calculation amount is greatly reduced compared with the whole course of performing behavior analysis on the original tracking target.

In another possible implementation manner, determining, as the first time, an initial time when the first target trajectory disappears according to the motion trajectory of the first target includes: judging that the motion trail of the first target after the initial moment does not exist, and determining the initial moment as the first moment.

In another possible implementation manner, determining a second target according to the video frame, and using the second target as an updated tracking target includes: and inputting the video frame into a pre-trained second neural network to determine a second target, and taking the second target as an updated tracking target. And the neural network is used for detecting the behavior of the related video, so that the judgment precision is improved.

In another possible implementation, the second neural network includes a convolutional neural network and a graph convolutional neural network, and inputting the video data into the second neural network to determine the second target includes: inputting video data into a pre-trained convolutional neural network, and outputting the characteristics and bounding boxes of all targets in the video data; constructing a graph model according to the characteristics of all targets in the video data and the bounding box; and inputting the graph model into a pre-trained graph convolution neural network, and determining the second target as a switched tracking target according to an output result. And extracting the characteristics of the target in the video frame and generating a surrounding frame corresponding to the target through the second neural network, establishing a special graph model, and judging the behavior of the target by utilizing a graph convolution neural network so as to determine the new target after switching. And the graph convolution neural network is used, so that more space-time information is brought in, and the judgment precision is improved.

In another possible implementation, the sensors include at least two sets of sensors, and different sets of sensors are positioned in different orientations. For each group of sensors in the at least two groups of sensors, generating a motion track of a first target corresponding to the group of sensors according to sensing data acquired by the group of sensors, thereby obtaining at least two motion tracks of the first target; and fusing at least two motion tracks of the first target to form a fused motion track of the first target. Where "orientation" refers to the position or direction in which an object is located in real space. By "different sets of sensors are oriented differently" it is meant that the actual physical locations of the sets of sensors are spaced apart by a distance, illustratively at least two meters or more. The "sensing range" refers to the space range which can be sensed by the sensor, and the scene range which can be sensed by the sensor in different directions is different. For example, "different sensing ranges" are used for cameras to capture objects in a scene from different perspectives, and for radars to detect objects in a scene from different orientations. When the motion tracks of different groups of sensors are fused, a Kalman fusion method can be adopted. Illustratively, the first target in different induction ranges at each moment has a plurality of position data, a preferred position is selected as a measurement position, the estimation position is obtained by methods such as fitting, and the final position of the first target at the moment can be obtained by performing Kalman fusion on the estimation position and the measurement position. The method fuses the motion tracks of the first target under multiple viewing angles, and forms the final motion track of the first target. The visual angles of the cameras at different positions are different, and the sensing ranges of the radars at different positions are also different. When the first target is shielded by foreign matters (such as a billboard) in a certain sensing range (viewing angle), the sensors in other directions can continuously provide position data of the target, and the continuity of the first target track is ensured.

In another possible implementation, each set of sensors includes at least two types of sensors, namely a camera and at least one of the following two types of sensors: millimeter wave radar and laser radar, and two kinds of sensors are in the same position. "a group of sensors in the same orientation" means that the various types of sensors comprising the group are located in close physical proximity, such as: the group of sensors comprises a camera and a millimeter wave radar, and the two types of sensors are arranged on the same telegraph pole on the street. The distance between at least two types of sensors in the same group is not specifically limited in the embodiment of the application, as long as the sensing ranges of the two types of sensors are approximately the same. It should be noted that "orientation" refers to the physical position of the sensor in real space, and "sensing range" refers to the spatial range that the sensor can sense. For example, when the sensor is a camera, the sensing range of the sensor refers to the spatial range of the scene which can be shot by the sensor; when the sensor is a radar, its sensing range refers to the spatial range of the actual scene it can detect. For each type of sensor in the same group of sensors, generating a monitoring track of the first target corresponding to the type of sensor according to the sensing data acquired by the type of sensor, so as to obtain at least two monitoring tracks of the first target; and fusing the at least two monitoring tracks to form a motion track of the first target. Illustratively, the fused trajectory may employ an innovative kalman fusion method: and performing Kalman fusion on the unified prior estimation value (the optimal estimation value at the previous moment) and the measurement values of different sensors at the current moment in sequence to form the optimal estimation value at the current moment, wherein the optimal estimation value at each moment forms the motion track of the first target. The method integrates at least two types of sensor data of adjacent positions, and improves the track precision of the first target. Moreover, if only a camera is used for tracking, the influence of external weather is avoided, the track data provided by the radar is fused, the tracking target can be ensured not to be lost, and the tracking efficiency is improved.

In a third aspect, the present application provides an apparatus for target tracking, comprising: the device comprises an acquisition module and a processing module; the system comprises an acquisition module, a tracking module and a tracking module, wherein the acquisition module is used for acquiring sensing data of targets contained in a scene where a first target is located, the targets contained in the scene where the first target is located comprise the first target and at least one other target except the first target, and the first target is an initial tracking target; the processing module is used for generating a motion track of the first target and at least one other target according to the sensing data; the processing module is further used for determining a second target according to the motion trajectories of the first target and at least one other target, and taking the second target as an updated tracking target.

In another possible implementation manner, the processing module is further configured to determine a set of candidate targets, where the candidate targets are: the at least one other target, or other targets in the at least one other target whose distance from the first target is less than a preset threshold; for each candidate target, inputting the motion trail of the first target and the motion trail of the candidate target into a pre-trained first neural network, and obtaining the probability that the candidate target is the second target; determining the second target according to the probability that the at least one candidate target is the second target.

In another possible implementation manner, the processing module is further configured to detect that a first probability of the probabilities that the at least one candidate target is the second target is higher than a preset threshold, and determine that a target corresponding to the first probability is the second target.

In another possible implementation manner, the processing module is further configured to: for each candidate target, establishing at least one group of track characteristic pairs according to the candidate target and the motion track of the first target, wherein the position and the speed of the first target, the position and the speed of the candidate target and the included angles between the motion directions of the first target and the other targets are obtained by each group of track characteristic pairs at least two continuous moments; inputting the at least one set of trajectory feature pairs into a first neural network, and outputting a probability that the candidate target is a second target.

In another possible implementation manner, the time at which the first probability is higher than the preset threshold is a first time, and the processing module is further configured to select video frames before and after the first time, where the video frames include a picture that the first target may be updated to a third target; and inputting the video frame into a pre-trained second neural network, and determining a third target as an updated tracking target according to an output result.

In another possible implementation manner, the second neural network includes a convolutional neural network and a graph convolutional neural network, and the processing module is specifically configured to input the video frame to the pre-trained convolutional neural network, and output the features of the target and the bounding box included in the video frame; constructing a graph model according to the characteristics of the target and the bounding box; and inputting the graph model into a pre-trained graph convolution neural network, and determining a third target as a switched tracking target according to an output result.

In another possible implementation manner, the sensors include at least two groups of sensors, the positions of the sensors in different groups are different, and for each of the first target and the other targets, the processing module is specifically configured to: generating at least two motion tracks of a target according to the induction data acquired by the at least two groups of sensor modules respectively; and fusing at least two motion tracks of the target to form a motion track of the target.

In another possible implementation, each set of sensors includes at least two types of sensors, which are a camera and at least one of the following two types of sensors: the system comprises a millimeter wave radar and a laser radar, wherein the at least two types of sensors are positioned in the same direction, and a processing module is specifically used for each target in targets contained in a scene where the first target is positioned: generating at least two monitoring tracks of a target according to the sensing data acquired by the at least two sensor modules respectively; and fusing at least two monitoring tracks of the target to form a motion track of the target.

In a fourth aspect, the present application provides another apparatus for tracking a target, including an obtaining module and a processing module, where the obtaining module is configured to obtain sensing data of a first target through a sensor; the processing module is used for generating a motion track of the first target according to the induction data; determining the initial moment when the first target track disappears as a first moment; acquiring video data of a period of time before and after the first moment, wherein the video data comprises a picture that the first target is possibly updated to a second target; and determining the second target according to the video data, and taking the second target as an updated tracking target.

In another possible implementation manner, the processing module is further configured to determine that a motion trajectory of the first target after an initial time does not exist, and determine that the initial time is the first time.

In another possible implementation manner, the processing module is further configured to input the video data to a pre-trained second neural network to determine a second target, and use the second target as an updated tracking target.

In another possible implementation manner, the second neural network comprises a convolutional neural network and a graph convolutional neural network, and the processing module is further configured to input the video frame to the pre-trained convolutional neural network, and output the features of the target and the bounding box included in the video data; constructing a graph model according to the characteristics of the target contained in the video frame and the bounding box; and inputting the graph model into a pre-trained graph convolution neural network, determining the second target according to an output result, and taking the second target as an updated tracking target.

In another possible implementation manner, the sensors include at least two groups of sensors, the positions of the sensors in different groups are different, and the processing module is specifically configured to: generating at least two motion tracks of a first target according to the sensing data acquired by the at least two groups of sensor modules respectively; and fusing at least two motion tracks of the first target to form a motion track of the first target.

In another possible implementation, each set of sensors includes at least two types of sensors, which are a camera and at least one of the following two types of sensors: millimeter wave radar and laser radar, just two at least kinds of sensors are in same position, and processing module specifically is used for: generating at least two monitoring tracks of a first target according to the sensing data acquired by the at least two types of sensor modules respectively; and fusing at least two monitoring tracks of the first target to form a motion track of the first target.

In a fifth aspect, the present application provides an apparatus for target tracking, the apparatus comprising a processor and a memory, wherein: the memory has stored therein computer instructions; the processor executes the computer instructions to implement the method of any of the first aspect and possible implementations described above.

In a sixth aspect, the present application provides an apparatus for target tracking, the apparatus comprising a processor and a memory, wherein: the memory has stored therein computer instructions; the processor executes the computer instructions to implement the method of any of the second aspect and possible implementations described above.

In a seventh aspect, the present application provides a computer-readable storage medium storing computer program code, which when run on a computer, causes the computer to perform the method of any of the first aspect and possible implementations described above. These computer-readable memories include, but are not limited to, one or more of the following: read-only memory (ROM), Programmable ROM (PROM), Erasable PROM (EPROM), Flash memory, Electrically EPROM (EEPROM) Hard disk drive (Hard drive).

In an eighth aspect, the present application provides a computer-readable storage medium storing computer program code which, when run on a computer, causes the computer to perform the method of any of the second and possible implementations described above. These computer-readable memories include, but are not limited to, one or more of the following: read-only memory (ROM), Programmable ROM (PROM), Erasable PROM (EPROM), Flash memory, Electrically EPROM (EEPROM) Hard disk drive (Hard drive).

In a ninth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of the first aspect and possible implementations described above.

In a tenth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of the second and possible implementations described above.

Drawings

Fig. 1 is a schematic application scenario diagram of a method for target tracking according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a system architecture for target tracking according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating a method for target tracking according to an embodiment of the present disclosure.

Fig. 4 is another schematic flowchart of a method for target tracking according to an embodiment of the present disclosure.

Fig. 5 is a schematic flowchart of a method for fusing a video track and a radar track according to an embodiment of the present application.

Fig. 6 is a table of positions of a target at different times and at different viewing angles according to an embodiment of the present application.

Fig. 7 is a schematic diagram of two-dimensional motion trajectories of an original tracked target and other targets according to an embodiment of the present application.

Fig. 8 is a schematic probability diagram of each other target being a new target after switching at each time point of the output of the LSTM neural network provided by the embodiment of the present application.

Fig. 9 is a schematic structural diagram of an LSTM unit provided in an embodiment of the present application.

Fig. 10(a) is a schematic structural diagram of an LSTM neural network provided in an embodiment of the present application.

Fig. 10(b) is a schematic diagram of a feature pair provided in an embodiment of the present application.

FIG. 11 is a time interval distribution diagram of training LSTM neural network training samples provided by embodiments of the present application.

Fig. 12 is a schematic flowchart of a method for performing secondary determination by computer vision according to an embodiment of the present application.

Fig. 13 is a schematic structural flow diagram of a second neural network provided in an embodiment of the present application.

Fig. 14 is a schematic diagram of an enclosure of an object provided in an embodiment of the present application.

Fig. 15 is a diagram model schematic diagram provided in the embodiment of the present application.

Fig. 16 is a schematic hardware structure diagram of a target tracking device according to an embodiment of the present application.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings.

Application scenario introduction

Fig. 1 is a schematic view of an application scenario of the target tracking method provided in the present application. Illustratively, there are a set of monitoring devices in front of and diagonally behind the tracked object 16 in FIG. 1. The monitoring equipment 10 in front comprises a monitoring camera 11 and a radar 12, the monitoring equipment 13 in the oblique rear comprises a monitoring camera 14 and a radar 15, the sensing ranges (viewing angles) of the two sets of monitoring equipment are different, and each set of monitoring equipment can monitor and track the target in the sensing range and form the motion track of the target under the world coordinate system. When the monitoring equipment is a monitoring camera, the induction range refers to a scene range which can be shot by the camera; when the monitoring camera is a radar, the induction range refers to a space range which can be detected by the radar; the view angle refers to the direction of the image shot by the camera, and different view angles correspond to different sensing ranges. The motion trajectory refers to the position of the object over a period of time and may be represented by coordinates. And according to the position calibration of the camera and the radar, projecting the target position shot or detected at each moment into a global coordinate system, thereby forming the motion trail of the target. In order to improve the track accuracy in a certain sensing range, in addition to collecting video data by using a monitoring camera, a millimeter wave radar (or a laser radar) is also installed at a physical position adjacent to the camera. The following description will be given using "radar" in general terms "millimeter wave radar or laser radar". And radar data and video data in the same induction range are fused to form a motion track with higher precision. In addition, in order to reduce the influence of shielding on tracking and ensure the continuity of the trajectory, the motion trajectories in multiple sensing ranges need to be fused, that is, the trajectory data acquired by the monitoring device 10 and the monitoring device 13 are fused, so that the motion trajectory of the target can be obtained. In the process of being tracked, the tracked object 16 will leave with the bus 17. At this time, since the tracking target 16 is on the vehicle and disappears, even if the movement trajectories of the tracking target 16 in different sensing ranges (multiple viewing angles) are fused, real-time tracking cannot be continued because the tracking target 16 cannot be found under any lens. Therefore, the scheme proposed by the application is as follows: firstly, whether the tracking target 16 (original tracking target) is switched (vehicle switching) is judged according to the motion trail of the tracking target 16 and surrounding targets (such as the bus 17 or more other targets), if the tracking target 16 is confirmed to be switched (for example, the tracking target 16 is seated on the bus 17), the tracking target is switched to the bus 17 until the original tracking target reappears in a certain monitoring scene picture. Further, after a new tracking target is determined through the track, in order to improve the judgment accuracy, video data including a picture suspected of having a switching behavior is selected for behavior analysis, and whether the switching behavior of the original tracking target does occur is judged and verified.

The method for automatically tracking can continuously track the target, even if the target switches the vehicles, the switched target can be monitored and tracked, any time period can not be omitted, and convenience and reliability of personnel control and tracking are improved to a great extent.

Introduction to System architecture

Fig. 2 is a schematic system architecture diagram of an automatic tracking system according to the present application. As shown, the system includes an end Node 21(TNode), an Edge Node 22(Edge Node), and a Server Node 23 (SNode). Each node can independently execute a calculation task, and the nodes can communicate with each other through a network to complete the task issuing and result uploading. The network transmission mode comprises a wired transmission mode and a wireless transmission mode. The wired transmission mode includes data transmission in forms of ethernet, optical fiber, etc., and the wireless transmission mode includes broadband cellular network transmission modes such as 3g (third generation), 4g (fourth generation), or 5g (first generation).

The end side nodes 21(TNode) may be used to fuse video tracks as well as radar tracks in the same sensing range. The end-side node 21 may be a camera itself or various processor devices with computing capabilities. Data collected by the cameras and the radars which are adjacent to each other in physical position can be directly subjected to fusion calculation at the end-side node without network transmission, so that the occupation of bandwidth is reduced, and the time delay is reduced. The edge side nodes 22 (ENodes) may be used to fuse the trajectory data at different sensing ranges (perspectives). The edge side nodes 22 may be edge computing boxes including switches, storage units, power distribution units, computing units, and the like. The server side node 23(SNode) is mainly used to perform cross-target behavior determination on a tracking target. The server side node 23 may be a cloud server, and stores and calculates data uploaded by the end side node 21 and the edge side node 22 to implement allocation of tasks. The method and the device do not limit the type and the virtualization management mode of the cloud server equipment.

It should be noted that the tasks executed by the computing nodes are not fixed, for example, the edge-side node 22 may also merge video tracks and radar tracks in the same sensing range, and the server-side node 23 may also merge motion tracks in different sensing ranges (multiple viewing angles). Specific implementations include, but are not limited to, the following three: (1) different sets of sensors (cameras and/or radars) transmit the acquired sensing data directly to the service side node 23, and all calculations and determinations are performed by the service side node 23. (2) The sensors (cameras and/or radars) of different groups directly transmit the acquired sensing data to the edge side node 22, the edge side node 22 fuses the sensing data of the sensors of the same group, and then fuses the trajectory data of different groups, so as to obtain the continuous motion trajectory of the target, and then transmits the calculation result to the server side node 23. The server side node 23 determines the second target (which may also be implemented by the edge side node 22) from the trajectory data calculation and performs the secondary verification from the video data. (3) The same group of camera and radar data is transmitted to the end side node 21 closest to the end side node 21, the end side node 21 is responsible for fusing videos and radar tracks in the same induction range, then the end side nodes 21 transmit the fused track data to the edge side node 22, and the edge side nodes fuse tracks of the same target in different induction ranges (visual angles), so that continuous motion tracks of the target are obtained. The edge side node 22 then transmits the continuous motion trajectory of the object to the server side node 23, and the server side node 23 determines a second object according to the trajectory data and performs secondary verification according to the video data. In short, the present invention is not limited to a specific node to execute a computing function, and is mainly determined by user habits, network bandwidth, or computing power of hardware itself.

Integral solution

The overall process of the scheme of the present invention will be described with reference to fig. 3, and the method includes the following steps:

s31: and acquiring sensing data of the target contained in the scene where the first target is located, wherein the sensing data is acquired by the sensor. The target comprises a first target and other targets, and the first target refers to an original tracking target. For example, the scene in which the first target is located may be an area with a radius of 20 meters centered on the original tracked target. For example, if the first target is a tracking target 16, then the other targets may be vehicles that are less than 20 meters away from the tracking target 16. If the sensor is a monitoring camera, the sensing data is the video data shot by the monitoring camera; if the sensor is a millimeter wave radar, the induction data is the distance data between the target detected by the millimeter wave radar and the radar; if the sensor is a laser radar, the sensing data is the position, the speed and other data of the target detected by the laser radar.

S32: and generating a motion track of each target according to the sensing data acquired by the sensor. Each object here refers to the first object and the other objects in the scene in which the first object is located. For example, when the sensor is a camera, the camera is calibrated in advance, and the position of a target in the camera under a world coordinate system can be directly obtained through video data; when the sensor is a radar, the position of the radar is fixed and known, and the position of the target under a world coordinate system can be obtained by measuring the distance between the target and the radar; when the sensor is a laser radar, the coordinate of the laser radar is also known, and the three-dimensional lattice data of the target can be acquired by emitting laser beams, so that the information such as the position, the speed and the like of the target can be directly acquired. The position of the target at each moment in the world coordinate system constitutes the motion trail of the target.

S33: and determining a second target according to the motion tracks of the first target and other targets, and taking the second target as a switched target. When the first target is switched (cross-target), determining a second target by using the track of the original tracking target and the tracks of other tracking targets, and continuously tracking the second target as a new tracking target.

The scheme considers the real motion condition of the target in the tracking process, flexibly switches the tracking target, forms the tracking without dead angles and time gaps, and improves the tracking efficiency. In addition, the method and the device determine a new tracking target by using the trajectory data, compared with the prior art that behavior judgment is carried out in the whole process by a computer vision method, the calculation amount is greatly reduced, and the real-time requirement can be better met.

Implementation details of the scheme

After the overall flow of the scheme is introduced, a specific implementation scheme of the present invention will be shown in detail below. Illustratively, the original tracked target is the tracked target 16, and as shown in fig. 4, the complete solution mainly includes five steps (some of the steps are optional):

step S41: and for each group of sensors, fusing the sensing data of the video and the radar in the group of sensors to obtain the motion track of the target in the sensing range of the group of sensors.

At least two types of sensors may be included in the same set of sensors and the two types of sensors are in the same orientation. Illustratively, as shown by the camera 11 and radar 12 in fig. 1. The same orientation refers to physical proximity, and may be, for example, mounted on the same pole. In the figure, the azimuth 1 refers to the position 1 where the camera 11 and the radar 12 are located at adjacent physical positions, and the acquired sensing data for the tracking target 16 are subjected to kalman fusion, so that the trajectory data (i.e., the motion trajectory in the sensing range 1) of the tracking target 16 sensed and calculated from the position 1 can be obtained. Similarly, the camera 13 and the radar 14 are also located close to each other (both in the azimuth 2), and the sensing data of the camera 13 and the radar 14 are fused to obtain the track data of the target sensed from the azimuth 2 (i.e. the motion track in the sensing range 2). It should be noted that, the sensing range refers to a scene range that can be shot by the camera; by radar is meant the range of the scene that the radar can detect. The sensing ranges of the sensors (e.g. camera 11, radar 12) in the same orientation are approximately the same.

The target includes an original tracking target and other targets in a scene where the original tracking target is located, and generating the motion trajectory of the target means generating a corresponding trajectory for each target. Optionally, when video data is collected, in order to improve the precision of the video track, a 3D detection method may be used to determine the centroid of the target. In addition, when the target is the vehicle, the local ground-attaching feature of the vehicle or the front wheel and the rear wheel can be used for judging the mass center so as to improve the accuracy of the video track.

In order to improve the accuracy of the target motion track sensed in a certain direction, track data obtained by multiple types of sensors can be fused. For example, track data acquired by a camera and track data acquired by a millimeter wave radar are fused, in addition, the track data of the camera and the track data of a laser radar can be fused, and even data of the camera, the millimeter wave radar and the laser radar are fused.

Taking the acquisition of the motion trajectory of the tracked target 16 in the sensing range 1 as an example, the trajectory data of the camera 11 and the radar 12 are fused, and the fusion method provided by the embodiment of the present application is specifically shown below. The method applies the calculation idea of Kalman fusion. The track of the tracked target 16 is formed by the position of each moment, and the track data of the camera and the radar, that is, the position of the tracked target 16 provided by fusing the two kinds of data at each moment, forms the motion track of the tracked target 16 in the sensing range. Before the fusion, it is necessary to correspond the radar data (position) and the video data (position) of the tracking target 16. For example, an echo image of the radar may be analyzed to generate a contour image of a target included in a scene in which the tracking target 16 is located, and the target position monitored by the radar may be distinguished from which target in the video corresponds in combination with the calibration information.

For convenience in describing a specific process of the fusion, the position of the tracking target 16 acquired by collecting data with the camera is named as a video measurement position, and the position of the tracking target 16 acquired by collecting data with the radar is named as a radar measurement position. Fig. 5 shows how the video data and radar data of the tracking target 16 are fused at time t. First, there is an optimal estimated position F at time t-1_t-1From this last moment optimal estimated position F_t-1The predicted position E at the time t can be predicted_tThe specific predictive formula may be an empirically derived formula. At the moment t, the video measurement position V can be obtained according to the video data collected by the camera_t. The predicted position E at the time t_tAnd a video measurement position V_tKalman fusion is carried out to obtain a middle optimal estimation position M_t. Meanwhile, the radar measurement position R of the target can be obtained according to the data acquired by the radar at the moment t_tThe intermediate optimal estimated position M_tAnd radar measurement position R_tKalman fusion is carried out, and the final optimal estimation position F at the moment t can be obtained_t. The optimal estimated position of each moment is calculated according to the optimal estimated position of the previous moment, and the optimal estimated position of the initial moment can be a video measurement position or a radar measurement position of the initial moment or a position obtained by fusion calculation of the video measurement position and the radar measurement position. To this end, the obtained optimal estimated position at each time instant constitutes the motion trajectory of the tracking target 16 in the sensing range 1. It should be noted that the fusion process may be replaced by fusing radar data first and then fusing data of the camera, and in short, the embodiment of the present application does not limit the type and number of the fused sensors and the fusion sequence. In addition to tracking the object 16 (first object), the movement trajectory of other objects such as the bus 17 is also obtained in the same manner as described above.

It should be noted that, in addition to the application of the kalman concept to fuse the video and the radar track, the simplest weighted average algorithm may be used to perform the fusion. For example, the position a of the tracking target 16 may be obtained from the video data at the time t-1, the position B of the tracking target may be obtained from the radar data, and the final position of the tracking target 16 in the sensing range 1 at the time t-1 may be obtained by directly performing weighted average calculation on the position a and the position B. Optionally, the fusion policy may specify that the trajectory data of the target located more than 60 meters away from the sensor is radar-oriented (weighted higher), and the trajectory data of the target located less than 60 meters away from the sensor is video-oriented (weighted higher).

According to the embodiment of the application, measurement position data provided by different sensors in the same induction range are fused through an innovative Kalman filtering method, and the precision of the target motion track is improved. It should be noted that, in the fusion method in the embodiment of the present application, the video prediction position and the video measurement position are not simply subjected to kalman fusion, the radar prediction position and the radar measurement position are subjected to kalman fusion, and then the trajectories of the two types of sensors are fused. The key point of the embodiment of the invention is that a uniform predicted position is adopted at the same time, and then the measured positions of different sensors are fused, so that an intermediate optimal estimated position is generated in the process. The fusion method enables the final position of the target at each moment to refer to the induction data of various sensors, and improves the precision of the track.

It should be noted that step S41 is not necessary in the whole scheme, and in practical cases, no radar (or millimeter wave radar) is installed around the camera, in which case the trajectory data in a certain sensing range can be directly obtained from the video data collected by the camera without fusion. In short, step S41 is optional and is mainly determined by the actual situation of the application scenario, the requirement of the monitoring personnel, and the like.

Step S42: and fusing the motion tracks of the target in different induction ranges.

The target includes the original tracking target and other targets in the scene where the original tracking target is located. After the motion track of the target in a certain induction range is obtained, the target may be blocked by foreign matters (a billboard and a bus) due to the continuous movement of the target, or the target directly leaves the monitoring range, so that the track of the target is interrupted. In order to obtain the continuous motion trajectory of the target, the same target under different view angles needs to be associated, and then the trajectories of the different view angles of the same target are fused.

Because there are multiple targets (tracking target 16, other targets such as bus 17) in each sensing range, it is necessary to associate the tracks of the same target in different sensing ranges before merging the tracks of the target in different sensing ranges, that is, to ensure that the motion tracks obtained before merging the tracks of the multiple sensing ranges belong to the same target. Illustratively, the specific steps of association may be as follows: firstly, acquiring induction data of each target in different induction ranges, and extracting characteristics (human hair color, clothes color) of the targetColor, color of vehicle, shape, etc.). Then, the target track position P and the target characteristic C at the time t are paired

Wherein the content of the first and second substances,

representing the track position of the target n at the t moment under the view angle k;

and representing characteristic information of the target n at the t moment under the k view angle, such as the position of the vehicle, the direction of the vehicle head, the contour of the vehicle body and the like. And clustering and associating the feature and track position pairs detected by each target in different induction ranges through a clustering algorithm, and if the feature and track position pairs can be clustered in the same category, judging that the feature and track position pairs belong to the same object correspondingly. The Clustering algorithm may be a Density-Based Clustering algorithm (DBSCAN), and the Clustering algorithm used in association with the same target is not specifically limited in the embodiment of the present application. After clustering, if there is one in a certain class

The pairs of pairs are information, and the pairs belong to the same target object, illustratively, i.e.

That is, from the positions of the trajectories measured by the tracking target 16 at different viewing angles (view angles 1, 2, 4). At time t, the track positions of the same target at different viewing angles are associated, and then the track positions of the target measured at multiple viewing angles need to be fused. It should be noted that "view angle" refers to a direction photographed by the camera, each view angle corresponds to a sensing range, and different view angles correspond to different sensing ranges. Cameras at different orientations have different viewing angles and thus correspond to different sensing ranges.

As shown in fig. 6, the whole table is associated by the clustering algorithm and then tracks the position of the target 16 at different times and in different sensing ranges, the time of each row is the same, and the sensing range (viewing angle) of each column is the same. The details of the method of fusing multi-view trajectory data are presented herein, taking as an example a sensor that includes only a camera. The embodiment of the application applies the concept of Kalman fusion to track fusion, and the specific fusion method is as follows:

when t is 0, that is, the initial time, the measured position of a certain viewing angle may be randomly selected as the estimated position of the initial time. At the time t is 1, there are measurement positions of the tracked object 16 at the view angle 1 and the view angle 2 …, and a measurement position at a better view angle is selected as a target measurement position at the time t is 1 according to the principle of whether these view angles are entrance channels, the distance between the device corresponding to the view angle and the object, and the like. The position at time t-1 is predicted from the final position (optimum position) at time t-0, and the target predicted position at time t-1 is obtained. There are various methods for predicting the position of the current time according to the final position of the previous time, and for example, the position of the current time may be estimated by a fitting method or directly according to the speed (including the size and the direction) and the position of the previous time. And performing Kalman fusion on the target predicted position and the target measurement position at the selected time t-1, so as to obtain the optimal estimation position at the time t-1. Similarly, the target measurement position at the optimal viewing angle and the target prediction position at the time t-2 (obtained by predicting according to the optimal estimation position at the time t-1) are also selected at the time t-2 to perform kalman fusion, so that the optimal estimation position at the time t-2 can be obtained. The continuous motion track of the tracking target 16 after fusing the data of different view angles can be obtained by repeating the process at each moment. Other surrounding objects can also obtain continuous motion tracks according to the method. In addition to the above-described method of fusion using the concept of kalman, the most direct weighted average method may be used to obtain the final position of the target at each time. For example, since the viewing angles 1, 2, and 3 … k exist at the time t is 1, the tracking target 16 has k positions at the time t is 1, and the final position of the tracking target 16 at the time t is 1 can be obtained by directly performing weighted average calculation on the k positions.

The embodiment of the application applies the Kalman fusion idea to the fusion of target motion tracks under multiple visual angles for the first time, and the optimal visual angle is selected for the measurement value at each moment, and the fusion is not random visual angle, so that the accuracy of track fusion is improved, the problem that the target is blocked or lost in the tracking process can be effectively solved, and the continuity of each target motion track is ensured.

Step S43: and determining the target after the original tracking target is switched according to the motion tracks of the original tracking target and other targets.

According to step S42, the continuous motion trajectories of the original tracked target and other targets can be obtained, and the continuous motion trajectories of the original tracked target and other targets are input into a pre-trained neural network model, so as to determine whether the original tracked target has a switching behavior, thereby determining a new tracked target. Illustratively, the two-dimensional schematic diagram of the trajectory is shown in fig. 7, where a solid line 1 is a motion trajectory of an original tracked target 1 (tracked target 16), and dashed

lines

2, 3, and 4 correspond to motion trajectories of

other targets

2, 3, and 4, where the other targets 2 corresponding to the dashed lines 2 may be buses 17. Optionally, before determining that the original tracking target switches to the target according to the trajectory, the trajectory that is too far away from the original tracking target may be filtered. Illustratively, at time t, when the distance between tracking target 16 and the other target is greater than 7 meters, the other target is filtered out. Assume the original tracking target position is (x)₁，y₁) The position of one other object is (x)₂，y₂) For example, the distance L between the two can be calculated by using the euclidean distance calculation formula:

other objects remaining after filtering may be referred to as candidate objects.

The process of determining a new target after switching the original tracking target according to the trajectory substantially comprises the following steps: firstly, filtering other distant targets 4 according to the method to obtain

candidate targets

2 and 3; and then establishing a track characteristic space of the original tracking target and the candidate target, inputting the track characteristic space into a pre-trained first neural network model, and determining a new target after the original tracking target is switched. In the embodiment of the present application, an LSTM neural network (first neural network) is taken as an example to describe how to analyze the trajectory data to determine a new target after switching.

Before describing the use of an LSTM neural network for trajectory analysis, we first describe the LSTM neural network.

The LSTM network is a time-recursive neural network. The LSTM network (as shown in fig. 10 (a)) includes an LSTM unit (as shown in fig. 9), and the LSTM unit includes three gates, namely a forgetting gate (Forget gate), an Input gate (Input gate), and an Output gate (Output gate). The input gate of the LSTM unit is used for determining the updated information, and the output gate of the LSTM unit is used for determining the output value. Illustratively, fig. 9 shows LSTM units at three time instants, where the input of the first LSTM unit is the input at time instant t-1, the input of the second LSTM unit is the input at time instant t, the input of the third LSTM unit is the input at time instant t +1, and the first LSTM unit, the second LSTM unit, and the third LSTM unit have the same structure. The core of the LSTM network is the state of a cell (cell), which is a large box in fig. 9, and the state of a cell (i.e., C in fig. 9) across the horizontal line in fig. 9_t-1And C_t) Like a conveyor belt, passes through the entire cell, with only a small amount of linear manipulation. This allows information to be passed through the entire cell without alteration, and thus allows long term memory retention. After the LSTM unit outputs every time, if the track characteristic pairs of time points are not input into the LSTM unit, the output of the time and the track characteristic pairs of the next time points which are not input into the LSTM unit are input into the LSTM unit until the track characteristic pairs of the LSTM unit to be input do not exist after the LSTM unit outputs the time. It should be noted that fig. 9 only shows some units of the network, and the whole neural network is actually formed by combining a plurality of LSTM units, as shown in fig. 10(a), the outputs of the LSTM units at each time are collectively input to the full connection layer(full connection), then outputting the classification result (score) through softmax layer, finding out which specific moment in the input moments is the switching moment through regression loss (regression loss), and X in fig. 9 and 10(a)_t-1、X_t、X_t+1And correspond to each other. Illustratively, X in FIG. 10(a)_t-1、X_t、X_t+1A set of pairs of features of the

targets

1 and 2 at times t1, t2, and t3, respectively, as shown in fig. 10(b), where P11 denotes the position of the target 1 at time t1, and P12 denotes the position of the target 2 at time t 1; v11 represents the velocity of object 1 at time t1, V12 represents the velocity of object 2 at time t1, and θ 1 represents the angle between the directions of motion of

objects

1 and 2 at time t 1. Inputting the set of feature pairs shown in fig. 10(b) into the LSTM neural network model shown in fig. 10(a), it can be determined whether the set of feature pairs contains switching behavior (softmax is classified as yes or no, and score is probability value), and time (index).

In the above method, the first neural network needs to be pre-trained before it is used. For example, video data containing a scene of a taxi on a person may be manually searched. Obtaining trajectory data of a person and trajectory data of a taxi on which the person is located through the video data, establishing trajectory feature pairs of the person and the taxi according to the method (shown in fig. 10 (b)), namely positions, speed sizes and speed included angles of the person and the taxi at each moment, marking different samples formed by the feature pairs in different time periods respectively, and inputting the marked samples into a neural network for training. For example, assuming that the track sampling time interval is 1 second, a video segment is manually found, the time when a person gets on the car is 11 o 'clock 01 min 25 s (01' 25 "in fig. 11), a, b, c, d, e in fig. 11 are five staggered time segments, and the length of the time segment is 10 seconds. Taking a time period a as an example, the initial time is 01 minutes and 10 seconds, the time length is 10 seconds, video data of a person and a taxi in the time period a is selected and converted into corresponding track data, so that a group of track feature pairs of the person and the taxi at 10 moments (the time period a comprises 10 seconds) can be obtained, the 10 track feature pairs are taken as a group of track features and are marked as a classification result NO (because the person does not get on the taxi in the time period a), and the switching time is none. Similarly, the classification result of a group of feature pairs obtained in the time period b is also marked as NO, and the switching time is none. The initial time of the time period e is 01' 20 ", the length of the time period e is 10s, the time of a person riding a taxi is included, therefore, the classification result of a group of track feature pairs (10 track feature pairs) obtained in the time period e is marked as YES, and the switching time is 0125. The five groups of feature pairs obtained in the five time periods can be used as five training samples to train the neural network, and the actual number of the training samples is determined by the actual situation or the expected model accuracy. After the neural network is trained, the real-time trajectory data can be converted into trajectory features, the neural network is input, and the probability that other targets are switched targets at each moment can be output.

The use of the LSTM neural network is described below. Taking

targets

1, 2, 3 as an example, the input to the LSTM neural network is the pair of track features of target 1 and target 2, and the pair of track features of target 1 and target 3. The above track characteristics can be understood as some attributes of the target track, such as position, speed and angle, which are specifically shown as follows:

[O₁V_t，O₂V_t，O₁P_t(x,y),O₂P_t(x，y)，θ1_t]，t＝0,1,2,…,m (1)

[O₁V_t，O₃V_t，O₁P_t(x，y)，O₃P_t(x，y)，θ2_t]，t＝0，1，2，…，m (2)

wherein, O₁P_t(x, y) denotes the position of the original object 1 at time t, O₁V_tRepresenting the rate of the original target 1 at time t; o is₂P_t(x, y) denotes the position of the candidate object 2 at time t, O₂V_tRepresenting the rate of the candidate target 2 at time t; o is₃P_t(x, y) denotes the position of the candidate object 3 at time t, O₃V_tIndicating the velocity of the candidate target 3 at time t; the target and the target can also be calculated according to the direction of the speedAngle therebetween, θ 1_tShown is the angle between the direction of motion of object 1 and object 2 at time t, θ 1_tThe angle between the direction of movement of the object 1 and the object 3 at time t is shown.

Formula (1) establishes a pair of trajectory features of the original tracked target 1 and the candidate target 2 at each time, and formula (2) establishes a pair of trajectory features of the original tracked target 1 and the candidate target 3 at each time. The trajectory feature pairs contain the position, velocity, and angle (determined by the velocity direction) between the targets (

targets

1 and 2, targets 1 and 3) of each target. The established feature pairs are respectively input into a pre-trained neural network, such as a Long Short-Term Memory network (LSTM), so that the probability that each candidate target is a new target after the original target is switched can be output in real time. As shown in fig. 8, the LSTM neural network may directly output the probability that each candidate target is a new target after handover at each time. For example, taking the analysis candidate object 2 as an example, assuming that the current time is t-4, starting to track the object 1, selecting 4 feature pairs (a set of feature pairs) of the

objects

1 and 2 at the time t-0 to t-4 established by formula (1) and inputting the feature pairs into the LSTM neural network shown in fig. 10(a), and outputting the probability of occurrence of handover as a probability value at the time t-4 by softmax; when the current time is changed to t ═ 5, 4 feature pairs (another group of feature pairs) of the

targets

1 and 2 at the time t ═ 1 to t ═ 5 established by the formula (1) are selected and input to the LSTM neural network shown in fig. 10(a), softmax outputs the probability of occurrence of handover (as the probability value at the time t ═ 5), and the process is repeated, so that a time probability distribution graph (shown by a dotted line 2 in fig. 8) of the candidate target 2 as a new target after handover can be drawn. In FIG. 8, F_tAnd the probability that each candidate target is a new target after switching at each moment is shown, and the horizontal axis, namely the time axis, is the current moment. Presetting a probability threshold c at t₀At the moment, the probability corresponding to the candidate target 2 exceeds the preset threshold, i.e. represents at t₀The original target is switched to candidate target 2 at that time.

After the trajectory data is obtained, in addition to using the neural network computational analysis, the trajectory data may be calculated by using a conventional machine learning classification model such as a support vector machine. Or directly judging the obtained track data according to a rule specified by people. The existing research method mainly utilizes video data to carry out intelligent behavior analysis on a target, and the embodiment of the application judges whether the target is switched or not by utilizing the motion track of the target, does not need to carry out a large amount of calculation on the video data, only needs to consider the change of the position of the target, saves calculation power, improves calculation speed and ensures the real-time performance of tracking; and the pre-trained neural network is used for calculation, so that the accuracy of the calculation result is ensured.

Step S44: and performing secondary judgment by using a computer vision method.

An optimal video segment is selected before further decision of handover behavior is made. The time t when one target is switched can be obtained through the embodiment₀In step S42, a better viewing angle is selected at each time, and thus, the best video may be a video captured at the above best viewing angle around the time t 0. It should be noted that, instead of using only the video at the optimal viewing angle selected in step S42, a plurality of pieces of video including a screen on which the original tracking target is switched may be selected. The time period of the video is mainly focused on t₀The time around is, for example, about 6 seconds of video, and the middle time of the time period covered by the video is t₀. In order to reduce the amount Of calculation and improve the calculation accuracy, a Region Of Interest (ROI) in the video frame may be extracted first, and then the extracted video data may be input to a pre-trained neural network. Illustratively, the region of interest may include only the tracked object 16 and the bus 17. The process of performing the second judgment by computer vision is shown in fig. 12 and 13, and includes the following steps:

step S121: the selected video frame data is input to a pre-trained Convolutional Neural network, which may be a Convolutional Neural Network (CNN), a three-dimensional Convolutional Neural network (3D CNN, 3D Convolutional Neural network), or the like. The convolutional neural network is mainly used for extracting features of an object in a video image and generating a bounding box (bbox) corresponding to the object. Pre-training is required before use, and the training set may be an image of a bounding box that has been manually labeled. The convolutional neural network mainly comprises an input layer, a convolutional layer, a pooling layer and a full-connection layer. The input layer is the input to the entire neural network, which in a convolutional neural network that processes images generally represents the pixel matrix of a picture; the input of each node in the convolutional layer is only a small block in the neural network of the previous layer, the size of the small block is 3 x 3, 5 x 5 or other sizes, and the convolutional layer is used for carrying out deeper analysis on each small block in the neural network so as to obtain characteristics with higher abstraction degree; the pooling layer can further reduce the number of nodes in the final full-connection layer, so that the aim of reducing parameters in the whole neural network is fulfilled; the full connection layer is mainly used for completing classification tasks.

Step S122: the convolutional neural network extracts features of objects in the video and generates a bounding box. Wherein the features of the target are represented by a series of numerical matrices; the position of the object determines the coordinates of the four corners of the bounding box, as shown in fig. 14, and both the tracked object 16 and the bus 17 generate a bbox, respectively, and generate a corresponding object classification.

Step S123: and establishing a graph model according to the characteristics of the target and the bounding box corresponding to the target. When the graph model is constructed, the objects (which can be understood as an ID in the field of object identification) identified in the video frames are used as nodes. Edges connecting between nodes are mainly classified into two types. The first class of edges has two components: 1) the first part shows the similarity of objects between the previous and following frames. The higher the similarity, the higher the value, the value range [0,1 ]. Obviously, the same person and the same vehicle in the continuous frames have higher similarity, different vehicles and people have lower similarity, and people and vehicles basically have no similarity; 2) the second part shows the coincidence degree between the bbox of the current frame target and the bbox of the next frame target, i.e. the size of the iou (interaction of units). If they completely coincide, the value is 1. The first part and the second part are combined and added according to preset weight to form the value of the first class edge. The second class of edges represents the distance of two objects in real space, the distance being approximately close, the larger the value of the edge. For example, the graph model constructed according to the above method is shown in fig. 15, where the first kind of edges can be understood as the lateral edges in the graph, and the second kind of edges can be understood as the longitudinal edges in the graph.

Step S124: and inputting the constructed graph model into a graph convolution neural network. A typical graph convolution neural network includes a graph convolution layer, a pooling layer, a fully connected layer, and an output layer. The graph convolution layer is similar to image convolution, performs information transfer in the graph, and can fully mine the characteristics of the graph; the pooling layer is used for reducing the dimension; the full connection layer is used for executing classification; the output layer outputs the result of the classification. Illustratively, the classification result may include getting-on behavior, getting-off behavior, people-vehicle-close behavior, people-short behavior, vehicle door opening and closing behavior, and the like. It should be noted that the output of the neural network is not only a behavior recognition (getting on or off) but also includes a behavior detection, for example, which can determine the characteristics of the vehicle on which the person is. The target after handover determined in step S44 is generally the same as the target after handover determined in step S43, and in this case, the target may be directly used as the target after handover. When the switched targets determined at steps S43 and S44 are different, the switched target determined at step S44 is taken as a new tracking target.

The above-described graph-convolution neural network also requires pre-training before use. Illustratively, a video of getting on the car of a person is manually acquired, a graph model of each target in the video is generated according to the steps S121-S123, and the graph model is marked as a getting on behavior and is used as a training set of a graph convolution neural network. Therefore, after the graph model generated by real-time video data is input, the graph convolution neural network can automatically output classification and judge whether the target-crossing behavior occurs to the tracked target.

It should be noted that step S44 is not essential, and a final new target after switching can be directly determined only by step S43, which is mainly determined by the preference or actual situation of the monitoring personnel. In addition, in step S44, the use of the convolutional neural network for determination is described, and it should be noted that other artificial intelligence classification models besides the convolutional neural network may also perform behavior recognition on the selected better video.

According to the method and the device, whether the original target is switched or not is judged for the second time by using a computer vision method, a special graph model is built, more time and space information is brought in, and the judgment precision is improved.

Step S45: and determining the target after switching.

Step S44 is optional, so there are two cases: (1) step S45 is executed directly after step S43 is executed, and the target determined by step S43 is taken as the target after switching; (2) after step S43 is executed, step S44 is executed, and in step S45, the post-handover target determined in step S44 is taken as the final post-handover target. In the case of the (2) th case (i.e., the targets after switching determined in step S43 and step S44 are different), in addition to directly using the target determined in step S44 as the target after switching, another strategy may be adopted, for example, one of the targets may be selected as the final target after switching in consideration of the accuracy of the models used in the two steps and the confidence of the output result.

The steps S41 to S45 mainly take a scene in which a person gets on the bus as an example, and switch the tracking target from the tracking target 16 to the bus 17. Then, the bus 17 can be tracked all the time, and the video picture of the passenger getting off the bus is obtained in each parking time period to find the tracking target 16. When the tracking target 16 appears in a certain frame, the tracking target is switched to the tracking target 16 again, so that the continuous tracking of the target is realized.

In summary, the method for automatically tracking a cross-target provided by the embodiment of the present application can be summarized as follows: firstly, an improved Kalman filtering method is adopted to fuse video and radar data under the same visual angle, the target motion track precision under a single visual angle is improved, then the same target motion track under a plurality of visual angles is fused, the shielding problem is effectively solved, and the continuous motion track of each target can be obtained. And then, inputting the motion tracks of the tracking target and other nearby targets into a pre-trained neural network, judging whether the original tracking target is switched, and if so, replacing the tracking target with a new target. In addition, in order to further improve the accuracy of judgment, the optimal video capable of capturing the picture of the behavior occurring switching can be selected for behavior analysis on the basis of the previous step. Inputting the optimal video into a neural network to extract features, then constructing a specific graph model, and carrying out secondary judgment on whether a target in the video has a target-crossing behavior or not by adopting a pre-trained graph convolution neural network. The method can be applied to the tracing of suspicious persons, and can also be applied to the help of finding children or old lost people, and the technical scheme provided by the invention can be applied to the tracing of the tracked objects regardless of the types of the tracked objects.

From the perspective of a tracking strategy, the automatic cross-target tracking method provided by the application can automatically switch the tracking target when the original target has a cross-target behavior, so that the real-time effectiveness of tracking can be ensured, and tracking interruption caused by the fact that the target takes a vehicle can be avoided; from the specific implementation mode of tracking, the method and the device have the advantages that the cross-target behavior analysis and judgment are carried out by utilizing the motion track of the target in the world coordinate system for the first time, and the accuracy of behavior judgment is improved. In addition, in order to improve the motion trail precision of the target, the invention also provides a series of measures, including: (1) fusing the video and the radar track under a single view angle by adopting an original improved Kalman filtering method; (2) and fusing the motion trail of the same target under a plurality of visual angles by adopting a Kalman filtering method. On the basis, in order to further improve the judgment precision, the method also constructs a special graph model according to the selected optimal video, and then adopts a graph convolution neural network to carry out secondary judgment on the cross-target behavior, so as to incorporate more space-time information. In a word, the automatic cross-target tracking method provided by the invention can realize real-time, continuous and accurate tracking of the target, does not omit information of any time period, and improves the convenience and reliability of control and tracking.

In addition to the above implementation, the present application provides another modified embodiment for target tracking based on the above embodiment. The method comprises the following steps: (1) the motion trajectory of the tracking target 16 (original tracking target) is acquired. The trajectory of the original tracked target may be obtained as described in steps S41 (optional) and S42 in the above embodiments. The trajectory consists of the position of the object over a period of time, which may be represented by coordinates in a global coordinate system, such as an east-north-sky coordinate system. (2) The initial moment when the track disappears is determined as the first moment according to the motion track of the tracked target 16. That is, when the trajectory of the tracked target 16 no longer appears, the initial time at which the target disappears is determined as the first time. (3) Video data is acquired a period of time before and after the first time, the video data including a picture of the tracked target 16 that may be updated to a second target. Illustratively, the tracking target 16 is complete and clear in the frame of the video data. (4) And analyzing the video data acquired in the step and determining the switched target. Illustratively, the relevant video data may be input into a pre-trained neural network for analysis. The pre-trained neural network may be the neural network adopted in step S44, and the video content is analyzed by the neural network to determine the target after switching.

The method mainly describes a scene of getting on (switched tracking target) of a person (tracking target 16), and the switched target such as a bus is obtained by judging the disappearance of the trajectory of the person and then calling related videos around for analysis, so as to track the bus. When people get off, the tracking target 16 still needs to be searched in the monitoring picture along the bus, and then the tracking is continued. In the embodiment of the application, only one target track is required to be obtained at each moment in the whole process. Related videos are directly called to conduct behavior analysis through the track of the original tracking target, occupation of computing power is reduced to a certain degree, and instantaneity and accuracy are guaranteed.

The method for tracking the target provided by the embodiment of the present application is described above with reference to fig. 3 to fig. 15, and the target tracking device provided by the embodiment of the present application will be described below. The device can comprise an acquisition module and a processing module;

the system comprises an acquisition module, a tracking module and a tracking module, wherein the acquisition module is used for acquiring sensing data of targets contained in a scene where a first target is located, the targets contained in the scene where the first target is located comprise the first target and at least one other target except the first target, and the first target is an initial tracking target;

the processing module is used for generating motion tracks of the first target and the at least one other target according to the sensing data; and the tracking module is further used for determining a second target according to the motion trail of the first target and the motion trails of the other targets, and taking the second target as a switched tracking target.

Optionally, the processing module is specifically configured to determine a group of candidate targets, where the candidate targets are: the at least one other target, or other targets in the at least one other target whose distance from the first target is less than a preset threshold; for each candidate target, inputting the motion trail of the first target and the motion trail of the candidate target into a pre-trained first neural network, and obtaining the probability that the candidate target is the second target; determining the second target according to the probability that the set of candidate targets is the second target.

Optionally, the processing module is further configured to detect that a first probability of the probabilities that the at least one candidate target is the second target is higher than a preset threshold, and determine that a target corresponding to the first probability is the second target.

Optionally, the processing module is further configured to, for each candidate target, establish at least one group of track feature pairs according to the candidate target and the motion track of the first target, where each group of track feature pairs includes at least two track feature pairs at consecutive times, and the track feature pair at each time includes a position and a velocity of the first target at that time, a position and a velocity of the candidate target, and an included angle between the first target and the motion direction of the candidate target; inputting the at least one group of track feature pairs into a first neural network, and obtaining the probability that the candidate target is the second target.

Optionally, the processing module is further configured to detect that a first probability of the probabilities of the at least one candidate target is higher than a preset threshold, and determine that a target corresponding to the first probability is the second target.

Optionally, a time at which the first probability is higher than a preset threshold is set as a first time, and the processing module is further configured to acquire video data before and after the first time, where the video data includes a picture in which the first target may be switched; and inputting the video data into a pre-trained second neural network, and determining the third target as a switched tracking target according to an output result.

Optionally, the second neural network includes a convolutional neural network and a graph convolutional neural network, and the processing module is specifically configured to input the selected video frame data to the pre-trained convolutional neural network, and output the features and bounding boxes of all targets in the video frame data; constructing a graph model according to the characteristics and the bounding box; and inputting the graph model into a pre-trained graph convolution neural network, and determining a third target as a switched tracking target according to an output result.

Optionally, the sensors include at least two groups of sensors located at different orientations, the different groups of sensor modules are located at different orientations, and for each target in a scene where the first target is located, the processing module is specifically configured to generate at least two motion trajectories of the target according to the sensing data acquired by the at least two groups of sensor modules, respectively; and fusing at least two motion tracks of the target to form the motion track of the target.

Optionally, each group of sensors includes at least two types of sensors, that is, includes a camera and at least one of the following two types of sensors: millimeter wave radar and laser radar, and two kinds of sensors are in same position, and to every target that contains in the scene that first target is located, processing module specifically is used for: generating at least two monitoring tracks of a target according to the sensing data acquired by the at least two sensor modules respectively; and fusing at least two monitoring tracks of the target to form a motion track of the target.

The application also provides another target tracking device, which comprises an acquisition module and a processing module, wherein the acquisition module is used for acquiring the induction data of a first target through a sensor, and the first target is an initial tracking target; the processing module is used for generating a motion track of the first target according to the sensing data; determining the initial moment when the first target track disappears as a first moment according to the motion track of the first target; acquiring a video frame of a period of time before and after the first time, wherein the video frame comprises the first target; and determining the second target according to the video frame, and taking the second target as an updated tracking target.

Optionally, the processing module is further configured to determine that a motion trajectory of the first target after the initial time does not exist, and determine that the initial time is the first time.

Optionally, the processing module is further configured to input the video frame to a pre-trained second neural network to determine a second target, and use the second target as an updated tracking target.

Optionally, the second neural network includes a convolutional neural network and a graph convolutional neural network, and the processing module is further configured to: inputting the video frame into a pre-trained convolutional neural network, and outputting the characteristics of the target and a bounding box contained in the video data; constructing a graph model according to the characteristics of the target contained in the video frame and the bounding box; and inputting the graph model into a pre-trained graph convolution neural network, determining the second target according to an output result, and taking the second target as an updated tracking target.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Fig. 16 is a hardware structural diagram of a computing device for target tracking according to an embodiment of the present disclosure. As shown in fig. 16, the device 160 may include a processor 1601, a communication interface 1602, a memory 1603, and a system bus 1604. The memory 1603 and the communication interface 1602 are connected to the processor 1601 via a system bus 1604 and communicate with each other. The memory 1603 is used for storing computer executable instructions, the communication interface 1602 is used for communicating with other devices, and the processor 1601 is used for executing the computer instructions to realize the scheme shown in all the embodiments.

The system bus mentioned in fig. 16 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The memory may comprise Random Access Memory (RAM) and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor may be a general-purpose processor, including a central processing unit CPU, a Network Processor (NP), and the like; but may also be a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc.

Optionally, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the computer is caused to execute the method as shown in the foregoing method embodiment.

Optionally, an embodiment of the present application further provides a chip, where the chip is configured to perform the method shown in the foregoing method embodiment.

It is to be understood that the various numerical references referred to in the embodiments of the present application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of the present application.

It should be understood that, in the embodiment of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for target tracking, the method comprising:

acquiring a motion track of a target contained in a scene where a first target is located through a sensor, wherein the target contained in the scene where the first target is located comprises the first target and at least one other target except the first target, and the first target is an initial tracking target;

and determining a second target according to the motion trail of the first target and the motion trail of the at least one other target, and taking the second target as an updated tracking target.

2. The method of claim 1, wherein determining a second target based on the motion trajectory of the first target and the motion trajectory of the at least one other target comprises:

determining a set of candidate targets, the candidate targets being: the at least one other target, or other targets in the at least one other target whose distance from the first target is less than a preset threshold;

for each candidate target, inputting the motion trail of the first target and the motion trail of the candidate target into a pre-trained first neural network, and obtaining the probability that the candidate target is the second target;

determining the second target according to the probability that the set of candidate targets is the second target.

3. The method of claim 2, wherein determining the second target based on the probability that the set of candidate targets is the second target comprises: detecting that a first probability of the probabilities that the set of candidate targets is the second target is higher than a preset threshold, and determining that a target corresponding to the first probability is the second target.

4. The method of claim 2 or 3, wherein for each of the candidate objects, inputting the motion trajectory of the first object and the motion trajectory of the candidate object to a pre-trained first neural network, and obtaining the probability that the candidate object is the second object comprises:

establishing at least one group of track feature pairs according to the motion track of the candidate target and the motion track of the first target, wherein each group of track feature pairs comprises at least two track feature pairs at continuous moments, and the track feature pairs at each moment comprise the position and the speed of the first target at the moment, the position and the speed of the candidate target and an included angle between the motion directions of the first target and the candidate target;

inputting the at least one set of trajectory feature pairs into the first neural network, and obtaining a probability that the candidate target is the second target.

5. The method according to claim 3 or 4, wherein the time at which the first probability is higher than a preset threshold is determined to be a first time, the method further comprising:

acquiring a video frame of a period of time before and after a first moment, wherein the video frame comprises the first target;

and inputting the video frame into a pre-trained second neural network, determining a third target according to an output result, and taking the third target as an updated tracking target.

6. The method of claim 5, wherein the second neural network comprises a convolutional neural network and a graph convolutional neural network, and wherein inputting the video frame to the pre-trained second neural network, determining the third target according to the output result, and using the third target as an updated tracking target comprises:

inputting the video frame into a pre-trained convolutional neural network, and outputting the characteristics of the target and a bounding box contained in the video frame;

constructing a graph model according to the characteristics of the target contained in the video frame and the bounding box;

and inputting the graph model into a pre-trained graph convolution neural network, determining the third target according to an output result, and taking the third target as an updated tracking target.

7. The method according to any one of claims 1 to 6, wherein the sensors include at least two groups of sensors at different orientations, and for each of the objects included in the scene where the first object is located, the acquiring, by the sensors, the motion trajectory of the object includes:

for each group of sensors in the at least two groups of sensors, generating a motion track of the target corresponding to the group of sensors according to sensing data acquired by the group of sensors, thereby obtaining at least two motion tracks of the target, wherein the at least two motion tracks of the target are obtained by shooting from different directions;

and fusing at least two motion tracks of the target to obtain a motion track after the target is fused.

8. The method of claim 7, wherein each set of sensors comprises at least two types of sensors including a camera and at least one of the following two types of sensors: the method comprises the following steps that millimeter wave radar and laser radar are arranged, the at least two types of sensors are located in the same direction, and for each target in targets contained in a scene where the first target is located, for each group of sensors in the at least two groups of sensors, motion tracks of the targets corresponding to the group of sensors are generated according to sensing data collected by the group of sensors, and the motion tracks comprise:

for each type of sensor in the group of sensors, generating a monitoring track of the target corresponding to the type of sensor according to the sensing data acquired by the type of sensor, so as to obtain at least two monitoring tracks of the target;

and fusing at least two monitoring tracks of the target to obtain a motion track of the target.

9. A method for target tracking, the method comprising:

acquiring a motion track of a first target through a sensor, wherein the first target is an initial tracking target;

determining the initial moment when the first target track disappears as a first moment according to the motion track of the first target;

acquiring a video frame of a period of time before and after the first time, wherein the video frame comprises the first target;

and determining a second target according to the video frame, and taking the second target as an updated tracking target.

10. The method of claim 9, wherein the determining the initial moment of disappearance of the first target trajectory from the motion trajectory of the first target is a first moment, comprising:

judging that the motion trail of the first target after the initial moment does not exist, and determining the initial moment as the first moment.

11. The method according to claim 9 or 10, wherein the determining a second target from the video frame and using the second target as an updated tracking target comprises:

and inputting the video frame into a pre-trained second neural network, determining the second target according to an output result, and taking the second target as an updated tracking target.

12. The method of claim 11, wherein the second neural network comprises a convolutional neural network and a graph convolutional neural network, and wherein inputting the video frame to the pre-trained second neural network, determining the second target according to the output result, and using the second target as an updated tracking target comprises:

and inputting the graph model into a pre-trained graph convolution neural network, determining the second target according to an output result, and taking the second target as an updated tracking target.

13. An apparatus for target tracking, comprising: the device comprises an acquisition module and a processing module;

the acquisition module is used for acquiring sensing data of a target contained in a scene where the first target is located, wherein the target contained in the scene where the first target is located comprises the first target and at least one other target except the first target, and the first target is an initial tracking target;

the processing module is configured to generate a motion trajectory of the first target and the at least one other target according to the sensing data, determine a second target according to the motion trajectory of the first target and the motion trajectory of the at least one other target, and use the second target as an updated tracking target.

14. The apparatus of claim 13, wherein the processing module is specifically configured to,

15. The apparatus of claim 14, wherein the processing module is further configured to detect that a first probability of the probabilities that the set of candidate targets is the second target is higher than a preset threshold, and determine that a target corresponding to the first probability is the second target.

16. The apparatus of claim 14 or 15, wherein the processing module is further configured to:

for each candidate target, establishing at least one group of track feature pairs according to the motion tracks of the candidate target and the first target, wherein each group of track feature pairs comprises at least two track feature pairs at continuous moments, and the track feature pairs at each moment comprise the position and the speed of the first target at the moment, the position and the speed of the candidate target, and an included angle between the motion directions of the first target and the candidate target;

17. The apparatus according to claim 15 or 16, wherein the time at which the first probability is higher than the preset threshold is a first time, the processing module is further configured to,

18. The apparatus of claim 17, wherein the second neural network comprises a convolutional neural network and a graph convolutional neural network, wherein the processing module is specifically configured to,

constructing a graph model according to the characteristics of all targets in the video data and the bounding box;

19. The apparatus according to any of claims 13-18, wherein the sensors comprise at least two sets of sensors at different orientations, and the processing module is specifically configured to, for each of the objects comprised in the scene in which the first object is located,

20. The apparatus of claim 19, wherein each set of sensors comprises at least two types of sensors, the at least two types of sensors comprising a camera and at least one of the following two types of sensors: the millimeter wave radar and the laser radar, and the at least two types of sensors are located at the same orientation, and the processing module is specifically configured to, for each target in the targets included in the scene where the first target is located,

21. An apparatus for target tracking, the apparatus comprising an acquisition module and a processing module,

the acquisition module is used for acquiring sensing data of a first target through a sensor, wherein the first target is an initial tracking target;

the processing module is used for generating a motion track of the first target according to the sensing data; determining the initial moment when the first target track disappears as a first moment according to the motion track of the first target; acquiring a video frame of a period of time before and after the first time, wherein the video frame comprises the first target; and determining the second target according to the video frame, and taking the second target as an updated tracking target.

22. The apparatus of claim 21,

the processing module is further configured to determine that a motion trajectory of the first target after the initial time does not exist, and determine that the initial time is the first time.

23. The apparatus of claim 21 or 22, wherein the processing module is further configured to,

24. The apparatus of claim 23, wherein the second neural network comprises a convolutional neural network and a graph convolutional neural network, and wherein the processing module is further configured to:

inputting the video frame into a pre-trained convolutional neural network, and outputting the characteristics of the target and a bounding box contained in the video data;

25. A computing device for target tracking, the computing device comprising a processor and a memory, wherein:

the memory having stored therein computer instructions;

the processor executes the computer instructions stored by the memory to implement the method of any of claims 1-8.

26. A computing device for target tracking, the computing device comprising a processor and a memory, wherein: the memory having stored therein computer instructions;

the processor executes the computer instructions stored by the memory to implement the method of any of claims 9-12.

27. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer program code which, when executed by a computer, causes the computer to perform the method according to any of claims 1-8.

28. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer program code which, when executed by a computer, causes the computer to perform the method according to any of claims 9-12.