Multi-view image fusion space target tracking system and method
Technical Field
The invention relates to the field of target tracking, in particular to a multi-view image fused space target tracking system and method.
Background
With the new requirements of people on security, especially in the field of target tracking, people hope to monitor the action of a target through video streaming, the high efficiency performance of the deep learning technology on image feature extraction, and with the continuous improvement of the graphic computing capability of the GPU, the framework of image classification and target detection based on the deep learning technology is further developed, and the relevant traditional method is rapidly replaced.
However, the current target tracking technology is mainly based on the traditional image retrieval mode, that is, the similar distance between images is directly calculated, whether the images belong to the same object is judged through the similar distance, and if the images are blocked, the matching of the images is likely to be unsuccessful, the performance of the mode is poor.
The effective characteristics of the target which is greatly stored are extracted from the image characteristic information based on the convolutional neural network, and the similar distance between the images is calculated, so that the performance of target tracking is greatly improved, but the frame needs to mark the target frame of the target in the video in advance, and the purpose of practicability cannot be achieved.
Disclosure of Invention
The invention provides a space target tracking method based on multi-view image fusion, aiming at the technical problems in the prior art, and solving the problem of poor space target tracking performance in the prior art.
The technical scheme for solving the technical problems is as follows: a multi-view image fused space target tracking system comprises: the system comprises a camera shooting unit, a target frame extraction module, a spatial feature extraction module and a tracker module;
the camera shooting unit comprises at least two cameras symmetrically arranged around a scene for space target tracking, the cameras shoot the scene in an inclined downward posture to obtain multiple shooting frame images, and the multiple shooting frame images are at least two image frame images of the same scene shot by the at least two cameras at the same time;
the target frame extraction module marks targets detected from the multiple shot same frame images in a target frame mode;
the spatial feature extraction module extracts feature information in each target frame, and clusters each feature information to obtain each clustering feature;
and the tracker module carries out target tracking according to the clustering characteristics.
A tracking method of a multi-view image fusion space target tracking system comprises the following steps:
step 1, at least two cameras are symmetrically arranged around a scene for space target tracking, the cameras shoot the scene in an inclined downward posture to obtain multiple shooting frame images, and the multiple shooting frame images are at least two image frame images of the same scene shot by the at least two cameras at the same time;
step 2, labeling the targets detected from the multiple shot same frame images in a target frame mode;
step 3, extracting characteristic information in each target frame, and clustering each characteristic information to obtain each clustering characteristic;
and 4, tracking the target according to the clustering characteristics.
The invention has the beneficial effects that: the invention provides a multi-view image fused space target tracking system and a method, which consider the situation that in the prior art, once shielding occurs between targets in the process of tracking the space target, the situation of disordered identification is very easy to occur by calculating the similar distance of the characteristics between images, and aiming at special scenes, such as rectangular spaces of sports fields and the like and places with higher requirements on target tracking accuracy, at least two cameras are symmetrically arranged around the scene and shoot the scene in an inclined downward posture, so that the problem of identification errors caused by shielding is solved; the spatial feature extraction module can extract overlook spatial features according to specific target frames by combining extracted features of all visual angles, perform clustering, greatly reduce the scene of disordered target tracking according to the uniqueness of the position of a target in the space, and is suitable for target tracking of a single scene with high requirement on target tracking accuracy.
On the basis of the technical scheme, the invention can be further improved as follows.
Furthermore, the scene is a rectangular space, the number of the cameras is four, the cameras are arranged on four corners of the rectangular scene, and the cameras face to the center of the scene simultaneously.
Further, the target frame extraction module predicts the target frame based on the YoloV3 detection network.
Further, the spatial feature extraction module comprises a convolutional neural network module and a spatial clustering module;
the convolutional neural network extracts the characteristic information of each target frame, and the characteristic information is spliced to form a spatial information characteristic matrix;
the spatial clustering module clusters the spatial information characteristic matrix to obtain clustering centers of each category, and outputs a clustering center matrix formed by the clustering centers to the tracker module.
Further, the tracker module includes respective tracker classes corresponding to respective ones of the cluster centers;
when the continuous occurrence time of the target exceeds a set target occurrence time threshold, initializing the corresponding tracker class;
and when the target loss time exceeds a set target loss time threshold, discarding the corresponding tracker class.
Further, the tracker class includes a tracking target class attribute, a live index attribute, a true live index, and a global live index attribute;
the tracking target class attribute is used for storing the tracked target;
the survival index attribute is used for recording the index of the tracked target;
the real survival index is used for recording the tracking target class which is not influenced by the time threshold value and is not removed;
the global live index attribute is used for recording the live tracking target class index.
Further, the initialization process of the tracker class includes:
initializing a global category serial number, wherein the size of the global category serial number is related to the category number of the current clustering center; and classifying the categories in the multi-shot same-frame images in an index tracking index mode.
Further, the maintenance process of the tracker class in the tracking process includes:
calculating the similar distance between each cluster center in the current cluster center matrix and each cluster center in the last cluster center matrix, judging whether each category in the current cluster center matrix exists or not by adopting an optimal distribution principle, if so, matching with the last category, and tracking; otherwise, a non-existent class is added to the tracked class for storage.
The beneficial effect of adopting the further scheme is that: the method comprises the steps of fusing a YoloV3 detection network which is better in performance in the deep learning based process at present, fully automatically detecting a target of a video frame, marking the target in the form of a target frame, combining the target frame detected by the target detection network with a global image information extraction network to obtain a feature map only containing the target frame, and removing most useless feature information; setting a time threshold value, and removing targets which do not appear in a period of time from the tracking target class, so that the operation also ensures that the number of the target tracking classes in the cache is within the safe range of the memory and the phenomenon of cache explosion is avoided; the method is characterized in that real-time target tracking is achieved based on a deep learning technology, high-efficiency target detection efficiency of YoloV3, strong capability of an image space feature extraction network and related judgment rules of the same object are combined, real-time target tracking without manual labeling is achieved through multiple cameras in the same scene, and the effect of the method meets the requirement of practicability.
Drawings
Fig. 1 is a block diagram of a multi-view image fusion space target tracking system according to the present invention;
fig. 2 is a diagram illustrating the detection effect of YoloV3 target in a multi-shot frame image according to an embodiment of the present invention;
fig. 3 is a flowchart of spatial feature extraction according to an embodiment of the present invention;
fig. 4 is a block diagram illustrating a method for tracking a spatial target by multi-view image fusion according to an embodiment of the present invention; .
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a block diagram of a multi-view image fusion space target tracking system according to the present invention, and as shown in fig. 1, the system includes: the system comprises a camera shooting unit, a target frame extraction module, a spatial feature extraction module and a tracker module.
The camera shooting unit comprises at least two cameras symmetrically arranged around a scene for space target tracking, the cameras shoot the scene in an inclined downward posture to obtain a multi-shooting frame image, and the multi-shooting frame image is at least two image frame images of the same scene shot by the at least two cameras at the same time; the camera takes an image in a downward inclined posture, that is, the image taken by the camera is an image taken at a top view angle.
The target frame extraction module marks targets detected from each of the multiple shot frame images in the form of target frames. The target frame detected by the target detection network is combined with the global image information extraction network to obtain a feature map only containing the target frame, and most useless feature information is removed.
And the spatial feature extraction module extracts feature information in each target frame and clusters each feature information to obtain each clustering feature.
And the tracker module tracks the target according to the clustering characteristics.
The invention provides a multi-view image fused space target tracking system, which considers that in the process of tracking space targets in the prior art, once shielding occurs between the targets, disordered identification scenes can easily occur by calculating the similar distance of the characteristics between images, and aiming at special scenes, such as rectangular spaces of sports fields and other places with higher requirements on target tracking accuracy, at least two cameras are symmetrically arranged around the scene and shoot the scene in an inclined downward posture, so that the problem of identification errors caused by shielding is solved; the spatial feature extraction module can extract overlook spatial features according to specific target frames by combining extracted features of all visual angles, perform clustering, greatly reduce the scene of disordered target tracking according to the uniqueness of the position of a target in the space, and is suitable for target tracking of a single scene with high requirement on target tracking accuracy.
Example 1
Embodiment 1 of the present invention is an embodiment of a multi-view image fusion space target tracking system according to the present invention, in this embodiment, for the real-time tracking processing of the target in the video stream, that is, for the processing of each frame of image, to realize a real-time tracking frame of the target based on deep learning, the yoolov 3 is first used to detect the target in the image of the same frame under four cameras, and at the same time, the four image frames are simultaneously input into the Resnet18 network and the features of the image are extracted by using the migration learning, and then the yoolov 3 detection box and the image feature extraction network are combined to form a spatial feature network, clustering different spatial features of the multiple targets under the four cameras to obtain overlooking features, the method greatly improves the defect of disordered target tracking caused by occlusion in the past by utilizing the uniqueness characteristic of the target in the overlooking space, and puts the final clustering matrix into a tracker module for tracking the target.
Specifically, the embodiment of the system comprises: the system comprises a camera shooting unit, a target frame extraction module, a spatial feature extraction module and a tracker module.
The camera shooting unit comprises at least two cameras symmetrically arranged around a scene for space target tracking, the cameras shoot the scene in an inclined downward posture to obtain a multi-shooting frame image, and the multi-shooting frame image is at least two image frame images of the same scene shot by the at least two cameras at the same time; the camera takes an image in a downward inclined posture, that is, the image taken by the camera is an image taken at a top view angle.
At present, the similar distance between images is calculated by a traditional target tracking algorithm or a target tracking algorithm based on deep learning, but the difference is that the image similar distance is calculated directly according to the pixel value of the image, the effective characteristics of the image are firstly extracted by the deep learning, and then the similar distance of the image is calculated according to the characteristics, so that the deep learning is faster in speed and higher in precision compared with the traditional method. However, when occlusion occurs between targets, it is very easy to find out a scene with confusing recognition by calculating the similarity distance of features between images, and although partial occlusion between targets can be solved by continuously adjusting the pose of the cameras or increasing the number of the cameras, the following situations occur: the labor cost is increased by adjusting the camera, the pose of the camera cannot be adjusted every time when the camera is shielded, and the camera cannot be suitable for most scenes; the number of the cameras is increased, and only some cameras can observe non-shielded pictures, and the cameras with shielded pictures still do not solve the problem of identification errors caused by shielding. Therefore, the problems can be well solved by using the overlooking spatial feature extraction network, and the overlooking spatial feature extraction network mainly utilizes the characteristic of uniqueness of the target object in an overlooking two-dimensional plane to extract the features. The overlook spatial feature extraction network cannot extract the overlook spatial features from one camera image, and therefore the number of cameras needs to be increased.
Preferably, the test scene of the invention is a rectangular space, the four cameras are distributed at four corners, and the four cameras simultaneously face the center of the floor of the rectangular space in an inclined posture, so that the purpose of converting the features extracted from the same-frame images of the four cameras into overlooking features when clustering the extracted space is performed, and the accuracy of re-identification of the target can be improved by utilizing the uniqueness of the position of the target in the space.
The target frame extraction module marks targets detected from each of the multiple shot frame images in the form of target frames.
Preferably, the target frame extraction module predicts the target frame based on the YoloV3 detection network. Fig. 2 is a diagram illustrating the effect of detecting YoloV3 targets in a multi-shot image according to an embodiment of the present invention.
Aiming at the problem that manual labeling of a target frame in the early stage of the current target tracking frame is time-consuming and labor-consuming, the method disclosed by the invention integrates a YoloV3 detection network which is better in performance in deep learning at present, fully automatically detects the target of a video frame, labels the target in the form of a target frame, and then combines the output multi-object target frame and an image information extraction network into a subsequent spatial information characteristic extraction network.
YoloV3 is a high-precision and good-detection-speed target detection network, which takes Darknet-53 as the basic skeleton of the network, takes 256x256 pictures as input, largely uses 1x1 and 3x3 convolutional layers for stacking, and uses a residual error network to transmit shallow information to a deep layer, thereby increasing the network depth and simultaneously not causing problems of gradient explosion and the like; a multi-scale detection strategy is adopted in the detection aspect of the Yolov3 structure, three feature maps with different sizes, namely 32x32, 16x16 and 8x8, are used for detection output, the feature maps are subjected to size mapping on each point of a detection feature layer, each point is provided with 3 prediction boxes, therefore 4032 prediction boxes are totally arranged on the detection of the three feature layers, the prediction number greatly meets the requirement for detecting multiple types of objects, finally, a logistic regression is used for carrying out targeted scoring on each prediction box, target boxes meeting the requirement are selected according to the targeted scoring, and the target boxes are predicted.
And the spatial feature extraction module extracts feature information in each target frame and clusters each feature information to obtain each clustering feature.
Preferably, as shown in fig. 3, a flowchart of spatial feature extraction provided in the embodiment of the present invention is provided, and the spatial feature extraction module includes a convolutional neural network module and a spatial clustering module.
Extracting the characteristic information of the multi-shot same-frame image by using an image characteristic extraction network pre-trained by an Imagenet database, wherein the image characteristic extraction network can be a classical convolution neural network; the convolutional neural network extracts the characteristic information of each target frame, and the characteristic information is spliced to form a large spatial information characteristic matrix.
The spatial clustering module clusters the spatial information characteristic matrix to obtain clustering centers of each category, and outputs a clustering center matrix formed by the clustering centers to the tracker module. Since the clustering center is equivalent to the mean value of the features of the same category, the clustering centers of the images of the symmetrically distributed cameras are the overlooking features of the images.
And the tracker module tracks the target according to the clustering characteristics. The tracker module mainly combines a clustering center extracted by a spatial characteristic information network with a dynamic increase and decrease tracking target to realize a relatively accurate target tracking effect.
Preferably, the tracker module comprises respective tracker classes corresponding to respective cluster centers.
When the continuous occurrence time of the target does not exceed the set target occurrence time threshold, the target does not need to be tracked, so that the condition that the target which appears in a short time cannot be tracked is avoided, and the storage space is saved; when the continuous appearance time of the target exceeds the set target appearance time threshold, the target needs to be tracked, so that the cluster center of the target is stored in a tracking target class, and a corresponding tracker class is initialized.
And when the target loss time exceeds a set target loss time threshold, discarding the corresponding tracker class.
The time threshold value of the tracker class is set to be very important in initialization of the tracker class, which is related to the process of dynamically increasing and decreasing targets, the target loss time threshold value is set mainly to remove targets which do not appear in a period of time from the tracked target class, and the operation also ensures that the number of target tracking classes in the cache is within the internal memory safety range and the phenomenon of cache burst does not occur.
The tracker module mainly utilizes the combination of the clustering center extracted by the spatial characteristic information network and the dynamic increase and decrease of the tracked target to realize a more accurate target tracking effect.
In particular, the tracker class includes a track target class attribute, a live index attribute, a true live index, and a global live index attribute.
The tracked target class attribute is used for storing the tracked target.
The live index attribute is used for recording the index of the tracked target and corresponds to the tracked target class attribute.
The true live index is used to record the trace-target classes that are not affected by the time threshold and are not removed.
The global live index attribute is used to record the live tracked target class index.
Further, the initialization process of the tracker class includes:
initializing a global category serial number, wherein the size of the global category serial number is related to the category number of the current cluster center; and classifying the categories in the images of the multiple shot frames by adopting an index tracking index mode.
Further, the maintenance process of the tracker class in the tracking process includes:
calculating the similar distance between each cluster center in the current cluster center matrix and each cluster center in the last cluster center matrix, judging whether each category in the current cluster center matrix exists or not by adopting an optimal distribution principle, if so, matching with the last category, and tracking; otherwise, the non-existent category is added to the tracked class for storage, which is equivalent to putting the data into the database.
In the specific implementation process, theoretically, the more cameras are used, the better the test effect is, so that the test environment is built by using two paths of cameras, and the test space is a leisure space as shown in fig. 4. Two mobile phones are erected on opposite corners under the environment to prepare for shooting videos at the same time, a rectangular space is formed, and the mobile phones face the center of the rectangular space in a top view state. Let the person walk gradually into the center of the rectangle and the person walks in the rectangular space, with increasing number of persons. And finally, the shot two mobile phone videos are the same-frame video images in the same scene.
And inputting the two videos into a target tracking system in a certain sequence for tracking and identifying. The IDs of the same type of targets under the two cameras can be accurately identified to be consistent, and if the number of the cameras is increased, more collected spatial features can be used for accurately tracking and re-identifying multiple people.
Example 2
Embodiment 2 of the present invention is an embodiment of a multi-view image fused space target tracking method, and as shown in fig. 4, is a flowchart of an embodiment of a multi-view image fused space target tracking method, and as can be seen from fig. 4, the embodiment of the method includes:
step 1, at least two cameras are symmetrically arranged around a scene for space target tracking, the cameras shoot the scene in an inclined downward posture to obtain a multi-shooting frame image, and the multi-shooting frame image is at least two image frame images of the same scene shot by the at least two cameras at the same time.
Preferably, the scene is a rectangular space, the number of the cameras is four, the cameras are arranged at four corners of the rectangular scene, and each camera faces to the center of the scene at the same time.
And 2, marking the targets detected from the multiple shot frame images in a target frame mode.
Preferably, the target frame is predicted based on the YoloV3 detection network.
And 3, extracting the characteristic information in each target frame, and clustering each characteristic information to obtain each clustering characteristic.
Preferably, the feature information of each target frame is extracted, and the feature information is spliced to form a large spatial information feature matrix.
Clustering the spatial information characteristic matrix to obtain the clustering center of each category, and outputting the clustering center matrix formed by the clustering centers as the clustering characteristic.
And 4, tracking the target according to the clustering characteristics.
Preferably, step 4 comprises:
and establishing each tracker class corresponding to each clustering center.
When the continuous appearance time of the target exceeds the set target appearance time threshold, the target needs to be tracked, so that the cluster center of the target is stored in a tracking target class, and a corresponding tracker class is initialized.
And when the target loss time exceeds a set target loss time threshold, discarding the corresponding tracker class.
The time threshold value of the tracker class is set to be very important in initialization of the tracker class, which is related to the process of dynamically increasing and decreasing targets, the target loss time threshold value is set mainly to remove targets which do not appear in a period of time from the tracked target class, and the operation also ensures that the number of target tracking classes in the cache is within the internal memory safety range and the phenomenon of cache burst does not occur.
In particular, the tracker class includes a track target class attribute, a live index attribute, a true live index, and a global live index attribute.
The tracked target class attribute is used for storing the tracked target.
The live index attribute is used for recording the index of the tracked target and corresponds to the tracked target class attribute.
The true live index is used to record the trace-target classes that are not affected by the time threshold and are not removed.
The global live index attribute is used to record the live tracked target class index.
Further, the initialization process of the tracker class includes:
a global class number is initialized whose size is related to the number of classes in the current cluster center. And classifying the categories in the images of the multiple shot frames by adopting an index tracking index mode.
Further, the maintenance process of the tracker class in the tracking process includes:
calculating the similar distance between each cluster center in the current cluster center matrix and each cluster center in the last cluster center matrix, judging whether each category in the current cluster center matrix exists or not by adopting an optimal distribution principle, if so, matching with the last category, and tracking; otherwise, the non-existent category is added to the tracked class for storage, which is equivalent to putting the data into the database.
An embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, is implemented to perform the environmental event monitoring method for an environmental internet of things provided in the foregoing embodiments, for example, the method includes: step 1, setting the types of indexes related to each environmental event and the alarm value range. And 2, monitoring the numerical values of all indexes in real time, and executing the step 3 after first alarm information is generated when any index is in the corresponding alarm numerical value range. And 3, executing the step 4 when the number of the types of the indexes related to the environmental events corresponding to the indexes is at least two. And 4, generating second alarm information when all indexes related to the environmental event within the set time are judged to be within the corresponding alarm value range.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.