CN118115927B

CN118115927B - Target tracking method, apparatus, computer device, storage medium and program product

Info

Publication number: CN118115927B
Application number: CN202410533582.3A
Authority: CN
Inventors: 李南君; 李拓; 邹晓峰; 王长红; 李国庆; 展永政
Original assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Current assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Filing date: 2024-04-30
Publication date: 2024-07-09
Anticipated expiration: 2044-04-30

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a target tracking method, a device, computer equipment, a storage medium and a program product, wherein the target tracking method comprises the following steps: acquiring a first video acquired by a first camera device, and determining a first identification object in the first video; identifying a target type event in the first video, and determining a target object matched with the target type event based on object features of the first identified object; generating a target feature template based on the object features of the target object in the first video, and determining the target object with the object features matched with the target feature template in a second identification object of a second video, wherein the second video is acquired by at least one second camera associated with the first camera. The invention can efficiently realize the cooperative abnormal event of multiple cameras and relay detection of a target object, can improve the intelligent level of the existing monitoring system and perfects a video monitoring system.

Description

Target tracking method, apparatus, computer device, storage medium and program product

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a target tracking method, apparatus, computer device, storage medium, and program product.

Background

With the rapid development of socioeconomic and the acceleration of urban process, population density is rapidly increased, and social public safety problems are increasingly prominent. Therefore, how to monitor important institutions and public places in real time all weather and automatically find suspicious conditions is important to guaranteeing public safety of society.

However, in the related video monitoring scheme, the intelligent level of the monitoring system is still at a low level, and only has some primary video analysis and processing functions, wherein each camera is relatively independent, lacks comprehensive understanding capability when facing complex events in a dynamic scene, and cannot effectively sense an illegal event with potential safety hazards and identify an object participating in the illegal event.

Disclosure of Invention

In view of the foregoing, the present invention provides a target tracking method, apparatus, computer device, storage medium and program product, so as to solve the problem that related video monitoring schemes lack comprehensive understanding capability when facing complex events in dynamic scenes, and cannot effectively sense violation events with potential safety hazards and identify objects participating in the violation events.

In a first aspect, the present invention provides a target tracking method, which is characterized in that the method includes:

acquiring a first video acquired by a first camera device, and determining a first identification object in the first video;

identifying a target type event in the first video, and determining a target object matched with the target type event based on object features of the first identified object;

And generating a target feature template based on the object features of the target object in the first video, and determining a target object with the object features matched with the target feature template in a second identification object of a second video, wherein the second video is acquired by at least one second camera associated with the first camera.

In an alternative embodiment, generating a target feature template based on the target object includes:

intercepting a sub-image corresponding to a target object in a video frame of a first video;

Object features of the target object are extracted based on the sub-images, and a target feature template of the target object is generated according to the object features.

In the embodiment of the disclosure, it is considered that the conventional target handover method also uses an appearance model to perform target matching, and the appearance model is degraded when the appearance model faces the conditions of disordered movement track, frequent shielding and the like of the target object. Therefore, the method and the device can establish the target feature template of the target object so as to track the target object in the second video according to the target feature template, thereby realizing the target robust perception of the monitoring scene.

In an alternative embodiment, extracting object features of the target object based on the sub-image, and generating a target feature template of the target object according to the object features, includes:

Extracting motion track characteristics of a target object in a plurality of sub-images, and extracting visual characteristics of the sub-images, wherein the visual characteristics comprise: color features and/or texture features;

and determining a target feature template of the target object according to the motion trail features and the object features.

In the embodiment of the disclosure, it is considered that the conventional target handover method also uses an appearance model to perform target matching, and the appearance model is degraded when the appearance model faces the conditions of disordered movement track, frequent shielding and the like of the target object. Therefore, the method and the device can search for the complementary advantages of the appearance and the track model based on the mode of joint characterization of the appearance characteristics and the motion track characteristics of the transducer-LSTM network, so that the problems of shielding and the like are overcome, and the target robust perception of the monitored scene is realized.

In an alternative embodiment, determining a target feature template of the target object according to the motion trail feature and the object feature includes:

fusing the motion trail features and the object features to obtain target features;

And generating the target feature template according to the target features.

In the embodiment of the disclosure, it is considered that the conventional target handover method also uses an appearance model to perform target matching, and the appearance model is degraded when the appearance model faces the conditions of disordered movement track, frequent shielding and the like of the target object. Therefore, the method and the device can seek the complementary advantages of the appearance and the track model based on the mode of joint characterization of the object characteristics and the motion track characteristics of the transducer-LSTM network, so that the problems of shielding and the like are overcome, and the target robust perception of the monitored scene is realized.

In an alternative embodiment, determining a target object in which the feature vector matches the target feature template in the second recognition object of the second video includes:

Acquiring target features corresponding to the target object based on the target feature template, and extracting object features of a second identification object based on a second video;

and determining a third recognition object with the similarity of the object feature and the target feature meeting the similarity condition in the second recognition object, and determining the third recognition object as the target object.

In an alternative embodiment, determining a third recognition object whose similarity between the object feature and the target feature satisfies a similarity condition in the second recognition object, and determining the third recognition object as the target object includes:

performing feature space conversion on the object features through a first sub-network in the twin network to obtain object feature vectors, and performing feature space conversion on the target features through a second sub-network in the twin network to obtain target feature vectors;

Calculating the absolute value of the difference between the object feature vector and the target feature vector, and mapping the absolute value of the difference to obtain a mapping value;

and determining a third identification object with the mapping value meeting the similar condition in the second identification objects, and determining the third identification object as a target object.

In an alternative embodiment, acquiring a first video acquired by a first camera device, and determining a first identification object in the first video includes:

Acquiring point cloud data corresponding to a current video frame in a first video, wherein the point cloud data are acquired through a point cloud device associated with a first camera device;

Fusing the point cloud data with the current video frame to obtain a current fused feature map, wherein the current fused feature map is used for indicating a fusion result of the feature map aiming at the first video and the corresponding point cloud data;

identifying based on the current fusion feature map to obtain a target area of a first identification object in the current video frame;

and identifying a first identification object matched with the target area in a subsequent video frame of the first video, wherein the subsequent video frame is that a corresponding time node in the first video is positioned behind the current video frame.

In the embodiment of the disclosure, considering that the video frame in the first video is an RGB (red green blue) image, the fusion of the RGB image and the point cloud data can be based, and the identification of the first identification object can be performed according to the fusion result, so that the three-dimensional accurate detection of the multi-scale object with disordered background in the complex illumination condition scene is realized, and the accuracy of object identification is improved.

In an alternative embodiment, fusing is performed with a current video frame based on the point cloud data to obtain a current fused feature map, including:

respectively converting the point cloud data and the first video into corresponding two-dimensional feature graphs;

And carrying out weighted summation on the pixel points of the two-dimensional feature images to obtain the current fusion feature image.

In the embodiment of the disclosure, considering that the detection effect is easy to be reduced under the conditions of complex field Jing Guangzhao, variable target scale and the like aiming at the target detection algorithm based on the RGB image, the method fuses the RGB image and the point cloud data to be used for constructing a three-dimensional detection model applicable to the multi-scale target in the complex illumination condition scene, thereby realizing three-dimensional accurate detection of the multi-scale object with disordered background in the complex illumination condition scene and improving the accuracy of object identification.

In an alternative embodiment, the identifying based on the current fusion feature map, to obtain the target area of the first identification object in the current video frame includes:

Acquiring a history fusion feature map corresponding to a preset history video frame, and predicting a target object position of a first identification object in a current fusion feature map according to the object position of the first identification object in the history fusion feature map, wherein the preset history video frame is a video frame at a moment before the current video frame in the first video;

performing object identification based on the current fusion feature map to obtain a region to be selected;

And correcting the area to be selected based on the target object position to obtain a target area.

In the embodiment of the disclosure, the object recognition can be performed on the first recognition object by combining the historical fusion feature map with the current fusion feature map, so that the confidence of the recognition result is improved.

In an alternative embodiment, the method further comprises:

after a history fusion feature map of a preset history moment is obtained, feature points in the history fusion feature map are extracted through a feature extraction network, so that history image features are obtained, wherein the extraction precision of the feature points in the history fusion feature map is different;

and constructing a feature group based on the historical image features to perform the step of predicting the target object position of the first recognition object in the current fusion feature map from the object position of the first recognition object in the historical fusion feature map according to the feature group.

In the embodiment of the disclosure, a strong-weak network alternative processing history fusion feature map can be pre-established to construct a feature group according to a processing result, so that image features are rapidly processed on the basis of ensuring the recognition accuracy of the image features in the feature group.

In an alternative embodiment, identifying a first identification object matching the target region in a subsequent video frame of the first video includes:

analyzing position coordinates of the first recognition object in the current video frame and semantic information based on the target area, wherein the semantic information is used for indicating information for carrying out semantic description on pixel point characteristics of the target area;

And determining the identification object of which the located area and the target area meet the matching degree condition in the subsequent video frame to obtain a first identification object, wherein the matching degree condition comprises position matching degree and semantic matching degree.

In the embodiment of the disclosure, the identification object of which the located area and the target area meet the matching degree condition can be determined in the subsequent video frames of the first video, so that the first identification object is obtained, the track tracking of the first identification object is realized, and a technical basis is provided for determining the target object based on the first identification object.

In an alternative embodiment, the method further comprises:

after the target area is obtained, determining the spatial overlapping degree based on the overlapping area of the target area between the current video frames;

The associated ones of the first recognition objects are analyzed based on the spatial overlap.

In the embodiment of the disclosure, the associated object in the current video frame can be determined, so that events occurring in the first video frame can be better summarized when the first video is subjected to semantic analysis, and the confidence of the determined abnormal events is improved.

In an alternative embodiment, identifying a target type event in a first video includes:

acquiring a semantic reasoning network of a source domain, and migrating according to the semantic reasoning network to obtain a target reasoning network, wherein the source domain is used for indicating the domain for carrying out semantic description on events in video frames;

And carrying out semantic recognition on the first video through the target reasoning network to obtain event sentences, and summarizing the target type event based on the event sentences.

In the embodiment of the disclosure, compared with the working mode that the events occurring in the video are required to be manually analyzed to determine the abnormal events, the method and the device can automatically summarize the abnormal events in the first video through the target inference network, have higher reliability, effectively reduce the dependence of the monitoring system on manpower and the running cost, and improve the efficiency of summarizing the abnormal events.

In an alternative embodiment, the migration is performed according to a semantic reasoning network to obtain a target reasoning network, including:

acquiring at least part of parameters in a semantic reasoning network, and initializing model parameters of the reasoning network to be trained based on the at least part of parameters;

and carrying out parameter training on the inference network to be trained through sample data to obtain a target inference network, wherein the sample data comprises video sequences which are arranged in time sequence.

In the embodiment of the disclosure, the semantic reasoning network in the source field with sufficient samples can be directly applied to abnormal event identification in the target field through migration learning, manual definition of monitoring scenes is not needed in advance, event samples are collected and marked, and generalization capability of the disclosure for different scenes is enhanced. In addition, the migrated target inference network can provide detailed semantic description of the abnormal event, so that the detection effect on the abnormal event is improved.

In an alternative embodiment, determining a target object that matches the target type event based on object characteristics of the first recognition object includes:

acquiring multi-mode data of a first identification object in a first video, wherein the multi-mode data comprises the appearance of a target object in the first video and the data of a motion track;

extracting features based on the multi-mode data to obtain high-dimensional features of the object;

And determining a target high-dimensional feature matched with the target type event from the target high-dimensional features, and determining a first identification object corresponding to the target high-dimensional feature as a target object, wherein the motion feature and the appearance feature indicated by the target high-dimensional feature are matched with the semantics corresponding to the target type event.

In the embodiment of the disclosure, the multi-mode data input of the first identification object can be utilized to establish a self-training target object positioning framework based on the object high-dimensional feature combination, so that the unsupervised positioning of local target objects with different behavior categories in the first video is realized.

In an alternative embodiment, the multimodal data includes: the method comprises the steps of setting a sub-image corresponding to a target object in a video frame of a first video, a human skeleton sequence corresponding to the sub-image, visual characteristic information of the sub-image, and an optical flow sequence when the position of the target object is changed between the sub-images;

feature extraction is performed based on multi-modal data to obtain high-dimensional features of the object, including:

respectively extracting action features corresponding to the human skeleton sequences and color features corresponding to the visual feature information based on the first convolution network, and extracting motion features corresponding to the optical flow sequences based on the second convolution network;

adding the color features and the motion features to obtain a first fusion feature;

and carrying out weighted summation on the first fusion feature and the action feature through a feature fusion network to obtain a second fusion feature, and splicing the first fusion feature and the second fusion feature to obtain the high-dimensional feature of the object.

In a second aspect, the present invention provides a target tracking apparatus comprising:

The first determining module is used for acquiring a first video acquired by the first camera device and determining a first identification object in the first video;

the identification module is used for identifying a target type event in the first video and determining a target object matched with the target type event based on the object characteristics of the first identification object;

And the second determining module is used for generating a target feature template based on the object features of the target object in the first video and determining a target object with the object features matched with the target feature template in a second identification object of a second video, wherein the second video is acquired through at least one second camera associated with the first camera.

In a third aspect, the present invention provides a computer device comprising: the object tracking system comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions, so that the object tracking method of the first aspect or any corresponding implementation mode of the first aspect is executed.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the object tracking method of the first aspect or any of its corresponding embodiments.

In a fifth aspect, the present invention provides a computer program product comprising computer instructions for causing a computer to perform the object tracking method of the first aspect or any of its corresponding embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a target tracking method according to an embodiment of the invention;

FIG. 2 is a flow chart of another object tracking method according to an embodiment of the invention;

FIG. 3 is a flow diagram of determining a first recognition object in a first video;

FIG. 4 is a flow chart of yet another object tracking method according to an embodiment of the present invention;

FIG. 5 is a flow diagram of identifying a target type event in a first video and determining a target object in a first identified object whose object characteristics match the target type event;

FIG. 6 is a flow chart of yet another object tracking method according to an embodiment of the present invention;

FIG. 7 is a flow chart of generating a target feature template and tracking a target object from the target feature target;

FIG. 8 is a network architecture diagram of a target tracking subsystem according to an embodiment of the invention;

FIG. 9 is a schematic diagram of a tracking process corresponding to the target tracking subsystem;

FIG. 10 is a block diagram of a target tracking apparatus according to an embodiment of the invention;

fig. 11 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The application scenario is described herein in connection with an application scenario on which execution of the target tracking method depends.

However, in the related video monitoring scheme, the intelligent level of the monitoring system is still at a low level, and only has some primary video analysis and processing functions, wherein each camera is relatively independent, lacks comprehensive understanding capability when facing complex events in a dynamic scene, and cannot effectively sense abnormal events with potential safety hazards and identify objects participating in the abnormal events.

Specifically, when detecting a video abnormal event through the above-mentioned monitoring system, it is generally required to sense the abnormal event in the monitored scene and identify the object participating in the abnormal event in the video, however, in the real monitored scene, considering that many challenging factors such as changeable conditions (such as gesture change, scale change, appearance change, etc.), complicated object behavior patterns (such as escape, fighting, theft, etc.), changeable object motion track (such as continuous movement across multiple cameras) affect the performance of abnormal event and object detection.

Based on this, in the present disclosure, first a first video acquired by a first image capturing apparatus may be acquired, a first recognition object in the first video is determined, then a target type event in the first video is recognized, a target object matching the target type event is determined based on an object feature of the first recognition object, so as to generate a target feature template based on the target object, and a target object whose object feature matches the target feature template is determined in a second recognition object of a second video, wherein the target type event may be an abnormal event, and the second video is a video acquired by at least one second image capturing apparatus associated with the first image capturing apparatus. Therefore, the method and the device can efficiently realize relay detection of the cooperative abnormal events and the target objects of the multiple cameras, can improve the intelligent level of the existing monitoring system, and improve the video monitoring system.

In accordance with an embodiment of the present invention, there is provided an embodiment of a target tracking method, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order other than that shown or described herein.

In this embodiment, a target tracking method is provided, which may be used in the above-mentioned monitoring system, and fig. 1 is a flowchart of a target tracking method according to an embodiment of the present invention, as shown in fig. 1, where the flowchart includes the following steps:

step S101, a first video acquired by a first camera device is acquired, and a first identification object in the first video is determined.

In an embodiment of the present disclosure, the monitoring system may include a plurality of image capturing devices, where the image capturing devices may be cameras for capturing video data in a monitored scene, and the number of image capturing devices may be set according to an area of the monitored scene.

Here, the installation position of the first camera device can be set according to the people flow direction in the monitoring scene, so that the target type event can be found as early as possible, and other camera devices in the monitoring system can conveniently track the target type event and the corresponding target object. Thus, the installation location may be a starting location for the flow direction of the human flow, such as a starting location in a monitoring scene.

It should be understood that the first image capturing device may collect the first video in the monitoring area in real time, and perform real-time object recognition on the video frames in the first video, so as to obtain the first recognition object in the first video. Here, the first recognition object may be a preset type of recognition object, for example, a person, a vehicle, or the like.

Step S102, identifying a target type event in the first video, and determining a target object matched with the target type event based on the object characteristics of the first identified object.

In the embodiment of the present disclosure, the target type event is an abnormal event, where the abnormal event may be a violation event in a preset field corresponding to a preset monitoring scene, for example, when the monitoring scene is a traffic scene, the violation event may be that a pedestrian violates a signal lamp at a serial intersection, the vehicle does not travel according to a specified route, and so on.

Specifically, when an abnormal event in the first video is identified, a target inference network applicable to the preset field corresponding to the monitoring scene can be trained in advance, and the target inference network can be used for carrying out semantic analysis on a video frame sequence in the preset field and analyzing behaviors of each first identification object in the video frame sequence to obtain semantic descriptions of a plurality of scenes, for example, people crossing by a bicycle appear in a crowd walking on a motor vehicle lane by a pedestrian X.

Then, the target inference network may analyze the scene semantic description piece by piece to obtain an abnormal event occurring in the first video. Then, the first recognition object can be classified through a classification algorithm to obtain a classification result, wherein the classification result comprises a set M and a set N, the first recognition object in the set M is a target object, and the first recognition object in the set N is a non-target object.

Step S103, generating a target feature template based on the target object, and determining a target object with a feature vector matching the target feature template in a second recognition object of a second video, where the second video is a video acquired by at least one second camera associated with the first camera.

In an embodiment of the disclosure, the second image capturing device is another image capturing device that performs handover with respect to the first image capturing device in the monitoring system, for example, if the first image capturing device is disposed at an intersection, the second image capturing device may be disposed at an adjacent intersection, so as to facilitate relay tracking of the target object.

Here, the target feature template can be used for indicating appearance features and track features of the target object, and when the target feature template is based on the target feature template to match the target recognition object in the second recognition object of the second video, the space-time matching of feature vectors can be performed through a gated cyclic convolution twin network, so that the robustness to the class imbalance problem is improved, the similarity between the second recognition object and the target object is estimated through learning from the semantic similarity, and the defects of difficult deployment of an actual scene, low abnormal detection semantic level and the like in a related monitoring system are effectively overcome.

As can be seen from the foregoing description, in the embodiments of the present disclosure, a first video acquired by a first image capturing device may be acquired first, a first identification object in the first video is determined, then, a target type event in the first video is identified, and a target object matched with the target type event is determined based on an object feature of the first identification object, so as to generate a target feature template based on the target object, and a target object whose object feature matches with the target feature template is determined in a second identification object of a second video, where the target type event may be an abnormal event, and the second video is acquired by at least one second image capturing device associated with the first image capturing device. Therefore, the method and the device can efficiently realize relay detection of the cooperative abnormal events and the target objects of the multiple cameras, can improve the intelligent level of the existing monitoring system, and improve the video monitoring system.

In this embodiment, another object tracking method is provided, which may be used in the above-mentioned monitoring system, and fig. 2 is a flowchart of another object tracking method according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:

step S201, a first video acquired by a first camera device is acquired, and a first identification object in the first video is determined.

Specifically, the step S201 includes:

In step S2011, point cloud data corresponding to a current video frame in the first video is acquired, where the point cloud data is data acquired by a point cloud device associated with the first camera device.

In an embodiment of the present disclosure, the point cloud device may be a device for performing point cloud data acquisition, such as a laser radar, where a data acquisition area of the point cloud device is not smaller than a video acquisition area of the first image capturing device. When the point cloud data corresponding to the current video frame in the first video is acquired, matching can be performed in the point cloud data returned by the point cloud device based on the time node corresponding to the current video frame, so as to obtain the point cloud data corresponding to the current video frame.

Step 2012, fusing the point cloud data with the current video frame to obtain a current fused feature map, wherein the current fused feature map is used for indicating a fusion result of the feature map for the first video and the corresponding point cloud data.

Step S2013, identifying based on the current fusion feature map to obtain a target area of the first identification object in the current video frame.

Step S2014, identifying a first identification object matched with the target area in a subsequent video frame of the first video, where the subsequent video frame is that a corresponding time node in the first video is located after the current video frame.

In this embodiment of the present disclosure, the point cloud device may be a device for collecting point cloud data, such as a laser radar, and the fused feature map is a two-dimensional feature map, and the target area may be a position of an identification frame of the first identification object in a current video frame. It should be understood that the data acquisition area of the point cloud device is not smaller than the video acquisition area of the first camera device.

And in the step of acquiring the point cloud data corresponding to the current video frame in the first video, matching can be performed in the point cloud data returned by the point cloud device based on the time node corresponding to the current video frame so as to acquire the point cloud data corresponding to the current video frame.

Step S202, an abnormal event in the first video is identified, and a target object with object characteristics matched with the abnormal event is determined in the first identification object. Please refer to step S102 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S203, generating a target feature template based on the target object, and determining a target object with a feature vector matching the target feature template in a second recognition object of a second video, where the second video is a video acquired by at least one second camera associated with the first camera. Please refer to step S103 in the embodiment shown in fig. 1 in detail, which is not described herein.

In the embodiment of the disclosure, the video frames in the first video are considered to be RGB images, so that the fusion of the RGB images and the point cloud data can be based, and the identification of the first identification object can be performed according to the fusion result, so that three-dimensional accurate detection of multi-scale objects with disordered backgrounds in complex lighting condition scenes is realized, and the accuracy of object identification is improved.

In some alternative embodiments, step S2012 includes:

And a step a1, respectively converting the point cloud data and the first video into corresponding two-dimensional feature maps.

And a2, carrying out weighted summation on the pixel points of the two-dimensional feature map to obtain the current fusion feature map.

In the embodiments of the present disclosure, a cascade attention network for RGB images may be designed to suppress background information in the current video frame, where the cascade attention network may first perform preprocessing on the current video frame, specifically, may process background interference in the current video frame in a coarse-to-fine two-stage manner through a self-attention model and a spatial attention model.

Next, the current video frame may be input into a 2D convolutional network to extract a two-dimensional feature mapAnd adopts the automatic calibration projection to carry outThe conversion is to a two-dimensional feature map (Bird-Eye-View) at the BEV View, and it is understood that the two-dimensional feature map at the BEV View is a smooth spatial feature map having highest correspondence with features of the point cloud data.

When converting point cloud data, converting a 3D convolution network constructed by sparse convolution and sub-flow convolution into a two-dimensional feature map after voxel separation of the point cloud data into voxels。

In determining the two-dimensional characteristic diagramAndThereafter, an adaptive gating cell pair may be employedAnd (3) withPerforming pixel-by-pixel weighted fusion to obtain a current fusion feature mapThe fusion process is expressed as:

Wherein the self-adaptive gating unit comprises two 3X 3Conv (Convolutional-Gated, convolution gates), And (3) withAre parameters in the adaptive gating cell,Is thatThe corresponding fusion weight is used for the fusion of the data,Is thatAnd corresponding fusion weights.

In some alternative embodiments, the step S2013 includes:

Step b1, acquiring a history fusion feature map corresponding to a preset history video frame, and predicting a target object position of a first identification object in a current fusion feature map according to the object position of the first identification object in the history fusion feature map, wherein the preset history video frame is a video frame at a moment before the current video frame in the first video.

And b2, carrying out object recognition based on the current fusion feature map to obtain a region to be selected.

And b3, correcting the area to be selected based on the position of the target object to obtain the target area.

In the embodiment of the present disclosure, the preset historical video frame may be a historical video frame corresponding to three adjacent historical time nodes between current video frames. For example, the current time isThe time nodes corresponding to the preset historical video frames are respectively:， And 。

It should be understood that, in the present disclosure, the detection of the first video is performed in real time, that is, each frame of video frame acquired in real time is fused with corresponding point cloud data, so as to obtain a corresponding fusion feature map, and the fusion feature map is stored in the server. Based on the history fusion feature map, a history fusion feature map corresponding to the preset history video frame can be obtained from the server.

After the history fusion feature map is obtained, a target area where the first recognition object is located in the current fusion feature map can be determined based on the history fusion feature map. Here, the target area where the first recognition object is located in the current fusion feature map may be predicted based on the historical fusion feature map, and the area to be selected where the first recognition object is located in the current fusion feature map is corrected according to the prediction result, and a specific correction process is described below, which is not described herein.

In some optional embodiments, the step S2013 further includes:

Step c1, after a history fusion feature map of a preset history moment is obtained, feature points in the history fusion feature map are extracted through a feature extraction network, and history image features are obtained, wherein the extraction precision of the feature points in the history fusion feature map is different.

And c2, constructing a feature group based on the historical image features, and executing the step of predicting the target object position of the first identification object in the current fusion feature map according to the object position of the first identification object in the historical fusion feature map according to the feature group.

In the embodiment of the disclosure, a time-space 'strong-weak' target detection network alternately processing history fusion feature map can be pre-established, the time-space 'strong-weak' target detection includes a 'strong' network and a 'weak' network, wherein the 'strong' network comprises a separable convolution layer with a larger channel multiplier factor, and the number of channels of an input feature can be expanded to extract accurate information. The 'weak' network comprises a separable convolution network with a small channel multiplier factor, so that the image characteristics can be processed quickly.

From the above, the preset history time includes，AndThus, a pair of "strong" networks may be usedFeature extraction is carried out on the history fusion feature map corresponding to the moment to obtain the history image featuresAnd are respectively paired through a weak networkPair ofFeature extraction is carried out on the history fusion feature map corresponding to the moment to obtain the history image featuresAndThereby constructing a feature group，。

Next, the feature set can be used over ConvLSTM networks，Determining current image characteristics corresponding to current video frameThe method comprisesMay be used to indicate the target region in the fused feature map.

In some alternative embodiments, step S2014 includes:

And d1, analyzing the position coordinates of the first recognition object in the current video frame and semantic information based on the target area, wherein the semantic information is used for indicating information for carrying out semantic description on the pixel point characteristics of the target area.

And d2, determining the identification object of which the located area and the target area meet the matching degree condition in the subsequent video frame to obtain a first identification object, wherein the matching degree condition comprises position matching degree and semantic matching degree.

In an embodiment of the present disclosure, the above-mentioned position coordinatesMay be used to indicate the center point coordinates, length, width, height and angle of the target area in the current video frame, wherein,. The semantic information is used for describing the visual characteristics of the target area based on the pixel points in the target area, for example, the distribution characteristics of the pixel values corresponding to the pixel points in the target area. Therefore, when the identification object of which the matching degree condition is satisfied by the located region and the target region is determined in the subsequent video frame, matching can be performed based on the position coordinates and the semantic information respectively.

Here, the above-mentioned position coordinates can be determined firstIn particular, the current image can be characterizedInput to SSD (Single Shot Detection, single object detection) network, can be expressed asWherein, the method comprises the steps of, wherein,Representing an SSD network.

Next, the identification object to be selected satisfying the semantic matching degree may be determined in the subsequent video frame by the above-described matching condition pair. Specifically, a plurality of recognition objects may be first recognized in a subsequent video frame, and a candidate recognition object that may be the first recognition object may be determined. Here, PD-CAE (PYRAMID DILATED Convolutional Autoencoder, pyramid hole convolution self-encoder) can be used to perform rough detection of human skeleton key points, a spatial pyramid mechanism based on hole convolution is designed in PD-CAE, hole convolution kernels with different receptive fields are used to process in parallel on subsequent video frames to extract multi-scale features, and fusion expression is performed on feature graphs with different scales to ensure detection of human skeleton key points with different scales.

It will be appreciated that the PD-CAE output is a regressive heat map M, and therefore, a maximum function may be usedAnd decoding key points of the identification objects in M, wherein the key points are skeleton key points of the first identification object. Wherein,. Here, the recognition object whose overall probability value satisfies the semantic matching degree of all the pixel points may be determined as the recognition object to be selected.

After determining the object to be identified, the first identification object may be determined based on the position matching degree, where the motion continuation of the same object to be identified between the multiple frames of video images may be identified based on the position matching degree, so as to obtain a time sequence edge of the image frame, so that the first identification object is determined according to the time sequence edge. The time sequence edge reflects the motion continuation of the identification object based on the position coordinate change of the same identification object between video frames, namely the change of the position of each identification object along with time, and the information is particularly important for identifying the target action and understanding the event content.

Specifically, the target area in the current video frame may be used as a template, where the target area template may be expressed asThe can then be calculatedThe position matching degree between the target object and the target area where any target object is located in the subsequent video frame can be used for indicating the similarity of motion trail between different objectsHere, the following video frameIs to be selected for the region of interestIn the case of an example of this,The method comprises the following steps:

wherein, For indicatingWith the area to be selectedSemantic relatedness between, here with the candidate regionFor example, the semantic correlation between the two may be determined by visual features between the twoAnd (3) withEuclidean distance between:

in addition, the method comprises the steps of, For indicatingWith the area to be selectedThe position correlation between the two can be represented by the spatial position overlap ratio between the two (calculated by using an intersection ratio formula), and the specific steps are as follows:

upon determining And (3) withCorresponding subsequent video framesBetween N candidate regionsThereafter, a target maximum value thereof can be selectedAs a means ofIn the current video frame and video frames subsequent to the video frameIs a frame-to-frame timing edge of (c). Then, it can be considered that in the candidate regionThe recognition object in the area with the maximum value larger than the target maximum value meets the position matching degree, so that the track tracking of the first recognition object is realized, and specifically, the track tracking result of the first recognition object in the first video can be expressed asI.e., a timing diagram, wherein,For representing the local features extracted in the identification area.

In some optional embodiments, the step S201 further includes:

and e1, after the target area is obtained, determining the spatial overlapping degree based on the overlapping area of the target area between the current video frames.

And e2, analyzing the associated objects in the first identification objects based on the spatial overlapping degree.

In the embodiment of the present disclosure, the spatial overlapping degree of the target area may be used for a social interaction relationship between the corresponding first recognition objects, and the social interaction relationship may be used as a spatial edge, for example, the social interaction relationship between the recognition object a and the recognition object B is the same as a carrier.

When the associated objects in the first recognition objects are analyzed based on the spatial overlapping degree, the spatial overlapping degree between each pair of the first recognition objects can be determined so as to determine the corresponding spatial edge according to the spatial overlapping degree, thereby constructing the spatial diagram corresponding to the current video frame according to the spatial edgeIf the number of target areas in the current video frame is N, thenAdjacency matrix, which can be expressed as N x NThe space diagram sequence corresponding to the current video frame is that。

Next, the associated object in the first identified object may be analyzed according to the above-mentioned space diagram, where the associated object is an associated object with close social interaction in the current video frame, for example, the first identified object of the same carrier.

In some optional embodiments, as shown in fig. 3, a flowchart of the process of determining the first recognition object in the first video corresponding to the step S201 is shown, where the feature set may be output through a target detection process based on significant fusion of the RGB image and the point cloud data, and the current image feature corresponding to the current video frame may be determined according to the feature setThe specific execution flow is as described above, and will not be described here again.

In addition, the trajectory tracking may be performed on the first recognition object in the first video based on the human body pose estimation procedure of the time sequence correction pyramid hole convolution self-encoder, where the rough heat detection map is the heat map M, and the heat map M is a sequence including multiple maps, and the sequenceAs shown in fig. 3. After determining the heat map M, a process of determining the first recognition object according to the position matching degree may be performed based on a residual connection LSTM (full scale Convolutional Long-short Term Memory, convolution long-short term memory) network, so as to obtain a fine detection key point corresponding to the heat map M, so that the track tracking is performed on the first recognition object in the first video according to the fine detection key point, and a specific track tracking process is described above and will not be repeated herein.

In this embodiment, there is provided another object tracking method, which may be used in the above-mentioned monitoring system, and fig. 4 is a flowchart of another object tracking method according to an embodiment of the present invention, as shown in fig. 4, where the flowchart includes the following steps:

in step S401, a first video acquired by a first camera device is acquired, and a first identification object in the first video is determined. Please refer to step S101 in the embodiment shown in fig. 1 in detail, which is not described herein.

In step S402, a target type event in the first video is identified, and a target object matching the target type event is determined based on the object characteristics of the first identified object.

Specifically, the step S402 includes:

step S4021, a semantic reasoning network of a source domain is obtained, migration is carried out according to the semantic reasoning network, and a target reasoning network is obtained, wherein the source domain is used for indicating the domain for carrying out semantic description on events in video frames.

Step S4022, carrying out semantic recognition on the first video through the target inference network to obtain event sentences, and summarizing target type events based on the event sentences.

In the embodiment of the present disclosure, the source domain includes a domain corresponding to a target inference network, for example, if the source domain is used to indicate that a semantic description is performed for an event in a video frame, the target domain corresponding to the target inference network may be used to indicate that a semantic description is performed for an event related to traffic in the video frame.

When the first video is semantically identified through the target inference network, the obtained event sentence can be a description text, and then the description text can be characterized to obtain text characteristicsAnd characterizing the textInput to classification head (classification header) to obtain a target type event (hereinafter, the target type event is simply referred to as an abnormal event), wherein the classification head is pre-trained based on a corpus.

Step S403 generates a target feature template based on the object features of the target object in the first video, and determines the target object whose object feature matches the target feature template in a second recognition object of a second video, where the second video is a video captured by at least one second camera associated with the first camera. Please refer to step S103 in the embodiment shown in fig. 1 in detail, which is not described herein.

In some optional embodiments, step S4021 described above further includes:

And f1, acquiring at least part of parameters in the semantic reasoning network, and initializing model parameters of the reasoning network to be trained based on the at least part of parameters.

And f2, carrying out parameter training on the inference network to be trained through sample data to obtain a target inference network, wherein the sample data comprises video sequences which are arranged in time sequence.

In the embodiment of the disclosure, model parameter setting can be performed on the to-be-trained inference network based on at least part of parameters in the semantic inference network to obtain the initialized to-be-trained inference network. Specifically, the semantic reasoning Network may be ST-SIN (Spatial-temporal SEMATIC INFERENCE Network) which includes two structures of an encoder and a decoder. The encoder comprises a space-time relation encoder (ST Encoder, hereinafter referred to as STRE) and a Global Context Encoder (GCE), wherein the STRE processes a space-time scene graph by using a space-time graph convolution network, digs space-time relation characteristics among targets, and the GCE captures global context characteristics by taking a video sequence as an input. Building a prediction network output video fine-granularity semantic description based on a gating circulating unit in a decoder, wherein the trained ST-SIN can be expressed as。

Then, can be based on the aboveModel parameter setting is carried out on at least part of parameters of the inference network to be trained to obtain an initialized inference network to be trained, and parameter training is carried out on the inference network to be trained through sample data to obtain a target inference network.

Specifically, to enhance cross-domain adaptability, a DAP (Domain Adaptive Module, domain adaptation Module) can be constructed toMigration is applied to the above-mentioned target area. During migration, sample data can be constructed based on a small amount of normal video sequences, and self-supervision contrast loss is adoptedThe parameters are updated, and it should be understood that the sample data does not include the first video and the second video in the monitoring scene. Here, the trained target inference network can be expressed as。

In some optional embodiments, the step S402 further includes:

In step S4023, multi-mode data of the first identification object in the first video is acquired, where the multi-mode data includes data of an appearance and a motion trail of the target object in the first video.

And step S4024, extracting the characteristics based on the multi-mode data to obtain the high-dimensional characteristics of the object.

In step S4025, a target high-dimensional feature matched with the target type event is determined in the high-dimensional features of the object, and the target object is determined by the first recognition object corresponding to the target high-dimensional feature, where the motion feature and the appearance feature indicated by the target high-dimensional feature are matched with the semantics corresponding to the target type event.

In an embodiment of the disclosure, the object is characterized in high dimensionThe low-dimensional characteristics can be obtained by splicing a plurality of low-dimensional characteristics, the low-dimensional characteristics are obtained by extracting the appearance data and the data of the motion trail, and the specific extraction process is described below and is not repeated here.

When determining the target high-dimensional characteristics matched with the target type event in the target high-dimensional characteristics, the isolated forest (iForest) algorithm can be utilized to base onAnd performing initial anomaly detection to obtain an initial target object set M and an initial non-target object set N. And then, M and N are used as training data, and the initial detection result is optimized and updated by adopting a self-training learning mode, so that the accurate detection of the target object is realized.

Specifically, a tag may be set for each target object in the set MSetting a label for each non-target object in the set NAnd taking the sets M and N as input, training an anomaly score network which corresponds to the encouragement and aesthetic algorithm and consists of a convolution layer and a full connection layer. Then, the learned anomaly score module is used to generate new anomaly scores for all objects to update sets M and N. Next, the above process may be iterated in a loop until the best scoring module is found.

During each iteration of the training process, it is desirable that the anomaly scores of the objects in M and N be as close as possible to the set labels. To achieve this, the anomaly score module is trained by minimizing the following objective function:

wherein, When (when)，，Representing an anomaly score network consisting of a convolutional layer and a fully-connected layer,To avoid overfitting regularization parameters.

In some alternative embodiments, the multi-modal data includes: the method comprises the steps of setting a sub-image corresponding to a target object in a video frame of a first video, a human skeleton sequence corresponding to the sub-image, visual characteristic information of the sub-image, and an optical flow sequence when the position of the target object is changed between the sub-images; the step S4024 further includes:

And g1, respectively extracting action features corresponding to the human skeleton sequences and color features corresponding to the visual feature information based on the first convolution network, and extracting movement features corresponding to the optical flow sequences based on the second convolution network.

And g2, adding the color features and the motion features to obtain a first fusion feature.

And g3, carrying out weighted summation on the first fusion feature and the action feature through a feature fusion network to obtain a second fusion feature, and splicing the first fusion feature and the second fusion feature to obtain the high-dimensional feature of the object.

In embodiments of the present disclosure, human skeletal sequencesThe visual characteristic information can be RGB sequence of sub-image, which is obtained by the identification of the pyramid cavity convolution self-encoderThe above optical flow sequence。

Respectively extracting human skeleton sequences based on a first convolution networkCorresponding motion characteristicsVisual characteristic informationCorresponding color featuresThe first convolutional network may be a space-time diagram convolutional network. In addition, extracting optical flow sequences based on a second convolution networkCorresponding movement characteristicsThe second convolutional network may be a 3D convolutional network.

Next, the feature vector may be alignedAnd (3) withAdding to obtain a first fusion featureAnd to the first fusion feature through a feature fusion networkAnd motion characteristicsWeighted summation is carried out to obtain a second fusion characteristicAnd for the first fusion featureSecond fusion featureSplicing to obtain high-dimensional characteristics of the object。

In particular, the second fusion feature may beInputting the channel attention weight into a feature fusion network, and calculating to obtain the channel attention weightSubsequently, theRespectively withAndOutputting fused object high-dimensional characteristics after dot-multiplication splicingThe specific process can be expressed as:

wherein, A channel attention fusion network may be represented for use in a first fusion featureAnd motion characteristicsWeighted summation is performed.

In some alternative embodiments, as shown in fig. 5, a flow chart of identifying a target type event in the first video and determining, in the first identified object, a target object whose object feature matches the target type event corresponding to the step S402 is shown, where first, a flow of identifying an abnormal event based on semantic migration may be performed. Specifically, firstly, training a space-time scene semantic reasoning model based on a source field, wherein the space-time scene semantic reasoning model is trained into a semantic reasoning network corresponding to the source field.

Next, the semantic reasoning network may be migrated to the target domain, so that the training of the reasoning network to be trained is performed to obtain the target reasoning network, and the specific training process is described in the embodiment of step S402, which is not repeated here, where the video sequence to be tested in fig. 5 is the sample data.

In addition, as shown in fig. 5, abnormal target positioning based on multi-modal feature association may be performed to determine a target object in the first identified object, and a specific target object determining process is described above and will not be described herein.

Outputting the feature set through a target detection flow based on the remarkable fusion of the RGB image and the point cloud data, and determining the current image feature corresponding to the current video frame according to the feature setThe specific execution flow is as described above, and is not described in the embodiment of step S402, and is not described here.

In this embodiment, there is provided another object tracking method, which may be used in the above-mentioned monitoring system, and fig. 6 is a flowchart of another object tracking method according to an embodiment of the present invention, as shown in fig. 6, where the flowchart includes the following steps:

In step S601, a first video acquired by a first image capturing device is acquired, and a first recognition object in the first video is determined. Please refer to step S101 in the embodiment shown in fig. 1 in detail, which is not described herein.

In step S602, a target type event in the first video is identified, and a target object matching the target type event is determined based on the object characteristics of the first identified object. Please refer to step S102 in the embodiment shown in fig. 1 in detail, which is not described herein.

Step S603 generates a target feature template based on the object features of the target object in the first video, and determines a target object whose object features match the target feature template in a second identified object of a second video, where the second video is a video captured by at least one second camera associated with the first camera.

Specifically, the step S603 includes:

in step S6031, a sub-image corresponding to the target object is captured in the video frame of the first video.

Step S6032, extracting object features of the target object based on the sub-image, and generating a target feature template of the target object according to the object features.

In the embodiment of the present disclosure, the sub-image may be an image obtained by capturing a target area of a target object in a corresponding video frame, and when the sub-image is captured, a frame-by-frame capturing may be performed on a video frame in a first video, so as to obtain a sub-image sequence.

Specifically, as shown in fig. 7, a flow chart of generating the target feature template and tracking the target object according to the target feature target corresponding to the step S603 is shown, where the target template may be generated based on a flow of joint characterization of the target appearance and the trajectory model of the Transformer-LSTM (LSTM, that is ConvLSTM, convolutional Long-short Term Memory, translated into convolution long-short-term memory) network, and the camera C1 is the first image capturing device.

In some alternative embodiments, step S6032 includes:

Step h1, extracting motion track features of a target object in a plurality of sub-images, and extracting visual features of the sub-images, wherein the visual features comprise: color features and/or texture features.

And h2, determining a target feature template of the target object according to the motion trail features and the object features.

Specifically, the step h2 includes:

(1) Fusing the motion trail features with the object features to obtain target features;

(2) And generating the target feature template according to the target features.

In the embodiment of the disclosure, the motion trail feature of the target object can be determined according to the position change of the target object in the sub-image sequence. Specifically, a trajectory model can be built according to the position change of the target object in the sub-image sequenceHere, it is possible toIs input into the above-mentioned transducer-LSTM network to extract through the multilayer LSTM networkCorresponding deep track characterization to obtain motion track characteristics。

Meanwhile, the visual characteristics of the target object in the sub-image can be extracted through the transducer-LSTM network. Specifically, firstly, the RGB image and the gray image of the sub-image can be identified to obtain the visual characteristics of the sub-imageHere, it is possible toIs input into the above-mentioned transducer-LSTM network for extraction by the visual transducer networkCorresponding color features and/or texture features to obtain visual features。

Next, the motion profile can be characterizedVisual characteristicsFusing to obtain target featuresThe specific feature fusion manner is to be able to be realized, and this disclosure is not limited in detail.

In some optional embodiments, the step S603 further includes:

step S6033, obtaining target characteristics corresponding to the target object based on the target characteristic template, and extracting object characteristics of the second identification object based on the second video.

In step S6034, a third recognition object whose similarity between the object feature and the target feature satisfies the similarity condition is determined among the second recognition objects, and the third recognition object is determined as the target object.

Specifically, the step S6034 includes:

And i1, performing feature space conversion on the object features through a first sub-network in the twin network to obtain object feature vectors, and performing feature space conversion on the target features through a second sub-network in the twin network to obtain the target feature vectors.

And i2, calculating the absolute value of the difference value of the object feature vector and the target feature vector, and mapping the absolute value of the difference value to obtain a mapping value.

And i3, determining a third identification object with the mapping value meeting the similar condition in the second identification objects, and determining the third identification object as the target object.

In the embodiment of the present disclosure, as shown in fig. 7, first, a second video may be acquired by the camera C2 (i.e., the second image capturing device) and a second identification object in the second video is identified, so as to intercept, frame by frame, a target area where a second identification of a second variety is located, to obtain a corresponding sub-image sequence, and a specific process of determining the sub-image sequence is described in the step S603, which is not repeated herein.

Next, the motion trail feature of the second recognition object can be extracted based on the sub-image sequence corresponding to the second recognition object through the transducer-LSTM networkAppearance characteristicsAnd for the motion trail characteristicsAnd appearance characteristicsFusing to obtain fusion characteristicsThe specific feature extraction and fusion process is described in the embodiment corresponding to step S6032, and is not described herein.

In determining the aboveAndThereafter, a gated circular convolution twin network (i.e., the twin network described above) may be established to provide a network of functionsAndThe space-time feature matching is performed, specifically, as shown in fig. 7, the pair of the sub-networks 1 (first sub-network) in the twin network can be performedPerforming characterization processing to obtain feature vectorsAnd is paired through sub-network 2 (second sub-network)Performing characterization processing to obtain feature vectors. Specifically, the subnetwork 1 and the subnetwork 2 include ConvGRU units with the same number, and the ConvGRU units can simultaneously consider the appearance and the track information of the feature map when vectorizing the features.

Next, feature vectors may be processed by the decision layerAnd feature vectorAnd performing matching degree calculation to determine a third recognition object with the mapping value meeting the similar condition in the second recognition objects. Specifically, the decision layer can pair the feature vector through the full connection layer and the sigmoid functionAnd feature vectorMaking a similarity measure can be expressed as:

wherein, Feature vectors can be usedAnd feature vectorIs mapped to a mapped value of a similarity interval to determine a third recognition object satisfying a similarity condition according to the mapped value, and determines the third recognition object as the target object. For example, if the similarity interval is [0,1], the second recognition object with the corresponding mapping value greater than 0.8 may be determined as the third recognition object.

In this embodiment, a target tracking subsystem is provided, which may be used in the above monitoring system, and fig. 8 is a network architecture diagram of the target tracking subsystem according to an embodiment of the present invention, as shown in fig. 8, where the system includes: the system comprises a video data acquisition and preprocessing module, a video intelligent processing module, a management module and a storage module;

the video data acquisition and preprocessing module is used for acquiring the first video based on the initial camera C1, acquiring the second video through the handover camera C2, preprocessing the original video to obtain video data, and storing the video data in the server storage.

In the embodiment of the present disclosure, the initial camera C1 is the first imaging device, and the relay camera C2 is the second imaging device. When the video data acquisition and preprocessing module performs preprocessing, the original video stream can be encoded and decoded to obtain an mp4 video file, and meanwhile, the operations such as distortion processing, blurring processing and the like can be performed on the original video, so that the video image quality is optimized.

The video intelligent processing module is used for dividing the first video and the second video to obtain video fragments, analyzing the target type event and the target object in the first video according to the video fragments, and tracking the target object in the video fragments of the second video.

In the embodiment of the disclosure, the process that the video intelligent processing module analyzes the target type event and the target object in the first video and tracks the target object in the video segment of the second video is described in the embodiment corresponding to fig. 1, which is not repeated herein.

And the management module is used for providing a bottom layer logic support for the video of the video data acquisition and preprocessing module, the intelligent processing module and the storage module.

In the embodiment of the disclosure, the management module may be responsible for providing an organization management service, such as process management, and coordinating the priority of different module operations to avoid process errors.

The storage module is used for storing the video data and transmitting the video data to the video data acquisition and preprocessing module, the video intelligent processing module and the management module through a network for data transmission.

Fig. 9 is a schematic diagram of a tracking flow corresponding to the target tracking subsystem, and the specific tracking flow is described above and will not be repeated here.

In summary, in the embodiments of the present disclosure, a first video acquired by a first image capturing device may be acquired first, a first identification object in the first video is determined, then, a target type event in the first video is identified, a target object matched with the target type event is determined based on an object feature of the first identification object, so as to generate a target feature template based on the target object, and a target object with an object feature matched with the target feature template is determined in a second identification object of a second video, where the target type event may be an abnormal event, and the second video is acquired through at least one second image capturing device associated with the first image capturing device. Therefore, the method and the device can efficiently realize relay detection of the cooperative abnormal events and the target objects of the multiple cameras, can improve the intelligent level of the existing monitoring system, and improve the video monitoring system.

The present embodiment also provides a target tracking device, which is used to implement the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides a target tracking apparatus, as shown in fig. 10, including:

A first determining module 1001, configured to acquire a first video acquired by a first camera device, and determine a first recognition object in the first video;

An identifying module 1002, configured to identify a target type event in the first video, and determine a target object that matches the target type event based on an object feature of the first identified object;

a second determining module 1003, configured to generate a target feature template based on an object feature of a target object in the first video, and determine, in a second identified object of a second video, the target object whose object feature matches the target feature template, where the second video is a video acquired by at least one second camera associated with the first camera.

In some alternative embodiments, the second determining module 1003 includes:

The intercepting unit is used for intercepting a sub-image corresponding to the target object in a video frame of the first video;

And the first extraction unit is used for extracting the object characteristics of the target object based on the sub-image and generating a target characteristic template of the target object according to the object characteristics.

In some alternative embodiments, the first extraction unit comprises:

The first extraction subunit is used for extracting motion track features of the target object in the multiple sub-images and extracting visual features of the sub-images, wherein the visual features comprise: color features and/or texture features;

And the first determination subunit is used for determining a target feature template of the target object according to the motion trail features and the object features.

In some alternative embodiments, the first determining subunit is configured to:

Fusing the motion trail features with the appearance features to obtain target features;

and generating the target feature template according to the target features.

In some alternative embodiments, the second determining module 1003 includes:

The second extraction unit is used for acquiring target characteristics corresponding to the target object based on the target characteristic template and extracting object characteristics of a second identification object based on a second video;

and the first determining unit is used for determining a third identification object with the similarity of the object characteristics and the target characteristics meeting the similarity condition in the second identification object, and determining the third identification object as the target object.

In some alternative embodiments, the first determining unit includes:

The conversion subunit is used for carrying out feature space conversion on the object features through a first sub-network in the twin network to obtain object feature vectors, and carrying out feature space conversion on the object features through a second sub-network in the twin network to obtain object feature vectors;

the second determining subunit is used for calculating the absolute value of the difference value of the object feature vector and the target feature vector, and mapping the absolute value of the difference value to obtain a mapping value;

And a third determination subunit configured to determine, among the second recognition objects, a third recognition object whose mapping value satisfies the similarity condition, and determine the third recognition object as the target object.

In some alternative embodiments, the first determining module 1001 includes:

The first acquisition unit is used for acquiring point cloud data corresponding to a current video frame in the first video, wherein the point cloud data are acquired through a point cloud device associated with a first camera device;

The fusion unit is used for fusing the point cloud data with the current video frame to obtain a current fusion feature map, wherein the current fusion feature map is used for indicating a fusion result of the feature map aiming at the first video and the corresponding point cloud data;

The first identification unit is used for carrying out identification based on the current fusion feature map to obtain a target area of a first identification object in the current video frame;

the second recognition unit is used for recognizing a first recognition object matched with the target area in a subsequent video frame of the first video, wherein the subsequent video frame is that a corresponding time node in the first video is located behind the current video frame.

In some alternative embodiments, the fusion unit comprises:

The conversion subunit is used for respectively converting the point cloud data and the first video into corresponding two-dimensional feature graphs;

And the weighted summation subunit is used for weighted summation of the pixel points of the two-dimensional feature image to obtain the current fusion feature image.

In some alternative embodiments, the first identification unit comprises:

The prediction subunit is used for acquiring a history fusion feature map corresponding to a preset history video frame and predicting the target object position of a first identification object in the current fusion feature map according to the object position of the first identification object in the history fusion feature map, wherein the preset history video frame is a video frame at the moment before the current video frame in the first video;

the identification subunit is used for carrying out object identification based on the current fusion feature map to obtain a region to be selected;

And the correction subunit is used for correcting the area to be selected based on the position of the target object to obtain the target area.

In some alternative embodiments, the first identification unit further comprises:

The second extraction subunit is used for extracting the feature points in the history fusion feature map through the feature extraction network after the history fusion feature map at the preset history moment is obtained to obtain the history image features, wherein the extraction precision of the feature points in the history fusion feature map is different;

and a construction subunit for constructing a feature set based on the historical image features to perform the step of predicting the target object position of the first recognition object in the current fusion feature map from the object position of the first recognition object in the historical fusion feature map according to the feature set.

In some alternative embodiments, the second identification unit further comprises:

an analysis subunit, configured to analyze, based on the target area, a position coordinate of the first recognition object in the current video frame and semantic information, where the semantic information is information indicating semantic description for a pixel feature of the target area;

and the fourth determining subunit is used for determining the identification object of which the area and the target area meet the matching degree condition in the subsequent video frame to obtain the first identification object, wherein the matching degree condition comprises the position matching degree and the semantic matching degree.

In some alternative embodiments, the first determining module 1001 further includes:

The second determining unit is used for determining the spatial overlapping degree based on the overlapping region of the target region between the current video frames after the target region is obtained;

and the analysis unit is used for analyzing the associated objects in the first identification objects based on the spatial overlapping degree.

In some alternative embodiments, the identification module 1002 further comprises:

The migration unit is used for acquiring a semantic reasoning network of a source domain and migrating according to the semantic reasoning network to obtain a target reasoning network, wherein the source domain is used for indicating the domain for carrying out semantic description on the event in the video frame;

the third recognition unit is used for carrying out semantic recognition on the first video through the target reasoning network to obtain event sentences, and summarizing target type events based on the event sentences.

In some alternative embodiments, the migration unit further comprises:

The initialization subunit is used for acquiring at least part of parameters in the semantic reasoning network and initializing model parameters of the reasoning network to be trained based on the at least part of parameters;

And the training sub-unit is used for carrying out parameter training on the inference network to be trained through sample data to obtain a target inference network, wherein the sample data comprises video sequences which are arranged in time sequence.

The second acquisition unit is used for acquiring multi-mode data of the first identification object in the first video, wherein the multi-mode data comprises the appearance of the target object in the first video and the data of the motion trail;

the third extraction unit is used for extracting the characteristics based on the multi-mode data to obtain the high-dimensional characteristics of the object;

And the third determining unit is used for determining a target high-dimensional feature matched with the target type event in the target high-dimensional features, and determining a first identification object corresponding to the target high-dimensional feature as a target object, wherein the motion feature and the appearance feature indicated by the target high-dimensional feature are matched with the semantics corresponding to the target type event.

In some alternative embodiments, the multimodal data includes: a sub-image corresponding to the target object in the video frame of the first video, a human skeleton sequence corresponding to the sub-image, visual characteristic information of the sub-image, and an optical flow sequence when the position of the target object is changed between the sub-images; the third extraction unit includes:

The third extraction subunit is used for respectively extracting action features corresponding to the human skeleton sequences and color features corresponding to the visual feature information based on the first convolution network, and extracting motion features corresponding to the optical flow sequences based on the second convolution network;

The adding subunit is used for adding the color characteristics and the motion characteristics to obtain a first fusion characteristic;

and the splicing subunit is used for carrying out weighted summation on the first fusion feature and the action feature through the feature fusion network to obtain a second fusion feature, and splicing the first fusion feature and the second fusion feature to obtain the high-dimensional feature of the object.

Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The object tracking device in this embodiment is presented in the form of a functional unit, where the unit refers to an ASIC (Application SPECIFIC INTEGRATED Circuit) Circuit, a processor and a memory executing one or more software or fixed programs, and/or other devices that can provide the above functions.

The embodiment of the invention also provides computer equipment, which is provided with the target tracking device shown in the figure 10.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 11, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 11.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.

The computer device further comprises input means 30 and output means 40. The processor 10, memory 20, input device 30, and output device 40 may be connected by a bus or other means, for example in fig. 11.

The input device 30 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output means 40 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. Such display devices include, but are not limited to, liquid crystal displays, light emitting diodes, displays and plasma displays. In some alternative implementations, the display device may be a touch screen.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Portions of the present invention may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or aspects in accordance with the present invention by way of operation of the computer. Those skilled in the art will appreciate that the form of computer program instructions present in a computer readable medium includes, but is not limited to, source files, executable files, installation package files, etc., and accordingly, the manner in which the computer program instructions are executed by a computer includes, but is not limited to: the computer directly executes the instruction, or the computer compiles the instruction and then executes the corresponding compiled program, or the computer reads and executes the instruction, or the computer reads and installs the instruction and then executes the corresponding installed program. Herein, a computer-readable medium may be any available computer-readable storage medium or communication medium that can be accessed by a computer.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A method of target tracking, the method comprising:

Acquiring a first video acquired by a first camera device and determining a first identification object in the first video, wherein the acquiring the first video acquired by the first camera device and determining the first identification object in the first video comprises:

acquiring point cloud data corresponding to a current video frame in the first video, wherein the point cloud data are data acquired through a point cloud device associated with the first camera device;

Identifying based on the current fusion feature map to obtain a target area of a first identification object in the current video frame, wherein the identifying based on the current fusion feature map to obtain the target area of the first identification object in the current video frame comprises the following steps:

Acquiring a history fusion feature map corresponding to a preset history video frame;

Extracting features of a history fusion feature map corresponding to a first preset history time through a first network in a target detection network to obtain a first history image feature, extracting features of a history fusion feature map corresponding to a second preset history time through a second network in the target detection network to obtain a second history image feature, wherein a channel multiplier factor contained in a separable convolution layer of the first network is larger than a channel multiplier factor contained in a separable convolution layer of the second network, and the first preset history time is before the second preset history time;

constructing a feature set based on the first historical image feature and the second historical image feature;

Predicting the target object position of a first identification object in the current fusion feature map according to the object position of the first identification object in the feature set and the history fusion feature map, wherein the preset history video frame is a video frame at a moment before the current video frame in the first video;

performing object recognition based on the current fusion feature map to obtain a region to be selected;

Correcting the area to be selected based on the target object position to obtain a target area;

identifying a first identification object matched with the target area in a subsequent video frame of the first video, wherein the subsequent video frame is that a corresponding time node in the first video is positioned behind the current video frame;

Generating a target feature template based on object features of the target object in the first video, and determining the target object with object features matched with the target feature template in a second identification object of a second video, wherein the second video is acquired by at least one second camera associated with the first camera.

2. The method of claim 1, wherein the generating a target feature template based on object features of the target object in the first video comprises:

Intercepting a sub-image corresponding to the target object in a video frame of the first video;

And extracting object characteristics of the target object based on the sub-image, and generating a target characteristic template of the target object according to the object characteristics.

3. The method of claim 2, wherein the extracting object features of the target object based on the sub-image and generating a target feature template of the target object from the object features comprises:

Extracting motion track features of the target object in a plurality of sub-images, and extracting visual features of the sub-images, wherein the visual features comprise: color features and/or texture features;

and determining a target feature template of the target object according to the motion trail features and the visual features.

4. A method according to claim 3, wherein said determining a target feature template for the target object from the motion profile features and the visual features comprises:

Fusing the motion trail features and the visual features to obtain target features;

and generating the target feature template according to the target feature.

5. The method of claim 1, wherein said determining the target object in the second identified object of the second video for which object features match the target feature template comprises:

Acquiring target features corresponding to the target object based on the target feature template, and extracting object features of the second identification object based on the second video;

And determining a third recognition object with the similarity of object characteristics and the target characteristics meeting the similarity condition in the second recognition object, and determining the third recognition object as the target object.

6. The method of claim 5, wherein determining a third recognition object of the second recognition object whose similarity between object features and the target features satisfies a similarity condition, and determining the third recognition object as the target object, comprises:

Performing feature space conversion on the object feature through a first sub-network in the twin network to obtain an object feature vector, and performing feature space conversion on the object feature through a second sub-network in the twin network to obtain an object feature vector;

calculating the absolute value of the difference value of the object feature vector and the target feature vector, and mapping the absolute value of the difference value to obtain a mapping value;

And determining a third identification object with the mapping value meeting the similar condition in the second identification objects, and determining the third identification object as the target object.

7. The method according to claim 1, wherein the fusing the current video frame with the point cloud data to obtain a current fused feature map includes:

respectively converting the point cloud data and the first video into corresponding two-dimensional feature maps;

and carrying out weighted summation on the pixel points of the two-dimensional feature map to obtain the current fusion feature map.

8. The method of claim 1, wherein the identifying a first identification object in a subsequent video frame of the first video that matches the target region comprises:

analyzing position coordinates of the first identification object in the current video frame and semantic information based on the target area, wherein the semantic information is used for indicating information for carrying out semantic description on pixel point characteristics of the target area;

9. The method according to claim 1, wherein the method further comprises:

after obtaining a target region, determining a spatial overlap degree based on an overlap region of the target region between the current video frames;

And analyzing the associated objects in the first identification objects based on the spatial overlapping degree.

10. The method of claim 1, wherein the identifying the object type event in the first video comprises:

And carrying out semantic recognition on the first video through the target reasoning network to obtain event sentences, and summarizing target type events based on the event sentences.

11. The method of claim 10, wherein said migrating according to the semantic reasoning network results in a target reasoning network, comprising:

acquiring at least part of parameters in the semantic reasoning network, and initializing model parameters of the reasoning network to be trained based on the at least part of parameters;

and carrying out parameter training on the to-be-trained inference network through sample data to obtain a target inference network, wherein the sample data comprises video sequences which are arranged in time sequence.

12. The method of claim 1, wherein the determining a target object that matches the target type event based on the object characteristics of the first recognition object comprises:

Acquiring multi-modal data of the first identification object in the first video, wherein the multi-modal data comprises the appearance of the target object in the first video and the data of a motion trail;

and determining a target high-dimensional feature matched with the target type event from the target high-dimensional features, and determining a first identification object corresponding to the target high-dimensional feature as a target object, wherein the motion feature indicated by the target high-dimensional feature and the object feature are matched with the semantics corresponding to the target type event.

13. The method of claim 12, wherein the multimodal data comprises: the target object is a corresponding sub-image in a video frame of the first video, the human skeleton sequence corresponding to the sub-image, the visual characteristic information of the sub-image, and the optical flow sequence when the position of the target object is changed between the sub-images;

the feature extraction based on the multi-modal data to obtain the high-dimensional feature of the object comprises the following steps:

Respectively extracting action features corresponding to the human skeleton sequences and color features corresponding to the visual feature information based on a first convolution network, and extracting motion features corresponding to the optical flow sequences based on a second convolution network;

14. A target tracking device, the device comprising:

The first determining module is configured to acquire a first video acquired by a first camera device and determine a first identification object in the first video, where acquiring the first video acquired by the first camera device and determining the first identification object in the first video includes: acquiring point cloud data corresponding to a current video frame in the first video, wherein the point cloud data are data acquired through a point cloud device associated with the first camera device; fusing the point cloud data with the current video frame to obtain a current fused feature map, wherein the current fused feature map is used for indicating a fusion result of the feature map aiming at the first video and the corresponding point cloud data; identifying based on the current fusion feature map to obtain a target area of a first identification object in the current video frame, wherein the identifying based on the current fusion feature map to obtain the target area of the first identification object in the current video frame comprises the following steps: acquiring a history fusion feature map corresponding to a preset history video frame; extracting features of a history fusion feature map corresponding to a first preset history time through a first network in a target detection network to obtain a first history image feature, extracting features of a history fusion feature map corresponding to a second preset history time through a second network in the target detection network to obtain a second history image feature, wherein a channel multiplier factor contained in a separable convolution layer of the first network is larger than a channel multiplier factor contained in a separable convolution layer of the second network, and the first preset history time is before the second preset history time; constructing a feature set based on the first historical image feature and the second historical image feature; predicting the target object position of a first identification object in the current fusion feature map according to the object position of the first identification object in the feature set and the history fusion feature map, wherein the preset history video frame is a video frame at a moment before the current video frame in the first video; performing object recognition based on the current fusion feature map to obtain a region to be selected; correcting the area to be selected based on the target object position to obtain a target area; identifying a first identification object matched with the target area in a subsequent video frame of the first video, wherein the subsequent video frame is that a corresponding time node in the first video is positioned behind the current video frame;

And the second determining module is used for generating a target feature template based on the object features of the target object in the first video and determining the target object with the object features matched with the target feature template in a second identification object of a second video, wherein the second video is acquired through at least one second camera associated with the first camera.

15. A computer device, comprising:

a memory and a processor in communication with each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the object tracking method of any of claims 1 to 13.

16. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the object tracking method of any one of claims 1 to 13.

17. A computer program product comprising computer instructions for causing a computer to perform the object tracking method of any one of claims 1 to 13.