CN110428448B

CN110428448B - Target detection tracking method, device, equipment and storage medium

Info

Publication number: CN110428448B
Application number: CN201910703381.2A
Authority: CN
Inventors: 黄湘琦; 周文
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2021-05-14
Anticipated expiration: 2039-07-31
Also published as: CN110428448A

Abstract

The embodiment of the application discloses a target detection tracking method, a target detection tracking device, target detection tracking equipment and a storage medium, and belongs to the technical field of computer vision. The method comprises the following steps: acquiring a video image acquired by a first camera; detecting and tracking a target object in a video image acquired by a first camera to obtain local tracking information of the target object; calculating the similarity between the target object and the objects in the global tracking queue according to the local tracking information of the target object; and distributing the global identification to the target object according to the similarity. According to the technical scheme, when the target detection tracking is carried out in the multi-camera scene, the spatial information of the target object is considered, and the time information and the appearance information are also considered in a combined manner, so that the probability of data mismatching is reduced, and the accuracy of the target detection tracking in the multi-camera scene is improved.

Description

Target detection tracking method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computer vision, in particular to a target detection tracking method, a target detection tracking device, target detection tracking equipment and a storage medium.

Background

With the increasing public safety requirements, the area for laying video monitoring is continuously expanded. A plurality of cameras are usually arranged in a large-range video monitoring area, and target detection and tracking in a large range can be realized by processing and analyzing video images acquired by the plurality of cameras.

At present, in a target detection and tracking scheme in a multi-camera scene, a three-dimensional world coordinate calibration based on cameras is usually adopted in the related art to obtain a spatial position relationship of a tracked target object between the cameras. In the related technology, three-dimensional world coordinate calibration is performed on each camera in advance, and when a system processes and analyzes data of each camera, a target object in each camera is firstly converted into the same three-dimensional world coordinate system, and then the target object is processed and analyzed in the three-dimensional world coordinate system.

In the related art, for a target detection tracking scheme in a multi-camera scene, whether target objects appearing in different time points or different cameras are matched or not is mainly judged based on the spatial positions of the target objects, but data mismatching is easily caused, that is, the accuracy of target detection tracking in the multi-camera scene is poor.

Disclosure of Invention

The embodiment of the application provides a target detection tracking method, a target detection tracking device and a storage medium, which can be used for solving the problem of poor accuracy of target detection tracking in a multi-camera scene in the related art. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a target detection and tracking method, where the method includes:

acquiring a video image acquired by a first camera;

detecting and tracking a target object in a video image acquired by the first camera to obtain local tracking information of the target object, wherein the local tracking information comprises local appearance information and local spatiotemporal information, the local appearance information is used for indicating appearance characteristics of the target object in the video image acquired by the first camera, and the local spatiotemporal information is used for indicating spatiotemporal characteristics of the target object in the video image acquired by the first camera;

calculating the similarity between the target object and the objects in the global tracking queue according to the local tracking information of the target object; the global tracking queue is used for storing global tracking information of at least one object obtained by tracking based on video images collected by at least one camera, the at least one camera comprises the first camera, the global tracking information comprises global appearance information and global space-time information, the global appearance information is used for indicating global appearance characteristics of the object under the at least one camera, and the global space-time information is used for indicating the latest space-time characteristics of the object under the at least one camera;

and distributing a global identifier for the target object according to the similarity.

In another aspect, an embodiment of the present application provides an apparatus for detecting and tracking a target, where the apparatus includes:

the information acquisition module is used for acquiring a video image acquired by the first camera;

the detection tracking module is used for detecting and tracking a target object in a video image acquired by the first camera to obtain local tracking information of the target object, wherein the local tracking information comprises local appearance information and local spatiotemporal information, the local appearance information is used for indicating appearance characteristics of the target object in the video image acquired by the first camera, and the local spatiotemporal information is used for indicating spatiotemporal characteristics of the target object in the video image acquired by the first camera;

the calculation comparison module is used for calculating the similarity between the target object and the objects in the global tracking queue according to the local tracking information of the target object; the global tracking queue is used for storing global tracking information of at least one object obtained by tracking based on video images collected by at least one camera, the at least one camera comprises the first camera, the global tracking information comprises global appearance information and global space-time information, the global appearance information is used for indicating global appearance characteristics of the object under the at least one camera, and the global space-time information is used for indicating the latest space-time characteristics of the object under the at least one camera;

and the distribution identification module is used for distributing global identification to the target object according to the similarity.

In a possible design, the space-time comparison unit is further configured to, when the spatial position relationship is that two cameras are adjacent and the fields of view overlap, calculate the global position coordinate of the target object according to the position coordinate of the target object in the image coordinate system corresponding to the first camera and a conversion relationship between the image coordinate system corresponding to the first camera and the global coordinate system; calculating the global position coordinate of the ith object according to the conversion relation between the image coordinate system corresponding to the second camera and the global coordinate system of the ith object; and calculating the space-time similarity between the target object and the ith object according to the global position coordinate of the target object and the global position coordinate of the ith object.

In a possible design, the space-time comparison unit is further configured to, when a spatial position relationship is that two cameras are adjacent and views overlap, convert a position coordinate of the target object in an image coordinate system corresponding to the first camera and a position coordinate of the ith object in an image coordinate system corresponding to the second camera into an image coordinate system corresponding to the same target camera, where the target camera is the first camera or the second camera; and calculating the space-time similarity between the target object and the ith object according to the position coordinate of the target object in the image coordinate system corresponding to the target camera and the position coordinate of the ith object in the image coordinate system corresponding to the target camera.

In a possible design, the appearance comparison sub-module is configured to calculate a distance value between a k-dimensional appearance feature included in the local appearance information of the target object and a k-dimensional appearance feature included in the global appearance information of the ith object. Wherein k is a positive integer. And determining the appearance similarity between the target object and the ith object according to the distance value.

In a possible design, the similarity operator module is configured to perform weighted summation on the appearance similarity and the spatio-temporal similarity, and calculate a similarity between the target object and the ith object.

In a possible design, the identifier assigning module is configured to, for each target object, find whether a specific object matching the target object exists in the global tracking queue according to a similarity matrix; wherein the similarity matrix comprises similarities between the m target objects and the n objects pairwise; if the specific object exists in the global tracking queue, distributing a global identifier of the specific object to the target object; and if the specific object does not exist in the global tracking queue, distributing a new global identifier for the target object.

In one possible design, the apparatus further includes a coordinate calculation module, configured to calculate a global position coordinate of the target object according to a position coordinate of the target object in a coordinate system corresponding to the first camera and a transformation relationship between the coordinate system corresponding to the first camera and a global coordinate system; and generating a motion track of the target object according to the global position coordinates of the target object at each moment.

In yet another aspect, an embodiment of the present application provides a computer device, where the computer device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the above target detection and tracking method.

In yet another aspect, an embodiment of the present application provides a computer-readable storage medium, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the above target detection and tracking method.

In a further aspect, an embodiment of the present application provides a computer program product, which is configured to, when executed by a processor, implement the above target detection and tracking method.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

the method comprises the steps of calculating the similarity between a target object and objects in a global tracking queue by obtaining local appearance information and local space-time information of the target object under a single camera, and distributing a global identifier to the target object according to the calculated similarity, namely, when target detection tracking is carried out under a multi-camera scene, considering time information and appearance information in addition to space information of the target object, which is beneficial to reducing probability of data mismatching and improving accuracy of target detection tracking under the multi-camera scene.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic diagram of a possible implementation environment of a target detection and tracking method provided in an embodiment of the present application;

fig. 2 is a core step of a target detection and tracking method provided in an embodiment of the present application;

FIG. 3 is a process diagram of a possible computer device for a target detection and tracking method according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of a target detection and tracking method provided in an embodiment of the present application;

FIG. 5 is a method for computing spatiotemporal similarity according to embodiments of the present application;

FIG. 6 is a block diagram of an object detection and tracking apparatus provided in an embodiment of the present application;

FIG. 7 is a complete block diagram of an object detection and tracking apparatus provided in the embodiments of the present application;

fig. 8 is a block diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The scheme provided by the embodiment of the application relates to the computer vision technology of artificial intelligence, and is specifically explained by the following embodiment.

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Refer to fig. 1, which is a diagram illustrating an implementation environment according to an embodiment of the present application. The implementation environment may include: a camera 10 and a computer device 20.

The camera 10 is used for capturing images within the visual field thereof and generating a video stream. In the present embodiment, the number of the cameras 10 is plural. For example, as shown in fig. 1, a plurality of cameras 10 are arranged at different positions of a real scene 30, and each camera 10 is used for monitoring a partial region of the real scene 30 to obtain a corresponding video stream.

The Computer device 20 is a device having a function of processing and storing data, such as a PC (Personal Computer), a server, or other electronic devices having a computing capability, and is not limited in this embodiment of the present application. The computer device 20 may receive the video streams of the multiple cameras 10 and may decode the video streams into images, and then perform subsequent processing, such as multi-target detection tracking in a multi-camera scenario.

As shown in fig. 2, a flow chart of multi-target detection tracking in a multi-camera scenario by computer device 20 is shown. The computer apparatus 20 receives video streams of the plurality of cameras 10 and decodes the video streams to form images (as shown in part (a) of fig. 2). For each camera 10, the computer device 20 performs target detection on the video image of the single camera 10, and obtains a target detection result under the single camera 10. The computer device 20 can perform multi-object detection under a single camera 10, that is, detect a plurality of target objects from a video image of the single camera 10 (as shown in part (b) of fig. 2). The computer device 20 may further perform multi-target tracking based on the target detection result of the single camera 10, that is, respectively track a plurality of target objects under the single camera 10, so as to obtain a multi-target tracking result under the single camera 10 (as shown in part (c) of fig. 2). Finally, the computer device 20 integrates the multi-target tracking results under the multiple cameras 10, and obtains the tracking results of the multiple target objects under the multiple cameras 10, i.e., the global tracking result (as shown in part (d) of fig. 2).

The camera 10 and the computer device 20 can communicate in a wired or wireless manner. For example, data transmission between the camera 10 and the computer device 20 may be performed in an Ad-Hoc manner, or may be performed under the coordination of a base station or a wireless Access Point (AP), which is not limited in this embodiment of the present application.

In an exemplary embodiment, as shown in FIG. 3, computer device 20 includes a single screen tracking module 21 and a cross-screen tracking module 22. The single-screen tracking module 21 is mainly responsible for target detection and tracking under a single camera to obtain a multi-target tracking result under the single camera. The cross-screen tracking module 22 is mainly responsible for integrating multi-target tracking results under multiple cameras to obtain a global tracking result.

With reference to fig. 3, the single-screen tracking module 21 performs target detection on a video image under a single camera, starts target tracking under the single camera after a target object is detected, and obtains local tracking information under the single camera, where the local tracking information includes local appearance information and local spatiotemporal information. The single screen tracking module 21 pushes the local tracking information of the target object to the cross-screen tracking module 22. After receiving the local tracking information of the target object, the cross-screen tracking module 22 calculates the similarity between the target object and the objects in the global tracking queue, and allocates a global identifier to the target object according to the similarity. As shown in fig. 3, the global tracking queue is configured to store global tracking information of at least one object tracked based on video images acquired by a plurality of cameras, including global appearance information and global spatiotemporal information. The local tracking information of the single camera can be represented by a local tracking queue, and the local tracking queue comprises local tracking information of at least one target object obtained based on video images acquired by the single camera. The cross-screen tracking module 22 calculates a similarity between the target object and the objects in the global tracking queue according to the local tracking information of the target object and the global tracking information of the objects in the global tracking queue, where the similarity may be a weighted sum of an appearance similarity and a spatiotemporal similarity. Finally, the cross-screen tracking module 22 assigns a global identifier to the target object according to the similarity calculation result. In addition, the computer device may also implement some business functions according to the global tracking result, such as trajectory drawing and displaying of the object, or other business logic based on the global tracking result, such as region management and control, loitering detection, video structuring, and the like.

In the embodiment of the present application, the object refers to a person or an object that can be detected and tracked from a video image, and optionally, the object is a movable real object such as a pedestrian, an animal, a vehicle (e.g., a vehicle), and the like.

Referring to fig. 4, a flowchart of a target detection and tracking method according to an embodiment of the present application is shown. The method may be applied in a computer device implementing the environment shown in fig. 1. The method comprises the following steps (401-404):

step 401, acquiring a video image acquired by a first camera.

The first camera may be any one of a plurality of cameras. The multiple cameras can be used for monitoring a certain real scene, and different cameras can be used for monitoring different areas in the real scene. The monitoring areas of any two cameras may or may not have an overlapping area.

In addition, the computer device can decode the video stream collected by the first camera to obtain a plurality of frames of video images.

Step 402, detecting and tracking a target object in a video image acquired by a first camera to obtain local tracking information of the target object.

The target object refers to an object detected and tracked by the computer device in the video image acquired by the first camera, and the target object may include one object or a plurality of objects. As already described above, in the present embodiment, the object may be a movable real object such as a pedestrian, an animal, a vehicle (e.g., a vehicle), or the like. In one example, the computer device detects and tracks all objects (such as all pedestrians) in the video image acquired by the first camera, and obtains local tracking information of each object. In another example, the computer device detects and tracks a specific object or objects (such as a specific pedestrian or pedestrians) in the video image captured by the first camera, and obtains local tracking information of each specific object.

The computer device first detects a target object using a target detection technique. Optionally, the computer device may detect the target object by using methods such as ssd (single Shot multitox detector), yolo (young Only Look one), and the like, which is not limited in this embodiment of the present application.

After the computer equipment detects the target object, the target object is tracked by using a target tracking algorithm, and a local identifier is distributed to the target object. The local identification of the target object is used for playing a role of unique identification on the target object under a single camera, different objects have different local identifications under the single camera, and the local identifications can be recorded as local IDs. Optionally, the computer device may track the target object by using a relevant filtering algorithm such as kcf (kernel Correlation filters), a tracking algorithm based on a deep neural network (e.g., siamensnetwork), and the like, which is not limited in this embodiment of the present application.

In the embodiment of the application, the computer device may obtain local tracking information of the target object based on the detection tracking result of the target object. The local tracking information may include local appearance information and local spatiotemporal information, among others.

The local appearance information is used to indicate appearance features of the target object in a video image captured by a single camera (e.g., the first camera in this embodiment). The appearance characteristics reflect the characteristics of the color, shape, texture, etc. of the target object. For example, the local appearance information of the target object is obtained by performing feature extraction on the corresponding image area of the target object in the video image. Taking the target object as a pedestrian as an example, the local appearance information of the target object may be obtained by using a pedestrian re-identification (person re-identification) technology and/or a face recognition technology, and the specific acquisition means of the local appearance information is not limited in the embodiment of the present application.

The local spatiotemporal information is used to indicate spatiotemporal features of the target object in the video images acquired by the single camera (e.g., the first camera in this embodiment). The spatiotemporal features reflect the temporal and spatial features of the target object. The local spatiotemporal information may include information such as a position and a size of a detection frame of the target object, key point information of the target object, and a timestamp corresponding to the target object. The detection frame of the target object refers to a minimum frame that includes the target object and is in a preset shape, and may be, for example, a minimum rectangular frame that includes the target object. The local appearance information of the target object described above can be obtained by performing feature extraction on an image region in the detection frame of the target object. The key point information of the target object refers to the positions of the key points of the target object, and the key points may include key points of feet or key points of other parts, for example, the target object is a pedestrian. The key point information may be obtained by using a related key point positioning algorithm such as openpos or Mask RCNN, which is not limited in the embodiment of the present application. The timestamp corresponding to the target object refers to the acquisition time of the currently processed frame of video image.

After the computer device detects and tracks the target object in each pair of frames of video images acquired by the first camera, a local tracking queue can be obtained. The local tracking queue comprises local identification of at least one object obtained by detecting and tracking from the frame of video image and local tracking information corresponding to the local identification.

In addition, for the video stream captured by the first camera, the computer device may perform detection and tracking on the target object in each frame of video image in the video stream, or may perform detection and tracking on the target object once every several frames of video images, for example, perform detection and tracking on the target object once every 5 frames of video images, that is, perform detection and tracking on the target object in the video images such as the 1 st frame, the 6 th frame, the 11 th frame, and the 16 th frame.

In the

above steps

401 and 402, only the first camera is taken as an example, and the target detection and tracking process under the first camera is described. For a multi-camera scene, the target detection and tracking process under each camera is the same as or similar to the above-described target detection and tracking process under the first camera, and details are not repeated here.

And 403, calculating the similarity between the target object and the objects in the global tracking queue according to the local tracking information of the target object.

The global tracking queue is used for storing global identification of at least one object obtained by tracking based on video images acquired by at least one camera and global tracking information corresponding to each global identification. The at least one camera comprises the first camera. Optionally, when the technical solution provided by this embodiment is applied to target detection and tracking in a multi-camera scene, the number of the cameras is multiple (that is, at least two), and the at least two cameras include the first camera and at least one other camera. The global tracking queue comprises global identifications of all objects and global tracking information corresponding to the global identifications. The global identification of the object is used for playing a role of unique identification on the object under the multiple cameras, different objects have different global identifications under the multiple cameras, and the global identification can be recorded as a global ID.

In an embodiment of the present application, the global tracking information includes global appearance information and global spatiotemporal information.

The global appearance information is used to indicate global appearance features of the object under the at least one camera. In one possible implementation, for an object, the global appearance information of the object may be integrated according to the local appearance information of the object under each camera. For example, assuming that there are 3 cameras in total, denoted as camera 1, camera 2, and camera 3, if an object is detected from 2 cameras (such as camera 1 and camera 3) and local appearance information of the object under camera 1 and local appearance information of the object under camera 3 are obtained, global appearance information of the object can be calculated according to the local appearance information of the object under camera 1 and the local appearance information of the object under camera 3. For example, an average value or a weighted average value of the local appearance information of the object under each camera is calculated to obtain the global appearance information of the object. In another possible embodiment, for an object, the global appearance information of the object includes local appearance information of the object under all cameras. For example, assuming that there are 3 cameras in total, denoted as camera 1, camera 2, and camera 3, for a certain object, if the object is detected from 2 cameras (such as camera 1 and camera 3) and the local appearance information of the object under camera 1 and the local appearance information of the object under camera 3 are obtained, the global appearance information of the object includes the local appearance information under camera 1 and the local appearance information under camera 3.

The global spatiotemporal information is used to indicate global spatiotemporal features of the object under the at least one camera. In one possible embodiment, for an object, the global spatiotemporal information of the object refers to the latest local spatiotemporal information of the object under each camera. For example, assuming that there are 3 cameras in total, denoted as camera 1, camera 2, and camera 3, for an object, if the object is detected from 2 cameras (such as camera 1 and camera 3), local spatio-temporal information of the object under camera 1 and local spatio-temporal information of the object under camera 3 are obtained. Then, if the timestamp contained in the local spatio-temporal information of the object under the camera 1 is later than the timestamp contained in the local spatio-temporal information of the object under the camera 3, determining the local spatio-temporal information of the object under the camera 1 as the global spatio-temporal information of the object; on the contrary, if the timestamp contained in the local spatio-temporal information of the object under the camera 3 is later than the timestamp contained in the local spatio-temporal information of the object under the camera 1, the local spatio-temporal information of the object under the camera 3 is determined as the global spatio-temporal information of the object. In another possible implementation, for an object, the global spatiotemporal information of the object includes the latest spatiotemporal information of the object under all cameras. For example, assuming that there are 3 cameras in total, denoted as camera 1, camera 2, and camera 3, for an object, if the object is detected from 2 cameras (e.g., camera 1 and camera 3) and the latest local spatio-temporal information of the object under camera 1 and the latest local spatio-temporal information under camera 3 are obtained, the global spatio-temporal information of the object includes the latest local spatio-temporal information under camera 1 and the latest local spatio-temporal information under camera 3.

Optionally, for each object in the global tracking queue, the computer device calculates a similarity between the target object and the object according to the local tracking information of the target object and the global tracking information of the object. The computer equipment calculates the similarity between the target object and each object in the global tracking queue, and aims to detect whether the target object exists in each object recorded by the global tracking queue. For example, the similarity between the target object and one object in the global tracking queue is high, and if the similarity is greater than a preset threshold, it may be determined that the target object is the one object in the global tracking queue. For another example, the computer device may obtain a maximum similarity between the target object and each object in the global tracking queue, and if the maximum similarity is greater than a preset threshold, it may be determined that the target object is an object corresponding to the maximum similarity.

In one possible implementation, the similarity between the target object and the object in the global tracking queue may be calculated according to the appearance similarity and the space-time similarity between the target object and the object in the global tracking queue. The appearance similarity reflects the similarity of the appearance characteristics of the target object and the global tracking queue, and the appearance similarity between the target object and the object in the global tracking queue can be calculated according to the local appearance information of the target object and the global appearance information of the target object. The space-time similarity reflects the similarity of the space-time characteristics of the target object and the global tracking queue, and can be calculated according to the local space-time information of the target object and the global space-time information of the target object.

Illustratively, for the ith object in the global tracking queue, the computer device calculates the appearance similarity between the target object and the ith object according to the local appearance information of the target object and the global appearance information of the ith object; calculating the space-time similarity between the target object and the ith object according to the local space-time information of the target object and the global space-time information of the ith object; and calculating the similarity between the target object and the ith object according to the appearance similarity and the space-time similarity. Wherein i is a positive integer. The two steps of calculating the appearance similarity and calculating the space-time similarity may be executed simultaneously or sequentially, and the embodiment of the present application does not limit this. In addition, please refer to the description in the following embodiments for a specific calculation process of the above similarity.

It should be noted that the global trace queue maintained at the beginning of the computer device is an empty queue, and when the computer device acquires the first local trace queue, the local trace information of each object in the first local trace queue may be directly stored in the global trace queue, and used as the global trace information of each object in the global trace queue, so as to complete the first update of the global trace queue. In addition, in the first updating process, the global identifier of each object in the global tracking queue may directly use the local identifier of each object in the first local tracking queue, or may reallocate a new global identifier, which is not limited in this embodiment of the present application.

And step 404, determining the global identification of the target object according to the similarity.

After the computer device calculates the similarity between the target object and each object in the global tracking queue, it may find whether a specific object matching the target object exists in the global tracking queue according to the similarity calculation result. If the specific object exists in the global tracking queue, the computer equipment allocates the global identification of the specific object to the target object. And if the specific object does not exist in the global tracking queue, allocating a new global identification for the target object. For example, the global trace queue includes 3 objects, including object 1, object 2, and object 3, whose global identifiers are 001, 002, and 003, respectively. If the object 1 is determined to be matched with the target object according to the similarity calculation result, the global identifier allocated to the target object is 001; and if it is determined that no specific object matched with the target object exists in the global tracking queue according to the similarity calculation result, allocating a new global identifier to the target object, for example, allocating 004 as the global identifier of the target object.

In addition, the computer device needs to update the global tracking queue, and when a specific object matching the target object exists in the global tracking queue, the computer device updates the global tracking information of the specific object according to the local tracking information of the target object, including updating the global appearance information of the specific object according to the local appearance information of the target object, and updating the global spatio-temporal information of the specific object according to the local spatio-temporal information of the target object. When a specific object matching with the target object does not exist in the global tracking queue, the target object may be added to the global tracking queue as a new object, and the local tracking information of the target object is determined as the global tracking information of the target object.

It should be noted that, if the computer device includes the single-screen tracking module and the cross-screen tracking module described above, the

above steps

401 and 402 may be executed by the single-screen tracking module, and the single-screen tracking module sends the local tracking information of the target object to the cross-screen tracking module after obtaining the local tracking information; the

steps

403 and 404 may be executed by a cross-screen tracking module, and the cross-screen tracking module allocates a global identifier to the target object according to the local tracking information of the target object and by combining with the global tracking queue.

In a possible implementation, the step 404 may further include: and calculating the global position coordinate of the target object according to the position coordinate of the target object in the image coordinate system of the camera where the target object is located and the conversion relation between the image coordinate system of the camera and the global coordinate system. Wherein the global coordinate system refers to the actual physical coordinate system of the ground plane. When the camera is calibrated by global coordinates in advance, the conversion relation between the global coordinate system and the image coordinate system can be modeled by utilizing the property that images projected from planes in the physical world into the camera picture satisfy affine transformation, an affine transformation model is obtained, and then the target object can be converted from the image coordinate system of the camera into the global coordinate system by utilizing the affine transformation model, so that the global position coordinates of the target object are calculated. According to the calculated global position coordinates, the motion trail of the target object can be generated, and therefore real-time personnel positioning and tracking in a large-range multi-camera video monitoring scene can be achieved.

In summary, according to the technical scheme provided by the embodiment of the application, the similarity between the target object and the objects in the global tracking queue is calculated by obtaining the local appearance information and the local space-time information of the target object under a single camera, and the global identifier is allocated to the target object according to the calculated similarity, that is, when the target detection tracking is performed under a multi-camera scene, the probability of data mismatching is favorably reduced by considering the time information and the appearance information in addition to the spatial information of the target object, and the accuracy of the target detection tracking is improved under the multi-camera scene.

In addition, the technical scheme provided by the embodiment of the application can realize multi-target detection and tracking in a multi-camera scene. In an actual application scene, real-time personnel positioning and tracking in a large-range multi-camera video monitoring scene can be realized, and the method can be applied to business scenes such as personnel authorization, area management and control, real-time checking, tracing, statistical analysis of personnel activity tracks and the like; in addition, the system can help realize unmanned or few-person security monitoring, reduce security cost and improve security efficiency.

Next, a description will be given of a procedure of calculating the similarity.

In one possible implementation, the similarity between the target object and the objects in the global tracking queue may be calculated according to the appearance similarity and the space-time similarity between the target object and the objects in the global tracking queue. Taking the calculation of the similarity between the target object and the ith object in the global tracking queue as an example, the appearance similarity and the space-time similarity of the target object and the ith object can be subjected to weighted summation to calculate the similarity between the target object and the ith object.

Exemplarily, the similarity between the target object and the ith object is W1 × the appearance similarity between the target object and the ith object + W2 × the spatio-temporal similarity between the target object and the ith object, where W1 represents a weight corresponding to the appearance similarity, and W2 represents a weight corresponding to the spatio-temporal similarity. Specific values of W1 and W2 may be determined according to an actual application scenario, and this is not limited in this embodiment of the present application.

Optionally, the appearance similarity between the target object and the ith object is calculated by the following steps:

1. calculating a distance value between k-dimensional appearance features included in the local appearance information of the target object and k-dimensional appearance features included in the global appearance information of the ith object, wherein k is a positive integer;

2. and determining the appearance similarity between the target object and the ith object according to the distance value.

The appearance similarity between the target object and the ith object in the global tracking queue is determined according to a distance value between the k-dimensional appearance features, and the distance value can be represented by cosine distance or Euclidean distance. Optionally, the distance value is represented by a non-normalized euclidean distance, and the appearance similarity can be more visually represented in this way. In addition, the computer device may directly determine the distance value as the appearance similarity, or may convert the distance value into the appearance similarity based on a preset conversion rule, which is not limited in this embodiment of the application.

Optionally, the spatio-temporal similarity between the target object and the ith object is calculated by the following steps:

1. if the global space-time information of the ith object comes from a second camera in the at least one camera, determining the time difference of the target object in the first camera and the second camera according to the local space-time information of the target object and the global space-time information of the ith object, and determining the spatial position relationship between the first camera and the second camera;

as can be seen from the above description of the global spatiotemporal information in the embodiment, the global spatiotemporal information of the ith object is the latest local spatiotemporal information of the ith object under the at least one camera. The latest local spatio-temporal information is extracted from the video image of which camera, and the camera is the second camera. And the computer equipment calculates the difference value of the two timestamps according to the timestamp contained in the local space-time information of the target object and the timestamp contained in the global space-time information of the ith object, so as to obtain the time difference of the target object in the first camera and the second camera.

In addition, the spatial position relation between the first camera and the second camera is known, and the spatial position relation can be determined according to the arrangement position of each camera in a real scene. Wherein, the spatial position relation is any one of the following: the two cameras are the same camera, the two cameras are adjacent and have overlapped vision, the two cameras are adjacent and have no overlapped vision, and the two cameras are not adjacent. By two cameras being adjacent is meant that the two cameras are arranged adjacently, i.e. one camera is arranged beside the other camera. When two cameras are adjacent, the visual fields of the two cameras may overlap, that is, the shooting areas of the two cameras have a part of overlapping area; alternatively, the fields of view of the two cameras may not overlap, that is, the shooting areas of the two cameras do not have an overlapping area.

Optionally, the overlapping of the fields of view means that the ground plane shooting has an overlapping region, and the non-overlapping of the fields of view means that the ground plane shooting has no overlapping region.

2. And determining the space-time similarity between the target object and the ith object according to the time difference and the spatial position relation.

In the embodiment of the present application, a preset corresponding relationship may be configured, where the preset corresponding relationship includes a corresponding relationship between a value interval to which a time difference belongs, a spatial position relationship, and a spatial-temporal similarity. And inquiring the preset corresponding relation by the computer equipment, and determining the space-time similarity corresponding to the time difference and the space position relation as the space-time similarity between the target object and the ith object.

Referring to fig. 4 in combination, a schematic diagram of a preset correspondence relationship is exemplarily shown. In fig. 4, the size of the spatio-temporal similarity is represented by only the distance corresponding to the spatio-temporal similarity, and the greater the distance, the smaller the spatio-temporal similarity; the smaller the distance, the greater the spatio-temporal similarity. As shown in fig. 4, the time difference is divided into 4 different value intervals, which are 0 time difference, 0 < time difference ≦ T1, T1 < time difference ≦ T2, and time difference > T2. Wherein T1 and T2 are preset thresholds, and T1 is less than T2.

When the spatial position relationship is that two cameras are the same camera, the calculation of the distance corresponding to the spatial-temporal similarity can be divided into the following cases:

(1) if the time difference is 0, the target object and the ith object are not considered to be the same object, and the distance corresponding to the space-time similarity between the target object and the ith object is infinite.

(2) And if the time difference is more than 0 and less than or equal to T1, determining the distance corresponding to the space-time similarity between the target object and the ith object according to the distance between the detection frame of the target object and the detection frame of the ith object.

In one example, the computer device uses a euclidean distance between a center point of a detection frame of a target object and a center point of a detection frame of an ith object as a distance between the detection frame of the target object and the detection frame of the ith object.

In another example, the computer device calculates the distance between the two detection frames by integrating the two information in consideration of the sizes of the two detection frames in addition to the distance between the center points of the two detection frames. Optionally, the computer device divides the euclidean distance between the two detection frames by the similarity of the sizes of the two detection frames to obtain the distance between the two detection frames. The similarity between the sizes of the two detection frames can be obtained by comparing the area of the detection frame with a smaller area with the area of the detection frame with a larger area, which is not limited in the embodiment of the present application.

After the computer device calculates the distance between the detection frame of the target object and the detection frame of the ith object, the distance may be directly used as the distance corresponding to the space-time similarity between the target object and the ith object, or the distance corresponding to the space-time similarity may be further calculated according to a preset conversion relationship.

(3) If the time difference is > T1, the distance corresponding to the spatio-temporal similarity between the target object and the ith object is determined to be a preset constant C1.

The value of the preset constant C1 may be preset in combination with the actual situation, which is not limited in the embodiment of the present application.

When the spatial position relationship is that two cameras are adjacent and the visual fields overlap, the calculation of the distance corresponding to the spatial-temporal similarity can be divided into the following cases:

(1) and if the time difference is 0, determining the distance corresponding to the space-time similarity between the target object and the ith object according to the distance between the target object and the ith object converted into the same coordinate system.

In one example, when the camera is calibrated by global coordinates in advance, the conversion relationship between the global coordinate system and the image coordinate system can be modeled by utilizing the property that images projected by planes in the physical world into the camera frame satisfy affine transformation, and several sets of feature points (optionally, no less than four sets of feature points) can be calibrated in advance to complete the calculation of the affine transformation model. If the target object is a pedestrian, the pedestrian is assumed to stand on the ground, namely the foot of the pedestrian is located above the ground level, if the foot is visible, the position of the characteristic point of the foot in the image coordinate system can be converted into the global coordinate system, and then the distance between the target object and the ith object is calculated in the global coordinate system.

In another example, in a case where the camera does not perform global coordinate calibration in advance, the target object and the ith object may be converted into the same image coordinate system by using a mutual conversion relationship between image coordinate systems of two cameras that capture images with an overlapping area on the ground, and then the distance between the target object and the ith object is calculated in the same image coordinate system, which image coordinate system is not limited in the embodiment of the present application.

In another embodiment, when the cameras are calibrated by global coordinates in advance, the target object and the ith object may also be converted into the same image coordinate system by using a mutual conversion relationship between image coordinate systems of two cameras capturing overlapping regions, and then the distance between the target object and the ith object in the same image coordinate system is taken as the distance corresponding to the spatio-temporal similarity between the target object and the ith object.

(2) If 0 < time difference ≦ T2, then the following are divided:

in one example, in the case where the camera is calibrated in advance by global coordinates, the target object and the ith object may be converted into a global coordinate system, and then the distance between the target object and the ith object in the global coordinate system is used as the distance corresponding to the spatio-temporal similarity between the target object and the ith object.

In another example, when the camera is not calibrated by global coordinates in advance, the distance corresponding to the spatio-temporal similarity between the target object and the ith object is determined to be a preset constant C2, where a value of the preset constant C2 may be preset in combination with an actual situation, which is not limited in this embodiment of the present application.

In yet another example, in the case where the camera is calibrated in advance by global coordinates, a preset constant C2 may also be used as the distance corresponding to the spatio-temporal similarity between the target object and the ith object.

(3) If the time difference is greater than T2, it is determined that the distance corresponding to the spatio-temporal similarity between the target object and the ith object is a preset constant C3, where a value of the preset constant C3 may be preset in combination with an actual situation, which is not limited in the embodiment of the present application.

When the spatial position relationship is that two cameras are adjacent and the visual fields are not overlapped, the calculation of the distance corresponding to the spatial-temporal similarity can be divided into the following cases:

(2) If 0 < time difference ≦ T2, then the following are divided:

in one example, if the camera is calibrated in advance by global coordinates, the distance between the target object and the ith object in the global coordinate system can be used as the distance corresponding to the spatio-temporal similarity between the target object and the ith object.

In another example, if the camera is not calibrated by global coordinates in advance, the distance corresponding to the spatio-temporal similarity between the target object and the ith object is determined to be a preset constant C4, where a value of the preset constant C4 may be preset in combination with an actual situation, which is not limited in this embodiment of the present application.

(3) If the time difference is greater than T2, it is determined that the distance corresponding to the spatio-temporal similarity between the target object and the ith object is a preset constant C5, where a value of the preset constant C5 may be preset in combination with an actual situation, which is not limited in the embodiment of the present application.

And when the spatial position relation is that the two cameras are not adjacent, determining that the distance corresponding to the space-time similarity between the target object and the ith object is infinite.

The preset constants C1, C2, C3, C4, and C5 may be constants having the same value, or constants having different values, and may be preset in combination with actual situations, which is not limited in this embodiment of the present application.

In combination with the specific process for determining spatiotemporal similarity described in the above embodiments, the embodiments of the present application mainly determine the spatiotemporal similarity between two objects according to the following points:

1. the same object cannot appear at different positions at the same time;

2. when the object disappears, the longer the time is, the more unreliable the previous position information is;

3. when two cameras are adjacent and the visual fields are overlapped, the distance between a target object and an object in the global tracking queue can be determined by utilizing the property that the imaging projected from a plane in the physical world to the camera picture meets the affine transformation;

4. the distance between the detection frames in the same camera can simultaneously consider the influence of the size of the detection frames on the space-time similarity.

It should be noted that, in the embodiment of the present application, description is mainly given to a process of calculating a similarity between a target object and an ith object in a global tracking queue, and a similarity between any one target object and any one object in the global tracking queue may be calculated by using the same or similar method, which is not limited in the embodiment of the present application.

In summary, the time information, the size of the detection frame, the position information, and the like of the target object are comprehensively considered in the method for calculating the spatiotemporal similarity provided by the embodiment of the application, so that the calculation result of the spatiotemporal similarity is more reliable, the probability of data mismatching is further reduced, and the accuracy of target detection and tracking is improved.

In an exemplary embodiment, after calculating the similarity between the target object and the objects in the global tracking queue, the computer device may adopt several possible implementations as follows, and allocate a global identifier to the target object according to the similarity. Assume that the number of target objects detected and tracked from the video image of the first camera is m, the number of objects included in the global tracking queue is n, and m and n are positive integers.

In a possible embodiment, the step 404 includes the following sub-steps:

1. for each target object, searching whether a specific object matched with the target object exists in the global tracking queue according to the similarity matrix;

2. if the specific object exists in the global tracking queue, distributing a global identifier of the specific object for the target object;

3. and if no specific object exists in the global tracking queue, distributing a new global identifier for the target object.

The similarity matrix includes similarities between m target objects and n objects, and the calculation process related to the similarities is described above and will not be described herein again. The specific object may be an object in the global tracking queue, which has the greatest similarity to the target object and the greatest similarity is greater than a preset threshold. The value of the preset threshold may be set according to an actual application scenario, for example, in a scenario with a high requirement on target detection tracking accuracy, a higher value may be set, which is not limited in the embodiment of the present application.

In an exemplary embodiment, a certain algorithm may be adopted to search whether a specific object matching the target object exists in the similarity matrix, such as weighted hungarian algorithm, which is not limited in the embodiment of the present application.

In addition, in some other possible examples, the computer device may further traverse the objects in the global tracking queue one by one, calculate the similarity between the target object and the objects in the global tracking queue one by one, stop the calculation when the similarity of a certain object is greater than a preset threshold, directly allocate the global identifier of the object to the target object, and allocate a new global identifier to the target object when the similarity of no object in the global tracking queue is greater than the preset threshold. Or, the computer device may also calculate the similarity between the target object and each object in the global tracking queue, and then directly allocate the global identifier of the object with the largest similarity to the target object. The two modes have high calculation efficiency, contribute to reducing the calculation amount, and have low matching accuracy.

In summary, in the technical scheme provided in the embodiment of the present application, after the similarity between the target object and the objects in the global tracking queue is obtained through calculation, the global identifier is allocated to the target object according to the similarity, so that each detected target object has a corresponding global identifier, thereby implementing cross-screen tracking on the target object.

In addition, the object which has the maximum similarity with the target object in the global tracking queue and is larger than the preset threshold is used as the specific object matched with the target object, so that the matching accuracy is higher.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 6, a block diagram of an object detecting and tracking apparatus provided in an embodiment of the present application is shown. The device 6000 has functions of implementing the above method embodiments, and the functions may be implemented by hardware or by hardware executing corresponding software. The apparatus 6000 may be the computer device described above, or may be provided in a computer device. The apparatus 6000 may include: an information acquisition module 6100, a detection tracking module 6200, a calculation comparison module 6300, and an assignment identification module 6400.

An information obtaining module 6100, configured to obtain the video image collected by the first camera.

A detecting and tracking module 6200, configured to perform detecting and tracking on a target object in the video image acquired by the first camera to obtain local tracking information of the target object, where the local tracking information includes local appearance information and local spatio-temporal information, the local appearance information is used to indicate an appearance feature of the target object in the video image acquired by the first camera, and the local spatio-temporal information is used to indicate a spatio-temporal feature of the target object in the video image acquired by the first camera.

A calculation and comparison module 6300, configured to calculate, according to the local tracking information of the target object, a similarity between the target object and an object in the global tracking queue; the global tracking queue is used for storing global tracking information of at least one object obtained by tracking based on video images collected by at least one camera, the at least one camera comprises the first camera, the global tracking information comprises global appearance information and global space-time information, the global appearance information is used for indicating global appearance characteristics of the object, and the global space-time information is used for indicating latest space-time characteristics of the object.

And the identifier allocating module 6400 is configured to allocate a global identifier to the target object according to the similarity.

In this embodiment, the video capture module 6100 and the detection tracking module 6200 may be combined to form the single-screen tracking module described above, and the calculation comparing module 6300 and the identifier assigning module 6400 may be combined to form the cross-screen tracking module described above.

In an exemplary embodiment, as shown in fig. 7, the calculation comparison module 6300 includes: an appearance comparison submodule 6310, a spatio-temporal comparison submodule 6320 and a similarity operator module 6330.

The appearance comparison submodule 6310 is configured to, for the ith object in the global tracking queue, calculate an appearance similarity between the target object and the ith object according to the local appearance information of the target object and the global appearance information of the ith object.

A spatiotemporal comparison submodule 6320, configured to calculate a spatiotemporal similarity between the target object and the ith object according to the local spatiotemporal information of the target object and the global spatiotemporal information of the ith object.

A similarity operator module 6330, configured to calculate a similarity between the target object and the ith object according to the appearance similarity and the spatio-temporal similarity. Wherein i is a positive integer.

In an exemplary embodiment, as shown in fig. 7, the spatiotemporal comparison submodule 6320 includes: an information determination unit 6321 and a spatiotemporal comparison unit 6322.

An information determining unit 6321, configured to determine, when the global spatio-temporal information of the i-th object is from a second camera of the at least one camera, a time difference of occurrence of the target object in the first camera and the second camera according to the local spatio-temporal information of the target object and the global spatio-temporal information of the i-th object, and determine a spatial position relationship between the first camera and the second camera.

A spatio-temporal comparing unit 6322, configured to determine a spatio-temporal similarity between the target object and the i-th object according to the time difference and the spatial position relationship.

In an exemplary embodiment, as shown in fig. 7, the spatio-temporal comparing unit 6322 is configured to query a preset correspondence, and determine a spatio-temporal similarity corresponding to the time difference and the spatial position relationship as a spatio-temporal similarity between the target object and the ith object; the preset corresponding relation comprises a value interval to which the time difference belongs, and a corresponding relation between the spatial position relation and the space-time similarity, wherein the spatial position relation comprises at least one of the following items: the two cameras are the same camera, the two cameras are adjacent and have overlapped vision, the two cameras are adjacent and have no overlapped vision, and the two cameras are not adjacent.

In an exemplary embodiment, as shown in fig. 7, the spatio-temporal comparison unit 6322 is further configured to, when the spatial position relationship is that two cameras are adjacent and the fields of view overlap, calculate the global position coordinate of the target object according to the position coordinate of the target object in the image coordinate system corresponding to the first camera and the transformation relationship between the image coordinate system corresponding to the first camera and the global coordinate system; calculating the global position coordinate of the ith object according to the conversion relation between the image coordinate system corresponding to the ith object in the second camera and the global coordinate system; and calculating the space-time similarity between the target object and the ith object according to the global position coordinates of the target object and the global position coordinates of the ith object.

In an exemplary embodiment, as shown in fig. 7, the spatio-temporal comparison unit 6322 is further configured to, when the spatial position relationship is that two cameras are adjacent and the fields of view overlap, convert the position coordinates of the target object in the image coordinate system corresponding to the first camera and the position coordinates of the ith object in the image coordinate system corresponding to the second camera into the image coordinate system corresponding to the same target camera, which is the first camera or the second camera; and calculating the space-time similarity between the target object and the ith object according to the position coordinates of the target object in the image coordinate system corresponding to the target camera and the position coordinates of the ith object in the image coordinate system corresponding to the target camera.

In an exemplary embodiment, as shown in fig. 7, the appearance comparison sub-module 6310 is configured to calculate a distance value between the k-dimensional appearance feature included in the local appearance information of the target object and the k-dimensional appearance feature included in the global appearance information of the ith object. Wherein k is a positive integer. And determining the appearance similarity between the target object and the ith object according to the distance value.

In an exemplary implementation, as shown in fig. 7, the similarity operator module 6330 is configured to perform weighted summation on the appearance similarity and the spatio-temporal similarity, and calculate the similarity between the target object and the i-th object.

In an exemplary embodiment, as shown in fig. 7, the identifier assigning module 6400 is configured to, for each target object, find whether a specific object matching the target object exists in the global tracking queue according to a similarity matrix; wherein the similarity matrix comprises the similarity between the m target objects and the n objects pairwise; if the specific object exists in the global tracking queue, distributing a global identifier of the specific object to the target object; and if the specific object does not exist in the global tracking queue, distributing a new global identifier for the target object.

In an exemplary embodiment, as shown in fig. 7, the apparatus 6000 further includes a coordinate calculating module 6500, configured to calculate a global position coordinate of the target object according to the position coordinate of the target object in the coordinate system corresponding to the first camera and a transformation relationship between the coordinate system corresponding to the first camera and the global coordinate system; and generating a motion track of the target object according to the global position coordinates of the target object at each moment.

It should be noted that, in the device provided in the embodiment of the present application, when the functions of the device are implemented, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 8, a block diagram of a computer device according to an embodiment of the present disclosure is shown. The computer device may be used to implement the target detection tracking method provided in the above embodiments. The computer device may be, for example, the computer device 20 in the implementation environment shown in FIG. 1. Specifically, the method comprises the following steps:

the computer device 800 includes a processing unit (e.g., central processing unit CPU, graphics processor GPU, field programmable gate array FPGA, etc.) 801, a system memory 804 including a Random Access Memory (RAM)802 and a Read Only Memory (ROM)803, and a system bus 805 connecting the system memory 804 and the central processing unit 801. The computer device 800 also includes a basic input/output system (I/O system) 806, which facilitates transfer of information between various devices within the computer device, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 812.

The basic input/output system 806 includes a display 808 for displaying information and an input device 809 such as a mouse, keyboard, etc. for user input of information. Wherein the display 808 and the input device 809 are connected to the central processing unit 801 through an input output controller 810 connected to the system bus 805. The basic input/output system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the computer device 800. That is, the mass storage device 807 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.

Without loss of generality, the computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 804 and mass storage 807 described above may be collectively referred to as memory.

The computer device 800 may also operate as a remote computer connected to a network via a network, such as the internet, in accordance with embodiments of the present application. That is, the computer device 800 may be connected to the network 812 through the network interface unit 811 coupled to the system bus 805, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 811.

The memory also includes at least one instruction, at least one program, set of codes, or set of instructions stored in the memory and configured to be executed by the one or more processors to implement the above-described target detection tracking method.

In an embodiment of the present application, a computer-readable storage medium is further provided, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the storage medium, and when executed by a processor, the at least one instruction, the at least one program, the code set, or the set of instructions implements the target detection and tracking method.

In an exemplary embodiment, a computer program product is also provided, which, when being executed by a processor, is adapted to carry out the above object detection tracking method.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A target detection tracking method, the method comprising:

acquiring a video image acquired by a first camera;

calculating the similarity between the target object and the objects in the global tracking queue according to the local tracking information of the target object; the global tracking queue is used for storing global marks of at least one object obtained by tracking based on video images collected by at least one camera and global tracking information corresponding to each global mark, wherein the global tracking information comprises global appearance information and global spatiotemporal information, the global appearance information is used for indicating global appearance characteristics of the object under the at least one camera, and the global spatiotemporal information is used for indicating global spatiotemporal characteristics of the object under the at least one camera;

determining the global identification of the target object according to the similarity;

wherein the calculating the similarity between the target object and the objects in the global tracking queue comprises: calculating the space-time similarity between the target object and the ith object according to the local space-time information of the target object and the global space-time information of the ith object in the global tracking queue;

the calculating the spatiotemporal similarity between the target object and the ith object comprises: if the global space-time information of the ith object is from a second camera in the at least one camera, determining the time difference of the target object in the first camera and the second camera according to the local space-time information of the target object and the global space-time information of the ith object, and determining the spatial position relationship between the first camera and the second camera; determining the space-time similarity between the target object and the ith object according to the time difference and the spatial position relation;

wherein, when the spatial position relationship is that two cameras are adjacent and the visual fields are overlapped, the determining the space-time similarity between the target object and the ith object comprises: calculating the global position coordinate of the target object according to the position coordinate of the target object in the image coordinate system corresponding to the first camera and the conversion relation between the image coordinate system corresponding to the first camera and the global coordinate system; calculating the global position coordinate of the ith object according to the position coordinate of the ith object in the image coordinate system corresponding to the second camera and the conversion relation between the image coordinate system corresponding to the second camera and the global coordinate system; calculating the space-time similarity between the target object and the ith object according to the global position coordinate of the target object and the global position coordinate of the ith object;

or, when the spatial position relationship is that two cameras are adjacent and the visual fields are overlapped, the determining the space-time similarity between the target object and the ith object comprises: converting the position coordinates of the target object in the image coordinate system corresponding to the first camera and the position coordinates of the ith object in the image coordinate system corresponding to the second camera into the image coordinate system corresponding to the same target camera; wherein the target camera is the first camera or the second camera; and calculating the space-time similarity between the target object and the ith object according to the position coordinate of the target object in the image coordinate system corresponding to the target camera and the position coordinate of the ith object in the image coordinate system corresponding to the target camera.

2. The method of claim 1, wherein the calculating the similarity between the target object and the objects in the global tracking queue according to the local tracking information of the target object further comprises:

for the ith object in the global tracking queue, calculating the appearance similarity between the target object and the ith object according to the local appearance information of the target object and the global appearance information of the ith object;

and calculating the similarity between the target object and the ith object according to the appearance similarity and the space-time similarity, wherein i is a positive integer.

3. The method of claim 1, wherein determining the spatiotemporal similarity between the target object and the i-th object according to the time difference and the spatial position relationship further comprises:

inquiring a preset corresponding relation, and determining the space-time similarity corresponding to the time difference and the space position relation as the space-time similarity between the target object and the ith object;

the preset corresponding relationship comprises a value interval to which the time difference belongs, and a corresponding relationship between the spatial position relationship and the space-time similarity, wherein the spatial position relationship is any one of the following: the two cameras are the same camera, the two cameras are adjacent and have overlapped vision, the two cameras are adjacent and have no overlapped vision, and the two cameras are not adjacent.

4. The method according to claim 2, wherein the calculating the appearance similarity between the target object and the ith object according to the local appearance information of the target object and the global appearance information of the ith object comprises:

calculating a distance value between the k-dimensional appearance feature included in the local appearance information of the target object and the k-dimensional appearance feature included in the global appearance information of the ith object, wherein k is a positive integer;

and determining the appearance similarity between the target object and the ith object according to the distance value.

5. The method of claim 2, wherein said calculating a similarity between said target object and said ith object based on said appearance similarity and said spatio-temporal similarity comprises:

and carrying out weighted summation on the appearance similarity and the space-time similarity, and calculating to obtain the similarity between the target object and the ith object.

6. The method according to any one of claims 1 to 5, wherein the number of the target objects is m, the number of the objects contained in the global tracking queue is n, and m and n are both positive integers;

the determining the global identifier of the target object according to the similarity includes:

for each target object, searching whether a specific object matched with the target object exists in the global tracking queue according to a similarity matrix; wherein the similarity matrix comprises similarities between the m target objects and the n objects pairwise;

if the specific object exists in the global tracking queue, distributing a global identifier of the specific object to the target object;

and if the specific object does not exist in the global tracking queue, distributing a new global identifier for the target object.

7. An object detection tracking apparatus, characterized in that the apparatus comprises:

the video acquisition module is used for acquiring a video image acquired by the first camera;

the identification distribution module is used for distributing global identification to the target object according to the similarity;

the calculation comparison module comprises a space-time comparison submodule and a comparison module, wherein the space-time comparison submodule is used for calculating the space-time similarity between the target object and the ith object according to the local space-time information of the target object and the global space-time information of the ith object in the global tracking queue;

the space-time comparison submodule comprises: an information determining unit, configured to determine, when global spatiotemporal information of the ith object is from a second camera of the at least one camera, a time difference occurring in the first camera and the second camera of the target object according to local spatiotemporal information of the target object and global spatiotemporal information of the ith object, and determine a spatial position relationship between the first camera and the second camera; the space-time comparison unit is used for determining the space-time similarity between the target object and the ith object according to the time difference and the spatial position relation;

when the spatial position relationship is that two cameras are adjacent and the visual fields are overlapped, the space-time comparison unit is further used for calculating the global position coordinate of the target object according to the position coordinate of the target object in the image coordinate system corresponding to the first camera and the conversion relationship between the image coordinate system corresponding to the first camera and the global coordinate system; calculating the global position coordinate of the ith object according to the position coordinate of the ith object in the image coordinate system corresponding to the second camera and the conversion relation between the image coordinate system corresponding to the second camera and the global coordinate system; calculating the space-time similarity between the target object and the ith object according to the global position coordinate of the target object and the global position coordinate of the ith object;

or, when the spatial position relationship is that two cameras are adjacent and the fields of vision overlap, converting the position coordinates of the target object in the image coordinate system corresponding to the first camera and the position coordinates of the ith object in the image coordinate system corresponding to the second camera into the image coordinate system corresponding to the same target camera; wherein the target camera is the first camera or the second camera; and calculating the space-time similarity between the target object and the ith object according to the position coordinate of the target object in the image coordinate system corresponding to the target camera and the position coordinate of the ith object in the image coordinate system corresponding to the target camera.

8. The apparatus of claim 7, wherein the computation comparison module further comprises:

the appearance comparison submodule is used for calculating the appearance similarity between the target object and the ith object according to the local appearance information of the target object and the global appearance information of the ith object for the ith object in the global tracking queue;

and the similarity operator module is used for calculating the similarity between the target object and the ith object according to the appearance similarity and the space-time similarity, wherein i is a positive integer.

9. The apparatus of claim 7,

the space-time comparison unit is further configured to query a preset corresponding relationship, and determine a space-time similarity corresponding to the time difference and the spatial position relationship as a space-time similarity between the target object and the ith object;

the preset corresponding relationship comprises a value interval to which the time difference belongs, and a corresponding relationship between the spatial position relationship and the space-time similarity, and the spatial position relationship comprises at least one of the following: the two cameras are the same camera, the two cameras are adjacent and have overlapped vision, the two cameras are adjacent and have no overlapped vision, and the two cameras are not adjacent.

10. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method of object detection tracking as claimed in any one of claims 1 to 6.

11. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method of object detection tracking according to any one of claims 1 to 6.