CN114821430A

CN114821430A - Cross-camera target object tracking method, device, equipment and storage medium

Info

Publication number: CN114821430A
Application number: CN202210478177.7A
Authority: CN
Inventors: 管国权; 陈伟明
Original assignee: XI'AN FUTURE INTERNATIONAL INFORMATION CO Ltd
Current assignee: XI'AN FUTURE INTERNATIONAL INFORMATION CO Ltd
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-07-29

Abstract

The application provides a target object tracking method, device and equipment across cameras and a storage medium, and relates to the technical field of image processing. The method comprises the following steps: acquiring video streams of a plurality of camera shooting areas in a target place; carrying out face detection on each video stream to obtain a detection result of each object; the detection result comprises face feature information, the appearance time of the object in a camera shooting area and the camera number of the shot object; constructing a dynamic human face feature library according to the detection result of each object; the dynamic face feature library is used for storing face feature information of a plurality of objects, the appearance time of the objects in each camera shooting area and the serial number of each camera; and determining the global track of the target object to be tracked according to the dynamic human face feature library. The method and the device can quickly and effectively position the information such as the position and the time of the target object to be tracked in the target place, realize the accurate positioning of the target object to be tracked, and solve the problem of tracking the cross-camera object.

Description

Cross-camera target object tracking method, device, equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for tracking a target object across cameras.

Background

In certain fields, there is a need for tracking target objects. At present, tracking the moving track of a target object generally requires acquiring an acquired image of a camera, performing target detection on the image to identify a target, and tracking the identified target object, so as to obtain a complete track of the target object.

However, in some scenes with larger directions, when the target object appears in different areas, due to the wide area, the large amount of people flow, and the large number of cameras, the phenomena of track disorder, track loss and the like easily occur, so that the target object cannot be positioned and tracked quickly and accurately.

Disclosure of Invention

An object of the present application is to provide a cross-camera target object tracking method, device, apparatus and storage medium, so as to quickly and accurately locate a target object to be tracked, in order to overcome the above disadvantages in the prior art.

In order to achieve the above purpose, the technical solutions adopted in the embodiments of the present application are as follows:

in a first aspect, an embodiment of the present application provides a target object tracking method across cameras, where the method includes:

acquiring video streams of a plurality of camera shooting areas in a target place;

performing face detection on each video stream to obtain a detection result of each object, wherein the detection result comprises face feature information of the object, the appearance time of the object in the camera shooting area and the camera number for shooting the object;

according to the detection result of each object, constructing a dynamic face feature library, wherein the dynamic face feature library is used for storing face feature information of a plurality of objects, the appearance time of the objects in each camera shooting area and the serial number of each camera for shooting the objects;

and determining the global track of the target object to be tracked according to the dynamic human face feature library.

Optionally, the performing face detection on each video stream to obtain a detection result of each object includes:

extracting a key frame in the video stream, wherein the key frame is a frame image containing the human face;

performing face detection on the key frame to obtain a face detection result, wherein the face detection result comprises: the system comprises a face detection frame and a plurality of face key points;

performing affine transformation processing on the key frame according to the face detection frame and the plurality of face key points to obtain a corrected intermediate face image;

coding the intermediate face image to obtain a coded face feature vector, and taking the face feature vector as face feature information;

and determining the occurrence time of the object in the camera shooting area and the camera number for shooting the object according to the key frame.

Optionally, the determining, according to the key frame, the occurrence time of the object in the camera shooting area and a camera number for shooting the object includes:

acquiring the first time when the key frame appears in the video stream shot by the camera and the serial number of the camera shooting the video stream of the key frame;

and taking the first time as the appearance time of the object in the camera shooting area, and taking the serial number of the camera shooting the video stream of the key frame as the serial number of the camera shooting the object.

Optionally, the performing face detection on the key frame to obtain a face detection result includes:

and performing face detection on the key frame by adopting a face detection model obtained by pre-training to obtain a face detection frame containing the target object and a plurality of face key points.

Optionally, the determining a global trajectory of a target object to be tracked according to the dynamic face feature library includes:

acquiring a face image of the target object to be tracked;

extracting a first face characteristic vector of a face image of the target object to be tracked;

comparing the first face feature vector with a plurality of face feature vectors prestored in the dynamic face feature library to obtain a feature list of the face image of the target object to be tracked, wherein the feature list comprises at least one face feature vector to be selected, and the similarity between each face feature vector to be selected and the first face feature vector is smaller than a preset threshold;

and determining the global track of the target object to be tracked according to the face feature vectors to be selected.

Optionally, the determining a global trajectory of the target object to be tracked according to each of the face feature vectors to be selected includes:

obtaining a detection result associated with each face feature vector to be selected from the dynamic face feature library;

extracting a plurality of track segments from the video stream of each camera shooting area according to the detection result associated with each to-be-selected face feature vector;

and combining the track segments to obtain the global track of the target object to be tracked.

Optionally, the method further comprises:

and drawing the three-dimensional track of each object in the target place according to the global track of each object and the number of each camera.

In a second aspect, an embodiment of the present application further provides a target object tracking apparatus across cameras, where the apparatus includes:

the acquisition module is used for acquiring video streams of a plurality of camera shooting areas in a target place;

the detection module is used for carrying out human face detection on each video stream to obtain a detection result of each object, wherein the detection result comprises human face feature information of the object, the appearance time of the object in the camera shooting area and the camera number for shooting the object;

the construction module is used for constructing a dynamic face feature library according to the detection result of each object, and the dynamic face feature library is used for storing face feature information of a plurality of objects, the appearance time of the objects in each camera shooting area and the serial number of each camera for shooting the objects;

and the determining module is used for determining the global track of the target object to be tracked according to the dynamic human face feature library.

Optionally, the detection module is further configured to:

Optionally, the determining module is further configured to:

acquiring a face image of the target object to be tracked;

Optionally, the determining module is further configured to:

Optionally, the apparatus further comprises:

and the drawing module is used for drawing the three-dimensional track of each object in the target place according to the global track of each object and the serial number of each camera.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the method as provided by the first aspect.

In a fourth aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the method as provided in the first aspect.

The beneficial effect of this application is:

the embodiment of the application provides a target object tracking method, a device, equipment and a storage medium for crossing cameras, wherein the method comprises the following steps: acquiring video streams of a plurality of camera shooting areas in a target place; carrying out face detection on each video stream to obtain a detection result of each object; the detection result comprises face feature information, the appearance time of the object in a camera shooting area and the camera number of the shot object; constructing a dynamic human face feature library according to the detection result of each object; the dynamic face feature library is used for storing face feature information of a plurality of objects, the appearance time of the objects in each camera shooting area and the serial number of each camera; and determining the global track of the target object to be tracked according to the dynamic human face feature library. In the scheme, face detection processing is mainly carried out on video streams in a plurality of camera shooting areas, so that face characteristic information of each object, the appearance time of each object in the camera shooting areas and the camera numbers for shooting each object are obtained; then, a dynamic human face feature library is constructed in real time based on the detection result of each object; and finally, searching a detection result of the target object to be tracked (namely, the face characteristic information of the target object to be tracked, the appearance time of the target object to be tracked in a camera shooting area and the camera number for shooting the target object to be tracked) in the constructed dynamic face characteristic library according to the face characteristic information of the target object to be tracked, and splicing the detection results of the target object to be tracked to obtain the global track of the target object to be tracked in a target place, so that the information of the appearance position, the appearance time and the like of the target object to be tracked in the target place is quickly and effectively positioned, the target object to be tracked is accurately positioned, and the problem of object tracking across cameras is solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic structural diagram of a cross-camera target object tracking system according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a target object tracking method across cameras according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of another cross-camera target object tracking method according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of another cross-camera target object tracking method according to an embodiment of the present disclosure;

fig. 6 is a schematic flowchart of another cross-camera target object tracking method according to an embodiment of the present disclosure;

fig. 7 is a schematic flowchart of another cross-camera target object tracking method according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a target object tracking apparatus across cameras according to an embodiment of the present application.

Icon: 100-a target object tracking system across cameras; 101-a camera; 102-a server; 103-network.

Detailed Description

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the term "comprising" will be used in the embodiments of the present application to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.

The method, the apparatus, the electronic device, or the computer-readable storage medium described in the embodiments of the present application may be applied to any scenario that requires tracking a target object in a cross-camera, and the embodiments of the present application do not limit a specific application scenario.

First, before the technical solutions provided in the present application are explained in detail, the related background related to the present application will be briefly explained.

At present, in some scenes with larger directions, when a target object appears in different areas, due to the fact that the areas are wide, the flow of people is large, the number of cameras is large, the phenomena of track disorder, track loss and the like easily occur, and the target object cannot be quickly and accurately positioned and tracked. Therefore, it is now necessary to provide a cross-camera target object tracking method to improve the accuracy of tracking the target object.

In order to solve the technical problems in the prior art, the application provides a target object tracking method across cameras, which mainly performs face detection processing on video streams in a plurality of camera shooting areas to obtain face feature information of each object, the occurrence time of each object in the camera shooting areas and the camera numbers for shooting each object; then, a dynamic human face feature library is constructed in real time based on the detection result of each object; and finally, searching a detection result of the target object to be tracked (namely, the face characteristic information of the target object to be tracked, the appearance time of the target object to be tracked in a camera shooting area and the camera number for shooting the target object to be tracked) in the constructed dynamic face characteristic library according to the face image of the target object to be tracked, and splicing the detection results of the target object to be tracked to obtain the global track of the target object to be tracked in a target place, so that the information of the appearance position, the appearance time and the like of the target object to be tracked in the target place can be quickly and effectively positioned, the target object to be tracked can be accurately positioned, and the problem of object tracking across cameras can be solved.

The structural schematic diagram of the cross-camera target object tracking system provided by the present application is briefly described below through a plurality of embodiments.

Fig. 1 is a schematic structural diagram of a cross-camera target object tracking system provided in an embodiment of the present application; as shown in fig. 1, a cross-camera target object tracking system 100 includes: one or more of the plurality of cameras 101, the server 102, and the network 103, and the cameras 101 and the server 102 may include processors therein to perform the instruction operations. The camera 101 and the server 102 may be communicatively connected via a network 103.

In some embodiments, for example, the camera 101 may photograph any object within the shooting area to obtain video streams of multiple objects over different time periods.

The server 102 may be a terminal device or a computing device such as a server having a data processing function. For example, the server 102 may analyze and process a video stream in a shooting area of the camera 101 to obtain a global track of the target object to be tracked.

The network 103 may be used for the exchange of information and/or data, for example, the network 103 may be any type of wired or wireless network, or any combination thereof.

It will be appreciated that the configuration shown in FIG. 1 is merely illustrative, and that target object tracking system 100 across cameras may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

Fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure; the electronic device may be a processing device with a display function and an operation function, such as a computer, a mobile internet device, a tablet, a mobile phone terminal, a computer, a server, and the like, and is configured to deploy and operate the server 102 in fig. 1, so as to implement the cross-camera target object tracking method provided by the present application.

As shown in fig. 2, the server 102 includes a memory 201, a processor 202, and a communication unit 203. The memory 201, the processor 202 and the communication unit 203 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

The memory 201 stores software functional modules stored in the memory 201 in the form of software or firmware (firmware), and the processor 202 executes various functional applications and data processing by running the software programs and modules stored in the memory 201, that is, implements the target object tracking method across cameras in the embodiment of the present invention.

The Memory 201 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), and the like. The memory 201 is used for storing a program, and the processor 202 executes the program after receiving an execution instruction.

Processor 202 may be an integrated circuit chip having signal processing capabilities. The Processor 202 may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), and the like.

The communication unit 203 is used for establishing communication connection between the camera 101 and the server 102 through the network 103, and implementing transceiving operation of network signals and data information.

The following describes an implementation principle of the cross-camera target object tracking method applied to a server step and beneficial effects produced by the method through a plurality of specific embodiments.

Fig. 3 is a schematic flowchart of a target object tracking method across cameras according to an embodiment of the present disclosure; alternatively, the execution subject of the method may be the server shown in fig. 1.

It should be understood that in other embodiments, the order of some steps in the target object tracking method across cameras may be interchanged according to actual needs, or some steps may be omitted or deleted. As shown in fig. 3, the method includes: as shown in fig. 3, the method includes:

s301, video streams of shooting areas of a plurality of cameras in a target place are obtained.

Illustratively, the target location may be a location that requires the target object to track, such as a restaurant, an airport, a train station, or other public places. The target place is provided with a plurality of cameras, the installation positions of the cameras are any point positions in the target place, for example, the camera 1 can be installed at an inlet of a railway station, the camera 2 can be installed in a hall of the railway station, the camera 3 can be installed at a ticket gate of the railway station, and the like, namely, the cameras can be determined according to actual tracking requirements at the target place according to the positions.

In this embodiment, each camera starts a video capture function in real time to capture a video stream in a shooting area in real time, where the video stream includes a face image of any object passing through the shooting area, and each camera sends the captured video stream to a server for analysis processing.

And S302, carrying out face detection on each video stream to obtain a detection result of each object.

The detection result comprises face feature information of the object, the appearance time of the object in a camera shooting area and the camera number of the shot object. For example, the face feature information of the object 1 may be a face feature code of the object, the appearance time of the object 1 in the shooting area of the camera 1 is 15:00, and the camera number of the shooting object 1 is the camera 1. Further, the appearance position of the subject within the target site can be determined from each camera number of the subject. For example, the camera number of the subject 1 is camera 1, that is, the appearance position of the subject 1 in the target place is a train station entrance.

The target object to be tracked is taken by each camera shooting area, or the target object which does not need to be tracked is taken as an example.

In this embodiment, for example, a face detection model trained in advance may be used to perform face detection processing on each video stream, so as to obtain face feature information of each object, an appearance time of each object in a camera shooting area, and a camera number for shooting each object.

And S303, constructing a dynamic human face feature library according to the detection result of each object.

The dynamic face feature library is used for storing face feature information of a plurality of objects, appearance time of the objects in each camera shooting area and each camera number of the shot objects.

It should be noted that the dynamic face feature library is a feature library updated according to the detection result of new objects continuously entering (or leaving) in the shooting area of each camera in the target site. Therefore, the detection results of all objects shot by all cameras in the video stream can be stored in the dynamic human face feature library, all the cameras do not need to be associated, and the moving path of all the objects appearing in the target place can be ensured to be tracked.

In the embodiment, based on the dynamic human face feature library, the phenomena of track disorder, track loss and the like easily occurring when the target object appears in different areas can be avoided, and the accuracy of positioning the appearance time and the appearance position of the target object to be tracked in the target place is effectively improved.

And S304, determining the global track of the target object to be tracked according to the dynamic human face feature library.

In this embodiment, the face feature information of the target object to be tracked may be compared with the features in the dynamic face feature library obtained by the above-mentioned construction, the appearance time of the target object to be tracked in each camera shooting area and the number of each camera shooting the target object to be tracked may be searched, then, according to the appearance time of the target object to be tracked in each camera shooting area and the number of each camera shooting the target object to be tracked, obtaining the global track of the target object to be tracked in the target place, further, according to the global track of the target object to be tracked, the appearance time and the appearance position of the target object to be tracked in the target place can be quickly positioned, therefore, the tracking observation of the moving path of the target object to be tracked is completed, the capability of dynamic trajectory tracking analysis of the target object to be tracked is realized, and the problem of object tracking of a cross-camera is solved.

To sum up, the embodiment of the present application provides a target object tracking method across cameras, and the method includes: acquiring video streams of a plurality of camera shooting areas in a target place; carrying out face detection on each video stream to obtain a detection result of each object; the detection result comprises face feature information, the appearance time of the object in a camera shooting area and the camera number of the shot object; constructing a dynamic human face feature library according to the detection result of each object; the dynamic face feature library is used for storing face feature information of a plurality of objects, the appearance time of the objects in each camera shooting area and the serial number of each camera; and determining the global track of the target object to be tracked according to the dynamic human face feature library. In the scheme, face detection processing is mainly carried out on video streams in a plurality of camera shooting areas, so that face characteristic information of each object, the appearance time of each object in the camera shooting areas and the camera numbers for shooting each object are obtained; then, a dynamic human face feature library is constructed in real time based on the detection result of each object; and finally, searching a detection result of the target object to be tracked (namely, the face characteristic information of the target object to be tracked, the appearance time of the target object to be tracked in a camera shooting area and the camera number for shooting the target object to be tracked) in the constructed dynamic face characteristic library according to the face characteristic information of the target object to be tracked, and splicing the detection results of the target object to be tracked to obtain the global track of the target object to be tracked in a target place, so that the information of the appearance position, the appearance time and the like of the target object to be tracked in the target place is quickly and effectively positioned, the target object to be tracked is accurately positioned, and the problem of tracking the object across cameras is solved.

How to perform face detection on each video stream in step S302 described above will be specifically explained by the following embodiments, and a detection result of each object is obtained.

Alternatively, referring to fig. 4, the step S302 includes:

s401, extracting key frames in the video stream.

Wherein, the key frame is a frame image containing a human face.

In this embodiment, for example, 10 frames of images may be extracted from a video stream with a duration of 1 second, and it is determined whether the extracted frame of images has a human face, if so, the frame of images is taken as a key frame, and if not, the frame of images is discarded.

S402, carrying out face detection on the key frame to obtain a face detection result.

Wherein, the face detection result comprises: face detection frame, a plurality of face key points. The plurality of face key points may be two face key points or five face key points. For example, the five face key points are the left and right eyes, nose, and left and right mouth corners, respectively.

In this embodiment, a network model may be used to perform face detection on the key frame to obtain a face detection frame and a plurality of face key points. In addition, the face detection frame and the five face key points can be fused, so that the face detection can be rapidly and accurately realized.

And S403, performing affine transformation processing on the key frame according to the face detection frame and the plurality of face key points to obtain a corrected intermediate face image.

The affine transformation processing is linear transformation from two-dimensional coordinates to two-dimensional coordinates, and maintains the straightness and parallelism of the two-dimensional graph (the relative position relationship between straight lines is kept unchanged, the parallel lines are still parallel lines after affine transformation, and the position sequence of points on the straight lines is not changed).

In this embodiment, in consideration of the problem that the face in the key frame may have an angle that is not correct, the face in the key frame needs to be aligned. Therefore, the inclination angle of the eye connecting line relative to the horizontal line can be calculated according to the face detection frame and the five face key points, and then, affine transformation processing is carried out on the key frame according to the calculated inclination angle of the eye connecting line relative to the horizontal line to obtain a corrected intermediate face image, namely the intermediate face image is a face image which is intercepted and corrected.

S404, coding the intermediate face image to obtain a coded face feature vector, and taking the face feature vector as face feature information.

On the basis of the above embodiment, for example, a Facenet deep learning model may be used to perform face coding processing on the intermediate face image to obtain a coded face feature vector, and the face feature vector is used as face feature information. The coded face feature information can be used for face recognition.

Specifically, a Facenet deep learning model may be used to perform downsampling on the intermediate face image to extract features, to obtain a feature vector with 128 dimensions, and to perform L2 standardization on the 128-dimensional feature vector to construct a final face feature vector.

Wherein, the backbone in the Facenet deep learning model still uses the deep separable convolution similar to that of the backbone of Retinaface, and the feature extraction is rapidly and efficiently carried out. And tiling the feature map subjected to backward sampling, performing full connection with the number of the neurons being 128, and replacing the input picture by a feature vector with the length being 128.

And S405, determining the appearance time of the object in the camera shooting area and the camera number of the shooting object according to the key frame.

Optionally, after the face feature information of each object is obtained, the appearance time of the object in the camera shooting area and the camera number of the shooting object may also be extracted from the key frame.

How the appearance time of the subject in the camera shooting area and the camera number of the shooting subject are determined from the key frame in the above-described step S405 will be specifically explained by the following embodiments.

Alternatively, referring to fig. 5, the step S405 includes:

s501, acquiring the first time when the key frame appears in the video stream shot by the camera and the number of the camera shooting the video stream where the key frame is located.

And S502, taking the first time as the appearance time of the object in the shooting area of the camera, and taking the number of the camera for shooting the video stream of the key frame as the number of the camera for shooting the object.

Generally, each frame of image in the video stream carries time information for the camera to acquire the frame of image and a camera number for shooting each frame of image.

It should be noted that, if an object appears in both the key frame 1 and the key frame 10 in the video stream captured by the camera 1, and the capturing time of the key frame 1 is earlier than that of the key frame 10, only the capturing time of the key frame 1 is taken as the appearance time of the object in the capturing area of the camera 1.

In this embodiment, the first time when the key frame appears in the video stream and the camera number for shooting the key frame may be obtained from the video stream shot by the camera, and the first time is taken as the appearance time of the object in the camera shooting area, and the number of the camera shooting the key frame in the video stream is taken as the number of the camera shooting the object.

How to perform face detection on the key frame in step S402 is specifically explained through the following embodiments, so as to obtain a face detection result.

Optionally, performing face detection on the key frame to obtain a face detection result, including:

In this embodiment, for example, the face detection model trained in advance may be trained by using a deep neural network. Due to the requirement of the deep neural network for the input size of the input picture, the size of the key frame needs to be adjusted to a specified size without distortion.

When deep neural network training is adopted, a smaller weight value is generally used for fitting, and when the value of the training data is a larger integer value, the process of model training may be slowed down. Therefore, normalization of the pixels of the image is generally required.

In this embodiment, a one-stage face detection network Retinaface deep learning network model with very good performance on a face (where the face is a mainstream data set for face detection) may be used as the face detection model. The backsbone part of the Retinaface adopts a MobileNet network which applies deep separable convolution as a main structure and is used for extracting the human face features. The depth separable convolution only changes the conventional convolution a little, but can reduce the number of parameters to be reduced, improve the lightweight performance of the face detection network and greatly improve the face detection efficiency.

The Retinaface performs feature pyramid construction on the last three effective feature layers of the backbone, performs feature extraction on the image of each scale, can generate multi-scale feature representation, and feature maps of all levels have strong semantic information and even comprise some high-resolution feature maps.

In addition, in order to further enhance the face feature extraction, a multi-receptive-field feature fusion technology can also be used. For example, using three parallel structures, the effect of the 5x5 and 7x7 convolutions is replaced with a stack of 3x3 convolutions: the left is the 3x3 convolution, the middle is the 5x5 convolution replaced with two 3x3 convolutions, and the right is the 7x7 convolution replaced with three 3x3 convolutions. Through multi-receptive-field feature fusion, the network learns larger features.

How to determine the global trajectory of the target object to be tracked according to the dynamic face feature library in step S304 will be specifically explained through the following embodiments.

Alternatively, referring to fig. 6, the step S304 includes:

s601, obtaining a face image of the target object to be tracked.

S602, extracting a first face feature vector of a face image of the target object to be tracked.

The face image of the target object to be tracked can be an image shot by any camera in the target place or can also be an image shot by other camera equipment.

In this embodiment, the face image of the target object to be tracked needs to be obtained first, and the first face feature vector of the face image of the target object to be tracked can be obtained by using the above-mentioned face detection model and face encoding processing.

S603, comparing the first face feature vector with a plurality of face feature vectors stored in a dynamic face feature library in advance to obtain a feature list of the face image of the target object to be tracked.

The feature list comprises at least one face feature vector to be selected, and the similarity between each face feature vector to be selected and the first face feature vector is smaller than a preset threshold value. For example, if the similarity of the face feature vectors of the face images is smaller than a preset threshold, it is determined that the input faces are the same person.

It should be understood that the target object to be tracked may be any object that passes through one or more camera shooting areas within the target site.

For example, the target object to be tracked passes through the shooting area of the camera 1 and the shooting area of the camera 2, the video stream of the shooting area of the camera 1 and the video stream of the shooting area of the camera 2 both contain the target object to be tracked, and correspondingly, the pre-constructed dynamic face feature library contains the face feature information of the target object to be tracked in the video stream of the shooting area of the camera 1 and the face feature information of the video stream of the shooting area of the camera 2, the appearance time of the target object to be tracked in the shooting area of the camera 1, the appearance time of the target object to be tracked in the shooting area of the camera 2, and the camera numbers for shooting the target object to be tracked.

In this embodiment, the first face feature vector may be compared with a plurality of face feature vectors pre-stored in the dynamic face feature library to obtain a feature list of a face image of the target object to be tracked, where the feature list of the face image of the target object to be tracked includes: the face feature information of the video stream of the area shot by the camera 1 and the face feature information of the video stream of the area shot by the camera 2.

And S604, determining the global track of the target object to be tracked according to the face feature vectors to be selected.

On the basis of the above embodiment, the global trajectory of the target object to be tracked can be found from the dynamic face feature library according to the obtained face feature vectors to be selected, so as to realize the tracking of the moving path of the target object to be tracked.

How to determine the global trajectory of the target object to be tracked according to the face feature vectors to be selected in step S604 will be specifically explained through the following embodiments.

Alternatively, referring to fig. 7, the step S604 includes:

s701, obtaining detection results associated with the face feature vectors to be selected from the dynamic face feature library.

The detection result associated with each candidate face feature vector comprises the following steps: the time of appearance of the object in each camera shooting area, and each camera number of the shooting object.

For example, the first face feature of the video stream of the target object to be tracked in the shooting area of the camera 1 and the second face feature of the video stream of the target object to be tracked in the shooting area of the camera 2, the detection result associated with the first face feature is the occurrence time of the target object to be tracked in the shooting area of the camera 1 and the number of the camera 1 (i.e. the camera 1) shooting the target object to be tracked, and the detection result associated with the second face feature is the occurrence time of the target object to be tracked in the shooting area of the camera 2 and the number of the camera 2 (i.e. the camera 2) shooting the target object to be tracked.

Therefore, the detection result associated with each face feature vector to be selected can be searched and obtained from the dynamic face feature library obtained by the construction.

S702, extracting a plurality of track segments from the video stream of each camera shooting area according to the detection result associated with each face feature vector to be selected.

And S703, combining the track segments to obtain the global track of the target object to be tracked.

In this embodiment, the occurrence time of the target object to be tracked in each camera shooting area and each camera number for shooting the target object to be tracked can be found from the detection result associated with each face feature vector to be selected, then, according to the occurrence time of the target object to be tracked in each camera shooting area and each camera number for shooting the target object to be tracked, a plurality of track segments are extracted from the video stream of each camera shooting area, and the plurality of track segments are combined according to each camera number to obtain the global track of the target object to be tracked, so that the information of the occurrence position, the occurrence time and the like of the target object to be tracked in the target place can be quickly and effectively positioned, the target object to be tracked can be accurately positioned, and the problem of object tracking across cameras can be solved.

Optionally, the method further comprises: and drawing the three-dimensional track of each object in the target place according to the global track of each object and the number of each camera.

According to the number of each camera, the installation position of each camera in the target place, namely the spatial data of each camera can be determined.

In this embodiment, the moving trajectory of each object, that is, the three-dimensional trajectory of each object in the target location, can be drawn according to the global trajectory of each object in the spatial data corresponding to each camera, so that the three-dimensional simulation display of the action trajectory of each object in the target location is realized, the efficiency of viewing the action trajectory of each object in the target location is improved, and the tracking and observation of the path and the behavior of each object is completed.

Based on the same inventive concept, the embodiment of the present application further provides a cross-camera target object tracking device corresponding to the cross-camera target object tracking method, and as the principle of solving the problem of the device in the embodiment of the present application is similar to that of the cross-camera target object tracking method in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and repeated details are omitted.

Referring to fig. 8, the apparatus includes:

an obtaining module 801, configured to obtain video streams of multiple camera shooting areas in a target site;

the detection module 802 is configured to perform face detection on each video stream to obtain a detection result of each object, where the detection result includes face feature information of the object, an appearance time of the object in a camera shooting area, and a camera number of the shooting object;

a constructing module 803, configured to construct a dynamic face feature library according to the detection result of each object, where the dynamic face feature library is configured to store face feature information of multiple objects, the occurrence time of each object in each camera shooting area, and the number of each camera shooting each object;

and the determining module 804 is configured to determine a global trajectory of the target object to be tracked according to the dynamic face feature library.

Optionally, the detecting module 802 is further configured to:

extracting key frames in the video stream, wherein the key frames are frame images containing human faces;

and performing face detection on the key frame to obtain a face detection result, wherein the face detection result comprises: the system comprises a face detection frame and a plurality of face key points;

and determining the occurrence time of the object in the camera shooting area and the camera number of the shooting object according to the key frame.

Optionally, the detecting module 802 is further configured to:

acquiring the first time when a key frame appears in a video stream shot by a camera and the serial number of the camera shooting the video stream where the key frame is located;

and taking the first time as the appearance time of the object in the shooting area of the camera, and taking the number of the camera for shooting the video stream of the key frame as the number of the camera for shooting the object.

Optionally, the detecting module 802 is further configured to:

Optionally, the determining module 804 is further configured to:

acquiring a face image of a target object to be tracked;

extracting a first face characteristic vector of a face image of a target object to be tracked;

comparing the first face feature vector with a plurality of face feature vectors prestored in a dynamic face feature library to obtain a feature list of a face image of a target object to be tracked, wherein the feature list comprises at least one face feature vector to be selected, and the similarity between each face feature vector to be selected and the first face feature vector is smaller than a preset threshold value;

Optionally, the determining module 804 is further configured to:

acquiring a detection result associated with each face feature vector to be selected from a dynamic face feature library;

extracting a plurality of track segments from the video stream of each camera shooting area according to the detection result associated with each face feature vector to be selected;

Optionally, the apparatus further comprises:

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Optionally, the present application also provides a program product, such as a computer readable storage medium, comprising a program which, when being executed by a processor, is adapted to carry out the above-mentioned method embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to perform some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A target object tracking method across cameras, the method comprising:

2. The apparatus of claim 1, wherein the performing face detection on each video stream to obtain a detection result of each object comprises:

3. The method of claim 2, wherein determining the time of occurrence of the object within the camera capture area and the camera number of capturing the object from the keyframe comprises:

4. The method of claim 2, wherein the performing face detection on the key frame to obtain a face detection result comprises:

5. The method according to any one of claims 1 to 4, wherein the determining a global trajectory of a target object to be tracked according to the dynamic face feature library comprises:

acquiring a face image of the target object to be tracked;

6. The method according to claim 5, wherein the determining a global trajectory of the target object to be tracked according to each of the face feature vectors to be selected comprises:

7. The method according to any one of claims 1-6, further comprising:

8. An apparatus for tracking a target object across cameras, the apparatus comprising:

9. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.