CN111383252A

CN111383252A - Multi-camera target tracking method, system, device and storage medium

Info

Publication number: CN111383252A
Application number: CN201811637626.8A
Authority: CN
Inventors: 吴旻烨; 毕凝
Original assignee: Yaoke Intelligent Technology Shanghai Co ltd
Current assignee: Yaoke Intelligent Technology Shanghai Co ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2020-07-07
Anticipated expiration: 2038-12-29
Also published as: CN111383252B

Abstract

The method comprises the steps of selecting a plurality of original image areas of each image frame in a plurality of image frames synchronously shot by a plurality of cameras each time, generating a response image through a filter corresponding to an extracted feature image, and extracting a tracking result in the corresponding image frame by identifying a tracking target boundary frame determined by the response image containing a target point with the highest score; when the tracked target is judged to be shielded under some cameras, the tracking result of the shielded image frame is obtained through a multi-camera constraint method, the tracking shielding is effectively eliminated, the multi-cameras can simultaneously provide information of the tracked target at a plurality of visual angles, and the information can be used as input to enable a relevant filter to learn the characteristics of multiple angles, so that the robustness on the change of the visual angles is higher.

Description

Multi-camera target tracking method, system, device and storage medium

Technical Field

The present application relates to the field of target tracking technologies, and in particular, to a multi-camera target tracking method, system, device, and storage medium.

Background

General visual tracking (generic visual tracking) is a fundamental task in the field of computer vision. The task gives a bounding box of the tracked object for the first frame in the video sequence, and the tracker predicts the position and size of the tracked object for each frame thereafter. With the rapid development of visual tracking, the technology is increasingly applied to specific tracking in places such as crowded areas and security checkpoints. On the other hand, visual tracking is also a key technology in automatic driving. The visual tracking task can be divided into four categories, i.e., single-camera single-target tracking, single-camera multi-target tracking, multi-camera single-target tracking, and multi-camera multi-target tracking, according to the number of cameras and the number of objects to be tracked. The key to the tracking task is accurate target location and the efficiency of the algorithm.

At present, the tracking task mainly faces problems including occlusion, illumination change, deformation of the tracked object, and motion blur (motion blur). The tracking method based on the single camera is limited by the physical properties of the tracker, and has poor robustness to the situation that the tracked object is blocked.

Existing multi-camera tracking algorithms are basically multi-target tracking. The tracking algorithm mainly detects objects of people, and then uses a ReID (Re-identification) network to extract features and match frames, thereby achieving the tracking effect. However, due to the limitations of detection algorithms therein, such methods can only be used for tracking of people and cannot achieve tracking of arbitrary objects.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, it is an object of the present application to provide a multi-camera object tracking method, system, device and storage medium, which solve various problems of the object tracking technology in the prior art.

In order to achieve the above and other related objects, the present application provides a multi-camera object tracking method applied to an electronic device related to a camera array, wherein each camera in the camera array takes a picture synchronously, and each image frame taken by each camera at each time is taken as an image sequence; the method comprises the following steps: under the condition that an image sequence is correspondingly provided with an initial boundary frame used for selecting a tracking target from image frame frames in the image sequence, extracting a plurality of original image areas in each image frame in the image sequence respectively by using the initial boundary frame and a plurality of alternative boundary frames obtained under the condition that the center is unchanged and the scaling scale is changed; inputting a plurality of original image areas of each image frame into a feature extractor to obtain a plurality of corresponding feature maps of each image frame; filtering the plurality of feature maps of each image frame by using a filter to obtain a plurality of corresponding response maps; obtaining a response map of a target point containing the highest score from a plurality of response maps obtained according to an image frame corresponding to each camera in each image sequence to be used as a tracking result acquisition basis, and using the highest score as a generation basis of the score of the corresponding camera in the image frame; taking the position of the corresponding pixel point of the target point in the image frame as a boundary frame reference point, taking the scale of the alternative boundary frame used for obtaining the tracking result as the boundary frame scale, and combining the boundary frame reference point and the boundary frame scale to construct a tracking result boundary frame for extracting the tracking result from the image frame; comparing the score of each camera in the image sequence corresponding to the image frame with a preset threshold value so as to judge whether the camera has the condition that the tracking target is shielded; for the first type of image frames of the camera judged to have the condition that the tracking target is shielded, correcting the corresponding tracking result boundary frame by an inter-camera constraint method, and extracting the tracking result from the first type of image frames; and extracting the tracking result by utilizing the corresponding tracking result bounding box for the second type of image frame of the camera which is judged that the tracking target is not shielded.

In one embodiment, the filter and feature extractor are pre-trained, the pre-training comprising: one or more iterative computations, each iterative computation comprising: in a randomly selected video of a tracked target, randomly selecting a predetermined number of image frames for generating a plurality of training sample pairs, each training sample pair comprising: extracting an original image area in each randomly selected image frame according to a reference standard, and obtaining a response image according to the original image area; alternatively, each training sample pair comprises: an image area obtained by offsetting the original image area and a response image generated according to the image area obtained by offsetting; the filter and feature extractor are trained using one portion, and the other portion, of the plurality of training sample pairs, respectively.

In one embodiment, the filter fixes the parameters of the feature extractor during training to minimize a first objective function of the feature extractor for the filter; and/or the feature extractor fixes parameters of the filter during training to minimize a second objective function of the filter for the feature extractor.

In one embodiment, the video is from a target tracking data set comprising: one or more combinations of an OTB Dataset, a VOT Dataset, a sample Color 128 Dataset, a vidv Tracking Dataset, and a UAV123 Dataset.

In one embodiment, the filter is trained online with a target training set to obtain updates.

In one embodiment, the on-line training of the filter comprises: generating a training sample added into the target training set according to the image frames collected by each camera, wherein the training sample comprises an original image area extracted from the image frame; inputting training samples into a third objective function of a filter, wherein the third objective function of the filter:

wherein the content of the first and second substances,

a response graph obtained by filtering the jth training sample of the ith camera is shown,

a response graph corresponding to the reference standard of the jth training sample of the ith camera is represented;

represents a score when a tracking result is acquired according to the jth training sample of the ith camera; the objective function is transformed into the frequency domain, represented as:

wherein, conj () represents the conjugate of the complex number; the power factor represents Fourier transform; by iterative solution of gradients

And the objective function E (f) is optimized by using a conjugate gradient method to train the filter.

In an embodiment, the multi-camera target tracking method further includes: performing an update action on the target training set, comprising: when each camera collects a preset number of image frames, the tracking result of the camera which is judged that the tracking target is not shielded currently is added into the target training set as a training sample for updating.

In one embodiment, the response map is expressed as a gaussian distribution centered on a reference point in the original image region extracted by the reference standard.

In an embodiment, the method for multi-camera constraint includes: selecting a first camera from a first camera set containing cameras judged that the tracking target is occluded, and acquiring a second camera from a second camera set containing cameras judged that the tracking target is not occluded in the image sequence, wherein each camera in the first camera set and the second camera set corresponds to each image frame in the same image sequence; calculating a homography matrix for estimating a transformation relation of a motion plane of a tracking target under a first camera and a second camera according to a plurality of pairs of tracking results obtained by the first camera and the second camera in a plurality of image sequences; mapping a predetermined point of a tracking target in each second type image frame in each image sequence by using the homography matrix to obtain a mapping point of a first type image frame in the same image sequence; and carrying out constraint correction on the obtained tracking result bounding box according to the obtained mapping points of each first type image frame so as to obtain a corrected tracking result.

In an embodiment, the homography matrix is obtained by calculating a position relation between each pair of matching points in a plurality of pairs of tracking results obtained by a first tracking track of a first camera and a second tracking track of a second camera at a plurality of same time; the tracking trajectory refers to a time-sequentially ordered set of tracking results of the image frames acquired by each camera along the time sequence.

In an embodiment, before filtering each of the feature images, the method further includes: and smoothing the characteristic image.

To achieve the above and other related objects, the present application provides an electronic device, related to a camera array, including: at least one transceiver coupled to the camera array; at least one memory storing a computer program; at least one processor, coupled to the transceiver and the memory, for executing the computer program to perform the multi-camera target tracking method.

To achieve the above and other related objects, the present application provides a computer storage medium storing a computer program which, when executed, performs the multi-camera object tracking method.

In order to achieve the above and other related objects, the present application provides a multi-camera object tracking system, applied to an electronic device related to a camera array, wherein each camera in the camera array takes a picture synchronously, and each image frame taken by each camera at each time is taken as an image sequence; the system comprises: the image processing module is used for respectively extracting a plurality of original image areas in each image frame in an image sequence by utilizing an initial boundary frame and a plurality of alternative boundary frames obtained under the condition that the center of the initial boundary frame is unchanged and the scaling scale of the initial boundary frame is changed when the initial boundary frame is correspondingly configured for selecting a tracking target from the image frames in the image sequence; the characteristic extractor is used for extracting the characteristics of a plurality of original image areas of each image frame to obtain a plurality of corresponding characteristic maps of each image frame; the filter is used for filtering the characteristic maps of each image frame to obtain a plurality of corresponding response maps; a tracking calculation module, configured to obtain, from a plurality of response maps obtained according to an image frame corresponding to each camera in each image sequence, a response map of a target point including a highest score as a tracking result acquisition basis, and use the highest score as a generation basis of a score of the corresponding camera in the image frame; taking the position of the corresponding pixel point of the target point in the image frame as a boundary frame reference point, taking the scale of the alternative boundary frame used for obtaining the tracking result as the boundary frame scale, and combining the boundary frame reference point and the boundary frame scale to construct a tracking result boundary frame for extracting the tracking result from the image frame; comparing the score of each camera in the corresponding image frame in the image sequence with a preset threshold value so as to judge whether the camera has the condition that the tracking target is shielded; for the first type of image frames of the camera judged to have the condition that the tracking target is shielded, correcting the corresponding tracking result boundary frame by an inter-camera constraint method, and extracting the tracking result from the first type of image frames; and extracting the tracking result by utilizing the corresponding tracking result boundary frame for the second type of image frame of the camera which is judged that the tracking target is not shielded.

As described above, the multi-camera target tracking method, system, apparatus and storage medium of the present application extract a tracking result in a corresponding image frame by selecting a plurality of original image regions for each of a plurality of image frames synchronously captured by a plurality of cameras each time, and generating a response map by a filter in accordance with the extracted feature map, and by identifying a tracking target bounding box determined by the response map including a target point with the highest score; when the tracked target is judged to be shielded under some cameras, the tracking result of the shielded image frame is obtained through an inter-camera constraint method, the tracking shielding is effectively eliminated, the multi-cameras can simultaneously provide information of the tracked target at a plurality of visual angles, and the information can be used as input to enable a relevant filter to learn the characteristics of multiple angles, so that the robustness on the change of the visual angles is higher.

Drawings

Fig. 1 is a schematic flowchart of a multi-camera target tracking method according to an embodiment of the present application.

Fig. 2 is a schematic flow chart of a multi-camera constraint method in the embodiment of the present application.

Fig. 3 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Fig. 4 is a block diagram of a multi-camera object tracking system according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. The present application is capable of other and different embodiments and of being practiced or being carried out in various ways, and it is capable of other various modifications and changes without departing from the spirit of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Aiming at the defects of the existing target tracking, the scheme for realizing the target tracking by analyzing each frame image of multiple angles shot by the camera array is provided, and particularly for the tracking of a single target, a more precise effect can be achieved.

The camera array refers to a shooting device that combines a plurality of cameras to shoot the same scene or the same object, and the structure of the camera array may be, for example, a row of cameras, a column of cameras, or M rows by N columns of cameras, however, the camera array is not necessarily in the form of a square matrix, and may also be in various shapes such as a circle, a triangle, or other shapes.

The plurality of cameras in the camera array shoot images of the tracked target from different angles, the tracked target is positioned through image analysis, information of the tracked target at each angle can be well presented, and the problem that the tracked target is lost due to the fact that the tracked target is shielded under certain camera view angles can be avoided.

In some embodiments, the tracking target may be, for example, a person, an animal, or other moving object such as a car.

Fig. 1 shows a schematic flow chart of a multi-camera target tracking method provided in the embodiment of the present application.

The method is applied to an electronic device related to a camera array. In some embodiments, the electronic device may be a processing terminal, such as a desktop computer, laptop computer, smart phone, tablet computer, or other terminal with processing capabilities, that is independent of the camera array and is coupled to the camera array; in some embodiments, the camera array and electronics may also be integrated as components together as a single product, such as a light field camera, and the electronics may be implemented as circuitry in the light field camera that is attached to one or more circuit boards in the light field camera; in some embodiments, the cameras may be coupled to each other, and the electronic device may be implemented by a circuit in each of the cameras.

Each camera in the camera array shoots synchronously, and each image frame shot by each camera at each time is used as an image sequence.

For example, N cameras in the camera array take synchronous shots at time t to obtain I₁～I_NFor the image frames acquired by the N cameras respectively, the corresponding image sequence is represented as I_i,i＝1,…,N。

The method specifically comprises the following steps:

step S101: in the case that an image sequence is correspondingly provided with an initial boundary box used for selecting a tracking target from image frames in the image sequence, a plurality of original image areas are respectively extracted in each image frame in the image sequence by using the initial boundary box and a plurality of alternative boundary boxes obtained under the condition that the center is unchanged and the scaling scale is changed.

In one embodiment, let the image sequence I of the tracking target in each camera I_iLower corresponding initial bounding box B_iI.e. one initial bounding box is required for each picture sequence. Wherein, the bounding box (bounding box) is a geometric box for framing out the image region where the tracking target is located in the image by manual or machine recognition, and the framing result of the initial bounding box can be used as referenceThe standard, i.e. group route.

In this embodiment, the initial bounding box is a rectangular bounding box, and the framed image area is represented as:

B_i＝(x,y,w,h)；

where (x, y) represents the coordinates of a reference point of the bounding box, such as a center point, but also other feature points.

In this embodiment, the reference point (x, y) is the coordinate of the top left corner of the bounding box, and (w, h) represents the width and length of the bounding box, in pixels.

For selecting original image area B_iOf the initial bounding box

Centered on the image sequence I at camera I_iExtracting an original image area through alternative bounding boxes under different scaling scales d in the corresponding image frame

Wherein d is 1, …, N_dAnd d represents a different scaling scale.

Step S102: and inputting a plurality of original image areas of each image frame into a feature extractor to obtain a plurality of corresponding feature maps of each image frame.

In an embodiment, preferably, for the image feature extraction, the feature extractor may be implemented by using a CNN network model, which is commonly used, for example, various models such as VGG, ResNet, AlexNet, inclusion, and the like.

In an embodiment, the original image regions P under multiple scales are collected according to the above example_iI-1, …, N is used as the input of the pre-trained feature extractor g to obtain the feature map set (feature map) under the current image frame:

d denotes the different scaling, i ═ 1, …, N. The set of feature maps is represented as:

optionally, after obtaining the feature map set, the feature map set may be smoothed by a smoothing function, for example, a two-dimensional cosine window function w for the image region at each scaling scale

Multiplying, and performing weighted smoothing on the region to obtain

So that the method has better continuity in the frequency domain. For the

The image area of (1) has:

it should be noted that the smoothing process is only optional, and not necessary; the smoothing method is not limited to the two-dimensional cosine window function, and a similar smoothing method, such as gaussian filtering, may be used.

Step S103: and filtering the plurality of characteristic maps of each image frame by using a filter to obtain a plurality of corresponding response maps.

In an embodiment, in accordance with the foregoing example, it is preferable to filter the feature map after the smoothing process; of course, in other embodiments, the feature map that is not smoothed may be filtered.

Using filteringThe device f corresponds the characteristic image under each scale d of each camera i

Performing relevant filtering to obtain a response graph of each camera i under the current frame under the scale d

Expressed as:

wherein N is_tTo represent

Indicates the relevant filtering operation.

In one embodiment, the response map is expressed as a gaussian distribution centered on a reference point in an original image region extracted by a reference standard, each point in the response map corresponds to a score, and the highest score is 1; each response map is used for describing the correlation degree of each pixel in the original image area and the tracking target.

Step S104: obtaining a response map containing a target point with the highest score from a plurality of response maps obtained according to an image frame corresponding to each camera in each image sequence to be used as a tracking result obtaining basis, and using the highest score as a generation basis of the score of the corresponding camera in the image frame; and taking the position of the corresponding pixel point of the target point in the image frame as a boundary frame reference point, taking the scale of the alternative boundary frame used for obtaining the tracking result as the boundary frame scale, and combining the boundary frame reference point and the boundary frame scale to construct a tracking result boundary frame for extracting the tracking result from the image frame.

In an embodiment, taking the above example into account, each image sequence includes a plurality of image frames, each image frame corresponds to a camera in a camera array, and in each response map corresponding to an original image area of various scales extracted from the image frame, a response map including a highest-score target point corresponding to a closest pixel is used as a tracking result acquisition basis, so as to construct a tracking result bounding box to obtain a tracking result in the image frame.

Preferably, the score is highest with each camera score

Corresponding dimension

As the scaling of the bounding box of the current image frame, the center of the moved object frame is determined by

Position coordinate (x) of pixel having highest median score_max,y_max) The decision, the highest score is expressed as:

the tracking result is expressed as:

that is to utilize

And the corresponding tracking result bounding box is used as a tracking result in the area intercepted in the current image frame.

Step S105: and comparing the score of each camera in the corresponding image frame in the image sequence with a preset threshold value so as to judge whether the tracking target is shielded or not.

The threshold value may be set empirically.

Step S106: and for the first type of image frames of the camera which are judged to have the condition that the tracking target is blocked, correcting the corresponding tracking result boundary frame by using an inter-camera constraint method, and extracting the tracking result from the first type of image frames.

In one embodiment, take the above example if

Less than a threshold th_occThen the camera i is considered to have a camera occlusion, and the tracking result of this camera in the current frame will be determined by the multi-camera inter-constraint method.

As shown in fig. 2, a flow chart of the multi-camera constraint method is shown.

The multi-camera constraint method comprises the following steps:

step S201: selecting a first camera from a first camera set containing cameras judged that the tracking target is occluded, and acquiring a second camera from a second camera set containing cameras judged that the tracking target is not occluded in the image sequence, wherein each camera in the first camera set and the second camera set corresponds to each image frame in the same image sequence.

In an embodiment, taking the above example into account, for the camera array of uncalibrated synchronous multiple cameras, the tracking trajectory T of the camera i obtained by the tracking algorithm_iFrom image frames acquired by each camera in its temporal sequence

Constitutes the center point of (a).

When judging that the camera set O, O ∈ O is occluded and the camera set K, K ∈ K is not occluded in the current image sequence, taking 2 cameras K from the camera set with the occlusion and the camera set without the occlusion₁,k₂∈ K, and acquiring the highest score of all cameras in the current image frame

Step S202: and calculating a homography matrix for estimating a transformation relation of a motion plane of the tracking target under the first camera and the second camera according to a plurality of pairs of tracking results obtained by the first camera and the second camera in a plurality of image sequences.

Take advantage of the foregoing example, for example, according to

And

the central points of the last n tracking results, n is more than or equal to 4, and a homography matrix H is calculated_jiAnd the method is used for estimating the transformation relation of the motion plane of the tracking target under the two cameras. H_jiRepresenting a coordinate transformation from camera j to camera i. H_jiAccording to

And

linear equations are aligned and solved using SVD.

For the

And

center point of object frame under same time t

For a pair of matched point pairs, the homography matrix is H, then:

unfolding to obtain:

for ease of solution, the above equation can be changed to the form Ax ═ 0:

x₂(H₃₁x₁+H₃₂y₁+H₃₃)-H₁₁x₁+H₁₂y₁+H₁₃＝0

y₂(H₃₁x₁+H₃₂y₁+H₃₃)-H₂₁x₁+H₂₂y₁+H₂₃＝0

rewrite the above equation to the form of a vector product and normalize the last element to 1, let H ═ H (H)₁₁,H₁₂,H₁₃,H₂₁,H₂₂,H₂₃,H₃₁,H₃₂,1)^TThen the above two equations can be rewritten as:

a_xh＝0

a_yh＝0

wherein, a_x＝(-x₁,-y₁,0,0,0,x₂x₁,x₂y₁,x₂)^T,a_y＝(0,0,0,-x₁,-y₁,-1,y₂x₁,y₂y₁,y₂)^T

One pair of matched point pairs, one can get the above equation, H has 8 unknowns, using

And

the central point of the last n tracking results, n is greater than or equal to 4, the following equation can be obtained:

Ah＝0

wherein the content of the first and second substances,

for the solution of such over-determined equations, a least squares solution can be obtained by singular value decomposition of the coefficient matrix a:

UΣV^T＝SVD(A^TA)

selecting the maximum singular value lambda in the singular matrix sigma_maxThe corresponding vector u is the least square solution of Ah being 0, so as to find H.

Step S203: and mapping preset points of the tracking target in each second type image frame in each image sequence by using the homography matrix to obtain mapping points of the first type image frame in the same image sequence.

Taking the above example as a support, the center x of all the tracking targets in the camera k that are not occluded in the current image frame is calculated_kCoordinate transformation to occluded camera o:

λx′_ko＝H_kox_k

step S204: and carrying out constraint correction on the obtained tracking result bounding box according to the obtained mapping points of each first type image frame so as to obtain a corrected tracking result.

Receive the above example with the aid of N_kRespectively calculating the coordinates of the shielded object under the camera o for the shielded camera

And uses it to predict the position x of the tracked object in the camera where the occlusion occurs_oThe following constraint corrections are performed:

wherein N is_kRepresenting the number of cameras in the set K, x_oPosition information (x, y, W,H)，x′_oand position information (x ', y', W ', H') of a reference point (such as a central point) of a boundary frame of the tracking result obtained after the current frame image frame is corrected after the multi-camera constraint is performed, so that the corrected tracking result is obtained.

Step S107: and extracting the tracking result by utilizing the corresponding tracking result boundary frame for the second type of image frame of the camera which is judged that the tracking target is not shielded.

According to the above description, in the embodiment of the present application, in the case that the score certainty is low, the final tracking result is obtained by using the inter-camera constraint method rather than the filter result.

In one embodiment, the filter and feature extractor are pre-trained, the pre-training comprising: one or more iterative calculations.

Each iterative calculation includes:

in a randomly selected video of a tracked target, randomly selecting a predetermined number of image frames for generating a plurality of training sample pairs, each training sample pair comprising: extracting an original image area in each randomly selected image frame according to a reference standard, and obtaining a response image according to the original image area; alternatively, each training sample pair comprises: an image area obtained by shifting the original image area and a response image generated according to the image area obtained by shifting; the filter and feature extractor are trained using one portion, and the other portion, of the plurality of training sample pairs, respectively.

Specifically, the video is from a target tracking data set, and the target tracking data set includes: one or more combinations of an OTB Dataset, a VOT Dataset, a sample Color 128 Dataset, a vidv Tracking Dataset, and a UAV123 Dataset.

For example, each iteration, a tracking target may be randomly selected, and 16 pictures may be randomly extracted from the corresponding video for generating a training sample pair (p)_i,y_i),p_iIs a raw image area (patch) cut out from a raw image frame according to a reference standard (ground route). The tracking target is not necessarily in the center of the intercepted image, but is randomly shifted by a certain position delta to increase data diversity. Accordingly, a Gaussian kernel w is used_GObject-centered response map y_iThe center of (d) is also shifted by δ accordingly.

p_i＝(x,y,W,H)

p′_i＝(x±δ,y±δ,W,H)

Wherein, mu₁,μ₂Characterizing the mean, σ, of the horizontal and vertical directions₁,σ₂The variance of the gaussian function in two directions is characterized, and p represents the correlation coefficient of the two directions. w is a_G(x, y) represents a two-dimensional Gaussian distribution centered on (x, y).

Each iteration extracts 16 pairs (p, y) of samples in the dataset, representing the image area and the corresponding response map, respectively, and the training is divided into two phases:

the filter f is trained using, for example, the first 10 samples of each set of inputs (batch) to train the filter for the correlation filtering. At this time, the parameters of the network are extracted by the fixed features, and the loss function of the filtering result is minimized.

Wherein the content of the first and second substances,

is a response graph corresponding to a reference standard, F is a deep learning feature extraction operation,

is the input image patch, ω is the parameter of the neural network, the same below; f. of^*Is the filter obtained by optimizing the loss function. The optimization method uses a conjugate gradient algorithm.

The feature extractor F was trained and the last 6 samples were used. At this time, the parameter f of the filter is fixed^*The loss function is minimized for the feature extraction network.

Wherein, y_iIs a response map corresponding to a reference standard, s_iIs x_iAs an input, the grad () represents the gradient of the image, which is the result of the filtering by the correlation filtering. Updating omega using a gradient descent method^*。

In one embodiment, the on-line training of the filter comprises:

generating a training sample added into the target training set according to the image frames acquired by each camera, wherein the training sample comprises an original image area extracted from the image frame; inputting training samples into a third objective function of a filter, wherein the third objective function of the filter:

wherein the content of the first and second substances,

representing the jth training from the ith cameraThe score of the sample when obtaining the tracking result;

the objective function is transformed into the frequency domain, represented as:

And the objective function e (f) is optimized using a conjugate gradient method to train the filter.

For example, every 7 frames, for the judgment, the corresponding one calculated according to the current image frame is obtained

Greater than threshold th_spCamera of, tracking result thereof

What is considered to be new training data is added as training samples to the target training set D, and the new filter f 'is trained using the updated target training set D'.

Fig. 3 is a schematic structural diagram of an electronic device 300 according to an embodiment of the present disclosure.

In some embodiments, the electronic device 300 may be a processing terminal, such as a desktop computer, a laptop computer, a smart phone, a tablet computer, or other terminal with processing capabilities, coupled to the camera array 304 independently of the camera array 304; in some embodiments, the camera array 304 and the electronic device 300 may also be integrated together as a component as a single product, such as a light field camera, and the electronic device 300 may be implemented as circuitry in the light field camera that is attached to one or more circuit boards in the light field camera; in some embodiments, the cameras may be coupled to each other, and the electronic device 300 may be implemented by the cooperation of circuits in each of the cameras.

The electronic device 300 includes:

at least one transceiver 301 coupled to the camera array.

In one embodiment, the transceiver 301 comprises: such as one or more of CVBS, VGA, DVI, HDMI, SDI, GigE, USB3.0, Cameralink, HSLink, or CoaXPres.

At least one memory 302 storing computer programs;

at least one processor 303, coupled to the transceiver 301 and the memory 302, is configured to run the computer program to perform the multi-camera object tracking method.

In some embodiments, the memory 302 may include, but is not limited to, a high speed random access memory 302, a non-volatile memory 302. Such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state storage devices.

The processor 301 may be a general-purpose processor 301, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.

In addition, the various computer programs referred to in the foregoing multi-camera object tracking method embodiments (e.g., the embodiments of fig. 1, 2) may be loaded onto a computer-readable storage medium, which may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disc-read only memories), magneto-optical disks, ROMs (read only memories), RAMs (random access memories), EPROMs (erasable programmable read only memories), EEPROMs (electrically erasable programmable read only memories), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions. The computer readable storage medium can be a product which is not accessed into the computer device, and can also be a component which is accessed into the computer device for use.

In particular implementations, the computer programs are routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.

As shown in fig. 4, the multi-camera object tracking system in the embodiment of the present application is applied to an electronic device associated with a camera array, wherein each camera in the camera array takes pictures synchronously and each image frame taken by each camera at a time is taken as an image sequence. In this embodiment, technical features of specific implementation of the system are substantially the same as those of the multi-camera target tracking method in the foregoing embodiment, and technical contents that can be commonly used between embodiments are not repeated.

The system comprises:

an image processing module 401, configured to, in a case that an image sequence is configured with an initial bounding box for framing a tracking target from image frames therein, extract a plurality of original image regions in each image frame in the image sequence respectively by using the initial bounding box and a plurality of candidate bounding boxes obtained when the center of the initial bounding box is unchanged and the scaling scale of the initial bounding box is changed;

a feature extractor 402, configured to perform feature extraction on a plurality of original image regions of each image frame to obtain a plurality of corresponding feature maps of each image frame;

a filter 403, configured to filter the plurality of feature maps of each image frame to obtain a corresponding plurality of response maps;

a tracking calculation module 404, configured to obtain, from among a plurality of response maps obtained according to an image frame corresponding to each camera in each image sequence, a response map including a target point with a highest score as a basis for obtaining a tracking result, and use the highest score as a basis for generating a score of the corresponding camera in the image frame; taking the position of the corresponding pixel point of the target point in the image frame as a boundary frame reference point, taking the scale of the alternative boundary frame used for obtaining the tracking result as the boundary frame scale, and combining the boundary frame reference point and the boundary frame scale to construct a tracking result boundary frame for extracting the tracking result from the image frame; comparing the score of each camera in the corresponding image frame in the image sequence with a preset threshold value so as to judge whether the camera has the condition that the tracking target is shielded; for the first type of image frames of the camera judged to have the condition that the tracking target is shielded, correcting the corresponding tracking result boundary frame by an inter-camera constraint method, and extracting the tracking result from the first type of image frames; and extracting the tracking result by utilizing the corresponding tracking result boundary frame for the second type of image frame of the camera which is judged that the tracking target is not shielded.

wherein the content of the first and second substances,

In one embodiment, the system further comprises: a training set updating module, configured to perform an updating action on the target training set, including: when each camera collects a preset number of image frames, the tracking result of the camera which is judged that the tracking target is not shielded currently is added into the target training set as a training sample for updating.

In one embodiment, the system further comprises: and the smoothing module is arranged between the feature extractor and the filter and used for smoothing the feature image and outputting the smoothed feature image to the filter.

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules may all be implemented in the form of software invoked by a processing element, for example, the feature extractor may be implemented by a CNN network model; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the tracking calculation module may be a processing element separately set up, or may be implemented by being integrated into a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and the function of the tracking calculation module may be called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, the steps of the method or the modules may be implemented by hardware integrated logic circuits in a processor element or instructions in software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

To sum up, the multi-camera target tracking method, system, device and storage medium of the present application select a plurality of original image regions for each image frame of a plurality of image frames synchronously captured by a plurality of cameras each time, and generate a response map through a filter according to the extracted feature map, and extract a tracking result in the corresponding image frame by identifying a tracking target bounding box determined by the response map containing a target point with the highest score; when the tracked target is judged to be shielded under some cameras, the tracking result of the shielded image frame is obtained through an inter-camera constraint method, the tracking shielding is effectively eliminated, the multi-cameras can simultaneously provide information of the tracked target at a plurality of visual angles, and the information can be used as input to enable a relevant filter to learn the characteristics of multiple angles, so that the robustness on the change of the visual angles is higher.

The above embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the application. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which may be accomplished by those skilled in the art without departing from the spirit and scope of the present disclosure be covered by the claims which follow.

Claims

1. A multi-camera target tracking method is applied to an electronic device related to a camera array, each camera in the camera array synchronously shoots, and each image frame shot by each camera at each time is taken as an image sequence; the method comprises the following steps:

under the condition that an image sequence is correspondingly provided with an initial boundary frame used for selecting a tracking target from image frames in the image sequence, extracting a plurality of original image areas in each image frame in the image sequence respectively by using the initial boundary frame and a plurality of alternative boundary frames obtained under the condition that the center is unchanged and the scaling scale is changed;

inputting a plurality of original image areas of each image frame into a feature extractor to obtain a plurality of corresponding feature maps of each image frame;

filtering the plurality of feature maps of each image frame by using a filter to obtain a plurality of corresponding response maps;

obtaining a response map of a target point containing the highest score from a plurality of response maps obtained according to an image frame corresponding to each camera in each image sequence to be used as a tracking result acquisition basis, and using the highest score as a generation basis of the score of the corresponding camera in the image frame; taking the position of the corresponding pixel point of the target point in the image frame as a boundary frame reference point, taking the scale of the alternative boundary frame used for obtaining the tracking result as the boundary frame scale, and combining the boundary frame reference point and the boundary frame scale to construct a tracking result boundary frame for extracting the tracking result from the image frame;

comparing the score of each camera in the corresponding image frame in the image sequence with a preset threshold value so as to judge whether the camera has the condition that the tracking target is shielded;

for the first type of image frames of the camera judged to have the condition that the tracking target is shielded, correcting the corresponding tracking result boundary frame by an inter-camera constraint method, and extracting the tracking result from the first type of image frames; and extracting the tracking result by utilizing the corresponding tracking result boundary frame for the second type of image frame of the camera which is judged that the tracking target is not shielded.

2. The multi-camera object tracking method of claim 1, wherein the filter and feature extractor are pre-trained, the pre-training comprising:

one or more iterative computations, each iterative computation comprising:

in a randomly selected video of a tracked target, randomly selecting a predetermined number of image frames for generating a plurality of training sample pairs, each training sample pair comprising: extracting an original image area in each randomly selected image frame according to a reference standard, and obtaining a response image according to the original image area; alternatively, each training sample pair comprises: an image area obtained by offsetting the original image area and a response image generated according to the image area obtained by offsetting;

the filter and feature extractor are trained using one portion, and the other portion, of the plurality of training sample pairs, respectively.

3. The multi-camera object tracking method of claim 2, wherein the filter fixes parameters of the feature extractor at the time of training to minimize a first objective function of the feature extractor for the filter; and/or the feature extractor fixes parameters of the filter during training to minimize a second objective function of the filter for the feature extractor.

4. A multi-camera target tracking method according to claim 2, wherein the video is from a target tracking data set comprising: one or more combinations of an OTB Dataset, a VOT Dataset, a sample Color 128 Dataset, a VIVIDTracking Dataset, and a UAV123 Dataset.

5. A multi-camera target tracking method as claimed in claim 1, characterized in that the filter is trained online by a training set of targets to be updated.

6. The multi-camera target tracking method of claim 5, wherein the on-line training of the filter comprises:

generating a training sample added into the target training set according to the image frames acquired by each camera, wherein the training sample comprises an original image area extracted from the image frame;

inputting training samples into a third objective function of a filter, wherein the third objective function of the filter:

wherein the content of the first and second substances,

represents a score when a tracking result is acquired according to the jth training sample of the ith camera;

wherein, conj () represents the conjugate of the complex number; the power factor represents Fourier transform;

by iterative solution of gradients

7. The multi-camera target tracking method according to claim 5 or 6, further comprising: performing an update action on the target training set, comprising: when each camera collects a preset number of image frames, the tracking result of the camera which is judged that the tracking target is not shielded currently is added into the target training set as a training sample for updating.

8. The multi-camera object tracking method according to claim 1, 2 or 6, wherein the response map is expressed as a gaussian distribution centered on a reference point in an original image region extracted by a reference standard.

9. The multi-camera object tracking method of claim 1, wherein the inter-camera constraint method comprises:

selecting a first camera from a first camera set containing cameras judged that the tracking target is occluded, and acquiring a second camera from a second camera set containing cameras judged that the tracking target is not occluded in the image sequence, wherein each camera in the first camera set and the second camera set corresponds to each image frame in the same image sequence;

calculating a homography matrix used for estimating a transformation relation of a motion plane of a tracking target under a first camera and a second camera according to a plurality of pairs of tracking results obtained by the first camera and the second camera in a plurality of image sequences;

mapping a predetermined point of a tracking target in each second type image frame in each image sequence by using the homography matrix to obtain a mapping point of a first type image frame in the same image sequence;

and carrying out constraint correction on the obtained tracking result bounding box according to each obtained mapping point of each first type image frame so as to obtain a corrected tracking result.

10. The multi-camera object tracking method of claim 9, wherein the homography matrix is calculated from a positional relationship between each pair of matching points in a plurality of pairs of tracking results obtained at a plurality of same time instants from a first tracking trajectory of the first camera and a second tracking trajectory of the second camera; the tracking trajectory refers to a time-sequentially ordered set of tracking results of the image frames acquired by each camera along the time sequence.

11. The multi-camera object tracking method of claim 1, further comprising, before filtering each of the feature images: and smoothing the characteristic image.

12. An electronic device, associated with a camera array, comprising:

at least one transceiver coupled to the camera array;

at least one memory storing a computer program;

at least one processor, coupled to the transceiver and the memory, for executing the computer program to perform the multi-camera object tracking method according to any of claims 1 to 11.

13. A computer storage medium, characterized in that a computer program is stored which, when running, performs the multi-camera object tracking method according to any one of claims 1 to 11.

14. A multi-camera object tracking system for use with an electronic device associated with an array of cameras, each camera in the array taking a picture simultaneously and each picture frame taken by each camera at a time as a sequence of pictures; the system comprises:

the image processing module is used for extracting a plurality of original image areas in each image frame in an image sequence respectively by utilizing an initial boundary frame and a plurality of alternative boundary frames obtained under the condition that the center of the initial boundary frame is unchanged and the scaling scale of the initial boundary frame is changed when the initial boundary frame is correspondingly configured for selecting a tracking target from the image frames in the image sequence;

the characteristic extractor is used for extracting the characteristics of a plurality of original image areas of each image frame to obtain a plurality of corresponding characteristic maps of each image frame;

the filter is used for filtering the characteristic maps of each image frame to obtain a plurality of corresponding response maps;

a tracking calculation module, configured to obtain, from a plurality of response maps obtained according to an image frame corresponding to each camera in each image sequence, a response map including a target point with a highest score as a tracking result acquisition basis, and use the highest score as a generation basis of a score of the corresponding camera in the image frame; taking the position of the corresponding pixel point of the target point in the image frame as a boundary frame reference point, taking the scale of the alternative boundary frame used for obtaining the tracking result as the boundary frame scale, and combining the boundary frame reference point and the boundary frame scale to construct a tracking result boundary frame for extracting the tracking result from the image frame; comparing the score of each camera in the corresponding image frame in the image sequence with a preset threshold value so as to judge whether the camera has the condition that the tracking target is shielded; for the first type of image frames of the camera judged to have the condition that the tracking target is shielded, correcting the corresponding tracking result boundary frame by an inter-camera constraint method, and extracting the tracking result from the first type of image frames; and extracting the tracking result by utilizing the corresponding tracking result boundary frame for the second type of image frame of the camera which is judged that the tracking target is not shielded.