CN118057479A

CN118057479A - Multi-camera target tracking and re-identification algorithm and system in public places

Info

Publication number: CN118057479A
Application number: CN202211449912.8A
Authority: CN
Inventors: 谢英红; 周育竹; 吴浩铭; 韩晓微; 王晓
Original assignee: Shenyang Weina Visual Technology Co ltd
Current assignee: Shenyang Weina Visual Technology Co ltd
Priority date: 2022-11-19
Filing date: 2022-11-19
Publication date: 2024-05-21

Abstract

The application discloses a camera target tracking and identifying algorithm. The algorithm mainly comprises the following steps: setting a plurality of cameras, wherein the shooting areas are not overlapped; each camera records a monitorable area in the current environment and acquires a data set; tracking an interested target by each camera in each monitoring area; transforming the position coordinates of the camera positions and the coverage areas into a unified world coordinate system; transforming the target motion trail into a unified world coordinate system; predicting a camera to which a monitoring area where the target possibly appears belongs according to a target motion track by an optical flow method; and acquiring camera data of 2-3 targets possibly entering a camera shooting area to carry out target re-recognition on the shot video, when the targets are re-recognized, taking a first frame as a template frame to finish multi-camera target tracking and re-recognition, and if the targets are not found, expanding a search area according to an optical flow method to select cameras at positions where other targets possibly appear, and re-performing re-recognition tracking to finish multi-camera target tracking and re-recognition.

Description

Multi-camera target tracking and re-identification algorithm and system in public places

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a multi-camera target tracking and re-identification system.

Background

A plurality of monitoring cameras are usually arranged in a large public place and are used for searching for people, dynamic monitoring and the like. The monitored areas are generally not coherent, and conventional tracking algorithms in video monitoring cannot effectively track targets across cameras. Visual target tracking is a fundamental task in the field of computer vision. Target tracking refers to the continuous positioning of the target in subsequent frames according to a designated region of interest in an initial frame so as to provide a basis for understanding and analyzing the motion behavior and regularity of the target. The pedestrian re-recognition is to match the same targets at different positions at different moments in the video shot by the multiple cameras. The position and angle layout of the cameras in the monitoring places is complex and changeable, targets at different coordinates are affected by illumination changes, shooting visual angles and action posture changes, and the targets are different from the initial targets in characteristics.

Pedestrian re-recognition algorithms can be broadly divided into feature-based descriptive methods and distance-based metric methods. The resolution of the monitoring video picture of the common area is low, and the target to be detected cannot be confirmed in a face recognition mode and the like. When face recognition fails, pedestrian re-recognition technology has an irreplaceable status. The body shapes and dressing of different pedestrians are similar, which presents challenges for accuracy of heavy identification.

With the increase of security awareness, public security is becoming more important. In the conventional man-seeking technology, the camera image data near the area where the target is located needs to be widely browsed to determine the position of the target, and a great deal of manpower and time are consumed. The pedestrian re-recognition technology reduces the labor and time cost by automatically recognizing the task target and then adding a target tracking technology.

Disclosure of Invention

The invention aims to provide a multi-camera target tracking and re-identification system so as to improve the identification accuracy and reliability in the prior art.

The pedestrian recognition method with multiple cameras provided by the invention comprises the following steps:

S1, arranging a plurality of cameras, wherein the imaging areas are not overlapped;

S2, each camera records a monitorable area in the current environment and acquires a data set;

s3, tracking an interested target in each monitoring area by each camera;

S31, obtaining characteristics of an input picture by using a twin neural network and obtaining deformation offset of deformable convolution.

Further, the step S31 further includes:

(1) The network consists of Alexnet of offline pre-training, wherein the Alexnet network model is divided into five layers in total, and each convolution layer contains an excitation function ReLU;

(2) The fourth layer of convolution layer is a deformable convolution layer, and is used for taking the feature images obtained from the previous convolution layer as input, learning the offset of the feature images, then acting on a convolution kernel to achieve deformable convolution, and adding the offset to the regular grid R ，/>Where k= |r|,/>Representing pixel locations, w is a weight, x is a template frame,Representing the offset of grid R;

(3) The initial frame of the video sequence is a template frame, the current frame is a detection frame, twin neural networks are respectively input to obtain feature images of the template frame and the detection frame, wherein the input size of the template frame is 127 multiplied by 3, the obtained feature image size is 6 multiplied by 256, the input size of the detection frame is 256 multiplied by 3, and the obtained feature image size is 6 multiplied by 256;

S32, inputting the feature map into an RPN network to generate a candidate region, wherein the process is as follows:

(1) The candidate area network consists of two parts, one part is a classification branch for distinguishing a foreground from a background, and the other part is a regression branch for fine tuning a candidate area;

(2) For the classification branches, the candidate area network receives the template frame and the feature map of the detection frame generated in S31, carries out convolution operation by using a new convolution check, reduces the feature map and simultaneously obtains template frame features and detection frame features, takes the features of the template frame as the features of the convolution kernel deconvolution detection frame, and the output feature map comprises 2k channels for respectively representing the foreground and background scores of k anchor points, and generates a response map through regional pooling and offset pooling of interest;

(3) For the regression branch, the same operation is carried out to obtain a position regression value of each sample, wherein the position regression value comprises dx, dy, dw and dh values, namely the output characteristic diagram comprises 4k channels, and the coordinate deviation predictions of k anchor points are respectively represented;

S33, determining a tracking position, wherein the tracking position is determined by the following steps:

(1) Similarity measurement is carried out on the candidate frames of the template branches and the candidate frames of the detection branches, and a boundary frame of a tracking result is obtained;

(2) Screening the bounding box of the final predicted output by using non-maximum suppression (NMS) to obtain a final tracked target bounding box;

(3) The non-maximum suppression means that the optimal candidate frame is reserved by calculating the cross ratio:

S4, transforming the position coordinates of the camera positions and the coverage areas into a unified world coordinate system;

s5, transforming the target motion trail into a unified world coordinate system;

The specific conversion method for transforming the coordinate system into the unified world coordinate system is as follows:

(1)

Wherein the method comprises the steps of Is the coordinate of a camera coordinate system,/>Is the world coordinate system coordinate,/>In order to rotate the matrix is rotated,Is a translation vector.

In particular, the pedestrian motion area can be defined as a plane, and the z-axis data simplified calculation is omitted.

S6, predicting a camera to which a monitoring area where the target possibly appears belongs according to the target motion trail by an optical flow method;

The optical flow method is to calculate the offset of each pixel point between adjacent frames in the global range of the image to form an optical flow displacement field for representing the movement direction of pedestrians. Wherein it is assumed that the video has luminance invariance, temporal continuity, and spatial invariance during shooting. And acquiring an optical flow vector of each pixel point in the region of interest in each frame of image by an optical flow method, and acquiring the position change of the object to be detected relative to the camera according to the acquired optical flow vector.

， (2)

Wherein the method comprises the steps ofIs the pixel in time-space coordinates/>Brightness of the upper/(Is that the pixel is movedThe subsequent brightness, x, y, t, represents camera coordinates, dx, dy represents displacement, and dt represents elapsed time. According to the Taylor expansion formula on the right of the equation, the/> can be eliminatedThereby obtaining an equation

， (3)

Wherein the method comprises the steps ofDividing two sides by dt to obtain

， (4)

Wherein the method comprises the steps ofRespectively/>Optical flow energy minimization function/>Specifically, it is

， (5)

Wherein the method comprises the steps ofIs a parameter for adjusting the weight value.

S7, acquiring camera data of 2-3 targets possibly entering a camera shooting area, performing target re-recognition on the shot video, when the targets are re-recognized, taking a first frame as a template frame, repeating S3-S6, completing multi-camera target tracking and re-recognition, and if the targets are not found, jumping to S8;

s8, expanding a search area according to the optical flow method in the S6, selecting a camera at which other targets possibly appear, and re-implementing the S7;

The other camera selection schemes described in s81.s8 are as follows:

(1) Converting the target optical flow direction obtained in the step S6 into world coordinates of a camera obtained in the step S5;

(2) Obtaining the general direction of target movement, and judging 2-3 scenes in which the target is likely to appear by a K nearest neighbor method;

(3) Re-identification and tracking are completed under the 2-3 scenes;

s82.k neighbor metric means selects the euclidean distance, Wherein/>Representing distance,/>Representing the tracked target position,/>Representing the camera to be detected,/>To track the target/>1 St dimensional coordinates of/>To track the target/>2 Nd dimensional coordinates of/(To track the target/>Set/>For/>A set of coordinates for the individual cameras,Is/>3-Dimensional coordinates of the cameras.

First, the distance from the current target to each camera is calculatedWherein/>To track the target/>1 St dimensional coordinates of/>To track the target/>2 Nd dimensional coordinates of/(To track the target/>3 Rd dimensional coordinates of/(Is the 1 st dimension coordinate of the camera,/>Is the 2 nd dimensional coordinate of the camera head,/>Is the 3 rd dimensional coordinate of the camera, and is-，/>And (3) selecting k minimum distances calculated by the formula, wherein k=2 or 3, and according to specific conditions, turning to the corresponding k cameras to carry out target re-identification, turning to a certain camera to carry out target tracking when a target frame matched with a template frame is found in the certain frame in the certain camera, and discarding other k-1 cameras.

S9, repeating the steps S3-S7 to finish multi-camera target tracking and re-identification.

Advantageous effects

The invention aims to provide a multi-camera target tracking and re-identification algorithm. Firstly, extracting the characteristic extremely convolution deformation offset of a template frame and a detection frame through a twin neural network, distinguishing foreground background and anchor point coordinate offset prediction through an RPN network, pooling an interested region, pooling the offset, performing similarity measurement on candidate regions obtained by a template branch and a detection branch, obtaining a predicted target frame, and screening the predicted frame by using non-extreme inhibition to obtain a final target position. The invention realizes target tracking and re-identification across cameras, and can also realize re-identification tracking for video sequences with lower resolution of camera shooting.

Drawings

Fig. 1 is a Alexnet network architecture used in the twin network of the present invention.

Fig. 2 is a schematic diagram of a multi-camera target tracking and re-recognition algorithm according to the present invention.

Fig. 3 is a schematic diagram of the overall network structure of the target tracking and re-identification algorithm according to the present invention.

Fig. 4 is a block diagram of a computer architecture implementation in accordance with one embodiment of the present application.

Detailed Description

The application will be described in further detail with reference to the drawings and examples of embodiments. The specific embodiments of the application described are intended to be illustrative of the application and are not intended to be limiting. It should be further noted that, for convenience of description, only the portions related to the present application are shown in the drawings.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the disclosure. The appearances of the phrase "an embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

In the present application, a Alexnet network is adopted as a backbone network in consideration of project operation efficiency and implementation comprehensively, and an exemplary offline pretrained Alexnet network structure is shown in fig. 1. As shown in fig. 1, alexnet networks have a total of 5 layers, consisting of 5 convolutional layers. Taking template frame branching as an example, in the stage of a convolution layer 1, input data is 227 multiplied by 3, a convolution layer is constructed by using a filter with the step length of 11 multiplied by 11 and 4, the data obtained by relu activating a function is 55 multiplied by 96, and the data obtained by 3 multiplied by 3 and the maximum pooling with the step length of 2 is 27 multiplied by 96; in the stage of the convolution layer 2, input data are 27 multiplied by 96, a convolution layer is constructed by using a filter with the step length of 5 multiplied by 5 and the step length of 2, the data obtained by relu activating a function are 27 multiplied by 256, and the data obtained by 3 multiplied by 3 and the maximum pooling with the step length of 2 are 13 multiplied by 256; in the stage of the convolution layer 3, input data are 13×13×256, a convolution layer is constructed by using a filter with 3×3 and a step length of 1, and the data obtained by relu activation functions are 13×13×384; in the stage 4 of the convolution layer, the flexible convolution layer is formed, the input data is 13×13×256, the convolution layer is constructed by using a filter with 3×3 and step length of 1, and the offset is added in the regular grid RThe position p is changed, so that,

, (1)

Where k= |r|, p represents the pixel position, w is the weight, x is the template frame,Representing the offset of grid R, the resulting data is 13×13×384; in the stage of the convolution layer 53, input data is 13×13×384, a convolution layer is constructed by using a filter with a step size of 3×3 and a step size of 1, the data obtained by performing relu activation functions is 13×13×256, and the data obtained by performing maximum pooling with a step size of 3×3 and a step size of 2 is 6×6×256. The detection frame is the same as the input data only.

As shown in fig. 2, the multi-camera target tracking and re-recognition algorithm according to one embodiment of the present application includes:

s3, tracking an interested target in each monitoring area by each camera;

Specific details of S4, S6, S8 are described below:

Specifically, the transformation described in S4 is performed into a unified world coordinate system, and the specific transformation method is as follows:

， (2)

Specifically, the optical flow method in S6 refers to calculating the offset of each pixel point between adjacent frames in the global range of the image, so as to form an optical flow displacement field, which is used for representing the motion direction of the pedestrian. Wherein it is assumed that the video has luminance invariance, temporal continuity, and spatial invariance during shooting. And acquiring an optical flow vector of each pixel point in the region of interest in each frame of image by an optical flow method, and acquiring the position change of the object to be detected relative to the camera according to the acquired optical flow vector.

(3)

Wherein the method comprises the steps ofIs the pixel in time-space coordinates/>Brightness of the upper/(Is that the pixel is movedThe subsequent brightness, x, y, t, represents camera coordinates, dx, dy represents displacement, and dt represents elapsed time. According to the Taylor expansion formula on the right of the equation, the/> can be eliminatedThereby obtaining an equation:

(4)

Wherein the method comprises the steps of The two sides are divided by dt to obtain:

(5)

Wherein the method comprises the steps of Optical flow energy minimization function/>Specifically, it is

(6)

Specifically, the other camera selection schemes in S8 are as follows:

(2) Obtaining the general direction of the movement of the target, judging 2-3 scenes in which the target is likely to appear by a K neighbor method, wherein the K neighbor measurement method selects the Euclidean distance, setting x to represent the three-dimensional coordinate of the target tracked by the current frame, ，The 1 st dimension coordinate, the 2 nd dimension coordinate and the 3 rd dimension coordinate of the tracking target x are respectively provided with/>Is the coordinates of n cameras,

， (7)

Wherein the method comprises the steps ofRepresenting the distance between the tracking target and the camera,/>Is/>3-Dimensional coordinates of the cameras,/>Is the 1 st dimension coordinate of the camera,/>Is the 2 nd dimensional coordinate of the camera head,/>Is the 3 rd dimensional coordinate of the camera, and is-，/>The total number of cameras.

First, the distance from the current target to each camera is calculated，/>To track the target/>1 St dimensional coordinates of/>To track the target/>2 Nd dimensional coordinates of/(To track the target/>And then selecting k minimum distances calculated by the formula, wherein k=2 or 3, taking values according to specific conditions, turning to corresponding k cameras to carry out target re-identification, turning to a certain camera to carry out target tracking when a target frame matched with a template frame is found in the certain frame in the certain camera, and discarding other k-1 cameras.

(3) Re-identification and tracking are completed under the 2-3 scenes;

The algorithm network structure is shown in fig. 3, and the algorithm consists of two parts, namely a twin neural network and a candidate area network. The twin network receives two inputs, the upper part of which is called a template frame and is the position of an object coming out of an artificial frame in a first frame of a video; the input located below is called a detected frame, which is the other frame of the detected video segment than the first frame. The twin network maps the two images into a 6 x 256 size feature map and a 22 x 256 size feature map, respectively. The candidate area network is also composed of two branches, namely a classification branch used for foreground and background distinction and a regression branch used for adjusting the position of a priori frame, and each branch receives two inputs as the output of the front twin neural network. The specific flow of the algorithm is as follows:

(4) Inputting the feature map into a candidate area network to generate a candidate area, wherein the candidate area network consists of two parts, one part is a classification branch for distinguishing a foreground from a background, and the other part is a regression branch for fine tuning the candidate area;

(5) For the classification branches, the candidate area network receives the template frame and the feature map of the detection frame generated in S31, carries out convolution operation by using a new convolution check, reduces the feature map and simultaneously obtains template frame features and detection frame features, takes the features of the template frame as the features of the convolution kernel deconvolution detection frame, and the output feature map comprises 2k channels for respectively representing the foreground and background scores of k anchor points, and generates a response map through regional pooling and offset pooling of interest;

(6) For the regression branch, the same operation is carried out to obtain a position regression value of each sample, wherein the position regression value comprises dx, dy, dw and dh values, namely the output characteristic diagram comprises 4k channels, and the coordinate deviation predictions of k anchor points are respectively represented;

(7) In order to determine the tracking position, carrying out similarity measurement on the candidate frames of the template branches and the candidate frames of the detection branches to obtain a boundary frame of a tracking result;

(8) Screening the bounding box of the final predicted output by using non-maximum suppression (NMS) to obtain a final tracked target bounding box;

(9) The non-maximum suppression means that the optimal candidate frame is reserved by calculating the cross ratio.

Fig. 4 illustrates a multi-camera target tracking and re-recognition system in accordance with another aspect of the present application. Referring now to fig. 4, a schematic diagram of an electronic system 400 suitable for use in implementing embodiments of the present disclosure is shown. The electronic system shown in fig. 4 is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 4, the electronic system 400 may include a processing device (e.g., central processing unit, graphics processor, etc.) 401 that may perform various suitable actions and processes in accordance with programs stored in a Read Only Memory (ROM) 402 or loaded from a storage device 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic device 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

In general, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, magnetic tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic system 400 to communicate wirelessly or by wire with other devices to exchange data. While fig. 4 shows an electronic system 400 having various devices, it is to be understood that not all illustrated devices are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 4 may represent one device or a plurality of devices as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communications device 409, or from storage 408, or from ROM 402. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 401.

It should be noted that, the computer readable medium according to the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Whereas in embodiments of the present disclosure, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic system (also referred to herein as a "deformation target tracking system"); or may exist alone without being assembled into the electronic system. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic system to:

1) Recording current camera data and acquiring a data set;

2) Each camera uses an algorithm to track an interested target in each monitoring area;

3) Transforming the position coordinates of the camera positions and the coverage areas into a unified world coordinate system;

4) Transforming the target motion trail into a unified world coordinate system;

5) Predicting a camera to which a monitoring area where the target possibly appears belongs according to a target motion track by an optical flow method;

6) And (3) acquiring camera data of 2-3 targets possibly entering a camera shooting area, performing target re-recognition on the shot video, and repeating S3-S6 by taking the first frame as a template frame when the targets are re-recognized, so as to complete multi-camera target tracking and re-recognition.

The various algorithms and details in the deformed target tracking method according to the first aspect of the present application are equally applicable to the target tracking and re-recognition system 400 described above, and therefore a substantial portion of their description is omitted for brevity.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims

1. A camera target tracking and identifying algorithm is characterized by comprising the following specific steps:

s3, tracking an interested target in each monitoring area by each camera;

2. The camera object tracking and recognition algorithm according to claim 1, wherein for step 3, the network structure of the invention is as follows:

(2) The fourth layer of convolution layer is a deformable convolution layer, and is used for taking the feature images obtained from the previous convolution layer as input, learning the offset of the feature images, then acting on the convolution kernel to achieve deformable convolution, and adding the offset to the regular grid R ，/>Where k= |r|,/>Representing pixel position, w is weight, x is template frame,/>Representing the offset of grid R;

3. The camera object tracking and recognition algorithm according to claim 1, wherein for the step 4, the transformation unified world coordinate system is specifically transformed by

， (1)

Wherein the method comprises the steps ofIs the coordinate of a camera coordinate system,/>Is the world coordinate system coordinate,/>For rotation matrix,/>Is a translation vector.

4. The camera object tracking and recognition algorithm according to claim 1, wherein for step 6, an optical flow vector of each pixel point in the region of interest in each frame of image is obtained by an optical flow method, and a position change of the object to be detected relative to the camera is obtained according to the obtained optical flow vector;

， (2)

Wherein the method comprises the steps of Is the pixel in time-space coordinates/>Brightness of the upper/(Is that the pixel is movedThe subsequent brightness, x, y, t, represents camera coordinates, dx, dy represents displacement, and dt represents elapsed time.

5. From the Taylor expansion formula to the right of the equation, it can be eliminatedThereby obtaining an equation

， (3)

Wherein the method comprises the steps ofDividing two sides by dt to obtain

， (4)

Wherein the method comprises the steps ofAre respectively/>Optical flow energy minimization function/>Specifically, it is

， (5)