CN115760923A

CN115760923A - Passive non-vision field target real-time positioning and tracking method and system

Info

Publication number: CN115760923A
Application number: CN202211570111.7A
Authority: CN
Inventors: 李学龙; 赵斌; 王奕豪
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-03-07
Anticipated expiration: 2042-12-08
Also published as: CN115760923B

Abstract

The invention provides a passive non-vision field target real-time positioning and tracking method and a system, wherein the method comprises the following steps: acquiring a real-time video stream containing a non-visual field target action track reflected by a relay medium in real time by using a camera unit; initializing a position coding vector, and setting the position coding vector as an all-zero vector; inputting frames in a real-time video stream into a tracking unit in real time frame by frame to perform tracking operation, acquiring image feature vectors contained in each frame one by one, updating position coding vectors by using the image feature vectors after the image feature vectors contained in each frame are acquired, and inputting the position coding vectors into a decoder after each update; and the decoder decodes after receiving each position coding vector to obtain the real-time coordinate information corresponding to each position coding vector. The invention adopts a pure passive scheme, reduces the layout cost and solves the problems of difficult deployment and application caused by high cost and harsh experimental conditions of an active method in the non-visual field tracking problem.

Description

Passive non-vision field target real-time positioning and tracking method and system

Technical Field

The invention relates to the technical field of electronics, in particular to a passive non-visual field target real-time positioning and tracking method and system.

Background

The field of non-visual field imaging focuses on imaging, sensing and detecting invisible areas. In a general setting, the invisible area refers to an area separated from the detector by a wall, and an optical signal in the area cannot directly propagate to the position of the detector, but can be reached by reflection of a relay wall, so that the invisible area can also be called an area outside a direct line of sight. The mainstream non-visual field imaging technology in the past focuses on three-dimensional scene reconstruction of an invisible area by using an active emission signal (such as ultrafast pulse laser, sound wave and the like) and according to information such as the flight time of a return signal.

For non-visual field tracking tasks, most of the prior art uses active schemes, but the deployment and application of the techniques are limited by the high cost and the harsh experimental conditions of the prior art; few technologies use passive schemes and convert tracking tasks into regression tasks on positions by means of deep neural networks, but the effect is often unsatisfactory. In addition, most of the existing methods do not utilize information generated by object motion and motion continuity prior knowledge of the object, so that the tracking accuracy is not ideal and the stability is poor.

Disclosure of Invention

The present invention is directed to solving one of the problems set forth above.

The invention mainly aims to provide a passive non-visual field target real-time positioning and tracking method.

Another objective of the present invention is to provide a passive non-visual field target real-time positioning and tracking system.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

the invention provides a passive non-visual field target real-time positioning and tracking method, which comprises the following steps: acquiring a real-time video stream which is reflected by a relay medium and contains a non-visual field target action track in real time by using a camera unit; initializing a position coding vector, and setting the position coding vector as an all-zero vector; inputting frames in the real-time video stream into a tracking unit in real time frame by frame to perform tracking operation, acquiring image feature vectors contained in each frame one by one, updating the position coding vector by using the image feature vectors after the image feature vectors contained in each frame are acquired, and inputting the position coding vector into a decoder after each update; and after the decoder receives one position coding vector every time, decoding the received position coding vector to obtain real-time coordinate information corresponding to each position coding vector.

Another aspect of the present invention provides a passive non-visual field target real-time positioning and tracking system, including: the camera shooting unit is used for acquiring a real-time video stream which is reflected by the relay medium and contains the non-visual field target action track in real time; the device comprises an initialization unit, a position coding unit and a processing unit, wherein the initialization unit is used for initializing a position coding vector and setting the position coding vector as an all-zero vector; the tracking unit is used for receiving frames in the real-time video stream input frame by frame in real time, executing tracking operation, acquiring image feature vectors contained in each frame one by one, updating the position coding vector by using the image feature vectors after the image feature vectors contained in the frames are acquired, and inputting the position coding vector to a decoder after each update; and the decoder is used for decoding the received position coding vector after receiving one position coding vector every time to obtain the real-time coordinate information corresponding to each position coding vector.

According to the technical scheme provided by the invention, the passive non-visual field target real-time positioning and tracking method and system provided by the invention have the advantages that the real-time video is shot by using the camera unit in real time, the position coding vector contained in each frame of image in the video is updated in real time through tracking operation, and then the position coding vector is decoded into the position coordinate corresponding to the frame in real time by using the decoder, so that the purpose of real-time tracking is achieved. The passive non-visual field target real-time positioning and tracking method disclosed by the invention adopts a pure passive scheme, so that the layout cost is reduced, and the problems of high cost and difficult deployment and application caused by harsh experimental conditions of an active method in the non-visual field tracking problem are solved. In addition, by introducing a difference frame and a specially designed propagation and calibration network, the problems of unsatisfactory tracking precision and poor stability caused by neglecting motion information and motion continuity prior in the non-visual field real-time tracking problem are solved, and the tracking precision and the track stability are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a passive non-visual field target real-time positioning and tracking method according to embodiment 1 of the present invention;

fig. 2 is a schematic view of a scenario setup provided in embodiment 1 of the present invention;

fig. 3 is a flowchart of tracking and decoding according to embodiment 1 of the present invention;

FIG. 4 is a flowchart of performing pre-heating and tracking using a propagation and calibration network according to embodiment 1 of the present invention;

fig. 5 is a flowchart of the preheating phase, the tracking phase and the decoding provided in embodiment 1 of the present invention;

fig. 6 is a schematic structural diagram of a passive non-visual field target real-time positioning and tracking system according to embodiment 1 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or quantity or location.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

Example 1

The present embodiment provides a passive non-visual field target real-time positioning and tracking method, as shown in fig. 1, the method includes:

step S101, a real-time video stream which is reflected by a relay medium and contains a non-visual field target action track is obtained in real time by using an image pickup unit; specifically, the non-visual-field target generally refers to a living body (e.g., a person, an animal, etc.) or a non-living body (e.g., a vehicle, etc.) that can move freely, and is also a target for which the movement trajectory needs to be tracked in this embodiment. The relay medium may be a planar object or a non-planar object that can reflect light, such as a relay wall, a metal plate, or a plastic plate, as long as the light can be reflected. The camera unit may be a conventional consumer grade RGB camera, which should be able to capture video in real time. Fig. 2 is a schematic diagram illustrating a scene setting according to the present embodiment, where the scene includes a walking person, a general camera, a relay wall, and an obstacle. When a person walks in a room, the light of the person is isolated by the existence of the barrier and is directly acquired by the camera, so that the walking track of the person can be captured only by shooting the light reflected to the relay wall by the person in the walking process.

Step S102, initializing a position coding vector, and setting the position coding vector as an all-zero vector; specifically, a position-encoded vector refers to a high-dimensional vector that implies position semantic information, which can be decoded by the decoder into actual position coordinates. Before entering the tracking phase, the position-coding vector needs to be zeroed to prevent non-zeroed effects on later calculations. The position coding vector is subjected to initial assignment, so that preliminary information can be provided, and the position coding vector is changed into a real vector with position information. Step S102 may be completed before step S101, or may be completed after step S101 as long as it is completed before step S103.

Step S103, inputting frames in the real-time video stream into a tracking unit frame by frame in real time to perform tracking operation, acquiring image feature vectors contained in each frame one by one, updating position coding vectors by using the image feature vectors after each image feature vector contained in each frame is acquired, and inputting the position coding vectors into a decoder after each update; specifically, in the tracking stage, the received real-time frames need to be subjected to position conversion, and a real-time position encoding vector contained in each frame is input to a decoder for decoding so as to obtain real position coordinates.

In a specific embodiment, inputting frames in a real-time video stream into a tracking unit on a frame-by-frame basis in real time to perform a tracking operation, obtaining image feature vectors included in each frame one by one, and updating a position encoding vector by using the image feature vectors after each obtained image feature vector included in a frame includes: acquiring a current frame, wherein the current frame refers to a frame of a currently input real-time video stream; if the current frame is not the first frame, calculating a difference frame according to the current frame and the previous frame; extracting a differential frame image feature vector from a differential frame, and extracting a current frame image feature vector from a current frame, wherein the differential frame image feature vector contains dynamic information, and the current frame image feature vector contains static information; calculating and updating a position coding vector according to the differential frame image feature vector by using a propagation unit; and calculating and updating the position coding vector again according to the characteristic vector of the current frame image by using the calibration unit.

In particular, the position-coding vectors are updated in the tracking phase using the current frame, the differential frame and the propagation and calibration network. The frames of the tracking phase may be acquired in real time, i.e. the camera unit performs a single step tracking once a frame is acquired. Since the frame images acquired in the tracking stage are all frame images that require decoding at a later stage, each acquired frame needs to be processed. Of course, if the first frame is received, the difference operation cannot be performed, and therefore the operation is started from the second frame. Of course, if other stages (e.g., a subsequent warm-up stage) are included before the tracking stage, the frame received before may be regarded as the last frame when the first frame of the tracking stage is processed. Difference Frame refers to a "Difference image" obtained by subtracting each Frame from a previous Frame in a real-time video stream. The differential Frame has the same data size as the current Frame (Raw Frame), but the former reflects motion information at this time and the current Frame reflects static information at this time. Of course, if the first frame is received, the difference operation cannot be performed, and therefore, the operation is generally started from the second frame. However, if other stages (e.g., a subsequent warm-up stage) are included before the tracking stage, the frame received before the tracking stage may be treated as the previous frame. As shown in FIG. 3, during the tracking phase, the Tth frame f is used _T (current frame) and T-1 th frame f _T-1 (previous frame) differential operation is carried out to obtain differential frame d _T-1 And respectively carrying out propagation and calibration according to the image characteristic vectors of the differential frame and the current frame, and updating the position code. The image feature vector is a high-dimensional vector which implies semantic information of an image, and can be extracted from a frame image by using a feature extractor, wherein the feature vector of the current frame image and a differential frameThe image feature vectors are extracted from the current frame and the differential frame, respectively. The propagation and calibration unit is a basic component of a propagation and calibration network (PAC-Net), and includes two sets of sub-modules with the same structure but not sharing weight, called propagation unit (Propagate-Cell) and calibration unit (calibration-Cell), for propagating and calibrating position-coding vectors, respectively. Here, "not sharing weight" means that the submodules are independent of each other, have different internal parameters, and thus can perform different functions. Using a propagation unit, the position encoding vector can be updated by means of the feature vector containing the dynamic information extracted from the differential frame; the use of the calibration unit makes it possible to update the position-coding vector by means of the feature vectors extracted from the current frame, which contain static information. The propagation and calibration units in the tracking stage do not share weights, are independent of each other, have different internal parameters, and can play different roles. A flow chart performed in the tracking phase using the propagation and calibration network is shown in fig. 4.

In an optional embodiment, in the tracking operation, extracting the difference frame image feature vector from the difference frame, and extracting the current frame image feature vector from the current frame includes: extracting a difference frame image feature vector from a difference frame by adopting a first residual error neural network, and extracting a current frame image feature vector from a current frame by adopting a second residual error neural network, wherein the first residual error neural network and the second residual error neural network do not share weight; the calculating and updating the position encoding vector according to the difference frame image feature vector by using the propagation unit comprises the following steps: the propagation unit utilizes a first recurrent neural network to calculate the dynamic information contained in the differential frame image and the current position coding vector, and updates the position coding vector by utilizing the calculation result; the step of calculating and updating the position encoding vector again according to the feature vector of the current frame image by using the calibration unit comprises the following steps: the calibration unit utilizes the second recurrent neural network to operate according to the static information contained in the current frame image operation and the current position coding vector, and utilizes the operation result to update the position coding vector, wherein the first recurrent neural network and the second recurrent neural network do not share the weight. Specifically, as shown in the flow chart executed by the propagation and calibration network shown in fig. 4, in the step of extracting features in the tracking stage, the present embodiment uses the backbone parts of two residual neural networks (ResNet-18) that do not share weights as feature extractors, and the backbone parts are used for extracting feature vectors of a difference frame and a current frame respectively; two gated recursion units GRU that do not share weights are also used in the step of propagating and calibrating to perform the propagation and calibration operation on the position-encoded vector. ResNet-18 is a convolutional neural network CNN, GRU is a unit of a recurrent neural network RNN, and their operation can be formally described as follows:

F＝CNN(I)

h _t+1 ＝RNN(h _t ，F)

wherein I represents the frame image with extracted features, F represents the image feature vector of the frame image I, h _t Representing the position-coding vector before each update, h _t+1 Representing the position-coding vector after each update.

In the tracking operation of the embodiment, the differential frame is used as a carrier of motion information to supplement the motion information which is crucial to the tracking task by introducing the differential frame and a specially designed propagation and calibration network, so that the motion information is explicitly supplemented. The propagation and calibration network alternately extracts information from the differential frame for propagation and information from the current frame for calibration, and the recurrent neural network is used for explicitly modeling continuous motion, so that the problems of unsatisfactory tracking precision and poor stability caused by neglecting motion information and motion continuity prior in the non-visual field real-time tracking problem are solved, and the tracking precision and the track stability are improved.

And step S104, after the decoder receives each position coding vector, the decoder decodes the received position coding vector to obtain the real-time coordinate information corresponding to each position coding vector. Specifically, in the decoding step, a multi-layer Perceptron (MLP) is used as a decoder to decode the position-coded vector into position coordinates. In an optional embodiment, after the decoder completes decoding, the motion trajectory of the non-visual-field object is dynamically restored according to the real-time coordinate information corresponding to each position encoding vector. And sequentially connecting the corresponding position coordinates of each frame to form a real-time tracking track, and reconstructing the track of the non-visual field target in real time.

The passive non-visual field target real-time positioning and tracking method provided by this embodiment only uses the camera unit to shoot a real-time video in real time, updates the position encoding vector included in each frame of image in the video in real time through tracking operation, and then uses a decoder to decode the position encoding vector into the position coordinate corresponding to the frame in real time, thereby achieving the purpose of real-time tracking. The passive non-visual field target real-time positioning and tracking method of the embodiment reduces layout cost by adopting a pure passive scheme, and solves the problems of difficult deployment and application caused by high cost and harsh experimental conditions of an active method in the non-visual field tracking problem. In a specific implementation mode, the problems of unsatisfactory tracking precision and poor stability caused by neglecting motion information and motion continuity prior in the non-visual field real-time tracking problem can be solved by introducing a differential frame and a specially designed propagation and calibration network, and the tracking precision and the track stability are improved.

The technical framework designed by the invention aims at taking time sequence intensive high-dimensional characteristics as input and taking real-time low-dimensional reconstruction as a task target can be applied to the non-visual field real-time tracking problem of the invention and other tasks. When the problem of passive non-visual field real-time tracking is solved, a differential frame is used as a carrier of motion information, a backbone part of a residual error neural network (ResNet-18) is used as a feature extractor, a gated recursion unit GRU is used as a basic unit of the recursion neural network, and a multilayer perceptron MLP is used as a decoder; when other specific tasks are solved, different motion information carriers, different feature extractors, different recurrent neural network basic units and different decoders can be selected for solving. Therefore, all the solutions that can be realized by the technical framework of the present invention should be within the protection scope of the present invention.

In an alternative embodiment, a warm-up operation may be performed before performing the tracking operation, and the warm-up operation provides an accurate current position-coding vector for the tracking operation. Specifically, before inputting frames in the real-time video stream to the tracking unit on a frame-by-frame basis in real-time, the method further comprises: executing preheating operation; the performing of the preheating operation includes: inputting the first W frames in the real-time video stream into a preheating unit frame by frame, obtaining image characteristic vectors contained in each frame in the first W frames one by one, updating position coding vectors by using the image characteristic vectors contained in each frame in the first W frames, and obtaining the position coding vectors after preheating is completed, wherein the position coding vectors after preheating are the position coding vectors updated last time in the preheating operation, and W is more than or equal to 1 and is a positive integer. In particular, the warm-up operation operates in a manner similar to that of the tracking operation, but the purpose of the warm-up operation is to provide an accurate position-encoded vector before tracking is started. Fig. 5 illustrates the two-stage process execution method when preheating and track in this embodiment is employed. As shown in fig. 5, the pre-heating stage and the tracking stage respectively perform a single-step tracking on each frame in the real-time video stream, and the position-coding vector is updated in the single-step tracking. The preheating stage and the tracking stage can adopt the same operation mode, but the two stages are independent from each other and do not share the weight. The frames to be subjected to the warm-up operation may be acquired one by one at a time or may be acquired in real time. The 1-W frames in the real-time video stream are used for pre-heating and do not participate in the decoding of the subsequent tracking operation because the position information represented by the position encoding vector may not be accurate initially, the position encoding vector is further calibrated in the pre-heating stage, and the position encoding vector is gradually accurate in the pre-heating process, and finally the position encoding vector can be basically close to the real position information. The number of frames required in the preheating stage, i.e., the value of W, is affected by factors such as the complexity of the tracking scene and the complexity of the room environment, so that the value of W is different according to different tracking environments. In actual operation, a proper W value can be found in advance through training, and the W value is preset in a formal application scene. In general, the value of W may be 32 or 48, that is, 32 frames or 48 frames are taken as the frames of the preheating stage. Through preheating operation, a more accurate position coding vector can be provided for the subsequent tracking stage, and the tracking accuracy is improved.

In a specific embodimentInputting the first W frames in the real-time video stream into the preheating unit frame by frame, obtaining the image feature vector contained in each frame in the first W frames one by one, updating the position coding vector by using the image feature vector contained in each frame in the first W frames, and obtaining the position coding vector after preheating is completed, wherein the step of obtaining the position coding vector comprises the following steps: acquiring a current frame, wherein the current frame refers to the frames of the first W frames in the currently input real-time video stream; if the current frame is not the first frame of the previous W frames, calculating a difference frame according to the current frame and the previous frame; extracting a differential frame image feature vector from a differential frame, and extracting a current frame image feature vector from a current frame, wherein the differential frame image feature vector contains dynamic information, and the current frame image feature vector contains static information; calculating and updating a position coding vector according to the differential frame image feature vector by using a propagation unit; calculating and updating the position coding vector again according to the characteristic vector of the current frame image by using a calibration unit; if the current frame is the last frame of the previous W frames, after the position coding vector is calculated and updated again according to the current frame image feature vector by using the calibration unit, the position coding vector after preheating is output. Specifically, in the preheating stage, the acquired first frame is not processed, the processing is started from the second frame, and the position coding vector which is preheated is output after the processing is performed until the W-th frame is acquired. The Difference Frame (Difference Frame) refers to a "Difference image" obtained by subtracting each Frame from a previous Frame in the real-time video stream. The differential Frame has the same data size as the current Frame (Raw Frame), but the former reflects motion information at this time and the current Frame reflects static information at this time. As shown in fig. 5, using the 2 nd frame f ₂ (current frame) and 1 st frame f ₁ (previous frame) differential operation is carried out to obtain differential frame d ₁ . The image feature vector is a high-dimensional vector which implies semantic information of an image, and can be extracted from a frame image by using a feature extractor, wherein the current frame image feature vector and the differential frame image feature vector are respectively extracted from the current frame and the differential frame. The propagation and calibration unit is a basic component of a propagation and calibration network (PAC-Net), and comprises two sets of submodules which have the same structure and do not share weight, namely a propagation unit (Propagate-Cell) and a calibration unit (calibration-Cell)) For propagating and calibrating the position-coding vector, respectively. Here, "not sharing weight" means that the submodules are independent of each other, have different internal parameters, and thus can perform different functions. Using a propagation unit, the position encoding vector can be updated by means of the feature vector containing the dynamic information extracted from the differential frame; the use of a calibration unit makes it possible to update the position-coding vector by means of the feature vectors extracted from the current frame, which contain static information. The flow chart for the propagation and calibration network execution used in the warm-up phase is also shown in fig. 4.

In an alternative embodiment, in the warm-up operation, extracting the difference frame image feature vector from the difference frame and extracting the current frame image feature vector from the current frame includes: extracting a difference frame image feature vector from a difference frame by adopting a first residual error neural network, and extracting a current frame image feature vector from a current frame by adopting a second residual error neural network, wherein the first residual error neural network and the second residual error neural network do not share weight; the method for calculating and updating the position coding vector according to the characteristic vector of the differential frame image by using the propagation unit comprises the following steps: the propagation unit utilizes a first recurrent neural network to calculate the dynamic information contained in the differential frame image and the current position coding vector, and updates the position coding vector by utilizing the calculation result; the step of calculating and updating the position encoding vector again according to the feature vector of the current frame image by using the calibration unit comprises the following steps: the calibration unit performs operation according to static information contained in the current frame image operation and the current position coding vector by using a second recurrent neural network, and updates the position coding vector by using the operation result, wherein the first recurrent neural network and the second recurrent neural network do not share the weight. Specifically, as shown in the flow chart executed by the propagation and calibration network shown in fig. 4, in the step of extracting features, the present embodiment uses the backbone parts of two residual neural networks (ResNet-18) that do not share weights as feature extractors, and the backbone parts are used for extracting feature vectors of a difference frame and a current frame respectively; two Gated Recursion Units (GRUs) that do not share weights are used in the step of propagating and calibrating to perform the propagation and calibration operation on the position-encoded vector. ResNet-18 is a Convolutional Neural Network (CNN), GRU is a Recurrent Neural Network (RNN) unit, and their operation can be described formally as follows:

F＝CNN(I)

h _t+1 ＝RNN(h _t ，F)

wherein I represents a frame image with extracted features, F represents an image feature vector of the frame image I, h _t Representing the position-coding vector before each update, h _t+1 Representing the position-coding vector after each update.

The present embodiment further provides a passive non-visual field target real-time positioning and tracking system, as shown in fig. 6, the passive non-visual field target real-time positioning and tracking system includes: an image capturing unit 601, an initialization unit 602, a tracking unit 603, and a decoder 604.

An image capturing unit 601 configured to acquire a real-time video stream including a non-visual field target action trajectory reflected by a relay medium in real time; specifically, the camera unit 601 may be a common consumer grade RGB camera, and the camera unit 601 should be able to capture video in real time. The non-visual target generally refers to a living body (e.g., a person, an animal, etc.) or a non-living body (e.g., a vehicle, etc.) that can move freely, and is also a target for which the movement trajectory needs to be tracked in this embodiment. The relay medium may be a planar object or a non-planar object that can reflect light, such as a relay wall, a metal plate, or a plastic plate, as long as the light can be reflected. Fig. 2 is a schematic diagram illustrating a scene setting according to the present embodiment, where the scene includes a walking person, a general camera, a relay wall, and an obstacle. When a person walks in a room, the light of the person is isolated by the existence of the barrier and is directly acquired by the camera, so that the walking track of the person can be captured only by shooting the light reflected to the relay wall by the person in the walking process.

An initializing unit 602, configured to initialize the position coding vector, and set the position coding vector as an all-zero vector; specifically, the initialization unit 602 performs initialization assignment on the position-coding vector, which may provide preliminary information to make the position-coding vector a true vector with position information. A position-coded vector refers to a high-dimensional vector that implies position semantic information and can be decoded by a decoder into actual position coordinates. Before entering the tracking phase, the position-coding vector needs to be zeroed to prevent non-zeroed effects on later calculations. The initialization unit 602 may be completed before the image capturing unit 601 starts image capturing, or may be completed after the image capturing unit 601 starts image capturing as long as it is completed before the tracking unit 603 operates.

A tracking unit 603, configured to receive frames in a real-time video stream input frame by frame in real time and perform tracking operation, obtain image feature vectors included in each frame one by one, update a position encoding vector by using the image feature vectors after each image feature vector included in a frame is obtained, and input the position encoding vector to a decoder after each update; specifically, the tracking unit 603 needs to perform position conversion on the received real-time frames, and the real-time position encoding vector included in each frame is input to the decoder 604 for decoding to obtain the real position coordinates.

In a specific embodiment, the tracking unit receives frames in a real-time video stream input frame by frame in real time and performs a tracking operation, obtains image feature vectors included in each frame one by one, and updates the position encoding vector by using the image feature vectors after each obtained image feature vector included in a frame specifically includes: acquiring a current frame, wherein the current frame refers to a frame in a currently input real-time video stream; if the current frame is not the first frame, calculating a difference frame according to the current frame and the previous frame; extracting a differential frame image feature vector from a differential frame, and extracting a current frame image feature vector from a current frame, wherein the differential frame image feature vector contains dynamic information, and the current frame image feature vector contains static information; calculating and updating a position coding vector according to the differential frame image feature vector by using a propagation unit; and calculating and updating the position coding vector again according to the characteristic vector of the current frame image by using the calibration unit.

In particular, the current frame, the differential frame and the propagation and calibration network are used to update the position during the tracking phaseThe vector is encoded. The frames of the tracking phase may be acquired in real time, i.e. the camera unit performs a single step tracking once a frame is acquired. Since the frame images acquired in the tracking stage are all frame images that require decoding at a later stage, each acquired frame needs to be processed. The Difference Frame (Difference Frame) refers to a "Difference image" obtained by subtracting each Frame from a previous Frame in the real-time video stream. The differential Frame has the same data size as the current Frame (Raw Frame), but the former reflects motion information at this time and the current Frame reflects static information at this time. Of course, if the first frame is received, the difference operation cannot be performed, and therefore, the operation is generally started from the second frame. However, if other stages (e.g., a subsequent warm-up stage) are included before the tracking stage, the frame received before the tracking stage may be treated as the previous frame. As shown in FIG. 3, during the tracking phase, the Tth frame f is used _T (current frame) and T-1 th frame f _T-1 (previous frame) differential operation is carried out to obtain differential frame d _T-1 And respectively carrying out propagation and calibration according to the image characteristic vectors of the differential frame and the current frame, and updating the position codes. The image feature vector refers to a high-dimensional vector which implies semantic information of an image, and can be extracted from the frame image by using a feature extractor, wherein the current frame image feature vector and the differential frame image feature vector are respectively extracted from the current frame and the differential frame. The propagation and calibration unit is a basic component of a propagation and calibration network (PAC-Net), and includes two sets of sub-modules with the same structure but not sharing weight, called propagation unit (Propagate-Cell) and calibration unit (calibration-Cell), for propagating and calibrating position-coding vectors, respectively. Here, "not sharing weight" means that the submodules are independent of each other, have different internal parameters, and thus can perform different functions. Using a propagation unit, the position encoding vector can be updated by the characteristic vector containing the dynamic information extracted from the differential frame; the use of a calibration unit makes it possible to update the position-coding vector by means of the feature vectors extracted from the current frame, which contain static information. The propagation and calibration units in the tracking stage do not share weights, are independent of each other, have different internal parameters, and can exert different functionsAnd (4) acting. A flow chart performed in the tracking phase using the propagation and calibration network is shown in fig. 4.

In an optional embodiment, in the tracking operation, extracting the difference frame image feature vector from the difference frame, and extracting the current frame image feature vector from the current frame includes: extracting a difference frame image feature vector from a difference frame by adopting a first residual error neural network, and extracting a current frame image feature vector from a current frame by adopting a second residual error neural network, wherein the first residual error neural network and the second residual error neural network do not share weight; the calculating and updating the position encoding vector according to the difference frame image feature vector by using the propagation unit comprises the following steps: the propagation unit utilizes a first recurrent neural network to calculate the dynamic information contained in the differential frame image and the current position coding vector, and updates the position coding vector by utilizing the calculation result; the step of calculating and updating the position encoding vector again according to the feature vector of the current frame image by using the calibration unit comprises the following steps: the calibration unit utilizes the second recurrent neural network to operate according to the static information contained in the current frame image operation and the current position coding vector, and utilizes the operation result to update the position coding vector, wherein the first recurrent neural network and the second recurrent neural network do not share the weight. Specifically, as shown in the flow chart executed by the propagation and calibration network shown in fig. 4, in the step of extracting features in the tracking phase, the present embodiment also uses the backbone parts of two residual neural networks (ResNet-18) that do not share weights as feature extractors, and the backbone parts are used for extracting feature vectors of a difference frame and a current frame respectively; in the step of propagation and calibration, two gated recursion units GRU not sharing weights are also used to perform propagation and calibration operations on the position-encoded vector. ResNet-18 is a convolutional neural network CNN, GRU is a unit of a recurrent neural network RNN, and their operation can be formally described as follows:

F＝CNN(I)

h _t+1 ＝RNN(h _t ，F)

wherein I represents the frame image with extracted features, F represents the image feature vector of the frame image I, h _t Representing the position-coding vector before each update,h _t+1 representing the position-coding vector after each update.

In the tracking unit 603 of the present embodiment, the differential frame is used as a carrier of motion information to supplement the motion information crucial to the tracking task by introducing the differential frame and a specially designed propagation and calibration network, so as to explicitly supplement the motion information. The propagation and calibration network alternately extracts information from the differential frame for propagation and information from the current frame for calibration, and the recurrent neural network is used for explicitly modeling continuous motion, so that the problems of unsatisfactory tracking precision and poor stability caused by neglecting motion information and motion continuity prior in the non-visual field real-time tracking problem are solved, and the tracking precision and the track stability are improved.

The decoder 604 is configured to decode the received position-coding vector after each position-coding vector is received, so as to obtain real-time coordinate information corresponding to each position-coding vector. In particular, the decoder 604 may decode the position-encoded vectors into position coordinates using a multi-layer perceptron MLP. In an optional embodiment, after the decoder 604 completes decoding, the motion trajectory of the non-visual-area target is dynamically restored according to the real-time coordinate information corresponding to each position encoding vector. And sequentially connecting the corresponding position coordinates of each frame to form a real-time tracking track, and reconstructing the track of the non-visual field target in real time.

The passive non-visual field target real-time positioning and tracking system provided by this embodiment captures a real-time video in real time only by using the image capturing unit 601, updates a position encoding vector included in each frame of image in the video in real time by the tracking unit 603, and then decodes the position encoding vector into a position coordinate corresponding to the frame in real time by using the decoder 604, thereby achieving the purpose of real-time tracking. The passive non-visual field target real-time positioning and tracking system of the embodiment reduces layout cost by adopting a pure passive scheme, and solves the problems of difficult deployment and application caused by high cost and harsh experimental conditions of an active method in the non-visual field tracking problem. In a specific implementation manner, a differential frame and a specially designed propagation and calibration network can be introduced into the tracking unit 603, so that the problems of unsatisfactory tracking accuracy and poor stability caused by neglecting motion information and motion continuity prior in the non-visual field real-time tracking problem are solved, and the tracking accuracy and the track stability are improved.

In an optional embodiment, the passive non-visual field target real-time positioning and tracking system of this embodiment may further include a warming unit for performing a warming operation, and providing an accurate current position encoding vector for the tracking unit 603 through the warming operation. Specifically, the preheating unit performs the preheating operation specifically including: receiving the first W frames in the real-time video stream input frame by frame, obtaining the image characteristic vector contained in each frame in the first W frames one by one, updating the position coding vector by using the image characteristic vector contained in each frame in the first W frames, and obtaining the position coding vector after preheating is completed, wherein the position coding vector after preheating is the position coding vector updated last time in the preheating operation, and W is more than or equal to 1 and is a positive integer.

In particular, the warm-up operation operates in a manner similar to that of the tracking operation, but the purpose of the warm-up operation is to provide an accurate position-encoded vector before tracking is started. Fig. 5 illustrates a two-stage process execution method when preheating and track in this embodiment is employed. As shown in fig. 5, the pre-heating stage and the tracking stage respectively perform a single-step tracking on each frame in the real-time video stream, and the position-coding vector is updated in the single-step tracking. The preheating stage and the tracking stage can adopt the same operation mode, but the two stages are independent from each other and do not share the weight. The frames to be subjected to the warm-up operation may be acquired one by one at a time or may be acquired in real time. The 1-W frames in the real-time video stream are used for pre-heating and do not participate in the decoding of the subsequent tracking operation because the position information represented by the position encoding vector may not be accurate initially, the position encoding vector is further calibrated in the pre-heating stage, and the position encoding vector is gradually accurate in the pre-heating process, and finally the position encoding vector can be basically close to the real position information. The number of frames required in the preheating stage, i.e., the value of W, is affected by factors such as the complexity of the tracking scene and the complexity of the room environment, so that the value of W is different according to different tracking environments. In actual operation, a proper W value can be found in advance through training, and the W value is preset in a formal application scene. In general, the value of W may be 32 or 48, that is, 32 frames or 48 frames are taken as the frames of the preheating stage. Through preheating operation, a more accurate position coding vector can be provided for the subsequent tracking stage, and the tracking accuracy is improved.

In a specific embodiment, receiving the first W frames in the real-time video stream input frame by frame, obtaining the image feature vector included in each of the first W frames one by one, updating the position encoding vector by using the image feature vector included in each of the first W frames, and obtaining the position encoding vector after the preheating is completed specifically includes: acquiring a current frame, wherein the current frame refers to a frame in the previous W frames in the currently input real-time video stream; if the current frame is not the first frame of the previous W frames, calculating a difference frame according to the current frame and the previous frame; extracting a differential frame image feature vector from a differential frame, and extracting a current frame image feature vector from a current frame, wherein the differential frame image feature vector contains dynamic information, and the current frame image feature vector contains static information; calculating and updating a position coding vector according to the differential frame image feature vector by using a propagation unit; calculating and updating the position coding vector again according to the characteristic vector of the current frame image by using a calibration unit; if the current frame is the last frame of the previous W frames, after the position coding vector is calculated and updated again according to the current frame image feature vector by using the calibration unit, the position coding vector after preheating is output.

Specifically, in the preheating stage, the acquired first frame is not processed, the processing is started from the second frame, and the position coding vector which is preheated is output after the processing is performed until the W-th frame is acquired. The Difference Frame (Difference Frame) refers to a "Difference image" obtained by subtracting each Frame from a previous Frame in the real-time video stream. The differential Frame has the same data size as the current Frame (Raw Frame), but the former reflects motion information at this time and the current Frame reflects static information at this time. As shown in fig. 5, using the 2 nd frame f ₂ (current frame) and 1 st frame f ₁ Difference is obtained by performing difference operation (in the previous frame)Framing d ₁ . The image feature vector is a high-dimensional vector which implies semantic information of an image, and can be extracted from a frame image by using a feature extractor, wherein the current frame image feature vector and the differential frame image feature vector are respectively extracted from the current frame and the differential frame. The propagation and calibration unit is a basic component of a propagation and calibration network (PAC-Net), and includes two sets of sub-modules with the same structure but not sharing weight, called propagation unit (Propagate-Cell) and calibration unit (calibration-Cell), for propagating and calibrating position-coding vectors, respectively. Here, "not sharing weight" means that the submodules are independent of each other, have different internal parameters, and thus can perform different functions. Using a propagation unit, the position encoding vector can be updated by means of the feature vector containing the dynamic information extracted from the differential frame; the use of the calibration unit makes it possible to update the position-coding vector by means of the feature vectors extracted from the current frame, which contain static information. The flow chart for the propagation and calibration network execution used in the warm-up phase is also shown in fig. 4.

In an alternative embodiment, in the warm-up operation, extracting the difference frame image feature vector from the difference frame and extracting the current frame image feature vector from the current frame includes: extracting a difference frame image feature vector from a difference frame by adopting a first residual error neural network, and extracting a current frame image feature vector from a current frame by adopting a second residual error neural network, wherein the first residual error neural network and the second residual error neural network do not share weight; the calculating and updating the position encoding vector according to the difference frame image feature vector by using the propagation unit comprises the following steps: the propagation unit utilizes a first recurrent neural network to calculate the dynamic information contained in the differential frame image and the current position coding vector, and updates the position coding vector by utilizing the calculation result; the operation and the updating of the position coding vector again according to the feature vector of the current frame image by utilizing the calibration unit comprises the following steps: the calibration unit utilizes the second recurrent neural network to operate according to the static information contained in the current frame image operation and the current position coding vector, and utilizes the operation result to update the position coding vector, wherein the first recurrent neural network and the second recurrent neural network do not share the weight. Specifically, as shown in the flow chart executed by the propagation and calibration network shown in fig. 4, in the step of extracting features, the present embodiment uses the backbone parts of two residual neural networks (ResNet-18) that do not share weights as feature extractors, and the backbone parts are used for extracting feature vectors of a difference frame and a current frame respectively; in the step of propagation and calibration, two Gated Recursion Units (GRUs) which do not share weights are used to perform propagation and calibration operations on the position-encoded vector. ResNet-18 is a Convolutional Neural Network (CNN), GRU is a Recurrent Neural Network (RNN) unit, and their operation can be described formally as follows:

F＝CNN(I)

h _t+1 ＝RNN(h _t ，F)

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer-readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are exemplary and not to be construed as limiting the present invention, and that those skilled in the art may make variations, modifications, substitutions and alterations within the scope of the present invention without departing from the spirit and scope of the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A passive non-visual field target real-time positioning and tracking method is characterized by comprising the following steps:

acquiring a real-time video stream containing a non-visual field target action track reflected by a relay medium in real time by using a camera unit;

initializing a position coding vector, and setting the position coding vector as an all-zero vector;

inputting frames in the real-time video stream into a tracking unit in real time frame by frame to perform tracking operation, acquiring image feature vectors contained in each frame one by one, updating the position coding vector by using the image feature vectors after the image feature vectors contained in each frame are acquired, and inputting the position coding vector into a decoder after each update;

and after the decoder receives one position coding vector every time, decoding the received position coding vector to obtain real-time coordinate information corresponding to each position coding vector.

2. The method according to claim 1, wherein the inputting frames in the real-time video stream into a tracking unit on a frame-by-frame basis in real time performs a tracking operation, obtaining image feature vectors included in each of the frames one by one, and updating the position encoding vector with the image feature vectors after each of the image feature vectors included in the frame is obtained comprises:

acquiring a current frame, wherein the current frame refers to a frame of the currently input real-time video stream;

if the current frame is not the first frame, calculating a difference frame according to the current frame and the previous frame;

extracting a differential frame image feature vector from the differential frame, and extracting a current frame image feature vector from the current frame, wherein the differential frame image feature vector contains dynamic information, and the current frame image feature vector contains static information;

calculating and updating the position coding vector according to the characteristic vector of the differential frame image by using a propagation unit;

and calculating and updating the position coding vector again according to the characteristic vector of the current frame image by using a calibration unit.

3. The method of claim 1, wherein prior to inputting frames of the real-time video stream to a tracking unit on a frame-by-frame basis in real-time, the method further comprises: executing preheating operation;

performing the warm-up operation includes:

inputting the first W frames in the real-time video stream into a preheating unit frame by frame, obtaining image feature vectors contained in each frame in the first W frames one by one, updating the position coding vector by using the image feature vectors contained in each frame in the first W frames, and obtaining the position coding vector after preheating is completed, wherein the position coding vector after preheating is the position coding vector updated last time in the preheating operation, and W is not less than 1 and is a positive integer.

4. The method according to claim 3, wherein the inputting the first W frames of the real-time video stream into a pre-heating unit frame by frame, obtaining the image feature vectors contained in each of the first W frames one by one, updating the position-coding vectors by using the image feature vectors contained in each of the first W frames, and obtaining the position-coding vectors after pre-heating comprises:

acquiring a current frame, wherein the current frame refers to a frame in the first W frames in the currently input real-time video stream;

if the current frame is not the first frame of the previous W frames, calculating a difference frame according to the current frame and the previous frame;

calculating and updating the position coding vector according to the differential frame image feature vector by using a propagation unit;

calculating and updating the position coding vector again according to the characteristic vector of the current frame image by using a calibration unit;

and if the current frame is the last frame of the previous W frames, outputting the preheated position coding vector after calculating and updating the position coding vector again according to the current frame image feature vector by using a calibration unit.

5. The method according to claim 2 or 4,

the extracting the differential frame image feature vector from the differential frame and the extracting the current frame image feature vector from the current frame include:

extracting a difference frame image feature vector from the difference frame by adopting a first residual neural network, and extracting a current frame image feature vector from the current frame by adopting a second residual neural network, wherein the first residual neural network and the second residual neural network do not share weight;

the utilizing the propagation unit to calculate and update the position coding vector according to the differential frame image feature vector comprises:

the propagation unit utilizes a first recurrent neural network to calculate the dynamic information contained in the differential frame image and the current position coding vector, and updates the position coding vector by utilizing the calculation result;

the utilizing the calibration unit to calculate and update the position encoding vector again according to the feature vector of the current frame image comprises:

the calibration unit utilizes a second recurrent neural network to operate according to the static information contained in the current frame image operation and the current position coding vector, and utilizes the operation result to update the position coding vector, wherein the first recurrent neural network and the second recurrent neural network do not share the weight.

6. A passive, non-field-of-view, target real-time location tracking system, comprising:

the camera shooting unit is used for acquiring a real-time video stream which is reflected by the relay medium and contains the non-visual field target action track in real time;

the device comprises an initialization unit, a position coding unit and a processing unit, wherein the initialization unit is used for initializing a position coding vector and setting the position coding vector as an all-zero vector;

the tracking unit is used for receiving frames in the real-time video stream input frame by frame in real time, executing tracking operation, acquiring image feature vectors contained in each frame one by one, updating the position coding vector by using the image feature vectors after the image feature vectors contained in the frames are acquired, and inputting the position coding vector to a decoder after each update;

and the decoder is used for decoding the received position coding vector after receiving one position coding vector every time to obtain the real-time coordinate information corresponding to each position coding vector.

7. The passive non-visual-field-target real-time positioning and tracking system according to claim 6, wherein the tracking unit receives frames of the real-time video stream input on a frame-by-frame basis and performs a tracking operation, obtains image feature vectors included in each of the frames one by one, and after obtaining the image feature vectors included in the frames, the operation of updating the position-coding vector with the image feature vectors specifically includes:

8. The passive non-vision field object real-time location tracking system according to claim 6, further comprising: a preheating unit for performing a preheating operation;

the preheating unit executing the preheating operation specifically includes:

receiving the first W frames in the real-time video stream input frame by frame, obtaining the image feature vectors contained in each of the first W frames one by one, updating the position coding vectors by using the image feature vectors contained in each of the first W frames, and obtaining the position coding vectors after preheating is completed, wherein the position coding vectors after preheating are the position coding vectors updated last time in the preheating operation, and W is more than or equal to 1 and is a positive integer.

9. The passive non-visual-field-target real-time positioning and tracking system according to claim 8, wherein the receiving a first W frames of the real-time video stream input frame by frame, obtaining an image feature vector included in each of the first W frames one by one, updating the position-coding vector by using the image feature vector included in each of the first W frames, and obtaining the position-coding vector after completion of preheating specifically comprises:

acquiring a current frame, wherein the current frame refers to a frame in the previous W frames in the currently input real-time video stream;

if the current frame is the last frame of the previous W frames, outputting the preheated position coding vector after calculating and updating the position coding vector again according to the characteristic vector of the current frame image by using a calibration unit.

10. The method according to claim 7 or 9,