CN110910449B

CN110910449B - Method and system for identifying three-dimensional position of object

Info

Publication number: CN110910449B
Application number: CN201911223409.9A
Authority: CN
Inventors: 陈健生; 薛有泽; 万纬韬; 张馨予
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2023-10-13
Anticipated expiration: 2039-12-03
Also published as: CN110910449A

Abstract

The invention provides a method and a system for identifying the three-dimensional position of an object, wherein the method comprises the following steps: acquiring a plurality of videos respectively shot by a plurality of shooting devices on the same object; respectively determining two-dimensional positions of key points of the object in the videos; predicting the three-dimensional position of the key point according to the two-dimensional position by using a neural network; determining projection positions of the key points in the imaging surfaces of the imaging devices according to the three-dimensional positions and parameters of the imaging devices; and calculating a loss function of the neural network according to the difference between the projection position and the two-dimensional position, and optimizing parameters of the neural network according to the loss function.

Description

Method and system for identifying three-dimensional position of object

Technical Field

The invention relates to the field of image recognition, in particular to a method and a system for recognizing a three-dimensional position of an object.

Background

Currently, neural networks have been used to estimate three-dimensional positions from two-dimensional images of objects, and existing algorithms directly infer three-dimensional coordinates with the input of two-dimensional keypoint coordinates for a single perspective. After testing the existing neural network estimation algorithm on some videos, experimental results show that the generalization capability of the methods is poor.

The reason for the poor generalization capability of the prior art mainly comprises two points, namely, a single visual angle cannot provide enough three-dimensional information, and a three-dimensional structure deduced by a neural network depends on the statistical characteristics of training data and cannot be correctly migrated when facing new scenes and different camera configurations; the second is that the actual use environment has larger scene difference with the commonly used public data set such as human3.6M, and the model trained on the data set cannot be generalized to the actual application scene.

Disclosure of Invention

In view of this, the present invention provides a method of identifying a three-dimensional position of an object, comprising:

acquiring a plurality of videos respectively shot by a plurality of shooting devices on the same object;

respectively determining two-dimensional positions of key points of the object in the videos;

predicting the three-dimensional position of the key point according to the two-dimensional position by using a neural network;

determining projection positions of the key points in the imaging surfaces of the imaging devices according to the three-dimensional positions and parameters of the imaging devices;

and calculating a loss function of the neural network according to the difference between the projection position and the two-dimensional position, and optimizing parameters of the neural network according to the loss function.

Optionally, in the step of predicting the three-dimensional position of the key point according to the two-dimensional position by using a neural network, the two-dimensional position of the key point in one video is used as input data of the neural network, so that the neural network outputs the three-dimensional position.

Optionally, the plurality of videos are videos shot by an odd number of imaging devices which are close in height and have a certain horizontal interval, and the input data is taken from videos shot by imaging devices which are positioned in the middle in the horizontal direction.

Optionally, determining the two-dimensional positions of the keypoints of the object in the plurality of videos respectively includes:

determining the region where the object is located in the plurality of videos by utilizing a trained object detection network;

two-dimensional locations of the keypoints are determined within the region using a trained keypoint detection network, respectively.

Optionally, before acquiring the plurality of videos respectively shot by the plurality of image capturing devices on the same object, the method further includes: parameters of the neural network are initialized with training data, which is a plurality of videos taken by a plurality of cameras on the same object, the videos including the process of objects being far from and near to the camera locations.

Optionally, the initialization is divided into two phases, in which the loss functions used are not identical.

Optionally, the loss function used in the first stage updates parameters of the neural network with a first optimization objective, where the first optimization objective is to make the depth coordinates of the three-dimensional positions of the object keypoints in the training data output by the neural network positive;

the loss function used in the second stage updates the parameters of the neural network with a second optimization objective that is to agree the projected position and the two-dimensional position of the object keypoints in the training data on the basis of the first optimization objective.

Optionally, the neural network is a long-short term memory network.

Optionally, the object is a human body, and the key points include a plurality of parts of the human body.

The invention also provides a system for identifying the three-dimensional position of an object, comprising:

a plurality of image pickup devices for picking up videos of the same object, respectively;

and the terminal is used for identifying the three-dimensional position of the object according to the method for identifying the three-dimensional position of the object.

The method for identifying the three-dimensional position of the object combines the data-driven neural network with the traditional optimization method of manual modeling, utilizes the neural network to convert the two-dimensional key point coordinate sequence into the three-dimensional coordinate sequence, converts the optimization problem of the three-dimensional key point coordinate into the parameter optimization of the neural network, and can better restrict the time sequence relation of the coordinates compared with the direct optimization of the three-dimensional coordinates. In addition, the three-dimensional position is estimated by adopting an optimization rather than direct inference mode, video information shot by a plurality of visual angles is fully utilized, and definite geometric constraint is applied to the coordinates of the three-dimensional key points, so that the recognition process has higher efficiency, the recognition result has higher accuracy, and the problem of weak generalization capability commonly existing in the prior art is overcome.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of identifying a three-dimensional position of an object in an embodiment of the invention;

FIGS. 2 and 3 are schematic views of a system for identifying three-dimensional positions of objects in accordance with embodiments of the present invention;

fig. 4 is a schematic diagram of a process for identifying a three-dimensional position of a human body according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that the azimuth or positional relationship indicated by the terms "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the azimuth or positional relationship shown in the drawings, and are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or element to be referred to must have a specific azimuth, be constructed and operated in a specific azimuth, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The invention provides a method for identifying the three-dimensional position of an object, which can be executed by electronic equipment such as a computer, a server and the like, and comprises the following steps as shown in fig. 1:

s1, acquiring a plurality of videos respectively shot by a plurality of shooting devices on the same object. The plurality of image pickup devices may be specifically 2, 3 or more, and in a practical use scene, the heights of the respective image pickup devices should be made substantially uniform, with a certain interval in the horizontal direction, and all oriented toward the object to be photographed.

These imaging devices simultaneously capture a single object to obtain i views of video.

S2, respectively determining two-dimensional positions of key points of the object in a plurality of videos. The video consists of a sequence of images (frames), taking one frame of image at the moment t of the video as an example, the moment t corresponds to i images, and the obtained two-dimensional position can be expressed asThe meaning is the two-dimensional position of the key point m at time t on the image of the viewing angle i. There are various methods for determining the two-dimensional coordinates of a point in a two-dimensional image, and this can be achieved using any of the existing techniques.

S3, predicting the three-dimensional positions of the key points according to the two-dimensional positions by using the neural network. The neural network uses the current (or initialized) parameters to determine the parameters based on one, more or all of the i images at time tOutputting a three-dimensional coordinate, i.e. the three-dimensional position of the key point m at the time t, expressed as +.>

S4, determining projection positions of the key points in the imaging surfaces of the imaging devices according to the three-dimensional positions and parameters of the imaging devices. Parameters of the camera device can be calibrated when a hardware environment is built, the parameters comprise internal parameters and external parameters, firstly, internal parameters of three cameras can be calibrated by black and white pictures, and a camera calibration tool box of MATLAB is used. And then calibrating the external parameters of the camera by using the COLMAP, selecting a group of pictures shot by each camera device at the same time, and obtaining the external parameters while performing sparse reconstruction by using the sparse reconstruction function of the COLMAP. The calibrated camera device does not move any more, and then the calibrated camera device does not recalibrate when shooting video, and calibrated parameters when setting up the environment are used.

Projecting the three-dimensional coordinates to each view angle by using pre-calibrated parameters to obtain projection coordinatesThe three-dimensional coordinate of the key point m at the time t is the projection position of the three-dimensional coordinate at the viewing angle i.

And S5, calculating a loss function of the neural network according to the difference between the projection position and the two-dimensional position, and optimizing parameters of the neural network according to the loss function. The three-dimensional coordinates of the key points m should have consistency in each view angle, and should be overlapped with the two-dimensional key point coordinates of each view angle after projection, namelyShould be in charge of>Identical (equal or substantially identical), the optimization objective can thus be defined on the basis of the difference between the two, i.e. the loss function L can be determined, which can be expressed as +.>L should be as small as possible.

For example, the three-dimensional position of the key point at the time t is estimated for the image at the time t, after the loss function is calculated and the parameters of the neural network are optimized, the three-dimensional position of the key point at the time t+1 is estimated for the neural network at the time t+1 by adopting the optimized parameters. The optimization target L defined above can be optimized in a gradient descending manner, and parameters of the neural network are continuously updated in the optimization process until three-dimensional key points estimated by the neural network are consistent with two-dimensional key points of all visual angles. The structure of the neural network is combined with the optimization thought, and three-dimensional key point coordinates with visual angle consistency are indirectly obtained by optimizing parameters of the neural network.

The method can determine the two-dimensional position of the object key point in the video frame by frame, estimate the three-dimensional position of the object key point frame by frame, optimize the network frame by frame, and execute the processing at intervals of a plurality of frames, so the time t and the time t+1 are only used for explaining the time sequence relationship of the two times, and are not used for limiting two adjacent frames.

Because the image sequence (video) is input to the neural network, the key points have a time sequence relation in the sequence, the neural network is preferably a circulating neural network with a multi-layer LSTM (long short Term Memory) structure, the LSTM solves the long-Term dependence problem of the sequence input, and the input time sequence relation can be effectively utilized, so that the efficiency and the accuracy for predicting the three-dimensional position of the key points are higher.

The above steps are not only the identification process but also the training process, the parameters of the neural network can be randomly initialized, and then the parameters of the neural network are identified aiming at the input video, so that the parameters of the neural network are corrected, and the identification result is not output to the user until the set convergence condition is reached.

In order to improve the recognition efficiency, the parameters of the neural network may also be initialized in a specific manner before the recognition, that is, the neural network may be trained using specific training data before the recognition. The training data are video data, are shot by the camera devices with the i visual angles, and comprise the process that the object is far away from and close to the camera device, and the object in the training data is also preset with key points.

In the training process, operations of recognizing the two-position, predicting the three-dimensional position, and calculating the projection position are performed with reference to the above steps S2 to S5, and the loss function is different from the recognition process. In this embodiment the training process is divided into two phases, wherein different loss functions are used, i.e. the optimization objectives are different. Specifically, the loss function used in the first stage updates the parameters of the neural network with a first optimization objective that is to make the depth coordinates of the three-dimensional positions of the object keypoints in the training data output by the neural network positive, e.g. toTraining for loss function, wherein->Representing the z-coordinate in the three-dimensional coordinates of the key point m at time t, τ is a constant greater than 0, and the training of the first stage is ended by continuously updating the parameters of the neural network until q=0.

The loss function used in the second stage updates the parameters of the neural network with a second optimization objective that is to agree the projected location and the two-dimensional location of the object keypoints in the training data based on the first optimization objective. And in the second stage, training is performed by taking Q+L as a loss function, namely, the state that the network is still prevented from diverging to Q >0 while consistency is required.

After the training of the two phases is finished, the network parameters are saved, then the parameters are used as initial parameters in the identification process, and then the optimization of L can be continued in the identification process. Tests show that the initialization mode can ensure the convergence of optimization. In actual application, the network is initialized in the mode, and the network can be converged in 10 minutes for optimization, so that the requirements of actual application scenes can be met.

In a preferred embodiment, the step S2 includes:

s21, determining the area where the object is located in the plurality of videos by utilizing the trained object detection network. For example, an object detection network MASK-RCNN may be used to perform object detection on each frame of image of each video segment, so as to obtain a frame of the target object. In order to suppress the occurrence of the false detection phenomenon, it may be required to retain only the detection frame with the highest confidence level if a plurality of target objects are detected in one image. The object position of each frame image is determined by a quadruple (x ₁ ,y ₁ ,x ₂ ,y ₂ ) Representing the pixel coordinates of the upper left and lower right corners of the detection frame. If the target object cannot be detected correctly for the partial image, the output (0, 1) is expressed.

S22, respectively determining the two-dimensional positions of the key points in the area by using the trained key point detection network. For example, when the identified object is a human body, CPN (Cascaded Pyramid Network) may be used to mark the position of the human body in the image sequence, and each frame of image and the corresponding human body detection frame are sent to the CPN to obtain the pixel coordinates of each key point. The original CPN may be trained using a COCO dataset with only two-dimensional keypoint labels and using different human body keypoint definitions, in order to obtain a keypoint representation consistent with the three-dimensional dataset human3.6m, the embodiment trains the CPN using the two-dimensional keypoint labels of the human3.6m dataset.

In a specific embodiment, the method is used in a medical scenario, the identified object is a human body, and multiple parts of the human body are defined as key points. As shown in connection with fig. 2, the present embodiment provides a system for recognizing a three-dimensional position of an object, which includes three cameras and a terminal for data processing.

Three cameras are placed in front of a channel with a length of about 6 meters, and the three cameras are kept close in height, and are shot in the channel from the left front, the right front and the right front respectively. The three cameras shoot synchronously at the same frame rate (25 frames/second), the height and width of each frame of image are 1920 pixels and 1080 pixels respectively, and the video acquisition scene is shown in fig. 2 and 3.

And calibrating the three cameras after the cameras are built. Firstly, calibrating internal parameters of three cameras by using black and white grid pictures, and directly calibrating a tool box by using MATLAB cameras. And calibrating the camera external parameters by using the COLMAP. And selecting a group of pictures shot by three cameras at the same moment, and obtaining camera parameters while performing sparse reconstruction by using the sparse reconstruction function of the COLMAP. After the cameras are calibrated, the three cameras are not moved any more, and then the cameras are not recalibrated when new videos are shot, and calibrated parameters when the environment is built are used.

A chair is placed at the far end of the channel with the length of about 6 meters, a transverse line is marked on the channel every 60 cm, a red line is marked at a position which is about 2.5 meters away from the camera, and the patient needs to complete the turning motion before the red line. The outside of the channel is provided with a separation plate for shielding external interference and providing sufficient illumination above the channel. The three cameras are connected with a terminal (PC) in a USB mode, and the terminal is provided with special software for controlling the cameras to shoot, store and process and analyze videos.

The three cameras are used for collecting videos, firstly, a patient sits on a chair at the far end of a channel, after shooting starts, the patient is required to gradually stand up from a sitting position and move to the near end of the channel, after walking for about 3.5 meters, the patient turns around to the chair at the far end and sits down, and shooting can be finished once. In the whole shooting process, only one patient is required to appear in the pictures shot by the three cameras, and other irrelevant people cannot enter the channel. Depending on the speed of the patient, the duration of one shot is between about 10 seconds and 20 seconds, and the duration of shooting for a patient with serious walking disorder may reach more than one minute.

After three video segments are acquired, the terminal performs an operation of identifying three-dimensional positions of key points of the human body, wherein the key points of the human body comprise 17 positions, namely, a top of head, a nose, a neck, a left shoulder, a left elbow, a left wrist, a right shoulder, a right elbow, a right wrist, a back of a spine, a center of two crotch, a left hip, a left knee, a left ankle, a right hip, a right knee and a right ankle.

The terminal identifies the positions of these key points by using the method shown in fig. 1, and in this embodiment, three-dimensional key point coordinates of each moment are obtained by using the time sequence two-dimensional key points of the front view as input. The neural network recognition and optimization process is shown in FIG. 4, for example, the two-dimensional position of the key point m at the time t in the left front view video1 isThe two-dimensional position of video2 at the front viewing angle is +.>The two-dimensional position of video3 at the right front view is +.>The input data of the neural network is +.>

Neural network is based onThree-dimensional position of output key point m at time t>Then according to->And the projection position of the parameter calculation key point of the camera on the three visual angles +.>And->Whereby +.>And->Difference of->And->Difference of->And->Thereby optimizing parameters of the neural network based on the gap.

The initialization of the parameters of the neural network may refer to the training scheme in the above embodiment, and will not be described herein.

After the three-dimensional positions of all key points of the human body are obtained, the gesture of the human body can be analyzed. The coordinates of the three-dimensional key points can represent the posture of the human body, the condition of the data changing along with time can represent the motion state of the human body, and the data can be used for diagnosing or analyzing related diseases.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. A method of identifying a three-dimensional position of an object, comprising:

acquiring a plurality of videos respectively shot by a plurality of camera devices on the same object, wherein the videos are shot by an odd number of camera devices which are close in height and have a certain horizontal interval;

predicting the three-dimensional position of the key point according to the two-dimensional position by using a neural network, wherein the two-dimensional position of the key point in the video shot by a shooting device positioned in the middle in the horizontal direction is used as input data of the neural network, so that the neural network outputs the three-dimensional position;

calculating a loss function of the neural network based on the difference between the projection position and the two-dimensional position, and optimizing parameters of the neural network based on the loss function, the loss function includingWhereinIs the two-dimensional position of the key point m at time t on the image of the view angle i, +.>The projection position of the three-dimensional coordinates of the key point m at the moment t at the viewing angle i, the loss function further comprises +.>Wherein->Representing the z-coordinate in the three-dimensional coordinates of the key point m at time t,/>A constant greater than 0; the training process of the neural network is divided into two phases, in the first phase a loss function is used>By updating the parameters of the neural network until +.>Ending the training of the first phase and using the loss function +_in the second phase>And a loss function L for preventing the network from diverging to +.>After the training of the two stages is finished, saving the network parameters, and then taking the saved network parameters as initial values when the three-dimensional position is identified, and optimizing the network parameters by adopting a loss function L.

2. The method of claim 1, wherein determining two-dimensional locations of keypoints of the object in the plurality of videos, respectively, comprises:

3. The method according to claim 1, further comprising, before acquiring a plurality of videos respectively taken by a plurality of image pickup devices on the same object: parameters of the neural network are initialized with training data, which is a plurality of videos taken by a plurality of cameras on the same object, the videos including the process of objects being far from and near to the camera locations.

4. A method according to claim 3, characterized in that the initialization is divided into two phases, in which the loss functions used are not identical.

5. The method according to claim 4, wherein the loss function used in the first stage updates parameters of the neural network with a first optimization objective that is to make positive the depth coordinates of three-dimensional positions of object keypoints in training data output by the neural network;

6. The method of any one of claims 1-5, wherein the neural network is a long-term memory network.

7. A system for identifying a three-dimensional position of an object, comprising:

a terminal for identifying the three-dimensional position of a key point of an object according to the method of any one of claims 1-6.