WO2022003963A1 - Procédé de génération de données, programme de génération de données et dispositif de traitement d'informations - Google Patents

Procédé de génération de données, programme de génération de données et dispositif de traitement d'informations Download PDF

Info

Publication number
WO2022003963A1
WO2022003963A1 PCT/JP2020/026232 JP2020026232W WO2022003963A1 WO 2022003963 A1 WO2022003963 A1 WO 2022003963A1 JP 2020026232 W JP2020026232 W JP 2020026232W WO 2022003963 A1 WO2022003963 A1 WO 2022003963A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
dimensional
moving image
image data
skeleton
Prior art date
Application number
PCT/JP2020/026232
Other languages
English (en)
Japanese (ja)
Inventor
創輔 山尾
Original Assignee
富士通株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 富士通株式会社 filed Critical 富士通株式会社
Priority to JP2022533003A priority Critical patent/JP7318814B2/ja
Priority to PCT/JP2020/026232 priority patent/WO2022003963A1/fr
Publication of WO2022003963A1 publication Critical patent/WO2022003963A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis

Definitions

  • the present invention relates to a data generation method, a data generation program, and an information processing apparatus.
  • skeleton recognition is performed to detect 3D human movements related to various sports. For example, a method of calculating typical three-dimensional joint coordinates from a plurality of two-dimensional joint coordinates by a triangulation method is used. In recent years, in order to improve the accuracy of skeleton recognition, a method of estimating three-dimensional joint coordinates from two-dimensional joint coordinates of a plurality of viewpoints using an estimation model generated by machine learning is also known.
  • the above estimation model is generated by machine learning using learning data including a two-dimensional image and a three-dimensional skeletal position, but it is very difficult to apply it to skeletal recognition of complicated movements such as gymnastics. It is required to improve the estimation accuracy by using a lot of learning data. However, such learning data is generally generated manually, and the accuracy is poor and inefficient. Therefore, the accuracy of three-dimensional skeleton recognition using machine learning deteriorates and the cost increases. ..
  • CG Computer Graphics
  • One aspect is to provide a data generation method, a data generation program, and an information processing apparatus capable of automatically generating a learning data set including a two-dimensional image and a three-dimensional skeleton position.
  • the computer executes a process of acquiring the two-dimensional coordinates of each of the plurality of joints from each of the plurality of frames included in the moving image data of the subject performing the predetermined operation.
  • the computer identifies each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames from the three-dimensional series data including the three-dimensional skeleton data relating to the plurality of joint positions of the subject performing the predetermined operation. Execute the process to be performed.
  • the computer adjusts the time synchronization between the moving image data and the three-dimensional series data by using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data.
  • a computer executes a process of generating data in which the moving image data and the three-dimensional series data are associated with each other by using the optimized adjustment amount and the projection parameter.
  • a learning data set including a two-dimensional image and a three-dimensional skeleton position can be automatically generated.
  • FIG. 1 is a diagram illustrating an example of an overall configuration of a system using skeleton recognition.
  • FIG. 2 is a diagram illustrating the manual generation of learning data.
  • FIG. 3 is a functional block diagram showing a functional configuration of the generator according to the first embodiment.
  • FIG. 4 is a diagram showing an example of moving image data.
  • FIG. 5 is a diagram illustrating an example of three-dimensional skeleton series data.
  • FIG. 6 is a diagram illustrating acquisition of two-dimensional joint coordinates.
  • FIG. 7 is a diagram illustrating camera calibration.
  • FIG. 8 is a diagram illustrating time synchronization and resampling.
  • FIG. 9 is a diagram illustrating the estimation of the initial parameters.
  • FIG. 10 is a diagram illustrating parameter optimization.
  • FIG. 11 is a diagram illustrating the result of optimization.
  • FIG. 12 is a diagram showing an example of the generated learning data.
  • FIG. 13 is a flowchart showing the flow of the learning data generation process.
  • FIG. 14 is a flowchart showing the flow of processing from the estimation of the initial value to the optimization.
  • FIG. 15 is a diagram illustrating a processing example of skeleton recognition.
  • FIG. 16 is a diagram illustrating a processing example of skeleton recognition.
  • FIG. 17 is a diagram illustrating a processing example of skeleton recognition.
  • FIG. 18 is a diagram illustrating a hardware configuration example.
  • FIG. 1 is a diagram illustrating an example of an overall configuration of a system using skeleton recognition.
  • this system has a 3D (Three-Dimensional) laser sensor 5, a generation device 10, a learning device 40, a recognition device 50, and a scoring device 90, and has three-dimensional data of the performer 1 who is the subject. It is a system that images the image, recognizes the skeleton, etc., and scores the technique accurately.
  • the gymnastics competition will be described as an example, but the present invention is not limited to this, and other competitions in which the athlete performs a series of techniques and the referee scores, and various people. It can also be applied to actions and movements. Further, in this embodiment, two dimensions may be described as 2D, and three dimensions may be described as 3D.
  • the current scoring method in gymnastics is visually performed by a plurality of graders, but with the sophistication of techniques, it is becoming more difficult for the graders to visually score.
  • an automatic scoring system and a scoring support system for scoring competitions using a 3D laser sensor 5 are known.
  • a distance image which is three-dimensional data of the athlete, is acquired by the 3D laser sensor 5, and the skeleton such as the direction of each joint of the athlete and the angle of each joint is recognized from the distance image.
  • the scoring support system the result of skeleton recognition is displayed by a 3D model to support the grader to perform more correct scoring by confirming the detailed situation of the performer.
  • the technique performed is recognized from the result of skeleton recognition, and scoring is performed according to the scoring rule.
  • the scoring support system and the automatic scoring system it is required to support the scoring or automatically score the performances performed at any time in a timely manner.
  • a method of recognizing a performer's three-dimensional skeleton from a distance image or a color image causes a long processing time and a decrease in the accuracy of skeleton recognition due to insufficient memory or the like.
  • the result of automatic scoring by the automatic scoring system is provided to the grader and the grader compares it with his / her own scoring result
  • the provision of information to the grader is delayed.
  • the accuracy of skeleton recognition decreases, there is a possibility that the subsequent technique recognition will be erroneous, and as a result, the score determined by the technique will also be erroneous.
  • the scoring support system when the angle and position of the performer's joints are displayed using a 3D model, the time until the display is delayed or the displayed angle is incorrect. sell. In this case, the scoring by the grader using this scoring support system may result in an erroneous scoring.
  • the 3D laser sensor 5 is an example of a sensor device that measures (sensing) the distance of an object for each pixel using an infrared laser or the like.
  • the distance image includes the distance to each pixel. That is, the distance image is a depth image showing the depth of the subject as seen from the 3D laser sensor (depth sensor) 5.
  • the learning device 40 is an example of a computer device that learns a machine learning model for skeleton recognition. Specifically, the learning device 40 generates a machine learning model by executing machine learning such as deep learning using two-dimensional skeletal position information, three-dimensional skeletal position information, and the like as a learning data set.
  • the recognition device 50 is an example of a computer device that recognizes the skeleton related to the orientation and position of each joint of the performer 1 by using the distance image measured by the 3D laser sensor 5. Specifically, the recognition device 50 inputs the distance image measured by the 3D laser sensor 5 into the trained machine learning model learned by the learning device 40, and creates a skeleton based on the output result of the machine learning model. recognize. After that, the recognition device 50 outputs the recognized skeleton to the scoring device 90. For example, the information obtained as a result of skeletal recognition is information regarding the three-dimensional position of each joint.
  • the scoring device 90 identifies the transition of the movement obtained from the position and orientation of each joint of the performer by using the recognition result information input by the recognition device 50, and identifies and scores the technique performed by the performer. This is an example of a computer device.
  • three-dimensional skeleton information and the like are newly collected by using a laser method, an image method, and the like.
  • the laser is irradiated about 2 million times per second, and each irradiation including the target person is based on the running time (Time of Flight: ToF) of one laser. Requesting point depth information is being performed.
  • this laser method can acquire highly accurate depth data, it is difficult to use it for general purposes because it requires complicated configurations and processing such as laser scanning and ToF measurement, and the hardware is complicated and expensive. ..
  • an inexpensive RGB camera can be used as an image method for acquiring RGB data of each pixel using a CMOS (Complementary Metal Oxide Semiconductor) imager, and with the recent improvements in deep learning technology, the accuracy of three-dimensional skeleton recognition has also improved. It is improving.
  • CMOS Complementary Metal Oxide Semiconductor
  • machine learning is executed using an existing data set such as Total Capture that contains only general postures, it is not possible to generate a machine learning model that can recognize complicated postures such as gymnastics.
  • the machine learning model that can recognize a complicated posture cannot be generated by the generally used method, the learning data for learning the complicated posture is manually generated.
  • FIG. 2 is a diagram illustrating the manual generation of learning data.
  • the time is visually observed between the moving image data obtained by capturing a series of exercises of the performer 1 and the three-dimensional skeleton series data including the three-dimensional skeleton data when the performer 1 performed in the past. Synchronization is executed, and a frame in which the moving image data and the three-dimensional skeleton data are synchronized is searched for (S1). Then, using the approximate camera parameters defined as prior knowledge, the 3D skeleton data is projected onto the image (S2), and the values of the camera parameters are overlapped so that the person silhouette of the moving image data and the 3D skeleton data overlap. Is manually adjusted (S3). After that, the three-dimensional skeleton series data is resampled at the frame rate of the moving image data and projected onto the moving image data to create training data over the entire moving image data (S4).
  • the spatially and temporally optimum automatic alignment is performed using the moving image data acquired by the camera and the three-dimensional skeleton sequence data acquired by 3D sensing or the like for the same operation. And provide the technology to generate efficient and high quality training data.
  • the generation device 10 acquires the two-dimensional coordinates of each of the plurality of joints from each of the plurality of frames included in the two-dimensional moving image data of the performer 1 who performs the performance. Subsequently, the generation device 10 specifies each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames from the three-dimensional skeleton series data regarding the plurality of joint positions (three-dimensional skeleton data) of the performer 1 who performs the performance. .. Then, the generation device 10 uses the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data to adjust the time synchronization between the moving image data and the three-dimensional skeleton series data, and the moving image. Perform optimization with camera parameters when projecting 3D skeletal sequence data onto the data. After that, the generation device 10 generates data in which the moving image data and the three-dimensional skeleton series data are associated with each other by using the optimized adjustment amount and the camera parameters.
  • the generator 10 performs camera calibration and time synchronization based on non-linear optimization so that the geometrical consistency between the 3D skeletal sequence data and the 2D joint coordinates is maximized.
  • spatially and temporally optimal automatic alignment is performed.
  • the generation device 10 can automatically generate a training data set including a two-dimensional image and a three-dimensional skeleton position that can be used for training a machine learning model that estimates a three-dimensional skeleton position with high accuracy. ..
  • FIG. 3 is a functional block diagram showing a functional configuration of the generator 10 according to the first embodiment.
  • the generation device 10 includes a communication unit 11, a storage unit 12, and a control unit 20.
  • the communication unit 11 is a processing unit that controls communication with other devices, and is realized by, for example, a communication interface.
  • the communication unit 11 receives the moving image data of the performer 1 taken by using a camera or the like, and receives the three-dimensional skeleton sequence of the performer 1 taken by using the 3D laser sensor 5.
  • the storage unit 12 is a processing unit that stores various data, programs executed by the control unit 20, and the like, and is realized by, for example, a memory or a hard disk.
  • the storage unit 12 stores the moving image data 13, the three-dimensional skeleton sequence data 14, and the learning data set 15.
  • the moving image data 13 is a series of moving image data taken by a camera or the like when the performer 1 performs, and is composed of a plurality of frames.
  • FIG. 4 is a diagram showing an example of moving image data 13.
  • FIG. 4 shows, as an example, one frame in the moving image data 13 taken while acting a pommel horse.
  • the moving image data 13 is based on the position, posture, and resolution of the camera as the coordinate system, and is based on the time stamp and sampling rate peculiar to the camera as the time system.
  • the three-dimensional skeleton series data 14 is series data including three-dimensional skeleton data showing three-dimensional joint coordinates related to a plurality of joint positions.
  • the three-dimensional skeleton sequence data 14 is a series of three-dimensional skeleton data captured by a 3D laser sensor or the like when the performer 1 performs.
  • the three-dimensional skeleton data includes information on the three-dimensional skeleton position (skeleton information) of each joint when the performer 1 is performing.
  • FIG. 5 is a diagram illustrating an example of three-dimensional skeleton series data 14.
  • FIG. 5 illustrates one of the three-dimensional skeleton data 14 in the three-dimensional skeleton sequence data 14 generated from the performance of the performer 1.
  • the three-dimensional skeleton series data 14 is data acquired by the 3D sensing technique, and is data including three-dimensional coordinates of each joint.
  • each joint is, for example, 18 joints designated in advance such as the right shoulder, the left shoulder, and the right ankle, or a plurality of joints arbitrarily set by the user.
  • the three-dimensional skeleton sequence data 14 is based on the position and orientation of the sensor as a coordinate system, and is based on a sensor-specific time stamp and sampling rate as a time system.
  • the learning data set 15 is a database that is generated by the control unit 20 described later and includes a plurality of training data used for generating a machine learning model.
  • the learning data set 15 is information in which the moving image data 13 is associated with the three-dimensional skeleton data and the camera parameters.
  • the control unit 20 is a processing unit that controls the entire generation device 10, and is realized by, for example, a processor.
  • the control unit 20 has a data acquisition unit 21, a coordinate acquisition unit 22, and a learning data generation unit 23.
  • the data acquisition unit 21, the coordinate acquisition unit 22, and the learning data generation unit 23 are realized by an electronic circuit possessed by the processor, a process executed by the processor, or the like.
  • the data acquisition unit 21 is a processing unit that acquires moving image data 13 and three-dimensional skeleton series data 14 and stores them in the storage unit 12.
  • the data acquisition unit 21 can acquire the moving image data 13 from the camera, or can read the moving image data 13 captured by a previously known method from a storage destination for storing the moving image data 13 and store the moving image data 13 in the storage unit 12. can.
  • the data acquisition unit 21 can also acquire the three-dimensional skeletal sequence data 14 from the 3D laser sensor, and reads and stores the three-dimensional skeletal sequence data 14 captured by a previously known method from a storage destination for storing the three-dimensional skeletal sequence data 14. It can also be stored in the unit 12.
  • the coordinate acquisition unit 22 is a processing unit that acquires two-dimensional coordinates, which are two-dimensional joint coordinates of each of the plurality of joints, from each of the plurality of frames included in the moving image data 13. Specifically, the coordinate acquisition unit 22 selects several appropriate frames (for example, 10 frames) from the moving image data, and automatically or manually acquires the two-dimensional joint position of the target person in each frame. do.
  • FIG. 6 is a diagram illustrating acquisition of two-dimensional joint coordinates.
  • the coordinate acquisition unit 22 selects a frame from the moving image data 13, and sets eight joints designated in advance from the selected frame as annotation targets. Then, the coordinate acquisition unit 22 uses an existing model or the like to annotate (1) right elbow (2) right wrist (3) left elbow (4) left wrist (5) right knee (6) right ankle. (7) Left knee (8) Two-dimensional coordinates indicating the joint positions of each left ankle are acquired.
  • two-dimensional joint coordinates can be obtained by performing automatic annotation using the existing two-dimensional skeleton recognition method, and visual or manual annotation.
  • the learning data generation unit 23 has an initial setting unit 24, an optimization unit 25, and an output unit 26, and is a processing unit that generates a learning data set 15 in which the moving image data 13 is associated with the three-dimensional skeleton data. Specifically, since the learning data generation unit 23 has both the three-dimensional skeletal sequence data 14 and the moving image data 13 related to the same performance, the three-dimensional skeletal sequence data 14 is supplied to the moving image data 13. By projecting, it is possible to generate a large number of images to which two-dimensional joint coordinates or three-dimensional joint coordinates related to a complicated posture are added without acquiring new data.
  • the three-dimensional skeleton series data 14 and the moving image data 13 are based on different coordinate systems and time systems. Therefore, in order to project the three-dimensional skeleton sequence data 14 onto the moving image data 13, the learning data generation unit 23 performs camera calibration corresponding to spatial alignment and time synchronization corresponding to temporal alignment. And perform resampling.
  • FIG. 7 is a diagram illustrating camera calibration.
  • camera-specific parameters such as focal distance and resolution and 3D are related to the camera that captured the image.
  • Parameters of the camera position and orientation (camera external parameters) in the coordinate system (world coordinate system) that serves as a reference for points are required.
  • the process of obtaining these parameters (camera parameters) is called camera calibration.
  • [X, Y, Z] indicates the coordinates of the 3D point that is the projection source
  • [x, y] indicates the coordinates of the projection point on the image that is the projection destination.
  • K is an internal parameter of the camera and is a 3 ⁇ 3 internal matrix.
  • R is an external parameter of the camera and is a 3 ⁇ 3 rotation matrix.
  • t is a 3 ⁇ 1 translational vector. Of these, R and t are the targets of camera calibration.
  • FIG. 8 is a diagram illustrating time synchronization and resampling.
  • the moving image data 13 and the three-dimensional skeleton series data 14 are acquired in different time systems, the time of the entire two data is synchronized and then each frame time of the moving image data 13 is used.
  • the 3D skeleton sequence data 14 By resampling the 3D skeleton sequence data 14, the 3D skeleton data synchronized with the moving image data 13 can be acquired.
  • time synchronization defines the conversion of the time system between the moving image data 13 and the three-dimensional skeleton series data 14, and resampling is 3 at the time of each frame of the moving image data 13. It is to interpolate the dimensional skeleton data.
  • real-world time system is t
  • the time system of the three-dimensional framework sequence data 14 is t s
  • the time-based moving image data 13 and t v In this state, the three-dimensional skeleton sequence data 14 is sampled at the sampling period T s , and the three-dimensional skeleton data such as time ts, 0 , time ts , 1 , time t s, 2 is sampled.
  • Moving image data 13 is sampled at a sampling period T v, time t v, 0, time t v, 1, frames such as time t v, 1 is sampled.
  • time t s, 0 is the head of the three-dimensional framework sequence data 14, the difference between the time t v, 0 is the head of the moving image data 13, the time shift amount T v, the s.
  • the 3D skeletal data corresponding to the time tv, j is referred to by the bilinear method by referring to the 3D skeletal data within the range of the predetermined time near " tv, j " using the conversion formula of the time system.
  • the learning data generation unit 23 executes "camera calibration” and "time synchronization and resampling” in order to project the three-dimensional skeleton sequence data 14 onto the moving image data 13.
  • the learning data generation unit 23 performs camera calibration and time synchronization based on the non-linear optimization so that the geometrical consistency between the three-dimensional skeleton series data 14 and the moving image data 13 is maximized.
  • a cost function that implements at the same time.
  • the learning data generation unit 23 calculates the optimum time synchronization and camera calibration by optimizing the cost function.
  • the initial setting unit 24 is a processing unit that sets the initial parameters of the cost function. Specifically, the initial setting unit 24 uses each of the plurality of frames associated by resampling and each of the plurality of three-dimensional skeleton data in each synchronization pattern in which the time synchronization is changed to generate the moving image data 13.
  • the camera parameters are calculated by solving an estimation problem that estimates the position and orientation of the camera when projecting the three-dimensional skeleton sequence data 13. Then, the initial setting unit 24 calculates the likelihood indicating the validity of the time synchronization of each synchronization pattern and each camera parameter by using each camera parameter calculated for each synchronization pattern. After that, the initial setting unit 24 sets each of the plurality of frames specified for the synchronization pattern having the highest likelihood and each of the plurality of three-dimensional skeleton data as initial values.
  • FIG. 9 is a diagram illustrating the estimation of the initial parameters.
  • the initial setting unit 24 shows the three-dimensional skeleton data corresponding to the frame of the moving image data 13 in which the two-dimensional joint coordinates are acquired while appropriately changing the time synchronization. Obtained by resampling. Then, the initial setting unit 24 estimates the camera parameters by solving the PnP (Perspective-n-Point) problem in the correspondence between the two-dimensional joint position and the three-dimensional skeleton data.
  • PnP Perspective-n-Point
  • the initial setting unit 24 quantitatively determines the geometrical consistency between the three-dimensional skeleton series data 14 and the two-dimensional joint coordinates as the likelihood indicating the validity of the obtained camera parameters and the time synchronization. calculate. For example, the initial setting unit 24 calculates the ratio of the number of joints in which the reprojection error of the three-dimensional joint coordinates in the resampled three-dimensional skeleton data is less than the threshold value as the likelihood. Then, the initial setting unit 24 adopts the time synchronization and the camera parameter when the likelihood takes the maximum value as the quasi-optimal solution in all the trials related to the time synchronization, and outputs the time synchronization to the optimization unit 25 as the initial value.
  • the initial setting unit 24 executes trial i-1, trial i, and trial i + 1 corresponding to the synchronization pattern in which the first frame of the moving image data 13 is shifted at regular intervals to obtain a semi-optimal solution. Is shown as an example of calculating. Taking the trial i-1 as an example, the initial setting unit 24 corresponds to the frame of the moving image data 13 corresponding to the time ts, 0 at which the two-dimensional coordinates are acquired by the data acquisition unit 21. Data A1 is generated as a result of resampling using the three-dimensional skeleton data between t s, 0 and time t s, 1.
  • the initial setting unit 24 generates data A3 by resampling the frame for which the two-dimensional coordinates have been acquired using the three-dimensional skeleton data between the time ts, 2 and the time ts , 3. Then, resampling is performed using the three-dimensional skeleton data between the time t s, 3 and the time t s, 4 to generate the data A4.
  • the initial setting unit 24 estimates the camera parameters by solving the PnP problem using the data A1, the data A3, and the data A4 including the three-dimensional skeleton data and the two-dimensional coordinates.
  • the initial setting unit 24 projects the three-dimensional skeleton data on the frame of the moving image data 13 using the estimated camera parameters.
  • the initial setting unit 24 calculates the distance between the two-dimensional coordinates of each joint and the three-dimensional coordinates (using only the two-dimensional coordinates) of each joint in the three-dimensional skeleton data for each joint in the frame.
  • the initial setting unit 24 calculates the ratio of joints whose distance is less than the threshold value as the likelihood.
  • the likelihood can be calculated using the distance of each joint when the three-dimensional skeleton data is projected onto one or a plurality of frames.
  • the initial setting unit 24 executes the above-mentioned processing for each of the trial i-1, the trial i, and the trial i + 1, and calculates the likelihood. Then, the initial setting unit 24 determines the time synchronization of the trial i having the highest likelihood and the camera parameter as a quasi-optimal solution (initial value).
  • the optimization unit 25 is a processing unit that optimizes the cost function related to the adjustment amount of time synchronization and the camera parameter by using the quasi-optimal solution calculated by the initial setting unit 24 as the initial value. Specifically, the optimization unit 25 defines the cost function C (Equation 1) related to the adjustment amount ⁇ t of time synchronization and the camera parameter, applies the non-linear optimization with the quasi-optimal solution as the initial value, and applies the cost function. To minimize. At this time, the optimization unit 25 expresses the three-dimensional skeleton series data, which is discrete data, by a continuous function f (t) that can be differentiated for each joint, and incorporates it into the cost function C as a resampling process of the three-dimensional skeleton data. .. As f (t), third-order spline interpolation or the like can be applied.
  • f (t) third-order spline interpolation or the like can be applied.
  • FIG. 10 is a diagram illustrating parameter optimization.
  • the optimization unit 25 sets the quasi-optimal solution to the initial value of the cost function C, and repeats optimization and resampling to calculate the optimum value of each parameter.
  • the optimization unit 25 optimizes the cost function using the data B1 in which the two-dimensional coordinates and the three-dimensional skeleton data are associated with each other, and when the next data B2 is used, the above resampling is performed. After the data B2 is generated, the cost function is optimized using the data B2. By doing so, the optimization unit 25 can optimize ⁇ t and the camera parameters at the same time.
  • FIG. 11 is a diagram illustrating the result of optimization.
  • the initial setting unit 24 sets initial parameters for the frame (image) in which the two-dimensional coordinates of the joints (1) to (8) are acquired.
  • the optimization unit 25 executes the optimization to improve the geometrical consistency, so that the athlete's body on the image and the resampled three-dimensional skeleton data match.
  • the optimization unit 25 outputs the optimized ⁇ t and the camera parameter to the output unit 26.
  • the output unit 26 is a processing unit that generates learning data using the optimization result by the optimization unit 25. Specifically, the output unit 26 generates an image to which two-dimensional joint coordinates and three-dimensional joint coordinates are added on the premise of the optimized time synchronization adjustment amount and camera parameters, and trains data as training data. Store in set 15.
  • FIG. 12 is a diagram showing an example of the generated learning data.
  • the output unit 26 has "I 1 , ( ⁇ X 1 , 1, Y 1 , 1, Z 1 , 1 ⁇ ... ⁇ X” as "image, three-dimensional skeleton data, camera parameters”. 1, j , Y 1, j , Z 1, j ⁇ ), (K, R, t) "and" I 2 , ( ⁇ X 2 , 1, Y 2 , 1, Z 2 , 1 ⁇ ... ⁇ X 2, j , Y 2, j , Z 2, j ⁇ ), (K, R, t) ”and the like are stored.
  • the three-dimensional coordinates of the joint with respect to the two-dimensional image I 1 " ⁇ X 1 , 1, Y 1 , 1, Z 1 , 1 ⁇ ... ⁇ X 1, j , Y 1, j , Z 1 , J ⁇ ” is associated, indicating that the camera parameter at the time of association is“ K, R, t ”.
  • One camera parameter is set for a series of moving image data 13.
  • FIG. 13 is a flowchart showing the flow of the learning data generation process.
  • the coordinate acquisition unit 22 reads the moving image data 13 and the three-dimensional skeleton series data 14 from the storage unit 12 (S101), and acquires the two-dimensional joint coordinates in several frames of the moving image data 13. (S102).
  • the learning data generation unit 23 estimates the initial value of the camera parameter and the time synchronization (S103), optimizes the camera parameter and the time synchronization (S104), and generates training data using the optimization result (S105). ..
  • FIG. 14 is a flowchart showing the flow of processing from the estimation of the initial value to the optimization.
  • the learning data generation unit 23 generates a candidate group for time synchronization between the moving image data 13 and the three-dimensional skeleton series data 14 (S201).
  • the learning data generation unit 23 resamples the three-dimensional skeleton data corresponding to the frame having the two-dimensional joint coordinates for each candidate for time synchronization (S202). Then, the learning data generation unit 23 estimates the camera parameters for each candidate for time synchronization by solving the PnP problem based on the correspondence between the two-dimensional joint position and the three-dimensional skeleton data (S203).
  • the learning data generation unit 23 calculates the likelihood indicating the validity of the time synchronization and the camera parameters for each candidate for time synchronization (S204), and when the likelihood becomes the maximum among the candidates for time synchronization.
  • the time synchronization and the camera parameters are determined to be the semi-optimal solution (S205).
  • the learning data generation unit 23 defines a cost function related to time synchronization and camera parameters incorporating resampling processing of three-dimensional skeleton data (S206), and executes non-linear optimization with the quasi-optimal solution as the initial value. To minimize the cost function and obtain the optimum time synchronization and camera parameters (S207).
  • the generator 10 simultaneously estimates the suboptimal camera parameters and time synchronization so that the geometrical consistency between the 3D skeletal sequence data 14 and the 2D joint coordinates is maximized. After that, the spatially and temporally optimal automatic alignment is executed by the non-linear optimization with the estimation result as the initial value. Therefore, the generation device 10 efficiently generates training data for image-based skeleton recognition by using the moving image data 13 and the three-dimensional skeleton sequence data 14 asynchronously acquired by a camera or 3D sensing technology. be able to.
  • the generation device 10 can calculate the ratio of the number of joints in which the reprojection error for each joint when projected onto the frame of the moving image data is less than the threshold value as the likelihood, an accurate initial value is set. can do. As a result, the generation device 10 can execute the optimization in a state of being narrowed down to some extent, so that the cost of the optimization process can be reduced and the processing time of the optimization process can be shortened.
  • the target data type, cost function, machine learning model, learning data, various parameters, etc. used in the above embodiment are merely examples and can be arbitrarily changed.
  • the gymnastics competition has been described as an example, but the present invention is not limited to this, and can be applied to other competitions in which the athlete performs a series of techniques and the referee scores. Examples of other competitions include figure skating, rhythmic gymnastics, cheerleading, swimming dives, karate kata, and mogul air. Further, it can be applied not only to sports but also to posture detection of drivers of trucks, taxis, trains, etc. and posture detection of pilots.
  • the training data generated from the above can be adopted in various models for skeletal recognition.
  • an application example of learning data will be described.
  • 15 to 17 are diagrams illustrating a processing example of skeleton recognition.
  • FIG. 15 is an example of executing three-dimensional skeleton recognition by a mathematical formula after executing two-dimensional skeleton recognition.
  • a human detection model is used to detect a person from a large number of images
  • a 2D detection model is used to identify a plurality of two-dimensional joint coordinates from the heat map of each of the detected human joints.
  • the three-dimensional joint coordinates are obtained algebraically from the plurality of two-dimensional joint coordinates by the triangular survey method.
  • FIG. 16 is an example of executing three-dimensional skeleton recognition by a model after executing two-dimensional skeleton recognition.
  • a person detection model is used to detect a person from a large number of images
  • a 2D detection model is used to identify a plurality of two-dimensional joint coordinates from the heat map of each joint of the detected person. do.
  • the three-dimensional joint coordinates are obtained using the 3D estimation model obtained by learning.
  • the 3D joint coordinates are obtained from the 3D Voxel data that integrates the 2D heat maps of multiple viewpoints using the 3D estimation model obtained by learning.
  • machine learning is executed using the learning data associated with the "image and two-dimensional coordinates" generated in the first embodiment.
  • machine learning is executed using the training data associated with "image, 3D coordinates, and camera parameters". As a result, the accuracy of each model can be improved.
  • FIG. 17 is an example of directly executing three-dimensional skeleton recognition.
  • a human detection model is used to detect a person from an image, and a 3D detection model obtained by learning is used to estimate the three-dimensional joint coordinates of each joint for the detected person. ..
  • Two machine learning models are used in this method. For these models, machine learning is executed using the learning data associated with "images, 3D coordinates, and camera parameters". As a result, the accuracy of each model can be improved.
  • the coordinate acquisition unit 22 is an example of an acquisition unit
  • the learning data generation unit 23 is an example of a specific unit, an execution unit, and a generation unit.
  • the three-dimensional skeleton series data 14 is an example of the three-dimensional series data.
  • the camera parameter is an example of the projection parameter.
  • the two-dimensional joint coordinates are coordinates when the joint positions are expressed in two dimensions, and are synonymous with the two-dimensional skeletal coordinates.
  • the two-dimensional joint coordinates are the coordinates when the joint positions are expressed in three dimensions, and are synonymous with the three-dimensional skeletal coordinates.
  • each component of each device shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution or integration of each device is not limited to the one shown in the figure. That is, all or a part thereof can be functionally or physically distributed / integrated in any unit according to various loads, usage conditions, and the like.
  • each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.
  • FIG. 18 is a diagram illustrating a hardware configuration example.
  • the generation device 10 includes a communication device 10a, an HDD (Hard Disk Drive) 10b, a memory 10c, and a processor 10d. Further, the parts shown in FIG. 18 are connected to each other by a bus or the like.
  • HDD Hard Disk Drive
  • the communication device 10a is a network interface card or the like, and communicates with other servers.
  • the HDD 10b stores a program or DB that operates the function shown in FIG.
  • the processor 10d reads a program that executes the same processing as each processing unit shown in FIG. 3 from the HDD 10b or the like and expands the program into the memory 10c to operate a process that executes each function described in FIG. 3 or the like. For example, this process executes the same function as each processing unit of the generation device 10. Specifically, the processor 10d reads a program having the same functions as the data acquisition unit 21, the coordinate acquisition unit 22, the learning data generation unit 23, and the like from the HDD 10b and the like. Then, the processor 10d executes a process of executing the same processing as the data acquisition unit 21, the coordinate acquisition unit 22, the learning data generation unit 23, and the like.
  • the generation device 10 operates as an information processing device that executes various information processing methods by reading and executing the program. Further, the generation device 10 can realize the same function as that of the above-described embodiment by reading the program from the recording medium by the medium reader and executing the read program.
  • the program referred to in the other embodiment is not limited to being executed by the generation device 10.
  • the present invention can be similarly applied when other computers or servers execute programs, or when they execute programs in cooperation with each other.
  • This program can be distributed via networks such as the Internet.
  • this program is recorded on a computer-readable recording medium such as a hard disk, flexible disk (FD), CD-ROM, MO (Magneto-Optical disk), DVD (Digital Versatile Disc), and is recorded from the recording medium by the computer. It can be executed by being read.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

Selon la présente invention, un dispositif de génération obtient les coordonnées bidimensionnelles d'une pluralité d'articulations à partir de chaque trame d'une pluralité de trames incluses dans des données d'image animée obtenues par imagerie d'un sujet effectuant un mouvement prédéterminé. Le dispositif de génération identifie une pluralité d'éléments de données de squelette tridimensionnelles correspondant aux trames respectives à partir de données séquentielles tridimensionnelles comprenant des données de squelette tridimensionnelles relatives à une pluralité de positions d'articulation du sujet effectuant le mouvement prédéterminé. Le dispositif de génération optimise la quantité d'ajustement relative à la synchronisation temporelle entre les données d'image animée et les données séquentielles tridimensionnelles, ainsi qu'un paramètre de projection qui est utilisé lorsque les données séquentielles tridimensionnelles sont projetées sur les données d'image animée à l'aide des coordonnées bidimensionnelles de chaque trame de la pluralité de trames et de chaque élément de la pluralité d'éléments de données de squelette tridimensionnelles. Le dispositif de génération génère des données dans lesquelles les données d'image animée sont associées aux données séquentielles tridimensionnelles, à l'aide de la quantité d'ajustement et du paramètre de projection optimisés.
PCT/JP2020/026232 2020-07-03 2020-07-03 Procédé de génération de données, programme de génération de données et dispositif de traitement d'informations WO2022003963A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022533003A JP7318814B2 (ja) 2020-07-03 2020-07-03 データ生成方法、データ生成プログラムおよび情報処理装置
PCT/JP2020/026232 WO2022003963A1 (fr) 2020-07-03 2020-07-03 Procédé de génération de données, programme de génération de données et dispositif de traitement d'informations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/026232 WO2022003963A1 (fr) 2020-07-03 2020-07-03 Procédé de génération de données, programme de génération de données et dispositif de traitement d'informations

Publications (1)

Publication Number Publication Date
WO2022003963A1 true WO2022003963A1 (fr) 2022-01-06

Family

ID=79314967

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/026232 WO2022003963A1 (fr) 2020-07-03 2020-07-03 Procédé de génération de données, programme de génération de données et dispositif de traitement d'informations

Country Status (2)

Country Link
JP (1) JP7318814B2 (fr)
WO (1) WO2022003963A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115311314A (zh) * 2022-10-13 2022-11-08 深圳市华汉伟业科技有限公司 一种线激光轮廓数据的重采样方法、系统和存储介质
WO2024024055A1 (fr) * 2022-07-28 2024-02-01 富士通株式会社 Procédé, dispositif et programme de traitement d'informations

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018129009A (ja) * 2017-02-10 2018-08-16 日本電信電話株式会社 画像合成装置、画像合成方法及びコンピュータプログラム
WO2018211571A1 (fr) * 2017-05-15 2018-11-22 富士通株式会社 Programme, procédé dispositif d'affichage de prestation
WO2019043928A1 (fr) * 2017-09-01 2019-03-07 富士通株式会社 Programme d'aide à l'entraînement, procédé d'aide à l'entraînement et système d'aide à l'entraînement

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018129009A (ja) * 2017-02-10 2018-08-16 日本電信電話株式会社 画像合成装置、画像合成方法及びコンピュータプログラム
WO2018211571A1 (fr) * 2017-05-15 2018-11-22 富士通株式会社 Programme, procédé dispositif d'affichage de prestation
WO2019043928A1 (fr) * 2017-09-01 2019-03-07 富士通株式会社 Programme d'aide à l'entraînement, procédé d'aide à l'entraînement et système d'aide à l'entraînement

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024024055A1 (fr) * 2022-07-28 2024-02-01 富士通株式会社 Procédé, dispositif et programme de traitement d'informations
CN115311314A (zh) * 2022-10-13 2022-11-08 深圳市华汉伟业科技有限公司 一种线激光轮廓数据的重采样方法、系统和存储介质
CN115311314B (zh) * 2022-10-13 2023-02-17 深圳市华汉伟业科技有限公司 一种线激光轮廓数据的重采样方法、系统和存储介质

Also Published As

Publication number Publication date
JPWO2022003963A1 (fr) 2022-01-06
JP7318814B2 (ja) 2023-08-01

Similar Documents

Publication Publication Date Title
Zheng et al. Hybridfusion: Real-time performance capture using a single depth sensor and sparse imus
WO2020054442A1 (fr) Procédé et dispositif d'acquisition de position d'articulation, et procédé et dispositif d'acquisition de mouvement
JP7410499B2 (ja) 組立ロボットの遠隔操作環境のデジタルツインモデリング方法及びシステム
CN111402290B (zh) 一种基于骨骼关键点的动作还原方法以及装置
Shiratori et al. Motion capture from body-mounted cameras
KR101591779B1 (ko) 모션 데이터 및 영상 데이터를 이용한 골격 모델 생성 장치및 방법
KR101640039B1 (ko) 영상 처리 장치 및 방법
Wang et al. Video-based hand manipulation capture through composite motion control
KR101616926B1 (ko) 영상 처리 장치 및 방법
KR101812379B1 (ko) 포즈를 추정하기 위한 방법 및 장치
Zhang et al. Leveraging depth cameras and wearable pressure sensors for full-body kinematics and dynamics capture
JP7367764B2 (ja) 骨格認識方法、骨格認識プログラムおよび情報処理装置
JP2023502795A (ja) 実世界環境の4d時空間モデルを生成するためのリアルタイムシステム
JP7164045B2 (ja) 骨格認識方法、骨格認識プログラムおよび骨格認識システム
CN113449570A (zh) 图像处理方法和装置
JP2019079487A (ja) パラメータ最適化装置、パラメータ最適化方法、プログラム
CN109934847A (zh) 弱纹理三维物体姿态估计的方法和装置
WO2022003963A1 (fr) Procédé de génération de données, programme de génération de données et dispositif de traitement d'informations
JP2021060868A (ja) 情報処理装置、情報処理方法、およびプログラム
CN113284192A (zh) 运动捕捉方法、装置、电子设备以及机械臂控制系统
Zou et al. Automatic reconstruction of 3D human motion pose from uncalibrated monocular video sequences based on markerless human motion tracking
JP5503510B2 (ja) 姿勢推定装置および姿勢推定プログラム
Xu Single-view and multi-view methods in marker-less 3d human motion capture
JP2024501161A (ja) 画像または映像におけるオブジェクトの3次元場所特定
KR20230112636A (ko) 정보 처리 장치, 정보 처리 방법 및 프로그램

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20943599

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022533003

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20943599

Country of ref document: EP

Kind code of ref document: A1