WO2022003963A1 - Data generation method, data generation program, and information-processing device - Google Patents

Data generation method, data generation program, and information-processing device Download PDF

Info

Publication number
WO2022003963A1
WO2022003963A1 PCT/JP2020/026232 JP2020026232W WO2022003963A1 WO 2022003963 A1 WO2022003963 A1 WO 2022003963A1 JP 2020026232 W JP2020026232 W JP 2020026232W WO 2022003963 A1 WO2022003963 A1 WO 2022003963A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
dimensional
moving image
image data
skeleton
Prior art date
Application number
PCT/JP2020/026232
Other languages
French (fr)
Japanese (ja)
Inventor
創輔 山尾
Original Assignee
富士通株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 富士通株式会社 filed Critical 富士通株式会社
Priority to JP2022533003A priority Critical patent/JP7318814B2/en
Priority to PCT/JP2020/026232 priority patent/WO2022003963A1/en
Publication of WO2022003963A1 publication Critical patent/WO2022003963A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis

Definitions

  • the present invention relates to a data generation method, a data generation program, and an information processing apparatus.
  • skeleton recognition is performed to detect 3D human movements related to various sports. For example, a method of calculating typical three-dimensional joint coordinates from a plurality of two-dimensional joint coordinates by a triangulation method is used. In recent years, in order to improve the accuracy of skeleton recognition, a method of estimating three-dimensional joint coordinates from two-dimensional joint coordinates of a plurality of viewpoints using an estimation model generated by machine learning is also known.
  • the above estimation model is generated by machine learning using learning data including a two-dimensional image and a three-dimensional skeletal position, but it is very difficult to apply it to skeletal recognition of complicated movements such as gymnastics. It is required to improve the estimation accuracy by using a lot of learning data. However, such learning data is generally generated manually, and the accuracy is poor and inefficient. Therefore, the accuracy of three-dimensional skeleton recognition using machine learning deteriorates and the cost increases. ..
  • CG Computer Graphics
  • One aspect is to provide a data generation method, a data generation program, and an information processing apparatus capable of automatically generating a learning data set including a two-dimensional image and a three-dimensional skeleton position.
  • the computer executes a process of acquiring the two-dimensional coordinates of each of the plurality of joints from each of the plurality of frames included in the moving image data of the subject performing the predetermined operation.
  • the computer identifies each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames from the three-dimensional series data including the three-dimensional skeleton data relating to the plurality of joint positions of the subject performing the predetermined operation. Execute the process to be performed.
  • the computer adjusts the time synchronization between the moving image data and the three-dimensional series data by using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data.
  • a computer executes a process of generating data in which the moving image data and the three-dimensional series data are associated with each other by using the optimized adjustment amount and the projection parameter.
  • a learning data set including a two-dimensional image and a three-dimensional skeleton position can be automatically generated.
  • FIG. 1 is a diagram illustrating an example of an overall configuration of a system using skeleton recognition.
  • FIG. 2 is a diagram illustrating the manual generation of learning data.
  • FIG. 3 is a functional block diagram showing a functional configuration of the generator according to the first embodiment.
  • FIG. 4 is a diagram showing an example of moving image data.
  • FIG. 5 is a diagram illustrating an example of three-dimensional skeleton series data.
  • FIG. 6 is a diagram illustrating acquisition of two-dimensional joint coordinates.
  • FIG. 7 is a diagram illustrating camera calibration.
  • FIG. 8 is a diagram illustrating time synchronization and resampling.
  • FIG. 9 is a diagram illustrating the estimation of the initial parameters.
  • FIG. 10 is a diagram illustrating parameter optimization.
  • FIG. 11 is a diagram illustrating the result of optimization.
  • FIG. 12 is a diagram showing an example of the generated learning data.
  • FIG. 13 is a flowchart showing the flow of the learning data generation process.
  • FIG. 14 is a flowchart showing the flow of processing from the estimation of the initial value to the optimization.
  • FIG. 15 is a diagram illustrating a processing example of skeleton recognition.
  • FIG. 16 is a diagram illustrating a processing example of skeleton recognition.
  • FIG. 17 is a diagram illustrating a processing example of skeleton recognition.
  • FIG. 18 is a diagram illustrating a hardware configuration example.
  • FIG. 1 is a diagram illustrating an example of an overall configuration of a system using skeleton recognition.
  • this system has a 3D (Three-Dimensional) laser sensor 5, a generation device 10, a learning device 40, a recognition device 50, and a scoring device 90, and has three-dimensional data of the performer 1 who is the subject. It is a system that images the image, recognizes the skeleton, etc., and scores the technique accurately.
  • the gymnastics competition will be described as an example, but the present invention is not limited to this, and other competitions in which the athlete performs a series of techniques and the referee scores, and various people. It can also be applied to actions and movements. Further, in this embodiment, two dimensions may be described as 2D, and three dimensions may be described as 3D.
  • the current scoring method in gymnastics is visually performed by a plurality of graders, but with the sophistication of techniques, it is becoming more difficult for the graders to visually score.
  • an automatic scoring system and a scoring support system for scoring competitions using a 3D laser sensor 5 are known.
  • a distance image which is three-dimensional data of the athlete, is acquired by the 3D laser sensor 5, and the skeleton such as the direction of each joint of the athlete and the angle of each joint is recognized from the distance image.
  • the scoring support system the result of skeleton recognition is displayed by a 3D model to support the grader to perform more correct scoring by confirming the detailed situation of the performer.
  • the technique performed is recognized from the result of skeleton recognition, and scoring is performed according to the scoring rule.
  • the scoring support system and the automatic scoring system it is required to support the scoring or automatically score the performances performed at any time in a timely manner.
  • a method of recognizing a performer's three-dimensional skeleton from a distance image or a color image causes a long processing time and a decrease in the accuracy of skeleton recognition due to insufficient memory or the like.
  • the result of automatic scoring by the automatic scoring system is provided to the grader and the grader compares it with his / her own scoring result
  • the provision of information to the grader is delayed.
  • the accuracy of skeleton recognition decreases, there is a possibility that the subsequent technique recognition will be erroneous, and as a result, the score determined by the technique will also be erroneous.
  • the scoring support system when the angle and position of the performer's joints are displayed using a 3D model, the time until the display is delayed or the displayed angle is incorrect. sell. In this case, the scoring by the grader using this scoring support system may result in an erroneous scoring.
  • the 3D laser sensor 5 is an example of a sensor device that measures (sensing) the distance of an object for each pixel using an infrared laser or the like.
  • the distance image includes the distance to each pixel. That is, the distance image is a depth image showing the depth of the subject as seen from the 3D laser sensor (depth sensor) 5.
  • the learning device 40 is an example of a computer device that learns a machine learning model for skeleton recognition. Specifically, the learning device 40 generates a machine learning model by executing machine learning such as deep learning using two-dimensional skeletal position information, three-dimensional skeletal position information, and the like as a learning data set.
  • the recognition device 50 is an example of a computer device that recognizes the skeleton related to the orientation and position of each joint of the performer 1 by using the distance image measured by the 3D laser sensor 5. Specifically, the recognition device 50 inputs the distance image measured by the 3D laser sensor 5 into the trained machine learning model learned by the learning device 40, and creates a skeleton based on the output result of the machine learning model. recognize. After that, the recognition device 50 outputs the recognized skeleton to the scoring device 90. For example, the information obtained as a result of skeletal recognition is information regarding the three-dimensional position of each joint.
  • the scoring device 90 identifies the transition of the movement obtained from the position and orientation of each joint of the performer by using the recognition result information input by the recognition device 50, and identifies and scores the technique performed by the performer. This is an example of a computer device.
  • three-dimensional skeleton information and the like are newly collected by using a laser method, an image method, and the like.
  • the laser is irradiated about 2 million times per second, and each irradiation including the target person is based on the running time (Time of Flight: ToF) of one laser. Requesting point depth information is being performed.
  • this laser method can acquire highly accurate depth data, it is difficult to use it for general purposes because it requires complicated configurations and processing such as laser scanning and ToF measurement, and the hardware is complicated and expensive. ..
  • an inexpensive RGB camera can be used as an image method for acquiring RGB data of each pixel using a CMOS (Complementary Metal Oxide Semiconductor) imager, and with the recent improvements in deep learning technology, the accuracy of three-dimensional skeleton recognition has also improved. It is improving.
  • CMOS Complementary Metal Oxide Semiconductor
  • machine learning is executed using an existing data set such as Total Capture that contains only general postures, it is not possible to generate a machine learning model that can recognize complicated postures such as gymnastics.
  • the machine learning model that can recognize a complicated posture cannot be generated by the generally used method, the learning data for learning the complicated posture is manually generated.
  • FIG. 2 is a diagram illustrating the manual generation of learning data.
  • the time is visually observed between the moving image data obtained by capturing a series of exercises of the performer 1 and the three-dimensional skeleton series data including the three-dimensional skeleton data when the performer 1 performed in the past. Synchronization is executed, and a frame in which the moving image data and the three-dimensional skeleton data are synchronized is searched for (S1). Then, using the approximate camera parameters defined as prior knowledge, the 3D skeleton data is projected onto the image (S2), and the values of the camera parameters are overlapped so that the person silhouette of the moving image data and the 3D skeleton data overlap. Is manually adjusted (S3). After that, the three-dimensional skeleton series data is resampled at the frame rate of the moving image data and projected onto the moving image data to create training data over the entire moving image data (S4).
  • the spatially and temporally optimum automatic alignment is performed using the moving image data acquired by the camera and the three-dimensional skeleton sequence data acquired by 3D sensing or the like for the same operation. And provide the technology to generate efficient and high quality training data.
  • the generation device 10 acquires the two-dimensional coordinates of each of the plurality of joints from each of the plurality of frames included in the two-dimensional moving image data of the performer 1 who performs the performance. Subsequently, the generation device 10 specifies each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames from the three-dimensional skeleton series data regarding the plurality of joint positions (three-dimensional skeleton data) of the performer 1 who performs the performance. .. Then, the generation device 10 uses the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data to adjust the time synchronization between the moving image data and the three-dimensional skeleton series data, and the moving image. Perform optimization with camera parameters when projecting 3D skeletal sequence data onto the data. After that, the generation device 10 generates data in which the moving image data and the three-dimensional skeleton series data are associated with each other by using the optimized adjustment amount and the camera parameters.
  • the generator 10 performs camera calibration and time synchronization based on non-linear optimization so that the geometrical consistency between the 3D skeletal sequence data and the 2D joint coordinates is maximized.
  • spatially and temporally optimal automatic alignment is performed.
  • the generation device 10 can automatically generate a training data set including a two-dimensional image and a three-dimensional skeleton position that can be used for training a machine learning model that estimates a three-dimensional skeleton position with high accuracy. ..
  • FIG. 3 is a functional block diagram showing a functional configuration of the generator 10 according to the first embodiment.
  • the generation device 10 includes a communication unit 11, a storage unit 12, and a control unit 20.
  • the communication unit 11 is a processing unit that controls communication with other devices, and is realized by, for example, a communication interface.
  • the communication unit 11 receives the moving image data of the performer 1 taken by using a camera or the like, and receives the three-dimensional skeleton sequence of the performer 1 taken by using the 3D laser sensor 5.
  • the storage unit 12 is a processing unit that stores various data, programs executed by the control unit 20, and the like, and is realized by, for example, a memory or a hard disk.
  • the storage unit 12 stores the moving image data 13, the three-dimensional skeleton sequence data 14, and the learning data set 15.
  • the moving image data 13 is a series of moving image data taken by a camera or the like when the performer 1 performs, and is composed of a plurality of frames.
  • FIG. 4 is a diagram showing an example of moving image data 13.
  • FIG. 4 shows, as an example, one frame in the moving image data 13 taken while acting a pommel horse.
  • the moving image data 13 is based on the position, posture, and resolution of the camera as the coordinate system, and is based on the time stamp and sampling rate peculiar to the camera as the time system.
  • the three-dimensional skeleton series data 14 is series data including three-dimensional skeleton data showing three-dimensional joint coordinates related to a plurality of joint positions.
  • the three-dimensional skeleton sequence data 14 is a series of three-dimensional skeleton data captured by a 3D laser sensor or the like when the performer 1 performs.
  • the three-dimensional skeleton data includes information on the three-dimensional skeleton position (skeleton information) of each joint when the performer 1 is performing.
  • FIG. 5 is a diagram illustrating an example of three-dimensional skeleton series data 14.
  • FIG. 5 illustrates one of the three-dimensional skeleton data 14 in the three-dimensional skeleton sequence data 14 generated from the performance of the performer 1.
  • the three-dimensional skeleton series data 14 is data acquired by the 3D sensing technique, and is data including three-dimensional coordinates of each joint.
  • each joint is, for example, 18 joints designated in advance such as the right shoulder, the left shoulder, and the right ankle, or a plurality of joints arbitrarily set by the user.
  • the three-dimensional skeleton sequence data 14 is based on the position and orientation of the sensor as a coordinate system, and is based on a sensor-specific time stamp and sampling rate as a time system.
  • the learning data set 15 is a database that is generated by the control unit 20 described later and includes a plurality of training data used for generating a machine learning model.
  • the learning data set 15 is information in which the moving image data 13 is associated with the three-dimensional skeleton data and the camera parameters.
  • the control unit 20 is a processing unit that controls the entire generation device 10, and is realized by, for example, a processor.
  • the control unit 20 has a data acquisition unit 21, a coordinate acquisition unit 22, and a learning data generation unit 23.
  • the data acquisition unit 21, the coordinate acquisition unit 22, and the learning data generation unit 23 are realized by an electronic circuit possessed by the processor, a process executed by the processor, or the like.
  • the data acquisition unit 21 is a processing unit that acquires moving image data 13 and three-dimensional skeleton series data 14 and stores them in the storage unit 12.
  • the data acquisition unit 21 can acquire the moving image data 13 from the camera, or can read the moving image data 13 captured by a previously known method from a storage destination for storing the moving image data 13 and store the moving image data 13 in the storage unit 12. can.
  • the data acquisition unit 21 can also acquire the three-dimensional skeletal sequence data 14 from the 3D laser sensor, and reads and stores the three-dimensional skeletal sequence data 14 captured by a previously known method from a storage destination for storing the three-dimensional skeletal sequence data 14. It can also be stored in the unit 12.
  • the coordinate acquisition unit 22 is a processing unit that acquires two-dimensional coordinates, which are two-dimensional joint coordinates of each of the plurality of joints, from each of the plurality of frames included in the moving image data 13. Specifically, the coordinate acquisition unit 22 selects several appropriate frames (for example, 10 frames) from the moving image data, and automatically or manually acquires the two-dimensional joint position of the target person in each frame. do.
  • FIG. 6 is a diagram illustrating acquisition of two-dimensional joint coordinates.
  • the coordinate acquisition unit 22 selects a frame from the moving image data 13, and sets eight joints designated in advance from the selected frame as annotation targets. Then, the coordinate acquisition unit 22 uses an existing model or the like to annotate (1) right elbow (2) right wrist (3) left elbow (4) left wrist (5) right knee (6) right ankle. (7) Left knee (8) Two-dimensional coordinates indicating the joint positions of each left ankle are acquired.
  • two-dimensional joint coordinates can be obtained by performing automatic annotation using the existing two-dimensional skeleton recognition method, and visual or manual annotation.
  • the learning data generation unit 23 has an initial setting unit 24, an optimization unit 25, and an output unit 26, and is a processing unit that generates a learning data set 15 in which the moving image data 13 is associated with the three-dimensional skeleton data. Specifically, since the learning data generation unit 23 has both the three-dimensional skeletal sequence data 14 and the moving image data 13 related to the same performance, the three-dimensional skeletal sequence data 14 is supplied to the moving image data 13. By projecting, it is possible to generate a large number of images to which two-dimensional joint coordinates or three-dimensional joint coordinates related to a complicated posture are added without acquiring new data.
  • the three-dimensional skeleton series data 14 and the moving image data 13 are based on different coordinate systems and time systems. Therefore, in order to project the three-dimensional skeleton sequence data 14 onto the moving image data 13, the learning data generation unit 23 performs camera calibration corresponding to spatial alignment and time synchronization corresponding to temporal alignment. And perform resampling.
  • FIG. 7 is a diagram illustrating camera calibration.
  • camera-specific parameters such as focal distance and resolution and 3D are related to the camera that captured the image.
  • Parameters of the camera position and orientation (camera external parameters) in the coordinate system (world coordinate system) that serves as a reference for points are required.
  • the process of obtaining these parameters (camera parameters) is called camera calibration.
  • [X, Y, Z] indicates the coordinates of the 3D point that is the projection source
  • [x, y] indicates the coordinates of the projection point on the image that is the projection destination.
  • K is an internal parameter of the camera and is a 3 ⁇ 3 internal matrix.
  • R is an external parameter of the camera and is a 3 ⁇ 3 rotation matrix.
  • t is a 3 ⁇ 1 translational vector. Of these, R and t are the targets of camera calibration.
  • FIG. 8 is a diagram illustrating time synchronization and resampling.
  • the moving image data 13 and the three-dimensional skeleton series data 14 are acquired in different time systems, the time of the entire two data is synchronized and then each frame time of the moving image data 13 is used.
  • the 3D skeleton sequence data 14 By resampling the 3D skeleton sequence data 14, the 3D skeleton data synchronized with the moving image data 13 can be acquired.
  • time synchronization defines the conversion of the time system between the moving image data 13 and the three-dimensional skeleton series data 14, and resampling is 3 at the time of each frame of the moving image data 13. It is to interpolate the dimensional skeleton data.
  • real-world time system is t
  • the time system of the three-dimensional framework sequence data 14 is t s
  • the time-based moving image data 13 and t v In this state, the three-dimensional skeleton sequence data 14 is sampled at the sampling period T s , and the three-dimensional skeleton data such as time ts, 0 , time ts , 1 , time t s, 2 is sampled.
  • Moving image data 13 is sampled at a sampling period T v, time t v, 0, time t v, 1, frames such as time t v, 1 is sampled.
  • time t s, 0 is the head of the three-dimensional framework sequence data 14, the difference between the time t v, 0 is the head of the moving image data 13, the time shift amount T v, the s.
  • the 3D skeletal data corresponding to the time tv, j is referred to by the bilinear method by referring to the 3D skeletal data within the range of the predetermined time near " tv, j " using the conversion formula of the time system.
  • the learning data generation unit 23 executes "camera calibration” and "time synchronization and resampling” in order to project the three-dimensional skeleton sequence data 14 onto the moving image data 13.
  • the learning data generation unit 23 performs camera calibration and time synchronization based on the non-linear optimization so that the geometrical consistency between the three-dimensional skeleton series data 14 and the moving image data 13 is maximized.
  • a cost function that implements at the same time.
  • the learning data generation unit 23 calculates the optimum time synchronization and camera calibration by optimizing the cost function.
  • the initial setting unit 24 is a processing unit that sets the initial parameters of the cost function. Specifically, the initial setting unit 24 uses each of the plurality of frames associated by resampling and each of the plurality of three-dimensional skeleton data in each synchronization pattern in which the time synchronization is changed to generate the moving image data 13.
  • the camera parameters are calculated by solving an estimation problem that estimates the position and orientation of the camera when projecting the three-dimensional skeleton sequence data 13. Then, the initial setting unit 24 calculates the likelihood indicating the validity of the time synchronization of each synchronization pattern and each camera parameter by using each camera parameter calculated for each synchronization pattern. After that, the initial setting unit 24 sets each of the plurality of frames specified for the synchronization pattern having the highest likelihood and each of the plurality of three-dimensional skeleton data as initial values.
  • FIG. 9 is a diagram illustrating the estimation of the initial parameters.
  • the initial setting unit 24 shows the three-dimensional skeleton data corresponding to the frame of the moving image data 13 in which the two-dimensional joint coordinates are acquired while appropriately changing the time synchronization. Obtained by resampling. Then, the initial setting unit 24 estimates the camera parameters by solving the PnP (Perspective-n-Point) problem in the correspondence between the two-dimensional joint position and the three-dimensional skeleton data.
  • PnP Perspective-n-Point
  • the initial setting unit 24 quantitatively determines the geometrical consistency between the three-dimensional skeleton series data 14 and the two-dimensional joint coordinates as the likelihood indicating the validity of the obtained camera parameters and the time synchronization. calculate. For example, the initial setting unit 24 calculates the ratio of the number of joints in which the reprojection error of the three-dimensional joint coordinates in the resampled three-dimensional skeleton data is less than the threshold value as the likelihood. Then, the initial setting unit 24 adopts the time synchronization and the camera parameter when the likelihood takes the maximum value as the quasi-optimal solution in all the trials related to the time synchronization, and outputs the time synchronization to the optimization unit 25 as the initial value.
  • the initial setting unit 24 executes trial i-1, trial i, and trial i + 1 corresponding to the synchronization pattern in which the first frame of the moving image data 13 is shifted at regular intervals to obtain a semi-optimal solution. Is shown as an example of calculating. Taking the trial i-1 as an example, the initial setting unit 24 corresponds to the frame of the moving image data 13 corresponding to the time ts, 0 at which the two-dimensional coordinates are acquired by the data acquisition unit 21. Data A1 is generated as a result of resampling using the three-dimensional skeleton data between t s, 0 and time t s, 1.
  • the initial setting unit 24 generates data A3 by resampling the frame for which the two-dimensional coordinates have been acquired using the three-dimensional skeleton data between the time ts, 2 and the time ts , 3. Then, resampling is performed using the three-dimensional skeleton data between the time t s, 3 and the time t s, 4 to generate the data A4.
  • the initial setting unit 24 estimates the camera parameters by solving the PnP problem using the data A1, the data A3, and the data A4 including the three-dimensional skeleton data and the two-dimensional coordinates.
  • the initial setting unit 24 projects the three-dimensional skeleton data on the frame of the moving image data 13 using the estimated camera parameters.
  • the initial setting unit 24 calculates the distance between the two-dimensional coordinates of each joint and the three-dimensional coordinates (using only the two-dimensional coordinates) of each joint in the three-dimensional skeleton data for each joint in the frame.
  • the initial setting unit 24 calculates the ratio of joints whose distance is less than the threshold value as the likelihood.
  • the likelihood can be calculated using the distance of each joint when the three-dimensional skeleton data is projected onto one or a plurality of frames.
  • the initial setting unit 24 executes the above-mentioned processing for each of the trial i-1, the trial i, and the trial i + 1, and calculates the likelihood. Then, the initial setting unit 24 determines the time synchronization of the trial i having the highest likelihood and the camera parameter as a quasi-optimal solution (initial value).
  • the optimization unit 25 is a processing unit that optimizes the cost function related to the adjustment amount of time synchronization and the camera parameter by using the quasi-optimal solution calculated by the initial setting unit 24 as the initial value. Specifically, the optimization unit 25 defines the cost function C (Equation 1) related to the adjustment amount ⁇ t of time synchronization and the camera parameter, applies the non-linear optimization with the quasi-optimal solution as the initial value, and applies the cost function. To minimize. At this time, the optimization unit 25 expresses the three-dimensional skeleton series data, which is discrete data, by a continuous function f (t) that can be differentiated for each joint, and incorporates it into the cost function C as a resampling process of the three-dimensional skeleton data. .. As f (t), third-order spline interpolation or the like can be applied.
  • f (t) third-order spline interpolation or the like can be applied.
  • FIG. 10 is a diagram illustrating parameter optimization.
  • the optimization unit 25 sets the quasi-optimal solution to the initial value of the cost function C, and repeats optimization and resampling to calculate the optimum value of each parameter.
  • the optimization unit 25 optimizes the cost function using the data B1 in which the two-dimensional coordinates and the three-dimensional skeleton data are associated with each other, and when the next data B2 is used, the above resampling is performed. After the data B2 is generated, the cost function is optimized using the data B2. By doing so, the optimization unit 25 can optimize ⁇ t and the camera parameters at the same time.
  • FIG. 11 is a diagram illustrating the result of optimization.
  • the initial setting unit 24 sets initial parameters for the frame (image) in which the two-dimensional coordinates of the joints (1) to (8) are acquired.
  • the optimization unit 25 executes the optimization to improve the geometrical consistency, so that the athlete's body on the image and the resampled three-dimensional skeleton data match.
  • the optimization unit 25 outputs the optimized ⁇ t and the camera parameter to the output unit 26.
  • the output unit 26 is a processing unit that generates learning data using the optimization result by the optimization unit 25. Specifically, the output unit 26 generates an image to which two-dimensional joint coordinates and three-dimensional joint coordinates are added on the premise of the optimized time synchronization adjustment amount and camera parameters, and trains data as training data. Store in set 15.
  • FIG. 12 is a diagram showing an example of the generated learning data.
  • the output unit 26 has "I 1 , ( ⁇ X 1 , 1, Y 1 , 1, Z 1 , 1 ⁇ ... ⁇ X” as "image, three-dimensional skeleton data, camera parameters”. 1, j , Y 1, j , Z 1, j ⁇ ), (K, R, t) "and" I 2 , ( ⁇ X 2 , 1, Y 2 , 1, Z 2 , 1 ⁇ ... ⁇ X 2, j , Y 2, j , Z 2, j ⁇ ), (K, R, t) ”and the like are stored.
  • the three-dimensional coordinates of the joint with respect to the two-dimensional image I 1 " ⁇ X 1 , 1, Y 1 , 1, Z 1 , 1 ⁇ ... ⁇ X 1, j , Y 1, j , Z 1 , J ⁇ ” is associated, indicating that the camera parameter at the time of association is“ K, R, t ”.
  • One camera parameter is set for a series of moving image data 13.
  • FIG. 13 is a flowchart showing the flow of the learning data generation process.
  • the coordinate acquisition unit 22 reads the moving image data 13 and the three-dimensional skeleton series data 14 from the storage unit 12 (S101), and acquires the two-dimensional joint coordinates in several frames of the moving image data 13. (S102).
  • the learning data generation unit 23 estimates the initial value of the camera parameter and the time synchronization (S103), optimizes the camera parameter and the time synchronization (S104), and generates training data using the optimization result (S105). ..
  • FIG. 14 is a flowchart showing the flow of processing from the estimation of the initial value to the optimization.
  • the learning data generation unit 23 generates a candidate group for time synchronization between the moving image data 13 and the three-dimensional skeleton series data 14 (S201).
  • the learning data generation unit 23 resamples the three-dimensional skeleton data corresponding to the frame having the two-dimensional joint coordinates for each candidate for time synchronization (S202). Then, the learning data generation unit 23 estimates the camera parameters for each candidate for time synchronization by solving the PnP problem based on the correspondence between the two-dimensional joint position and the three-dimensional skeleton data (S203).
  • the learning data generation unit 23 calculates the likelihood indicating the validity of the time synchronization and the camera parameters for each candidate for time synchronization (S204), and when the likelihood becomes the maximum among the candidates for time synchronization.
  • the time synchronization and the camera parameters are determined to be the semi-optimal solution (S205).
  • the learning data generation unit 23 defines a cost function related to time synchronization and camera parameters incorporating resampling processing of three-dimensional skeleton data (S206), and executes non-linear optimization with the quasi-optimal solution as the initial value. To minimize the cost function and obtain the optimum time synchronization and camera parameters (S207).
  • the generator 10 simultaneously estimates the suboptimal camera parameters and time synchronization so that the geometrical consistency between the 3D skeletal sequence data 14 and the 2D joint coordinates is maximized. After that, the spatially and temporally optimal automatic alignment is executed by the non-linear optimization with the estimation result as the initial value. Therefore, the generation device 10 efficiently generates training data for image-based skeleton recognition by using the moving image data 13 and the three-dimensional skeleton sequence data 14 asynchronously acquired by a camera or 3D sensing technology. be able to.
  • the generation device 10 can calculate the ratio of the number of joints in which the reprojection error for each joint when projected onto the frame of the moving image data is less than the threshold value as the likelihood, an accurate initial value is set. can do. As a result, the generation device 10 can execute the optimization in a state of being narrowed down to some extent, so that the cost of the optimization process can be reduced and the processing time of the optimization process can be shortened.
  • the target data type, cost function, machine learning model, learning data, various parameters, etc. used in the above embodiment are merely examples and can be arbitrarily changed.
  • the gymnastics competition has been described as an example, but the present invention is not limited to this, and can be applied to other competitions in which the athlete performs a series of techniques and the referee scores. Examples of other competitions include figure skating, rhythmic gymnastics, cheerleading, swimming dives, karate kata, and mogul air. Further, it can be applied not only to sports but also to posture detection of drivers of trucks, taxis, trains, etc. and posture detection of pilots.
  • the training data generated from the above can be adopted in various models for skeletal recognition.
  • an application example of learning data will be described.
  • 15 to 17 are diagrams illustrating a processing example of skeleton recognition.
  • FIG. 15 is an example of executing three-dimensional skeleton recognition by a mathematical formula after executing two-dimensional skeleton recognition.
  • a human detection model is used to detect a person from a large number of images
  • a 2D detection model is used to identify a plurality of two-dimensional joint coordinates from the heat map of each of the detected human joints.
  • the three-dimensional joint coordinates are obtained algebraically from the plurality of two-dimensional joint coordinates by the triangular survey method.
  • FIG. 16 is an example of executing three-dimensional skeleton recognition by a model after executing two-dimensional skeleton recognition.
  • a person detection model is used to detect a person from a large number of images
  • a 2D detection model is used to identify a plurality of two-dimensional joint coordinates from the heat map of each joint of the detected person. do.
  • the three-dimensional joint coordinates are obtained using the 3D estimation model obtained by learning.
  • the 3D joint coordinates are obtained from the 3D Voxel data that integrates the 2D heat maps of multiple viewpoints using the 3D estimation model obtained by learning.
  • machine learning is executed using the learning data associated with the "image and two-dimensional coordinates" generated in the first embodiment.
  • machine learning is executed using the training data associated with "image, 3D coordinates, and camera parameters". As a result, the accuracy of each model can be improved.
  • FIG. 17 is an example of directly executing three-dimensional skeleton recognition.
  • a human detection model is used to detect a person from an image, and a 3D detection model obtained by learning is used to estimate the three-dimensional joint coordinates of each joint for the detected person. ..
  • Two machine learning models are used in this method. For these models, machine learning is executed using the learning data associated with "images, 3D coordinates, and camera parameters". As a result, the accuracy of each model can be improved.
  • the coordinate acquisition unit 22 is an example of an acquisition unit
  • the learning data generation unit 23 is an example of a specific unit, an execution unit, and a generation unit.
  • the three-dimensional skeleton series data 14 is an example of the three-dimensional series data.
  • the camera parameter is an example of the projection parameter.
  • the two-dimensional joint coordinates are coordinates when the joint positions are expressed in two dimensions, and are synonymous with the two-dimensional skeletal coordinates.
  • the two-dimensional joint coordinates are the coordinates when the joint positions are expressed in three dimensions, and are synonymous with the three-dimensional skeletal coordinates.
  • each component of each device shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution or integration of each device is not limited to the one shown in the figure. That is, all or a part thereof can be functionally or physically distributed / integrated in any unit according to various loads, usage conditions, and the like.
  • each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.
  • FIG. 18 is a diagram illustrating a hardware configuration example.
  • the generation device 10 includes a communication device 10a, an HDD (Hard Disk Drive) 10b, a memory 10c, and a processor 10d. Further, the parts shown in FIG. 18 are connected to each other by a bus or the like.
  • HDD Hard Disk Drive
  • the communication device 10a is a network interface card or the like, and communicates with other servers.
  • the HDD 10b stores a program or DB that operates the function shown in FIG.
  • the processor 10d reads a program that executes the same processing as each processing unit shown in FIG. 3 from the HDD 10b or the like and expands the program into the memory 10c to operate a process that executes each function described in FIG. 3 or the like. For example, this process executes the same function as each processing unit of the generation device 10. Specifically, the processor 10d reads a program having the same functions as the data acquisition unit 21, the coordinate acquisition unit 22, the learning data generation unit 23, and the like from the HDD 10b and the like. Then, the processor 10d executes a process of executing the same processing as the data acquisition unit 21, the coordinate acquisition unit 22, the learning data generation unit 23, and the like.
  • the generation device 10 operates as an information processing device that executes various information processing methods by reading and executing the program. Further, the generation device 10 can realize the same function as that of the above-described embodiment by reading the program from the recording medium by the medium reader and executing the read program.
  • the program referred to in the other embodiment is not limited to being executed by the generation device 10.
  • the present invention can be similarly applied when other computers or servers execute programs, or when they execute programs in cooperation with each other.
  • This program can be distributed via networks such as the Internet.
  • this program is recorded on a computer-readable recording medium such as a hard disk, flexible disk (FD), CD-ROM, MO (Magneto-Optical disk), DVD (Digital Versatile Disc), and is recorded from the recording medium by the computer. It can be executed by being read.

Abstract

According to the present invention, a generation device obtains two-dimensional coordinates of a plurality of joints from each of a plurality of frames included in moving image data obtained by imaging a subject performing a predetermined motion. The generation device identifies a plurality of pieces of three-dimensional skeletal data corresponding to the respective frames from three-dimensional sequential data including three-dimensional skeletal data related to a plurality of joint positions of the subject performing the predetermined motion. The generation device optimizes the amount of adjustment related to time synchronization between the moving image data and the three-dimensional sequential data, and a projection parameter that is used when the three-dimensional sequential data is projected to the moving image data, using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of pieces of three-dimensional skeletal data. The generation device generates data in which the moving image data is associated with the three-dimensional sequential data, using the optimized adjustment amount and projection parameter.

Description

データ生成方法、データ生成プログラムおよび情報処理装置Data generation method, data generation program and information processing device
 本発明は、データ生成方法、データ生成プログラムおよび情報処理装置に関する。 The present invention relates to a data generation method, a data generation program, and an information processing apparatus.
 カラー画像などの2次元画像を用いて、各種スポーツに関連した3次元の人の動きを検出する骨格認識が行われている。例えば、複数の2次元の関節座標から、三角測量法により代表的な3次元の関節座標を算出する手法が利用されている。近年では、骨格認識の精度を向上させるために、機械学習により生成された推定モデルを用いて、複数視点の2次元の関節座標から3次元の関節座標を推定する手法なども知られている。 Using 2D images such as color images, skeleton recognition is performed to detect 3D human movements related to various sports. For example, a method of calculating typical three-dimensional joint coordinates from a plurality of two-dimensional joint coordinates by a triangulation method is used. In recent years, in order to improve the accuracy of skeleton recognition, a method of estimating three-dimensional joint coordinates from two-dimensional joint coordinates of a plurality of viewpoints using an estimation model generated by machine learning is also known.
 ところで、上記推定モデルは、2次元画像と3次元の骨格位置とを含む学習データを用いた機械学習により生成されるが、体操競技など複雑な動きの骨格認識にも適用させるためには、非常に多くの学習データを用いて推定精度を向上させることが要求される。しかしながら、このような学習データは手動で生成されることが一般的であり、精度も悪く、効率的ではないことから、機械学習を用いた3次元の骨格認識の精度劣化やコスト増大が発生する。 By the way, the above estimation model is generated by machine learning using learning data including a two-dimensional image and a three-dimensional skeletal position, but it is very difficult to apply it to skeletal recognition of complicated movements such as gymnastics. It is required to improve the estimation accuracy by using a lot of learning data. However, such learning data is generally generated manually, and the accuracy is poor and inefficient. Therefore, the accuracy of three-dimensional skeleton recognition using machine learning deteriorates and the cost increases. ..
 なお、モーションキャプチャなどを用いて取得された3次元の骨格情報と同様の姿勢をとる人体CG(Computer Graphics)モデルを用いて、テクスチャやレンダリング条件を変えながら、多様なバリエーションでCG画像を合成することで、学習データセットを生成することも考えられる。しかし、例えば体操競技のように、選手がユニフォームを着て動作する場合や複数の器具を使用する場合、実際のカメラ画像に近い品質でそれらを模擬してCG画像を合成することが難しい。 In addition, using a human body CG (Computer Graphics) model that takes the same posture as the three-dimensional skeleton information acquired by motion capture, etc., CG images are synthesized with various variations while changing the texture and rendering conditions. Therefore, it is possible to generate a training data set. However, when a player wears a uniform or uses a plurality of instruments, for example, in gymnastics, it is difficult to simulate them with a quality close to that of an actual camera image and synthesize a CG image.
 一つの側面では、2次元画像と3次元の骨格位置とを含む学習データセットを自動で生成することができるデータ生成方法、データ生成プログラムおよび情報処理装置を提供することを目的とする。 One aspect is to provide a data generation method, a data generation program, and an information processing apparatus capable of automatically generating a learning data set including a two-dimensional image and a three-dimensional skeleton position.
 第1の案では、データ生成方法は、コンピュータが、所定動作を行う被写体を撮像した動画像データに含まれる複数のフレームそれぞれから複数の関節それぞれの2次元座標を取得する処理を実行する。データ生成方法は、コンピュータが、前記所定動作を行う前記被写体の複数の関節位置に関する3次元骨格データを含む3次元系列データから、前記複数のフレームそれぞれに対応する複数の3次元骨格データそれぞれを特定する処理を実行する。データ生成方法は、コンピュータが、前記複数のフレームそれぞれの2次元座標と前記複数の3次元骨格データそれぞれとを用いて、前記動画像データと前記3次元系列データとの間の時刻同期に関する調整量と、前記動画像データに前記3次元系列データを投影するときの投影パラメータとの最適化を実行する処理を実行する。データ生成方法は、コンピュータが、最適化された前記調整量と前記投影パラメータとを用いて、前記動画像データと前記3次元系列データとを対応付けたデータを生成する処理を実行する。 In the first plan, in the data generation method, the computer executes a process of acquiring the two-dimensional coordinates of each of the plurality of joints from each of the plurality of frames included in the moving image data of the subject performing the predetermined operation. In the data generation method, the computer identifies each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames from the three-dimensional series data including the three-dimensional skeleton data relating to the plurality of joint positions of the subject performing the predetermined operation. Execute the process to be performed. In the data generation method, the computer adjusts the time synchronization between the moving image data and the three-dimensional series data by using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data. And, the process of executing the optimization with the projection parameter when projecting the three-dimensional series data on the moving image data is executed. In the data generation method, a computer executes a process of generating data in which the moving image data and the three-dimensional series data are associated with each other by using the optimized adjustment amount and the projection parameter.
 一実施形態によれば、2次元画像と3次元の骨格位置とを含む学習データセットを自動で生成することができる。 According to one embodiment, a learning data set including a two-dimensional image and a three-dimensional skeleton position can be automatically generated.
図1は、骨格認識を用いたシステムの全体構成例を説明する図である。FIG. 1 is a diagram illustrating an example of an overall configuration of a system using skeleton recognition. 図2は、手動により学習データの生成を説明する図である。FIG. 2 is a diagram illustrating the manual generation of learning data. 図3は、実施例1にかかる生成装置の機能構成を示す機能ブロック図である。FIG. 3 is a functional block diagram showing a functional configuration of the generator according to the first embodiment. 図4は、動画像データの例を示す図である。FIG. 4 is a diagram showing an example of moving image data. 図5は、3次元骨格系列データの例を説明する図である。FIG. 5 is a diagram illustrating an example of three-dimensional skeleton series data. 図6は、2次元の関節座標の取得を説明する図である。FIG. 6 is a diagram illustrating acquisition of two-dimensional joint coordinates. 図7は、カメラキャリブレーションを説明する図である。FIG. 7 is a diagram illustrating camera calibration. 図8は、時刻同期とリサンプリングを説明する図である。FIG. 8 is a diagram illustrating time synchronization and resampling. 図9は、初期パラメータの推定を説明する図である。FIG. 9 is a diagram illustrating the estimation of the initial parameters. 図10は、パラメータの最適化を説明する図である。FIG. 10 is a diagram illustrating parameter optimization. 図11は、最適化の結果を説明する図である。FIG. 11 is a diagram illustrating the result of optimization. 図12は、生成される学習データの一例を示す図である。FIG. 12 is a diagram showing an example of the generated learning data. 図13は、学習データの生成処理の流れを示すフローチャートである。FIG. 13 is a flowchart showing the flow of the learning data generation process. 図14は、初期値の推定から最適化までの処理の流れを示すフローチャートである。FIG. 14 is a flowchart showing the flow of processing from the estimation of the initial value to the optimization. 図15は、骨格認識の処理例を説明する図である。FIG. 15 is a diagram illustrating a processing example of skeleton recognition. 図16は、骨格認識の処理例を説明する図である。FIG. 16 is a diagram illustrating a processing example of skeleton recognition. 図17は、骨格認識の処理例を説明する図である。FIG. 17 is a diagram illustrating a processing example of skeleton recognition. 図18は、ハードウェア構成例を説明する図である。FIG. 18 is a diagram illustrating a hardware configuration example.
 以下に、本発明にかかるデータ生成方法、データ生成プログラムおよび情報処理装置の実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。また、各実施例は、矛盾のない範囲内で適宜組み合わせることができる。 Hereinafter, examples of the data generation method, the data generation program, and the information processing apparatus according to the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. In addition, each embodiment can be appropriately combined within a consistent range.
[システム構成]
 図1は、骨格認識を用いたシステムの全体構成例を説明する図である。図1に示すように、このシステムは、3D(Three-Dimensional)レーザセンサ5、生成装置10、学習装置40、認識装置50、採点装置90を有し、被写体である演技者1の3次元データを撮像し、骨格等を認識して正確な技の採点を行うシステムである。なお、本実施例では、一例として、体操競技を例にして説明するが、これに限定されるものではなく、選手が一連の技を行って審判が採点する他の競技や、人の様々な行動や動作にも適用することができる。また、本実施例では、2次元を2D、3次元を3Dと記載する場合がある。
[System configuration]
FIG. 1 is a diagram illustrating an example of an overall configuration of a system using skeleton recognition. As shown in FIG. 1, this system has a 3D (Three-Dimensional) laser sensor 5, a generation device 10, a learning device 40, a recognition device 50, and a scoring device 90, and has three-dimensional data of the performer 1 who is the subject. It is a system that images the image, recognizes the skeleton, etc., and scores the technique accurately. In this embodiment, the gymnastics competition will be described as an example, but the present invention is not limited to this, and other competitions in which the athlete performs a series of techniques and the referee scores, and various people. It can also be applied to actions and movements. Further, in this embodiment, two dimensions may be described as 2D, and three dimensions may be described as 3D.
 一般的に、体操競技における現在の採点方法は、複数の採点者によって目視で行われているが、技の高度化に伴い、採点者の目視では採点が困難な場合が増加している。近年では、3Dレーザセンサ5を使った、採点競技の自動採点システムや採点支援システムが知られている。例えば、これらのシステムにおいては、3Dレーザセンサ5により選手の3次元データである距離画像を取得し、距離画像から選手の各関節の向きや各関節の角度などである骨格を認識する。そして、採点支援システムにおいては、骨格認識の結果を3Dモデルにより表示することで、採点者が、演技者の細部の状況を確認するなどにより、より正しい採点を実施することを支援する。また、自動採点システムにおいては、骨格認識の結果から、演技した技などを認識し、採点ルールに照らして採点を行う。 Generally, the current scoring method in gymnastics is visually performed by a plurality of graders, but with the sophistication of techniques, it is becoming more difficult for the graders to visually score. In recent years, an automatic scoring system and a scoring support system for scoring competitions using a 3D laser sensor 5 are known. For example, in these systems, a distance image, which is three-dimensional data of the athlete, is acquired by the 3D laser sensor 5, and the skeleton such as the direction of each joint of the athlete and the angle of each joint is recognized from the distance image. Then, in the scoring support system, the result of skeleton recognition is displayed by a 3D model to support the grader to perform more correct scoring by confirming the detailed situation of the performer. In addition, in the automatic scoring system, the technique performed is recognized from the result of skeleton recognition, and scoring is performed according to the scoring rule.
 ここで、採点支援システムや自動採点システムにおいては、随時行われる演技を、タイムリーに採点支援または自動採点することが求められる。通常、距離画像やカラー画像から演技者の3次元骨格を認識する手法では、メモリ不足などによる処理時間の長時間化や骨格認識の精度低下を招く。 Here, in the scoring support system and the automatic scoring system, it is required to support the scoring or automatically score the performances performed at any time in a timely manner. Usually, a method of recognizing a performer's three-dimensional skeleton from a distance image or a color image causes a long processing time and a decrease in the accuracy of skeleton recognition due to insufficient memory or the like.
 例えば、自動採点システムによる自動採点の結果を採点者へ提供し、採点者が自己の採点結果と比較する形態では、従来技術を用いた場合、採点者への情報提供が遅延する。さらに、骨格認識の精度が低下することで、続く技認識も誤ってしまう可能性があり、結果として技による決定される得点も誤ってしまう。同様に、採点支援システムにおいて、演技者の関節の角度や位置を、3Dモデルを使って表示する際にも、表示までの時間が遅延したり、表示される角度等が正しくないという事態を生じうる。この場合には、この採点支援システムを利用した採点者による採点は、誤った採点となってしまう場合もある。 For example, in a form in which the result of automatic scoring by the automatic scoring system is provided to the grader and the grader compares it with his / her own scoring result, when the conventional technique is used, the provision of information to the grader is delayed. Furthermore, as the accuracy of skeleton recognition decreases, there is a possibility that the subsequent technique recognition will be erroneous, and as a result, the score determined by the technique will also be erroneous. Similarly, in the scoring support system, when the angle and position of the performer's joints are displayed using a 3D model, the time until the display is delayed or the displayed angle is incorrect. sell. In this case, the scoring by the grader using this scoring support system may result in an erroneous scoring.
 以上の通り、自動採点システムや採点支援システムにおける骨格認識の精度が悪かったり、処理に時間を要すると、採点ミスの発生や、採点時間の長時間化を招いてしまう。このようなことから、機械学習により生成された機械学習モデルを用いることで、高精度な骨格認識、認識ミス、採点の長時間化の抑制などが実現されている。なお、3次元の人の動き検出(骨格認識)に関しては、複数台の3Dレーザセンサ5から3次元の関節座標を高精度で抽出する3Dセンシング技術が確立されつつあり、他スポーツや他分野への展開が期待されている。 As described above, if the accuracy of skeleton recognition in the automatic scoring system or scoring support system is poor, or if processing takes time, scoring errors will occur and the scoring time will be lengthened. Therefore, by using the machine learning model generated by machine learning, highly accurate skeleton recognition, recognition error, and suppression of long scoring are realized. Regarding 3D human movement detection (skeleton recognition), 3D sensing technology that extracts 3D joint coordinates from multiple 3D laser sensors 5 with high accuracy is being established, and is being used in other sports and other fields. Is expected to develop.
 ここで、図1におけるシステムを構成する各装置について説明する。3Dレーザセンサ5は、赤外線レーザ等を用いて対象物の距離を画素ごとに測定(センシング)するセンサ装置の一例である。距離画像には、各画素までの距離が含まれる。つまり、距離画像は、3Dレーザセンサ(深度センサ)5から見た被写体の深度を表す深度画像である。 Here, each device constituting the system in FIG. 1 will be described. The 3D laser sensor 5 is an example of a sensor device that measures (sensing) the distance of an object for each pixel using an infrared laser or the like. The distance image includes the distance to each pixel. That is, the distance image is a depth image showing the depth of the subject as seen from the 3D laser sensor (depth sensor) 5.
 学習装置40は、骨格認識用の機械学習モデルを学習するコンピュータ装置の一例である。具体的には、学習装置40は、2次元の骨格位置情報や3次元の骨格位置情報などを学習データセットとして使用した、ディープラーニングなどの機械学習を実行して機械学習モデルを生成する。 The learning device 40 is an example of a computer device that learns a machine learning model for skeleton recognition. Specifically, the learning device 40 generates a machine learning model by executing machine learning such as deep learning using two-dimensional skeletal position information, three-dimensional skeletal position information, and the like as a learning data set.
 認識装置50は、3Dレーザセンサ5により測定された距離画像を用いて、演技者1の各関節の向きや位置等に関する骨格を認識するコンピュータ装置の一例である。具体的には、認識装置50は、3Dレーザセンサ5により測定された距離画像を、学習装置40によって学習された学習済みの機械学習モデルに入力し、機械学習モデルの出力結果に基づいて骨格を認識する。その後、認識装置50は、認識された骨格を採点装置90に出力する。例えば、骨格認識の結果として得られる情報は、各関節の3次元位置に関する情報である。 The recognition device 50 is an example of a computer device that recognizes the skeleton related to the orientation and position of each joint of the performer 1 by using the distance image measured by the 3D laser sensor 5. Specifically, the recognition device 50 inputs the distance image measured by the 3D laser sensor 5 into the trained machine learning model learned by the learning device 40, and creates a skeleton based on the output result of the machine learning model. recognize. After that, the recognition device 50 outputs the recognized skeleton to the scoring device 90. For example, the information obtained as a result of skeletal recognition is information regarding the three-dimensional position of each joint.
 採点装置90は、認識装置50により入力された認識結果情報を用いて、演技者の各関節の位置や向きから得られる動きの推移を特定し、演技者が演技した技の特定および採点を実行するコンピュータ装置の一例である。 The scoring device 90 identifies the transition of the movement obtained from the position and orientation of each joint of the performer by using the recognition result information input by the recognition device 50, and identifies and scores the technique performed by the performer. This is an example of a computer device.
 上述した機械学習モデルを用いた骨格認識技術により、体操のような複雑な姿勢における3次元の骨格認識を行うためには、複雑な姿勢に関する多くの学習データを新たに準備して学習させる必要がある。 In order to perform 3D skeleton recognition in a complicated posture such as gymnastics by the skeleton recognition technology using the machine learning model described above, it is necessary to newly prepare and train a lot of learning data on the complicated posture. be.
 近年では、多くの学習データを生成するために、レーザ方式や画像方式などを用いて、3次元の骨格情報などを新たに収集することが行われている。例えば、3Dレーザセンサを用いたレーザ方式では、レーザを1秒間に約200万回照射し1レーザの走行時間(Time of Flight:ToF)をもとに、対象となる人も含めて、各照射点の深さ情報を求めることが行われている。このレーザ方式は、高精度な深度データを取得できるが、レーザースキャンやToF測定などの複雑な構成や処理が要求され、ハードウエアが複雑で高価になることから、汎用的に利用することは難しい。 In recent years, in order to generate a lot of learning data, three-dimensional skeleton information and the like are newly collected by using a laser method, an image method, and the like. For example, in the laser method using a 3D laser sensor, the laser is irradiated about 2 million times per second, and each irradiation including the target person is based on the running time (Time of Flight: ToF) of one laser. Requesting point depth information is being performed. Although this laser method can acquire highly accurate depth data, it is difficult to use it for general purposes because it requires complicated configurations and processing such as laser scanning and ToF measurement, and the hardware is complicated and expensive. ..
 また、CMOS(Complementary Metal Oxide Semiconductor)イメージャによって各ピクセルのRGBデータを取得する画像方式は、安価なRGBカメラを用いることができ、近年の深層学習技術の向上により、3次元の骨格認識の精度も向上しつつある。しかし、一般的な姿勢しか含まないTotal Captureなどの既存のデータセットを使用して機械学習を実行しても、体操のような複雑な姿勢を認識できる機械学習モデルを生成することができない。 In addition, an inexpensive RGB camera can be used as an image method for acquiring RGB data of each pixel using a CMOS (Complementary Metal Oxide Semiconductor) imager, and with the recent improvements in deep learning technology, the accuracy of three-dimensional skeleton recognition has also improved. It is improving. However, even if machine learning is executed using an existing data set such as Total Capture that contains only general postures, it is not possible to generate a machine learning model that can recognize complicated postures such as gymnastics.
 このように、一般的に使用されている方式では、複雑な姿勢を認識できる機械学習モデルを生成できないので、複雑な姿勢を学習するための学習データを手動で生成することが行われている。 As described above, since the machine learning model that can recognize a complicated posture cannot be generated by the generally used method, the learning data for learning the complicated posture is manually generated.
 図2は、手動により学習データの生成を説明する図である。図2に示すように、演技者1の体操の一連の演技を撮像した動画像データと、過去に演技者1が演技したときの3次元骨格データを含む3次元骨格系列データとを目視により時刻同期を実行し、動画像データと3次元骨格データとが同期するフレームを探索する(S1)。続いて、事前知識として定義したおおよそのカメラパラメータを用いて、3次元骨格データを画像に投影し(S2)、動画像データの人物シルエットと3次元骨格データとが重なるように、カメラパラメータの値を手動で調整する(S3)。その後、3次元骨格系列データを動画像データのフレームレートでリサンプリングし、動画像データに投影することで、動画像データ全体に渡る学習データを作成する(S4)。 FIG. 2 is a diagram illustrating the manual generation of learning data. As shown in FIG. 2, the time is visually observed between the moving image data obtained by capturing a series of exercises of the performer 1 and the three-dimensional skeleton series data including the three-dimensional skeleton data when the performer 1 performed in the past. Synchronization is executed, and a frame in which the moving image data and the three-dimensional skeleton data are synchronized is searched for (S1). Then, using the approximate camera parameters defined as prior knowledge, the 3D skeleton data is projected onto the image (S2), and the values of the camera parameters are overlapped so that the person silhouette of the moving image data and the 3D skeleton data overlap. Is manually adjusted (S3). After that, the three-dimensional skeleton series data is resampled at the frame rate of the moving image data and projected onto the moving image data to create training data over the entire moving image data (S4).
 このように、手動で学習データを作成する場合、動画像データと3次元骨格系列データの空間的および時間的な位置合わせを目視および手動で実施するので、学習データとして十分な精度が得られず、さらには膨大な作業時間を要する。 In this way, when the training data is created manually, the spatial and temporal alignment of the moving image data and the 3D skeletal sequence data is visually and manually performed, so that sufficient accuracy cannot be obtained as the training data. Moreover, it requires a huge amount of work time.
 そこで、実施例1では、同一の動作に関する、カメラで取得した動画像データと、3Dセンシングなどで取得した3次元骨格系列データとを用いて、空間的および時間的に最適な自動位置合わせを実施し、効率的かつ高品質な学習データを生成する技術を提供する。 Therefore, in the first embodiment, the spatially and temporally optimum automatic alignment is performed using the moving image data acquired by the camera and the three-dimensional skeleton sequence data acquired by 3D sensing or the like for the same operation. And provide the technology to generate efficient and high quality training data.
 具体的には、生成装置10は、演技を行う演技者1を撮像した2次元の動画像データに含まれる複数のフレームそれぞれから複数の関節それぞれの2次元座標を取得する。続いて、生成装置10は、演技を行う演技者1の複数の関節位置(3次元骨格データ)に関する3次元骨格系列データから、複数のフレームそれぞれに対応する複数の3次元骨格データそれぞれを特定する。そして、生成装置10は、複数のフレームそれぞれの2次元座標と複数の3次元骨格データそれぞれとを用いて、動画像データと3次元骨格系列データとの間の時刻同期に関する調整量と、動画像データへ3次元骨格系列データを投影するときのカメラパラメータとの最適化を実行する。その後、生成装置10は、最適化された調整量とカメラパラメータとを用いて、動画像データと3次元骨格系列データとを対応付けたデータを生成する。 Specifically, the generation device 10 acquires the two-dimensional coordinates of each of the plurality of joints from each of the plurality of frames included in the two-dimensional moving image data of the performer 1 who performs the performance. Subsequently, the generation device 10 specifies each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames from the three-dimensional skeleton series data regarding the plurality of joint positions (three-dimensional skeleton data) of the performer 1 who performs the performance. .. Then, the generation device 10 uses the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data to adjust the time synchronization between the moving image data and the three-dimensional skeleton series data, and the moving image. Perform optimization with camera parameters when projecting 3D skeletal sequence data onto the data. After that, the generation device 10 generates data in which the moving image data and the three-dimensional skeleton series data are associated with each other by using the optimized adjustment amount and the camera parameters.
 このように、生成装置10は、3次元骨格系列データと2次元の関節座標との間の幾何的な整合性が最大化されるように、非線型最適化に基づくカメラキャリブレーションと時刻同期を同時に実施することで、空間的および時間的に最適な自動位置合わせを実行する。この結果、生成装置10は、精度良く3次元の骨格位置を推定する機械学習モデルの学習に利用できる、2次元画像と3次元の骨格位置とを含む学習データセットを自動で生成することができる。 In this way, the generator 10 performs camera calibration and time synchronization based on non-linear optimization so that the geometrical consistency between the 3D skeletal sequence data and the 2D joint coordinates is maximized. By performing at the same time, spatially and temporally optimal automatic alignment is performed. As a result, the generation device 10 can automatically generate a training data set including a two-dimensional image and a three-dimensional skeleton position that can be used for training a machine learning model that estimates a three-dimensional skeleton position with high accuracy. ..
[機能構成]
 図3は、実施例1にかかる生成装置10の機能構成を示す機能ブロック図である。図3に示すように、生成装置10は、通信部11、記憶部12、制御部20を有する。
[Functional configuration]
FIG. 3 is a functional block diagram showing a functional configuration of the generator 10 according to the first embodiment. As shown in FIG. 3, the generation device 10 includes a communication unit 11, a storage unit 12, and a control unit 20.
 通信部11は、他の装置との間の通信を制御する処理部であり、例えば通信インタフェースなどにより実現される。例えば、通信部11は、カメラなどを用いて撮影された演技者1の動画像データを受信し、3Dレーザセンサ5を用いて撮影された演技者1の3次元骨格系列を受信する。 The communication unit 11 is a processing unit that controls communication with other devices, and is realized by, for example, a communication interface. For example, the communication unit 11 receives the moving image data of the performer 1 taken by using a camera or the like, and receives the three-dimensional skeleton sequence of the performer 1 taken by using the 3D laser sensor 5.
 記憶部12は、各種データや制御部20が実行するプログラムなどを記憶する処理部であり、例えばメモリやハードディスクなどにより実現される。この記憶部12は、動画像データ13、3次元骨格系列データ14、学習データセット15を記憶する。 The storage unit 12 is a processing unit that stores various data, programs executed by the control unit 20, and the like, and is realized by, for example, a memory or a hard disk. The storage unit 12 stores the moving image data 13, the three-dimensional skeleton sequence data 14, and the learning data set 15.
 動画像データ13は、演技者1が演技したときにカメラなどにより撮影された一連の動画像データであり、複数のフレームから構成される。図4は、動画像データ13の例を示す図である。図4では、一例として、あん馬を演技しているときに撮影された動画像データ13内の1つのフレームを示している。この動画像データ13は、座標系として、カメラの位置、姿勢や解像度を基準とし、時刻系として、カメラ固有のタイムスタンプやサンプリングレートを基準とする。 The moving image data 13 is a series of moving image data taken by a camera or the like when the performer 1 performs, and is composed of a plurality of frames. FIG. 4 is a diagram showing an example of moving image data 13. FIG. 4 shows, as an example, one frame in the moving image data 13 taken while acting a pommel horse. The moving image data 13 is based on the position, posture, and resolution of the camera as the coordinate system, and is based on the time stamp and sampling rate peculiar to the camera as the time system.
 3次元骨格系列データ14は、複数の関節位置に関する3次元の関節座標を示す3次元骨格データを含む系列データである。具体的には、3次元骨格系列データ14は、演技者1が演技したときに、3Dレーザセンサなどにより撮影された一連の3次元骨格データである。なお、3次元骨格データには、演技者1が演技しているときの各関節の3次元の骨格位置(骨格情報)に関する情報が含まれる。 The three-dimensional skeleton series data 14 is series data including three-dimensional skeleton data showing three-dimensional joint coordinates related to a plurality of joint positions. Specifically, the three-dimensional skeleton sequence data 14 is a series of three-dimensional skeleton data captured by a 3D laser sensor or the like when the performer 1 performs. The three-dimensional skeleton data includes information on the three-dimensional skeleton position (skeleton information) of each joint when the performer 1 is performing.
 図5は、3次元骨格系列データ14の例を説明する図である。図5では、演技者1の演技から生成された3次元骨格系列データ14中の1つの3次元骨格データを図示している。図5に示すように、3次元骨格系列データ14は、3Dセンシング技術で取得されるデータであり、各関節の3次元座標が含まれるデータである。ここで、各関節は、例えば右肩、左肩、右足首など予め指定した18関節やユーザが任意に設定した複数の関節などである。この3次元骨格系列データ14は、座標系として、センサの位置や姿勢を基準とし、時刻系として、センサ固有のタイムスタンプやサンプリングレートを基準とする。 FIG. 5 is a diagram illustrating an example of three-dimensional skeleton series data 14. FIG. 5 illustrates one of the three-dimensional skeleton data 14 in the three-dimensional skeleton sequence data 14 generated from the performance of the performer 1. As shown in FIG. 5, the three-dimensional skeleton series data 14 is data acquired by the 3D sensing technique, and is data including three-dimensional coordinates of each joint. Here, each joint is, for example, 18 joints designated in advance such as the right shoulder, the left shoulder, and the right ankle, or a plurality of joints arbitrarily set by the user. The three-dimensional skeleton sequence data 14 is based on the position and orientation of the sensor as a coordinate system, and is based on a sensor-specific time stamp and sampling rate as a time system.
 学習データセット15は、後述する制御部20によって生成される、機械学習モデルの生成に利用される複数の学習データを含むデータベースである。例えば、学習データセット15は、動画像データ13に3次元骨格データとカメラパラメータとが対応付けられた情報である。 The learning data set 15 is a database that is generated by the control unit 20 described later and includes a plurality of training data used for generating a machine learning model. For example, the learning data set 15 is information in which the moving image data 13 is associated with the three-dimensional skeleton data and the camera parameters.
 制御部20は、生成装置10全体を司る処理部であり、例えばプロセッサなどにより実現される。この制御部20は、データ取得部21、座標取得部22、学習データ生成部23を有する。なお、データ取得部21、座標取得部22、学習データ生成部23は、プロセッサが有する電子回路やプロセッサが実行するプロセスなどにより実現される。 The control unit 20 is a processing unit that controls the entire generation device 10, and is realized by, for example, a processor. The control unit 20 has a data acquisition unit 21, a coordinate acquisition unit 22, and a learning data generation unit 23. The data acquisition unit 21, the coordinate acquisition unit 22, and the learning data generation unit 23 are realized by an electronic circuit possessed by the processor, a process executed by the processor, or the like.
 データ取得部21は、動画像データ13や3次元骨格系列データ14を取得して記憶部12に格納する処理部である。例えば、データ取得部21は、カメラから動画像データ13を取得することもでき、以前に公知の手法により撮像された動画像データ13を記憶する記憶先から読み出して記憶部12に格納することもできる。同様に、データ取得部21は、3Dレーザセンサから3次元骨格系列データ14を取得することもでき、以前に公知の手法により撮像された3次元骨格系列データ14を記憶する記憶先から読み出して記憶部12に格納することもできる。 The data acquisition unit 21 is a processing unit that acquires moving image data 13 and three-dimensional skeleton series data 14 and stores them in the storage unit 12. For example, the data acquisition unit 21 can acquire the moving image data 13 from the camera, or can read the moving image data 13 captured by a previously known method from a storage destination for storing the moving image data 13 and store the moving image data 13 in the storage unit 12. can. Similarly, the data acquisition unit 21 can also acquire the three-dimensional skeletal sequence data 14 from the 3D laser sensor, and reads and stores the three-dimensional skeletal sequence data 14 captured by a previously known method from a storage destination for storing the three-dimensional skeletal sequence data 14. It can also be stored in the unit 12.
 座標取得部22は、動画像データ13に含まれる複数のフレームそれぞれから複数の関節それぞれの2次元の関節座標である2次元座標を取得する処理部である。具体的には、座標取得部22は、動画像データの中から適当なフレームを数枚(例えば10枚)程度選択し、それぞれのフレームにおける対象人物の2次元の関節位置を自動または手動で取得する。 The coordinate acquisition unit 22 is a processing unit that acquires two-dimensional coordinates, which are two-dimensional joint coordinates of each of the plurality of joints, from each of the plurality of frames included in the moving image data 13. Specifically, the coordinate acquisition unit 22 selects several appropriate frames (for example, 10 frames) from the moving image data, and automatically or manually acquires the two-dimensional joint position of the target person in each frame. do.
 図6は、2次元関節座標の取得を説明する図である。図6に示すように、座標取得部22は、動画像データ13の中からフレームを選択し、選択したフレームから予め指定した8関節をアノテーション対象に設定する。そして、座標取得部22は、既存のモデルなどを用いて、アノテーション対象である(1)右ひじ(2)右手首(3)左ひじ(4)左手首(5)右ひざ(6)右足首(7)左ひざ(8)左足首それぞれの関節位置を示す2次元座標を取得する。 FIG. 6 is a diagram illustrating acquisition of two-dimensional joint coordinates. As shown in FIG. 6, the coordinate acquisition unit 22 selects a frame from the moving image data 13, and sets eight joints designated in advance from the selected frame as annotation targets. Then, the coordinate acquisition unit 22 uses an existing model or the like to annotate (1) right elbow (2) right wrist (3) left elbow (4) left wrist (5) right knee (6) right ankle. (7) Left knee (8) Two-dimensional coordinates indicating the joint positions of each left ankle are acquired.
 なお、3次元骨格系列データにおける関節のサブセットをアノテーション対象とすることもできる。また、既存の2次元の骨格認識手法を用いた自動アノテーションや、目視や手動でのアノテーションを実施することで、2次元の関節座標を取得できる。 Note that a subset of joints in 3D skeletal sequence data can also be annotated. In addition, two-dimensional joint coordinates can be obtained by performing automatic annotation using the existing two-dimensional skeleton recognition method, and visual or manual annotation.
 学習データ生成部23は、初期設定部24、最適化部25、出力部26を有し、動画像データ13に3次元骨格データを対応付けた学習データセット15を生成する処理部である。具体的には、学習データ生成部23は、同一の演技に関する、3次元骨格系列データ14と動画像データ13の両方を保有することから、3次元骨格系列データ14を動画像データ13に対して投影することで、新たにデータを取得することなく、複雑な姿勢に関する2次元の関節座標または3次元の関節座標を付加した画像を大量に生成することができる。 The learning data generation unit 23 has an initial setting unit 24, an optimization unit 25, and an output unit 26, and is a processing unit that generates a learning data set 15 in which the moving image data 13 is associated with the three-dimensional skeleton data. Specifically, since the learning data generation unit 23 has both the three-dimensional skeletal sequence data 14 and the moving image data 13 related to the same performance, the three-dimensional skeletal sequence data 14 is supplied to the moving image data 13. By projecting, it is possible to generate a large number of images to which two-dimensional joint coordinates or three-dimensional joint coordinates related to a complicated posture are added without acquiring new data.
 ただし、図4と図5で説明したように、3次元骨格系列データ14と動画像データ13とは、互いに異なる座標系および時刻系を基準とする。このため、学習データ生成部23は、3次元骨格系列データ14を動画像データ13に投影するために、空間的な位置合わせに該当するカメラキャリブレーションと、時間的な位置合わせに該当する時刻同期とリサンプリングを実行する。 However, as described in FIGS. 4 and 5, the three-dimensional skeleton series data 14 and the moving image data 13 are based on different coordinate systems and time systems. Therefore, in order to project the three-dimensional skeleton sequence data 14 onto the moving image data 13, the learning data generation unit 23 performs camera calibration corresponding to spatial alignment and time synchronization corresponding to temporal alignment. And perform resampling.
(カメラキャリブレーション)
 ここで、カメラキャリブレーションについて説明する。図7は、カメラキャリブレーションを説明する図である。図7に示すように、ある3D点を画像に投影したときの投影点を求めるためには、画像を撮影したカメラに関する、焦点距離や解像度などのカメラ固有のパラメータ(カメラ内部パラメータ)と、3D点の基準となる座標系(世界座標系)におけるカメラの位置および姿勢のパラメータ(カメラ外部パラメータ)が必要である。これらのパラメータ(カメラパラメータ)を求める処理をカメラキャリブレーションと呼ぶ。
(Camera calibration)
Here, camera calibration will be described. FIG. 7 is a diagram illustrating camera calibration. As shown in FIG. 7, in order to obtain a projection point when a certain 3D point is projected on an image, camera-specific parameters (camera internal parameters) such as focal distance and resolution and 3D are related to the camera that captured the image. Parameters of the camera position and orientation (camera external parameters) in the coordinate system (world coordinate system) that serves as a reference for points are required. The process of obtaining these parameters (camera parameters) is called camera calibration.
 図7の例では、画像に3D点を投影する透視投影は、w[x,y,t]=K[R,t][X,Y,Z]で表すことができる。ここで、[X,Y,Z]は、投影元である3D点の座標を示し、[x,y]は、投影先である画像上の投影点の座標を示す。Kは、カメラの内部パラメータであり、3×3の内部行列である。Rは、カメラの外部パラメータであり、3×3の回転行列である。tは、3×1の並進ベクトルである。これらのうち、R、tがカメラキャリブレーションの対象となる。 In the example of FIG. 7, the perspective projection that projects a 3D point on an image can be represented by w [x, y, t] t = K [R, t] [X, Y, Z] t. Here, [X, Y, Z] indicates the coordinates of the 3D point that is the projection source, and [x, y] indicates the coordinates of the projection point on the image that is the projection destination. K is an internal parameter of the camera and is a 3 × 3 internal matrix. R is an external parameter of the camera and is a 3 × 3 rotation matrix. t is a 3 × 1 translational vector. Of these, R and t are the targets of camera calibration.
(時刻同期とリサンプリング)
 次に、時刻同期とリサンプリングについて説明する。図8は、時刻同期とリサンプリングを説明する図である。図8に示すように、動画像データ13と3次元骨格系列データ14が異なる時刻系で取得された場合、2つのデータ全体の時刻同期をとった上で、動画像データ13の各フレーム時刻で3次元骨格系列データ14をリサンプリングすることで、動画像データ13と同期した3次元骨格データを取得できる。
(Time synchronization and resampling)
Next, time synchronization and resampling will be described. FIG. 8 is a diagram illustrating time synchronization and resampling. As shown in FIG. 8, when the moving image data 13 and the three-dimensional skeleton series data 14 are acquired in different time systems, the time of the entire two data is synchronized and then each frame time of the moving image data 13 is used. By resampling the 3D skeleton sequence data 14, the 3D skeleton data synchronized with the moving image data 13 can be acquired.
 ここで、時刻同期とは、動画像データ13と3次元骨格系列データ14の間での時刻系の変換を規定することであり、リサンプリングとは、動画像データ13の各フレームの時刻における3次元骨格データを補間することである。 Here, time synchronization defines the conversion of the time system between the moving image data 13 and the three-dimensional skeleton series data 14, and resampling is 3 at the time of each frame of the moving image data 13. It is to interpolate the dimensional skeleton data.
 図8の場合、実世界の時刻系はtであり、3次元骨格系列データ14の時刻系はtであり、動画像データ13の時刻系はtとする。この状態で、3次元骨格系列データ14は、サンプリング周期Tでサンプリングされ、時刻ts,0、時刻ts,1、時刻ts,2などの3次元骨格データがサンプリングされる。動画像データ13は、サンプリング周期Tでサンプリングされ、時刻tv,0、時刻tv,1、時刻tv,1などのフレームがサンプリングされる。 For Figure 8, real-world time system is t, the time system of the three-dimensional framework sequence data 14 is t s, the time-based moving image data 13 and t v. In this state, the three-dimensional skeleton sequence data 14 is sampled at the sampling period T s , and the three-dimensional skeleton data such as time ts, 0 , time ts , 1 , time t s, 2 is sampled. Moving image data 13 is sampled at a sampling period T v, time t v, 0, time t v, 1, frames such as time t v, 1 is sampled.
 3次元骨格系列データ14の先頭である時刻ts,0と、動画像データ13の先頭である時刻tv,0との差分は、時刻シフト量Tv,sとなる。ここで、時刻同期は、「τ(tv,j)=時刻ts,0+時刻シフト量Tv,s+jT」のように、2つの時刻系間の時刻の変換を規定することで算出できる。また、リサンプリングは、時刻系の変換式を用いて「tv,j」付近所定時刻の範囲内の3次元骨格データを参照し、時刻tv,jに相当する3次元骨格データをバイリニア法などで補間することで実行できる。例えば、動画像データ13における時刻tv,0のフレームに対しては、3次元骨格系列データ14の時刻ts,2から時刻ts,3の間にある、τ(tv,0)で時刻同期された3次元骨格データを対応付けることで、リサンプリングされた3次元骨格データを抽出することができる。 And time t s, 0 is the head of the three-dimensional framework sequence data 14, the difference between the time t v, 0 is the head of the moving image data 13, the time shift amount T v, the s. Here, the time synchronization is defined by specifying the time conversion between the two time systems such as "τ (tv, j ) = time t s, 0 + time shift amount T v, s + jT v". Can be calculated. In resampling, the 3D skeletal data corresponding to the time tv, j is referred to by the bilinear method by referring to the 3D skeletal data within the range of the predetermined time near " tv, j " using the conversion formula of the time system. It can be executed by interpolating with. For example, for a frame at time tv, 0 in the moving image data 13, τ ( tv, 0 ) between time ts, 2 and time ts , 3 of the three-dimensional skeleton series data 14 By associating the time-synchronized 3D skeleton data, the resampled 3D skeleton data can be extracted.
 上述したように、学習データ生成部23は、3次元骨格系列データ14を動画像データ13に投影するために、「カメラキャリブレーション」と、「時刻同期とリサンプリング」とを実行する。このとき、学習データ生成部23は、3次元骨格系列データ14と動画像データ13との間の幾何的な整合性が最大化されるように、非線型最適化に基づくカメラキャリブレーションと時刻同期を同時に実施するコスト関数を定義する。そして、学習データ生成部23は、コスト関数の最適化により、最適な時刻同期とカメラキャリブレーションを算出する。 As described above, the learning data generation unit 23 executes "camera calibration" and "time synchronization and resampling" in order to project the three-dimensional skeleton sequence data 14 onto the moving image data 13. At this time, the learning data generation unit 23 performs camera calibration and time synchronization based on the non-linear optimization so that the geometrical consistency between the three-dimensional skeleton series data 14 and the moving image data 13 is maximized. Define a cost function that implements at the same time. Then, the learning data generation unit 23 calculates the optimum time synchronization and camera calibration by optimizing the cost function.
 図3に戻り、初期設定部24は、コスト関数の初期パラメータを設定する処理部である。具体的には、初期設定部24は、時刻同期を変更した各同期パターンにおいて、リサンプリングにより対応付けられた複数のフレームそれぞれと複数の3次元骨格データそれぞれとを用いて、動画像データ13に3次元骨格系列データ13を投影するときのカメラの位置姿勢を推定する推定問題を解くことによりカメラパラメータを算出する。そして、初期設定部24は、各同期パターンについて算出された各カメラパラメータを用いて、各同期パターンの時刻同期と各カメラパラメータとの妥当性を表す尤度を算出する。その後、初期設定部24は、尤度が最も高い同期パターンについて特定された複数のフレームそれぞれと複数の3次元骨格データそれぞれとを初期値に設定する。 Returning to FIG. 3, the initial setting unit 24 is a processing unit that sets the initial parameters of the cost function. Specifically, the initial setting unit 24 uses each of the plurality of frames associated by resampling and each of the plurality of three-dimensional skeleton data in each synchronization pattern in which the time synchronization is changed to generate the moving image data 13. The camera parameters are calculated by solving an estimation problem that estimates the position and orientation of the camera when projecting the three-dimensional skeleton sequence data 13. Then, the initial setting unit 24 calculates the likelihood indicating the validity of the time synchronization of each synchronization pattern and each camera parameter by using each camera parameter calculated for each synchronization pattern. After that, the initial setting unit 24 sets each of the plurality of frames specified for the synchronization pattern having the highest likelihood and each of the plurality of three-dimensional skeleton data as initial values.
 図9は、初期パラメータの推定を説明する図である。図9に示すように、初期設定部24は、時刻同期を適当に変えながら、2次元の関節座標が取得されている動画像データ13のフレームに対応する3次元骨格データを上記図8に示したリサンプリングにより取得する。そして、初期設定部24は、2次元の関節位置と3次元骨格データとの対応におけるPnP(Perspective-n-Point)問題を解くことでカメラパラメータを推定する。 FIG. 9 is a diagram illustrating the estimation of the initial parameters. As shown in FIG. 9, the initial setting unit 24 shows the three-dimensional skeleton data corresponding to the frame of the moving image data 13 in which the two-dimensional joint coordinates are acquired while appropriately changing the time synchronization. Obtained by resampling. Then, the initial setting unit 24 estimates the camera parameters by solving the PnP (Perspective-n-Point) problem in the correspondence between the two-dimensional joint position and the three-dimensional skeleton data.
 このとき、初期設定部24は、求めたカメラパラメータと時刻同期の妥当性を表す尤度として、3次元骨格系列データ14と2次元の関節座標との間の幾何的な整合性を定量的に計算する。例えば、初期設定部24は、リサンプリングした3次元骨格データにおける、3次元の関節座標の再投影誤差が閾値未満となる関節数の割合などを尤度として算出する。そして、初期設定部24は、時刻同期に関する全試行の中で、尤度が最大値をとるときの時刻同期とカメラパラメータを準最適解として採用し、最適化部25に初期値として出力する。 At this time, the initial setting unit 24 quantitatively determines the geometrical consistency between the three-dimensional skeleton series data 14 and the two-dimensional joint coordinates as the likelihood indicating the validity of the obtained camera parameters and the time synchronization. calculate. For example, the initial setting unit 24 calculates the ratio of the number of joints in which the reprojection error of the three-dimensional joint coordinates in the resampled three-dimensional skeleton data is less than the threshold value as the likelihood. Then, the initial setting unit 24 adopts the time synchronization and the camera parameter when the likelihood takes the maximum value as the quasi-optimal solution in all the trials related to the time synchronization, and outputs the time synchronization to the optimization unit 25 as the initial value.
 図9の例では、初期設定部24は、動画像データ13の先頭フレームを一定の間隔でシフトさせた同期パターンに該当する試行i-1、試行i、試行i+1を実行して、準最適解を算出する例を示している。試行i-1を例にして説明すると、初期設定部24は、データ取得部21により2次元座標が取得された時刻ts,0に該当する動画像データ13のフレームに対して、対応する時刻ts,0から時刻ts,1の間の3次元骨格データを用いてリサンプリングした結果、データA1を生成する。同様に、初期設定部24は、2次元座標が取得されたフレームに対して、時刻ts,2から時刻ts,3の間の3次元骨格データを用いてリサンプリングしてデータA3を生成し、時刻ts,3から時刻ts,4の間の3次元骨格データを用いてリサンプリングしてデータA4を生成する。 In the example of FIG. 9, the initial setting unit 24 executes trial i-1, trial i, and trial i + 1 corresponding to the synchronization pattern in which the first frame of the moving image data 13 is shifted at regular intervals to obtain a semi-optimal solution. Is shown as an example of calculating. Taking the trial i-1 as an example, the initial setting unit 24 corresponds to the frame of the moving image data 13 corresponding to the time ts, 0 at which the two-dimensional coordinates are acquired by the data acquisition unit 21. Data A1 is generated as a result of resampling using the three-dimensional skeleton data between t s, 0 and time t s, 1. Similarly, the initial setting unit 24 generates data A3 by resampling the frame for which the two-dimensional coordinates have been acquired using the three-dimensional skeleton data between the time ts, 2 and the time ts , 3. Then, resampling is performed using the three-dimensional skeleton data between the time t s, 3 and the time t s, 4 to generate the data A4.
 続いて、初期設定部24は、3次元骨格データと2次元座標とを含むデータA1、データA3、データA4を用いて、PnP問題を解くことにより、カメラパラメータを推定する。ここで、初期設定部24は、推定されたカメラパラメータを用いて、動画像データ13のフレームに、3次元骨格データを投影する。そして、初期設定部24は、当該フレームにおける各関節について、各関節の2次元座標と、3次元骨格データにおける各関節の3次元座標(2次元座標のみを使用)との距離を算出する。そして、初期設定部24は、距離が閾値未満である関節の割合を尤度として算出する。なお、尤度は、1つまたは複数のフレームに対して3次元骨格データを投影したときの各関節の距離を用いて算出することができる。 Subsequently, the initial setting unit 24 estimates the camera parameters by solving the PnP problem using the data A1, the data A3, and the data A4 including the three-dimensional skeleton data and the two-dimensional coordinates. Here, the initial setting unit 24 projects the three-dimensional skeleton data on the frame of the moving image data 13 using the estimated camera parameters. Then, the initial setting unit 24 calculates the distance between the two-dimensional coordinates of each joint and the three-dimensional coordinates (using only the two-dimensional coordinates) of each joint in the three-dimensional skeleton data for each joint in the frame. Then, the initial setting unit 24 calculates the ratio of joints whose distance is less than the threshold value as the likelihood. The likelihood can be calculated using the distance of each joint when the three-dimensional skeleton data is projected onto one or a plurality of frames.
 このようにして、初期設定部24は、試行i-1、試行i、試行i+1のそれぞれについて、上述した処理を実行して、尤度を算出する。そして、初期設定部24は、尤度が最も高い試行iの時刻同期とカメラパラメータとを準最適解(初期値)に決定する。 In this way, the initial setting unit 24 executes the above-mentioned processing for each of the trial i-1, the trial i, and the trial i + 1, and calculates the likelihood. Then, the initial setting unit 24 determines the time synchronization of the trial i having the highest likelihood and the camera parameter as a quasi-optimal solution (initial value).
 図3に戻り、最適化部25は、初期設定部24により算出された準最適解を初期値として、時刻同期の調整量とカメラパラメータとに関するコスト関数の最適化を実行する処理部である。具体的には、最適化部25は、時刻同期の調整量Δtとカメラパラメータに関するコスト関数C(式1)を定義し、準最適解を初期値とする非線型最適化を適用し、コスト関数を最小化する。このとき、最適化部25は、離散データである3次元骨格系列データを、関節ごとに微分可能な連続関数f(t)で表現し、3次元骨格データのリサンプリング処理としてコスト関数Cに組込む。なお、f(t)としては、3次スプライン補間などが適用できる。 Returning to FIG. 3, the optimization unit 25 is a processing unit that optimizes the cost function related to the adjustment amount of time synchronization and the camera parameter by using the quasi-optimal solution calculated by the initial setting unit 24 as the initial value. Specifically, the optimization unit 25 defines the cost function C (Equation 1) related to the adjustment amount Δt of time synchronization and the camera parameter, applies the non-linear optimization with the quasi-optimal solution as the initial value, and applies the cost function. To minimize. At this time, the optimization unit 25 expresses the three-dimensional skeleton series data, which is discrete data, by a continuous function f (t) that can be differentiated for each joint, and incorporates it into the cost function C as a resampling process of the three-dimensional skeleton data. .. As f (t), third-order spline interpolation or the like can be applied.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 なお、式(1)における「i」は、関節を示し、「t」は、準最適解における2次元座標が取得された動画像データ13のフレームの時刻である。「pi,t」は、時刻tにおける関節iの2次元の関節位置を示す。「f(t)」は、時刻tにおける関節iの位置であり、リンサンプリングされた3次元の関節座標である。「π(x)」は、カメラパラメータを用いた3D点Xの透視投影であり、2次元の関節座標である。このコスト関数Cにおいて、時刻の調整量Δtとπとが最適化対象である。 In the equation (1), "i" indicates a joint, and "t" is the time of the frame of the moving image data 13 in which the two-dimensional coordinates in the quasi-optimal solution are acquired. “Pi , t ” indicates the two-dimensional joint position of the joint i at time t. "F i (t)" is the position of joint i at time t, it is a 3-dimensional joint coordinate phosphorylated sampled. “Π (x)” is a perspective projection of the 3D point X using camera parameters, and is a two-dimensional joint coordinate. In this cost function C, the time adjustment amounts Δt and π are the optimization targets.
 図10は、パラメータの最適化を説明する図である。図10に示すように、最適化部25は、準最適解をコスト関数Cの初期値に設定し、最適化とリサンプリングとを繰り返すことで、各パラメータの最適値を算出する。例えば、最適化部25は、2次元座標と3次元骨格データとが対応付けられたデータB1を用いてコスト関数の最適化を実行し、次のデータB2を用いるときは上記リサンプリングした上でデータB2を生成した後、当該データB2を用いてコスト関数の最適化を実行する。このようにすることで、最適化部25は、Δtとカメラパラメータとを同時に最適化することができる。 FIG. 10 is a diagram illustrating parameter optimization. As shown in FIG. 10, the optimization unit 25 sets the quasi-optimal solution to the initial value of the cost function C, and repeats optimization and resampling to calculate the optimum value of each parameter. For example, the optimization unit 25 optimizes the cost function using the data B1 in which the two-dimensional coordinates and the three-dimensional skeleton data are associated with each other, and when the next data B2 is used, the above resampling is performed. After the data B2 is generated, the cost function is optimized using the data B2. By doing so, the optimization unit 25 can optimize Δt and the camera parameters at the same time.
 図11は、最適化の結果を説明する図である。図11に示すように、初期設定部24は、(1)から(8)の関節の2次元座標が取得されたフレーム(画像)に対して、初期パラメータを設定する。このとき、幾何学的な整合性が不十分であることから、画像上の選手の身体と、リサンプリングされた3次元骨格データとがずれた状態である。その後、最適化部25が最適化を実行することで、幾何学的な整合性が向上することから、画像上の選手の身体と、リサンプリングされた3次元骨格データとが一致する。そして、最適化部25は、最適化されたΔtとカメラパラメータとを出力部26に出力する。 FIG. 11 is a diagram illustrating the result of optimization. As shown in FIG. 11, the initial setting unit 24 sets initial parameters for the frame (image) in which the two-dimensional coordinates of the joints (1) to (8) are acquired. At this time, since the geometrical consistency is insufficient, the player's body on the image and the resampled three-dimensional skeleton data are out of alignment. After that, the optimization unit 25 executes the optimization to improve the geometrical consistency, so that the athlete's body on the image and the resampled three-dimensional skeleton data match. Then, the optimization unit 25 outputs the optimized Δt and the camera parameter to the output unit 26.
 出力部26は、最適化部25による最適化結果を用いて、学習データを生成する処理部である。具体的には、出力部26は、最適化された時刻同期の調整量およびカメラパラメータを前提として、2次元の関節座標や3次元の関節座標を付加した画像を生成し、学習データとして学習データセット15に格納する。 The output unit 26 is a processing unit that generates learning data using the optimization result by the optimization unit 25. Specifically, the output unit 26 generates an image to which two-dimensional joint coordinates and three-dimensional joint coordinates are added on the premise of the optimized time synchronization adjustment amount and camera parameters, and trains data as training data. Store in set 15.
 図12は、生成される学習データの一例を示す図である。図12に示すように、出力部26は、「画像、3次元骨格データ、カメラパラメータ」として「I、({X1,1,Y1,1,Z1,1}・・・{X1,j,Y1,j,Z1,j})、(K,R,t)」や「I、({X2,1,Y2,1,Z2,1}・・・{X2,j,Y2,j,Z2,j})、(K,R,t)」などが記憶される。 FIG. 12 is a diagram showing an example of the generated learning data. As shown in FIG. 12, the output unit 26 has "I 1 , ({X 1 , 1, Y 1 , 1, Z 1 , 1} ... {X" as "image, three-dimensional skeleton data, camera parameters". 1, j , Y 1, j , Z 1, j }), (K, R, t) "and" I 2 , ({X 2 , 1, Y 2 , 1, Z 2 , 1} ... { X 2, j , Y 2, j , Z 2, j }), (K, R, t) ”and the like are stored.
 この例では、2次元画像Iに対して関節の3次元座標「{X1,1,Y1,1,Z1,1}・・・{X1,j,Y1,j,Z1,j}」が対応付けられ、対応付け時のカメラパラメータが「K,R,t」であることを示している。なお、カメラパラメータは、一連の動画像データ13に対して、1つ設定される。 In this example, the three-dimensional coordinates of the joint with respect to the two-dimensional image I 1 "{X 1 , 1, Y 1 , 1, Z 1 , 1} ... {X 1, j , Y 1, j , Z 1 , J } ”is associated, indicating that the camera parameter at the time of association is“ K, R, t ”. One camera parameter is set for a series of moving image data 13.
[処理の流れ]
 図13は、学習データの生成処理の流れを示すフローチャートである。図13に示すように、座標取得部22は、記憶部12から動画像データ13と3次元骨格系列データ14を読込み(S101)、動画像データ13のフレーム数枚における2次元の関節座標を取得する(S102)。
[Processing flow]
FIG. 13 is a flowchart showing the flow of the learning data generation process. As shown in FIG. 13, the coordinate acquisition unit 22 reads the moving image data 13 and the three-dimensional skeleton series data 14 from the storage unit 12 (S101), and acquires the two-dimensional joint coordinates in several frames of the moving image data 13. (S102).
 そして、学習データ生成部23は、カメラパラメータと時刻同期の初期値を推定し(S103)、カメラパラメータと時刻同期を最適化し(S104)、最適化結果を用いて学習データを生成する(S105)。 Then, the learning data generation unit 23 estimates the initial value of the camera parameter and the time synchronization (S103), optimizes the camera parameter and the time synchronization (S104), and generates training data using the optimization result (S105). ..
 ここで、S103とS104で実行される処理の詳細を説明する。図14は、初期値の推定から最適化までの処理の流れを示すフローチャートである。図14に示すように、学習データ生成部23は、動画像データ13と3次元骨格系列データ14との間の時刻同期の候補群を生成する(S201)。 Here, the details of the processes executed in S103 and S104 will be described. FIG. 14 is a flowchart showing the flow of processing from the estimation of the initial value to the optimization. As shown in FIG. 14, the learning data generation unit 23 generates a candidate group for time synchronization between the moving image data 13 and the three-dimensional skeleton series data 14 (S201).
 続いて、学習データ生成部23は、時刻同期の各候補について、2次元の関節座標をもつフレームに対応する3次元骨格データのリサンプリングを実行する(S202)。そして、学習データ生成部23は、時刻同期の各候補について、2次元の関節位置と3次元骨格データの対応に基づくPnP問題を解くことでカメラパラメータを推定する(S203)。 Subsequently, the learning data generation unit 23 resamples the three-dimensional skeleton data corresponding to the frame having the two-dimensional joint coordinates for each candidate for time synchronization (S202). Then, the learning data generation unit 23 estimates the camera parameters for each candidate for time synchronization by solving the PnP problem based on the correspondence between the two-dimensional joint position and the three-dimensional skeleton data (S203).
 その後、学習データ生成部23は、時刻同期の各候補について、時刻同期とカメラパラメータの妥当性を示す尤度を計算し(S204)、時刻同期の候補のうち、尤度が最大となるときの時刻同期とカメラパラメータを準最適解に決定する(S205)。 After that, the learning data generation unit 23 calculates the likelihood indicating the validity of the time synchronization and the camera parameters for each candidate for time synchronization (S204), and when the likelihood becomes the maximum among the candidates for time synchronization. The time synchronization and the camera parameters are determined to be the semi-optimal solution (S205).
 そして、学習データ生成部23は、3次元骨格データのリサンプリング処理を組み込んだ、時刻同期とカメラパラメータに関するコスト関数を定義し(S206)、準最適解を初期値とする非線型最適化を実行してコスト関数を最小化し、最適な時刻同期とカメラパラメータを取得する(S207)。 Then, the learning data generation unit 23 defines a cost function related to time synchronization and camera parameters incorporating resampling processing of three-dimensional skeleton data (S206), and executes non-linear optimization with the quasi-optimal solution as the initial value. To minimize the cost function and obtain the optimum time synchronization and camera parameters (S207).
[効果]
 上述したように、生成装置10は、3次元骨格系列データ14と2次元の関節座標との間の幾何的な整合性が最大化されるように、準最適なカメラパラメータと時刻同期を同時に推定した後に、その推定結果を初期値とする非線型最適化により空間的および時間的に最適な自動位置合わせを実行する。したがって、生成装置10は、カメラや3Dセンシング技術により非同期的に取得した、動画像データ13と3次元骨格系列データ14を利用し、画像方式の骨格認識のための学習データを効率的に生成することができる。
[effect]
As mentioned above, the generator 10 simultaneously estimates the suboptimal camera parameters and time synchronization so that the geometrical consistency between the 3D skeletal sequence data 14 and the 2D joint coordinates is maximized. After that, the spatially and temporally optimal automatic alignment is executed by the non-linear optimization with the estimation result as the initial value. Therefore, the generation device 10 efficiently generates training data for image-based skeleton recognition by using the moving image data 13 and the three-dimensional skeleton sequence data 14 asynchronously acquired by a camera or 3D sensing technology. be able to.
 また、生成装置10は、尤度として、動画像データのフレームに投影したときの各関節に対する再投影誤差が閾値未満である関節数の割合を算出することができるので、正確な初期値を設定することができる。この結果、生成装置10は、ある程度絞り込んだ状態で最適化を実行することができるので、最適化処理のコストを削減することができ、最適化処理の処理時間を短縮することができる。 Further, since the generation device 10 can calculate the ratio of the number of joints in which the reprojection error for each joint when projected onto the frame of the moving image data is less than the threshold value as the likelihood, an accurate initial value is set. can do. As a result, the generation device 10 can execute the optimization in a state of being narrowed down to some extent, so that the cost of the optimization process can be reduced and the processing time of the optimization process can be shortened.
 さて、これまで本発明の実施例について説明したが、本発明は上述した実施例以外にも、種々の異なる形態にて実施されてよいものである。 Although the embodiments of the present invention have been described so far, the present invention may be implemented in various different forms other than the above-described embodiments.
[数値等]
 上記実施例で用いた対象とするデータの種類、コスト関数、機械学習モデル、学習データ、各種パラメータ等は、あくまで一例であり、任意に変更することができる。上記実施例では、体操競技を例にして説明したが、これに限定されるものではなく、選手が一連の技を行って審判が採点する他の競技にも適用することができる。他の競技の一例としては、フィギュアスケート、新体操、チアリーディング、水泳の飛び込み、空手の型、モーグルのエアーなどがある。また、スポーツに限らず、トラック、タクシー、電車などの運転手の姿勢検出やパイロットの姿勢検出などにも適用することができる。
[Numerical values, etc.]
The target data type, cost function, machine learning model, learning data, various parameters, etc. used in the above embodiment are merely examples and can be arbitrarily changed. In the above embodiment, the gymnastics competition has been described as an example, but the present invention is not limited to this, and can be applied to other competitions in which the athlete performs a series of techniques and the referee scores. Examples of other competitions include figure skating, rhythmic gymnastics, cheerleading, swimming dives, karate kata, and mogul air. Further, it can be applied not only to sports but also to posture detection of drivers of trucks, taxis, trains, etc. and posture detection of pilots.
[適用例]
 上述したよりに生成された学習データは様々な骨格認識用のモデルに採用することができる。ここでは、学習データの適用例について説明する。図15から図17は、骨格認識の処理例を説明する図である。
[Application example]
The training data generated from the above can be adopted in various models for skeletal recognition. Here, an application example of learning data will be described. 15 to 17 are diagrams illustrating a processing example of skeleton recognition.
 図15は、2次元の骨格認識を実行した後に数式で3次元の骨格認識を実行する例である。図15の例では、人検出モデルを用いて、多数点の画像から人を検出し、2D検出モデルを用いて、検出された人の各関節のヒートマップから複数の2次元の関節座標を特定した後、複数の2次元の関節座標から三角測量法により、代数的に3次元の関節座標を求める。 FIG. 15 is an example of executing three-dimensional skeleton recognition by a mathematical formula after executing two-dimensional skeleton recognition. In the example of FIG. 15, a human detection model is used to detect a person from a large number of images, and a 2D detection model is used to identify a plurality of two-dimensional joint coordinates from the heat map of each of the detected human joints. After that, the three-dimensional joint coordinates are obtained algebraically from the plurality of two-dimensional joint coordinates by the triangular survey method.
 この手法では、人検出モデルと2D検出モデルの2つの機械学習モデルが用いられている。この2つのモデルに対して、実施例1により生成された「画像と2次元座標」とが対応付けられた学習データを用いて機械学習を実行することにより、各モデルの精度を向上させることができる。 In this method, two machine learning models, a human detection model and a 2D detection model, are used. It is possible to improve the accuracy of each model by executing machine learning on these two models using the learning data in which the "image and the two-dimensional coordinates" generated in the first embodiment are associated with each other. can.
 図16は、2次元の骨格認識を実行した後に、モデルにより3次元の骨格認識を実行する例である。図16の例では、人検出モデルを用いて、多数点の画像から人を検出し、2D検出モデルを用いて、検出された人の各関節のヒートマップから複数の2次元の関節座標を特定する。その後、複数視点の2次元の関節座標から、学習によって得られた3D推定モデルを使って3次元の関節座標を求める。または、複数視点の2次元のヒートマップを統合した3DVoxelデータから、学習によって得られた3D推定モデルを使って3次元の関節座標を求める。 FIG. 16 is an example of executing three-dimensional skeleton recognition by a model after executing two-dimensional skeleton recognition. In the example of FIG. 16, a person detection model is used to detect a person from a large number of images, and a 2D detection model is used to identify a plurality of two-dimensional joint coordinates from the heat map of each joint of the detected person. do. Then, from the two-dimensional joint coordinates of multiple viewpoints, the three-dimensional joint coordinates are obtained using the 3D estimation model obtained by learning. Alternatively, the 3D joint coordinates are obtained from the 3D Voxel data that integrates the 2D heat maps of multiple viewpoints using the 3D estimation model obtained by learning.
 この手法では、3つの機械学習モデルが用いられている。このうち、人検出モデルと2D検出モデルに対しては、実施例1により生成された「画像と2次元座標」が対応付けられた学習データを用いて機械学習を実行する。3D推定モデルに対しては、「画像と3次元座標とカメラパラメータ」が対応付けられた学習データを用いて機械学習を実行する。この結果、各モデルの精度を向上させることができる。 In this method, three machine learning models are used. Of these, for the human detection model and the 2D detection model, machine learning is executed using the learning data associated with the "image and two-dimensional coordinates" generated in the first embodiment. For the 3D estimation model, machine learning is executed using the training data associated with "image, 3D coordinates, and camera parameters". As a result, the accuracy of each model can be improved.
 図17は、3次元の骨格認識を直接実行する例である。図17の例では、人検出モデルを用いて、画像から人を検出し、学習によって得られた3D検出モデルを用いて、検出された人に対して各関節の3次元の関節座標を推定する。この手法では、2つの機械学習モデルが用いられている。これらのモデルに対しては、「画像と3次元座標とカメラパラメータ」が対応付けられた学習データを用いて機械学習を実行する。この結果、各モデルの精度を向上させることができる。 FIG. 17 is an example of directly executing three-dimensional skeleton recognition. In the example of FIG. 17, a human detection model is used to detect a person from an image, and a 3D detection model obtained by learning is used to estimate the three-dimensional joint coordinates of each joint for the detected person. .. Two machine learning models are used in this method. For these models, machine learning is executed using the learning data associated with "images, 3D coordinates, and camera parameters". As a result, the accuracy of each model can be improved.
[システム]
 上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。なお、座標取得部22は、取得部の一例であり、学習データ生成部23は、特定部と実行部と生成部の一例である。また、3次元骨格系列データ14は、3次元系列データの一例である。また、カメラパラメータは、投影パラメータの一例である。なお、2次元の関節座標とは、関節の位置を2次元で表現したときの座標であり、2次元の骨格座標と同義である。同様に、2次元の関節座標とは、関節の位置を3次元で表現したときの座標であり、3次元の骨格座標と同義である。
[system]
Information including processing procedures, control procedures, specific names, various data and parameters shown in the above documents and drawings can be arbitrarily changed unless otherwise specified. The coordinate acquisition unit 22 is an example of an acquisition unit, and the learning data generation unit 23 is an example of a specific unit, an execution unit, and a generation unit. Further, the three-dimensional skeleton series data 14 is an example of the three-dimensional series data. The camera parameter is an example of the projection parameter. The two-dimensional joint coordinates are coordinates when the joint positions are expressed in two dimensions, and are synonymous with the two-dimensional skeletal coordinates. Similarly, the two-dimensional joint coordinates are the coordinates when the joint positions are expressed in three dimensions, and are synonymous with the three-dimensional skeletal coordinates.
 また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散や統合の具体的形態は図示のものに限られない。つまり、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。 Further, each component of each device shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution or integration of each device is not limited to the one shown in the figure. That is, all or a part thereof can be functionally or physically distributed / integrated in any unit according to various loads, usage conditions, and the like.
 さらに、各装置にて行なわれる各処理機能は、その全部または任意の一部が、CPUおよび当該CPUにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 Further, each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.
[ハードウェア]
 次に、ハードウェア構成例を説明する。図18は、ハードウェア構成例を説明する図である。図18に示すように、生成装置10は、通信装置10a、HDD(Hard Disk Drive)10b、メモリ10c、プロセッサ10dを有する。また、図18に示した各部は、バス等で相互に接続される。
[hardware]
Next, a hardware configuration example will be described. FIG. 18 is a diagram illustrating a hardware configuration example. As shown in FIG. 18, the generation device 10 includes a communication device 10a, an HDD (Hard Disk Drive) 10b, a memory 10c, and a processor 10d. Further, the parts shown in FIG. 18 are connected to each other by a bus or the like.
 通信装置10aは、ネットワークインタフェースカードなどであり、他のサーバとの通信を行う。HDD10bは、図3に示した機能を動作させるプログラムやDBを記憶する。 The communication device 10a is a network interface card or the like, and communicates with other servers. The HDD 10b stores a program or DB that operates the function shown in FIG.
 プロセッサ10dは、図3に示した各処理部と同様の処理を実行するプログラムをHDD10b等から読み出してメモリ10cに展開することで、図3等で説明した各機能を実行するプロセスを動作させる。例えば、このプロセスは、生成装置10が有する各処理部と同様の機能を実行する。具体的には、プロセッサ10dは、データ取得部21、座標取得部22、学習データ生成部23等と同様の機能を有するプログラムをHDD10b等から読み出す。そして、プロセッサ10dは、データ取得部21、座標取得部22、学習データ生成部23等と同様の処理を実行するプロセスを実行する。 The processor 10d reads a program that executes the same processing as each processing unit shown in FIG. 3 from the HDD 10b or the like and expands the program into the memory 10c to operate a process that executes each function described in FIG. 3 or the like. For example, this process executes the same function as each processing unit of the generation device 10. Specifically, the processor 10d reads a program having the same functions as the data acquisition unit 21, the coordinate acquisition unit 22, the learning data generation unit 23, and the like from the HDD 10b and the like. Then, the processor 10d executes a process of executing the same processing as the data acquisition unit 21, the coordinate acquisition unit 22, the learning data generation unit 23, and the like.
 このように、生成装置10は、プログラムを読み出して実行することで各種情報処理方法を実行する情報処理装置として動作する。また、生成装置10は、媒体読取装置によって記録媒体から上記プログラムを読み出し、読み出された上記プログラムを実行することで上記した実施例と同様の機能を実現することもできる。なお、この他の実施例でいうプログラムは、生成装置10によって実行されることに限定されるものではない。例えば、他のコンピュータまたはサーバがプログラムを実行する場合や、これらが協働してプログラムを実行するような場合にも、本発明を同様に適用することができる。 In this way, the generation device 10 operates as an information processing device that executes various information processing methods by reading and executing the program. Further, the generation device 10 can realize the same function as that of the above-described embodiment by reading the program from the recording medium by the medium reader and executing the read program. The program referred to in the other embodiment is not limited to being executed by the generation device 10. For example, the present invention can be similarly applied when other computers or servers execute programs, or when they execute programs in cooperation with each other.
 このプログラムは、インターネットなどのネットワークを介して配布することができる。また、このプログラムは、ハードディスク、フレキシブルディスク(FD)、CD-ROM、MO(Magneto-Optical disk)、DVD(Digital Versatile Disc)などのコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行することができる。 This program can be distributed via networks such as the Internet. In addition, this program is recorded on a computer-readable recording medium such as a hard disk, flexible disk (FD), CD-ROM, MO (Magneto-Optical disk), DVD (Digital Versatile Disc), and is recorded from the recording medium by the computer. It can be executed by being read.
 10 生成装置
 11 通信部
 12 記憶部
 13 動画像データ
 14 3次元骨格系列データ
 15 学習データセット
 20 制御部
 21 データ取得部
 22 座標取得部
 23 学習データ生成部
 24 初期設定部
 25 最適化部
 26 出力部
10 Generator 11 Communication unit 12 Storage unit 13 Moving image data 14 3D skeletal sequence data 15 Learning data set 20 Control unit 21 Data acquisition unit 22 Coordinate acquisition unit 23 Learning data generation unit 24 Initial setting unit 25 Optimization unit 26 Output unit

Claims (7)

  1.  コンピュータが、
     所定動作を行う被写体を撮像した動画像データに含まれる複数のフレームそれぞれから複数の関節それぞれの2次元座標を取得し、
     前記所定動作を行う前記被写体の複数の関節位置に関する3次元骨格データを含む3次元系列データから、前記複数のフレームそれぞれに対応する複数の3次元骨格データそれぞれを特定し、
     前記複数のフレームそれぞれの2次元座標と前記複数の3次元骨格データそれぞれとを用いて、前記動画像データと前記3次元系列データとの間の時刻同期に関する調整量と、前記動画像データに前記3次元系列データを投影するときの投影パラメータとの最適化を実行し、
     最適化された前記調整量と前記投影パラメータとを用いて、前記動画像データと前記3次元系列データとを対応付けたデータを生成する
     処理を実行することを特徴とするデータ生成方法。
    The computer
    Two-dimensional coordinates of each of a plurality of joints are acquired from each of a plurality of frames included in the moving image data obtained by capturing a subject performing a predetermined motion.
    From the three-dimensional series data including the three-dimensional skeleton data relating to the plurality of joint positions of the subject performing the predetermined operation, each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames is specified.
    Using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data, the adjustment amount regarding the time synchronization between the moving image data and the three-dimensional series data, and the moving image data are described. Perform optimization with projection parameters when projecting 3D series data,
    A data generation method characterized by executing a process of generating data in which the moving image data and the three-dimensional series data are associated with each other by using the optimized adjustment amount and the projection parameter.
  2.  前記特定する処理は、前記複数のフレームそれぞれの時刻を含む前記3次元系列データのサンプリング周期内の前記3次元骨格データを用いたリサンプリングにより、前記複数のフレームそれぞれに対応する前記複数の3次元骨格データそれぞれを特定することを特徴とする請求項1に記載のデータ生成方法。 The specifying process is performed by resampling using the three-dimensional skeleton data within the sampling cycle of the three-dimensional series data including the time of each of the plurality of frames, and the plurality of three dimensions corresponding to each of the plurality of frames. The data generation method according to claim 1, wherein each of the skeleton data is specified.
  3.  前記動画像データの時刻と前記3次元系列データの時刻との時刻同期を調整した各同期パターンにおいて、前記リサンプリングにより対応付けられた前記複数のフレームそれぞれと前記複数の3次元骨格データそれぞれとを用いて、カメラの位置姿勢を推定する推定問題を解くことにより前記投影パラメータを算出し、
     前記各同期パターンについて算出された各投影パラメータと前記各同期パターンの時刻同期との妥当性を表す尤度を算出する処理を、前記コンピュータが実行し、
     前記実行する処理は、前記尤度が最も高い前記同期パターンに対して算出された前記投影パラメータと前記時刻同期を初期値として、前記調整量と前記投影パラメータとに関するコスト関数の最適化を実行することを特徴とする請求項2に記載のデータ生成方法。
    In each synchronization pattern in which the time synchronization between the time of the moving image data and the time of the three-dimensional series data is adjusted, each of the plurality of frames associated with the resampling and each of the plurality of three-dimensional skeleton data are subjected to each other. Use to calculate the projection parameters by solving an estimation problem that estimates the position and orientation of the camera.
    The computer executes a process of calculating the likelihood indicating the validity of each projection parameter calculated for each synchronization pattern and the time synchronization of each synchronization pattern.
    In the process to be executed, the cost function of the adjustment amount and the projection parameter is optimized with the projection parameter calculated for the synchronization pattern having the highest likelihood and the time synchronization as initial values. The data generation method according to claim 2, wherein the data is generated.
  4.  前記算出する処理は、前記尤度として、前記リサンプリングされた前記複数の3次元骨格データそれぞれを、対応する前記動画像データのフレームに投影したときの各関節に対する再投影誤差が閾値未満である関節数の割合を算出することを特徴とする請求項3に記載のデータ生成方法。 In the calculation process, as the likelihood, the reprojection error for each joint when each of the resampled three-dimensional skeleton data is projected onto the frame of the corresponding moving image data is less than the threshold value. The data generation method according to claim 3, wherein the ratio of the number of joints is calculated.
  5.  前記生成する処理は、最適化された前記時刻同期の調整量にしたがって、前記動画像データの時刻と前記3次元系列データの時刻とを同期させ、最適化された前記投影パラメータを用いて、前記3次元系列データ内の各3次元骨格データを、前記時刻同期により時刻が同期する前記動画像データの各フレームに投影して、前記データを生成することを特徴とする請求項1に記載のデータ生成方法。 In the generated process, the time of the moving image data and the time of the three-dimensional series data are synchronized according to the optimized adjustment amount of the time synchronization, and the optimized projection parameter is used. The data according to claim 1, wherein each three-dimensional skeleton data in the three-dimensional series data is projected onto each frame of the moving image data whose time is synchronized by the time synchronization to generate the data. Generation method.
  6.  コンピュータに、
     所定動作を行う被写体を撮像した動画像データに含まれる複数のフレームそれぞれから複数の関節それぞれの2次元座標を取得し、
     前記所定動作を行う前記被写体の複数の関節位置に関する3次元骨格データを含む3次元系列データから、前記複数のフレームそれぞれに対応する複数の3次元骨格データそれぞれを特定し、
     前記複数のフレームそれぞれの2次元座標と前記複数の3次元骨格データそれぞれとを用いて、前記動画像データと前記3次元系列データとの間の時刻同期に関する調整量と、前記動画像データに前記3次元系列データを投影するときの投影パラメータとの最適化を実行し、
     最適化された前記調整量と前記投影パラメータとを用いて、前記動画像データと前記3次元系列データとを対応付けたデータを生成する
     処理を実行させることを特徴とするデータ生成プログラム。
    On the computer
    Two-dimensional coordinates of each of a plurality of joints are acquired from each of a plurality of frames included in the moving image data obtained by capturing a subject performing a predetermined motion.
    From the three-dimensional series data including the three-dimensional skeleton data relating to the plurality of joint positions of the subject performing the predetermined operation, each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames is specified.
    Using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data, the adjustment amount regarding the time synchronization between the moving image data and the three-dimensional series data, and the moving image data are described. Perform optimization with projection parameters when projecting 3D series data,
    A data generation program characterized by executing a process of generating data in which the moving image data and the three-dimensional series data are associated with each other by using the optimized adjustment amount and the projection parameter.
  7.  所定動作を行う被写体を撮像した動画像データに含まれる複数のフレームそれぞれから複数の関節それぞれの2次元座標を取得する取得部と、
     前記所定動作を行う前記被写体の複数の関節位置に関する3次元骨格データを含む3次元系列データから、前記複数のフレームそれぞれに対応する複数の3次元骨格データそれぞれを特定する特定部と、
     前記複数のフレームそれぞれの2次元座標と前記複数の3次元骨格データそれぞれとを用いて、前記動画像データと前記3次元系列データとの間の時刻同期に関する調整量と、前記動画像データに前記3次元系列データを投影するときの投影パラメータとの最適化を実行する実行部と、
     最適化された前記調整量と前記投影パラメータとを用いて、前記動画像データと前記3次元系列データとを対応付けたデータを生成する生成部と
     を有することを特徴とする情報処理装置。
    An acquisition unit that acquires two-dimensional coordinates of each of a plurality of joints from each of a plurality of frames included in the moving image data obtained by capturing a subject performing a predetermined operation, and an acquisition unit.
    A specific unit that identifies each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames from the three-dimensional series data including the three-dimensional skeleton data relating to the positions of the plurality of joints of the subject performing the predetermined operation.
    Using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data, the adjustment amount regarding the time synchronization between the moving image data and the three-dimensional series data, and the moving image data are described. An execution unit that executes optimization with projection parameters when projecting 3D series data,
    An information processing apparatus characterized by having a generation unit that generates data in which the moving image data and the three-dimensional series data are associated with each other by using the optimized adjustment amount and the projection parameter.
PCT/JP2020/026232 2020-07-03 2020-07-03 Data generation method, data generation program, and information-processing device WO2022003963A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022533003A JP7318814B2 (en) 2020-07-03 2020-07-03 DATA GENERATION METHOD, DATA GENERATION PROGRAM AND INFORMATION PROCESSING DEVICE
PCT/JP2020/026232 WO2022003963A1 (en) 2020-07-03 2020-07-03 Data generation method, data generation program, and information-processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/026232 WO2022003963A1 (en) 2020-07-03 2020-07-03 Data generation method, data generation program, and information-processing device

Publications (1)

Publication Number Publication Date
WO2022003963A1 true WO2022003963A1 (en) 2022-01-06

Family

ID=79314967

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/026232 WO2022003963A1 (en) 2020-07-03 2020-07-03 Data generation method, data generation program, and information-processing device

Country Status (2)

Country Link
JP (1) JP7318814B2 (en)
WO (1) WO2022003963A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115311314A (en) * 2022-10-13 2022-11-08 深圳市华汉伟业科技有限公司 Resampling method, system and storage medium for line laser contour data
WO2024024055A1 (en) * 2022-07-28 2024-02-01 富士通株式会社 Information processing method, device, and program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018129009A (en) * 2017-02-10 2018-08-16 日本電信電話株式会社 Image compositing device, image compositing method, and computer program
WO2018211571A1 (en) * 2017-05-15 2018-11-22 富士通株式会社 Performance display program, performance display method, and performance display device
WO2019043928A1 (en) * 2017-09-01 2019-03-07 富士通株式会社 Practice support program, practice support method and practice support system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018129009A (en) * 2017-02-10 2018-08-16 日本電信電話株式会社 Image compositing device, image compositing method, and computer program
WO2018211571A1 (en) * 2017-05-15 2018-11-22 富士通株式会社 Performance display program, performance display method, and performance display device
WO2019043928A1 (en) * 2017-09-01 2019-03-07 富士通株式会社 Practice support program, practice support method and practice support system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024024055A1 (en) * 2022-07-28 2024-02-01 富士通株式会社 Information processing method, device, and program
CN115311314A (en) * 2022-10-13 2022-11-08 深圳市华汉伟业科技有限公司 Resampling method, system and storage medium for line laser contour data
CN115311314B (en) * 2022-10-13 2023-02-17 深圳市华汉伟业科技有限公司 Resampling method, system and storage medium for line laser contour data

Also Published As

Publication number Publication date
JPWO2022003963A1 (en) 2022-01-06
JP7318814B2 (en) 2023-08-01

Similar Documents

Publication Publication Date Title
Zheng et al. Hybridfusion: Real-time performance capture using a single depth sensor and sparse imus
WO2020054442A1 (en) Articulation position acquisition method and device, and motion acquisition method and device
JP7410499B2 (en) Digital twin modeling method and system for remote control environment of assembly robots
Shiratori et al. Motion capture from body-mounted cameras
CN111402290B (en) Action restoration method and device based on skeleton key points
KR101591779B1 (en) Apparatus and method for generating skeleton model using motion data and image data
KR101640039B1 (en) Image processing apparatus and method
Wang et al. Video-based hand manipulation capture through composite motion control
KR101616926B1 (en) Image processing apparatus and method
KR101812379B1 (en) Method and apparatus for estimating a pose
Zhang et al. Leveraging depth cameras and wearable pressure sensors for full-body kinematics and dynamics capture
JP7367764B2 (en) Skeleton recognition method, skeleton recognition program, and information processing device
EP3382648A1 (en) Three-dimensional model generating system, three-dimensional model generating method, and program
JP7164045B2 (en) Skeleton Recognition Method, Skeleton Recognition Program and Skeleton Recognition System
JP2023502795A (en) A real-time system for generating 4D spatio-temporal models of real-world environments
JP2019079487A (en) Parameter optimization device, parameter optimization method and program
CN109934847A (en) The method and apparatus of weak texture three-dimension object Attitude estimation
WO2022003963A1 (en) Data generation method, data generation program, and information-processing device
CN113449570A (en) Image processing method and device
JP2021060868A (en) Information processing apparatus, information processing method, and program
Zou et al. Automatic reconstruction of 3D human motion pose from uncalibrated monocular video sequences based on markerless human motion tracking
JP5503510B2 (en) Posture estimation apparatus and posture estimation program
CN113284192A (en) Motion capture method and device, electronic equipment and mechanical arm control system
Xu Single-view and multi-view methods in marker-less 3d human motion capture
JP2024501161A (en) 3D localization of objects in images or videos

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20943599

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022533003

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20943599

Country of ref document: EP

Kind code of ref document: A1