WO2022003963A1

WO2022003963A1 - Data generation method, data generation program, and information-processing device

Info

Publication number: WO2022003963A1
Application number: PCT/JP2020/026232
Authority: WO
Inventors: 創輔山尾
Original assignee: 富士通株式会社
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2022-01-06
Also published as: JPWO2022003963A1; JP7318814B2

Abstract

According to the present invention, a generation device obtains two-dimensional coordinates of a plurality of joints from each of a plurality of frames included in moving image data obtained by imaging a subject performing a predetermined motion. The generation device identifies a plurality of pieces of three-dimensional skeletal data corresponding to the respective frames from three-dimensional sequential data including three-dimensional skeletal data related to a plurality of joint positions of the subject performing the predetermined motion. The generation device optimizes the amount of adjustment related to time synchronization between the moving image data and the three-dimensional sequential data, and a projection parameter that is used when the three-dimensional sequential data is projected to the moving image data, using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of pieces of three-dimensional skeletal data. The generation device generates data in which the moving image data is associated with the three-dimensional sequential data, using the optimized adjustment amount and projection parameter.

Description

Data generation method, data generation program and information processing device

The present invention relates to a data generation method, a data generation program, and an information processing apparatus.

Using 2D images such as color images, skeleton recognition is performed to detect 3D human movements related to various sports. For example, a method of calculating typical three-dimensional joint coordinates from a plurality of two-dimensional joint coordinates by a triangulation method is used. In recent years, in order to improve the accuracy of skeleton recognition, a method of estimating three-dimensional joint coordinates from two-dimensional joint coordinates of a plurality of viewpoints using an estimation model generated by machine learning is also known.

By the way, the above estimation model is generated by machine learning using learning data including a two-dimensional image and a three-dimensional skeletal position, but it is very difficult to apply it to skeletal recognition of complicated movements such as gymnastics. It is required to improve the estimation accuracy by using a lot of learning data. However, such learning data is generally generated manually, and the accuracy is poor and inefficient. Therefore, the accuracy of three-dimensional skeleton recognition using machine learning deteriorates and the cost increases. ..

In addition, using a human body CG (Computer Graphics) model that takes the same posture as the three-dimensional skeleton information acquired by motion capture, etc., CG images are synthesized with various variations while changing the texture and rendering conditions. Therefore, it is possible to generate a training data set. However, when a player wears a uniform or uses a plurality of instruments, for example, in gymnastics, it is difficult to simulate them with a quality close to that of an actual camera image and synthesize a CG image.

One aspect is to provide a data generation method, a data generation program, and an information processing apparatus capable of automatically generating a learning data set including a two-dimensional image and a three-dimensional skeleton position.

In the first plan, in the data generation method, the computer executes a process of acquiring the two-dimensional coordinates of each of the plurality of joints from each of the plurality of frames included in the moving image data of the subject performing the predetermined operation. In the data generation method, the computer identifies each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames from the three-dimensional series data including the three-dimensional skeleton data relating to the plurality of joint positions of the subject performing the predetermined operation. Execute the process to be performed. In the data generation method, the computer adjusts the time synchronization between the moving image data and the three-dimensional series data by using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data. And, the process of executing the optimization with the projection parameter when projecting the three-dimensional series data on the moving image data is executed. In the data generation method, a computer executes a process of generating data in which the moving image data and the three-dimensional series data are associated with each other by using the optimized adjustment amount and the projection parameter.

According to one embodiment, a learning data set including a two-dimensional image and a three-dimensional skeleton position can be automatically generated.

FIG. 1 is a diagram illustrating an example of an overall configuration of a system using skeleton recognition. FIG. 2 is a diagram illustrating the manual generation of learning data. FIG. 3 is a functional block diagram showing a functional configuration of the generator according to the first embodiment. FIG. 4 is a diagram showing an example of moving image data. FIG. 5 is a diagram illustrating an example of three-dimensional skeleton series data. FIG. 6 is a diagram illustrating acquisition of two-dimensional joint coordinates. FIG. 7 is a diagram illustrating camera calibration. FIG. 8 is a diagram illustrating time synchronization and resampling. FIG. 9 is a diagram illustrating the estimation of the initial parameters. FIG. 10 is a diagram illustrating parameter optimization. FIG. 11 is a diagram illustrating the result of optimization. FIG. 12 is a diagram showing an example of the generated learning data. FIG. 13 is a flowchart showing the flow of the learning data generation process. FIG. 14 is a flowchart showing the flow of processing from the estimation of the initial value to the optimization. FIG. 15 is a diagram illustrating a processing example of skeleton recognition. FIG. 16 is a diagram illustrating a processing example of skeleton recognition. FIG. 17 is a diagram illustrating a processing example of skeleton recognition. FIG. 18 is a diagram illustrating a hardware configuration example.

Hereinafter, examples of the data generation method, the data generation program, and the information processing apparatus according to the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. In addition, each embodiment can be appropriately combined within a consistent range.

[System configuration]
FIG. 1 is a diagram illustrating an example of an overall configuration of a system using skeleton recognition. As shown in FIG. 1, this system has a 3D (Three-Dimensional) laser sensor 5, a generation device 10, a learning device 40, a recognition device 50, and a scoring device 90, and has three-dimensional data of the performer 1 who is the subject. It is a system that images the image, recognizes the skeleton, etc., and scores the technique accurately. In this embodiment, the gymnastics competition will be described as an example, but the present invention is not limited to this, and other competitions in which the athlete performs a series of techniques and the referee scores, and various people. It can also be applied to actions and movements. Further, in this embodiment, two dimensions may be described as 2D, and three dimensions may be described as 3D.

Generally, the current scoring method in gymnastics is visually performed by a plurality of graders, but with the sophistication of techniques, it is becoming more difficult for the graders to visually score. In recent years, an automatic scoring system and a scoring support system for scoring competitions using a 3D laser sensor 5 are known. For example, in these systems, a distance image, which is three-dimensional data of the athlete, is acquired by the 3D laser sensor 5, and the skeleton such as the direction of each joint of the athlete and the angle of each joint is recognized from the distance image. Then, in the scoring support system, the result of skeleton recognition is displayed by a 3D model to support the grader to perform more correct scoring by confirming the detailed situation of the performer. In addition, in the automatic scoring system, the technique performed is recognized from the result of skeleton recognition, and scoring is performed according to the scoring rule.

Here, in the scoring support system and the automatic scoring system, it is required to support the scoring or automatically score the performances performed at any time in a timely manner. Usually, a method of recognizing a performer's three-dimensional skeleton from a distance image or a color image causes a long processing time and a decrease in the accuracy of skeleton recognition due to insufficient memory or the like.

For example, in a form in which the result of automatic scoring by the automatic scoring system is provided to the grader and the grader compares it with his / her own scoring result, when the conventional technique is used, the provision of information to the grader is delayed. Furthermore, as the accuracy of skeleton recognition decreases, there is a possibility that the subsequent technique recognition will be erroneous, and as a result, the score determined by the technique will also be erroneous. Similarly, in the scoring support system, when the angle and position of the performer's joints are displayed using a 3D model, the time until the display is delayed or the displayed angle is incorrect. sell. In this case, the scoring by the grader using this scoring support system may result in an erroneous scoring.

As described above, if the accuracy of skeleton recognition in the automatic scoring system or scoring support system is poor, or if processing takes time, scoring errors will occur and the scoring time will be lengthened. Therefore, by using the machine learning model generated by machine learning, highly accurate skeleton recognition, recognition error, and suppression of long scoring are realized. Regarding 3D human movement detection (skeleton recognition), 3D sensing technology that extracts 3D joint coordinates from multiple 3D laser sensors 5 with high accuracy is being established, and is being used in other sports and other fields. Is expected to develop.

Here, each device constituting the system in FIG. 1 will be described. The 3D laser sensor 5 is an example of a sensor device that measures (sensing) the distance of an object for each pixel using an infrared laser or the like. The distance image includes the distance to each pixel. That is, the distance image is a depth image showing the depth of the subject as seen from the 3D laser sensor (depth sensor) 5.

The learning device 40 is an example of a computer device that learns a machine learning model for skeleton recognition. Specifically, the learning device 40 generates a machine learning model by executing machine learning such as deep learning using two-dimensional skeletal position information, three-dimensional skeletal position information, and the like as a learning data set.

The recognition device 50 is an example of a computer device that recognizes the skeleton related to the orientation and position of each joint of the performer 1 by using the distance image measured by the 3D laser sensor 5. Specifically, the recognition device 50 inputs the distance image measured by the 3D laser sensor 5 into the trained machine learning model learned by the learning device 40, and creates a skeleton based on the output result of the machine learning model. recognize. After that, the recognition device 50 outputs the recognized skeleton to the scoring device 90. For example, the information obtained as a result of skeletal recognition is information regarding the three-dimensional position of each joint.

The scoring device 90 identifies the transition of the movement obtained from the position and orientation of each joint of the performer by using the recognition result information input by the recognition device 50, and identifies and scores the technique performed by the performer. This is an example of a computer device.

In order to perform 3D skeleton recognition in a complicated posture such as gymnastics by the skeleton recognition technology using the machine learning model described above, it is necessary to newly prepare and train a lot of learning data on the complicated posture. be.

In recent years, in order to generate a lot of learning data, three-dimensional skeleton information and the like are newly collected by using a laser method, an image method, and the like. For example, in the laser method using a 3D laser sensor, the laser is irradiated about 2 million times per second, and each irradiation including the target person is based on the running time (Time of Flight: ToF) of one laser. Requesting point depth information is being performed. Although this laser method can acquire highly accurate depth data, it is difficult to use it for general purposes because it requires complicated configurations and processing such as laser scanning and ToF measurement, and the hardware is complicated and expensive. ..

In addition, an inexpensive RGB camera can be used as an image method for acquiring RGB data of each pixel using a CMOS (Complementary Metal Oxide Semiconductor) imager, and with the recent improvements in deep learning technology, the accuracy of three-dimensional skeleton recognition has also improved. It is improving. However, even if machine learning is executed using an existing data set such as Total Capture that contains only general postures, it is not possible to generate a machine learning model that can recognize complicated postures such as gymnastics.

As described above, since the machine learning model that can recognize a complicated posture cannot be generated by the generally used method, the learning data for learning the complicated posture is manually generated.

FIG. 2 is a diagram illustrating the manual generation of learning data. As shown in FIG. 2, the time is visually observed between the moving image data obtained by capturing a series of exercises of the performer 1 and the three-dimensional skeleton series data including the three-dimensional skeleton data when the performer 1 performed in the past. Synchronization is executed, and a frame in which the moving image data and the three-dimensional skeleton data are synchronized is searched for (S1). Then, using the approximate camera parameters defined as prior knowledge, the 3D skeleton data is projected onto the image (S2), and the values of the camera parameters are overlapped so that the person silhouette of the moving image data and the 3D skeleton data overlap. Is manually adjusted (S3). After that, the three-dimensional skeleton series data is resampled at the frame rate of the moving image data and projected onto the moving image data to create training data over the entire moving image data (S4).

In this way, when the training data is created manually, the spatial and temporal alignment of the moving image data and the 3D skeletal sequence data is visually and manually performed, so that sufficient accuracy cannot be obtained as the training data. Moreover, it requires a huge amount of work time.

Therefore, in the first embodiment, the spatially and temporally optimum automatic alignment is performed using the moving image data acquired by the camera and the three-dimensional skeleton sequence data acquired by 3D sensing or the like for the same operation. And provide the technology to generate efficient and high quality training data.

Specifically, the generation device 10 acquires the two-dimensional coordinates of each of the plurality of joints from each of the plurality of frames included in the two-dimensional moving image data of the performer 1 who performs the performance. Subsequently, the generation device 10 specifies each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames from the three-dimensional skeleton series data regarding the plurality of joint positions (three-dimensional skeleton data) of the performer 1 who performs the performance. .. Then, the generation device 10 uses the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data to adjust the time synchronization between the moving image data and the three-dimensional skeleton series data, and the moving image. Perform optimization with camera parameters when projecting 3D skeletal sequence data onto the data. After that, the generation device 10 generates data in which the moving image data and the three-dimensional skeleton series data are associated with each other by using the optimized adjustment amount and the camera parameters.

In this way, the generator 10 performs camera calibration and time synchronization based on non-linear optimization so that the geometrical consistency between the 3D skeletal sequence data and the 2D joint coordinates is maximized. By performing at the same time, spatially and temporally optimal automatic alignment is performed. As a result, the generation device 10 can automatically generate a training data set including a two-dimensional image and a three-dimensional skeleton position that can be used for training a machine learning model that estimates a three-dimensional skeleton position with high accuracy. ..

[Functional configuration]
FIG. 3 is a functional block diagram showing a functional configuration of the generator 10 according to the first embodiment. As shown in FIG. 3, the generation device 10 includes a communication unit 11, a storage unit 12, and a control unit 20.

The communication unit 11 is a processing unit that controls communication with other devices, and is realized by, for example, a communication interface. For example, the communication unit 11 receives the moving image data of the performer 1 taken by using a camera or the like, and receives the three-dimensional skeleton sequence of the performer 1 taken by using the 3D laser sensor 5.

The storage unit 12 is a processing unit that stores various data, programs executed by the control unit 20, and the like, and is realized by, for example, a memory or a hard disk. The storage unit 12 stores the moving image data 13, the three-dimensional skeleton sequence data 14, and the learning data set 15.

The moving image data 13 is a series of moving image data taken by a camera or the like when the performer 1 performs, and is composed of a plurality of frames. FIG. 4 is a diagram showing an example of moving image data 13. FIG. 4 shows, as an example, one frame in the moving image data 13 taken while acting a pommel horse. The moving image data 13 is based on the position, posture, and resolution of the camera as the coordinate system, and is based on the time stamp and sampling rate peculiar to the camera as the time system.

The three-dimensional skeleton series data 14 is series data including three-dimensional skeleton data showing three-dimensional joint coordinates related to a plurality of joint positions. Specifically, the three-dimensional skeleton sequence data 14 is a series of three-dimensional skeleton data captured by a 3D laser sensor or the like when the performer 1 performs. The three-dimensional skeleton data includes information on the three-dimensional skeleton position (skeleton information) of each joint when the performer 1 is performing.

FIG. 5 is a diagram illustrating an example of three-dimensional skeleton series data 14. FIG. 5 illustrates one of the three-dimensional skeleton data 14 in the three-dimensional skeleton sequence data 14 generated from the performance of the performer 1. As shown in FIG. 5, the three-dimensional skeleton series data 14 is data acquired by the 3D sensing technique, and is data including three-dimensional coordinates of each joint. Here, each joint is, for example, 18 joints designated in advance such as the right shoulder, the left shoulder, and the right ankle, or a plurality of joints arbitrarily set by the user. The three-dimensional skeleton sequence data 14 is based on the position and orientation of the sensor as a coordinate system, and is based on a sensor-specific time stamp and sampling rate as a time system.

The learning data set 15 is a database that is generated by the control unit 20 described later and includes a plurality of training data used for generating a machine learning model. For example, the learning data set 15 is information in which the moving image data 13 is associated with the three-dimensional skeleton data and the camera parameters.

The control unit 20 is a processing unit that controls the entire generation device 10, and is realized by, for example, a processor. The control unit 20 has a data acquisition unit 21, a coordinate acquisition unit 22, and a learning data generation unit 23. The data acquisition unit 21, the coordinate acquisition unit 22, and the learning data generation unit 23 are realized by an electronic circuit possessed by the processor, a process executed by the processor, or the like.

The data acquisition unit 21 is a processing unit that acquires moving image data 13 and three-dimensional skeleton series data 14 and stores them in the storage unit 12. For example, the data acquisition unit 21 can acquire the moving image data 13 from the camera, or can read the moving image data 13 captured by a previously known method from a storage destination for storing the moving image data 13 and store the moving image data 13 in the storage unit 12. can. Similarly, the data acquisition unit 21 can also acquire the three-dimensional skeletal sequence data 14 from the 3D laser sensor, and reads and stores the three-dimensional skeletal sequence data 14 captured by a previously known method from a storage destination for storing the three-dimensional skeletal sequence data 14. It can also be stored in the unit 12.

The coordinate acquisition unit 22 is a processing unit that acquires two-dimensional coordinates, which are two-dimensional joint coordinates of each of the plurality of joints, from each of the plurality of frames included in the moving image data 13. Specifically, the coordinate acquisition unit 22 selects several appropriate frames (for example, 10 frames) from the moving image data, and automatically or manually acquires the two-dimensional joint position of the target person in each frame. do.

FIG. 6 is a diagram illustrating acquisition of two-dimensional joint coordinates. As shown in FIG. 6, the coordinate acquisition unit 22 selects a frame from the moving image data 13, and sets eight joints designated in advance from the selected frame as annotation targets. Then, the coordinate acquisition unit 22 uses an existing model or the like to annotate (1) right elbow (2) right wrist (3) left elbow (4) left wrist (5) right knee (6) right ankle. (7) Left knee (8) Two-dimensional coordinates indicating the joint positions of each left ankle are acquired.

Note that a subset of joints in 3D skeletal sequence data can also be annotated. In addition, two-dimensional joint coordinates can be obtained by performing automatic annotation using the existing two-dimensional skeleton recognition method, and visual or manual annotation.

The learning data generation unit 23 has an initial setting unit 24, an optimization unit 25, and an output unit 26, and is a processing unit that generates a learning data set 15 in which the moving image data 13 is associated with the three-dimensional skeleton data. Specifically, since the learning data generation unit 23 has both the three-dimensional skeletal sequence data 14 and the moving image data 13 related to the same performance, the three-dimensional skeletal sequence data 14 is supplied to the moving image data 13. By projecting, it is possible to generate a large number of images to which two-dimensional joint coordinates or three-dimensional joint coordinates related to a complicated posture are added without acquiring new data.

However, as described in FIGS. 4 and 5, the three-dimensional skeleton series data 14 and the moving image data 13 are based on different coordinate systems and time systems. Therefore, in order to project the three-dimensional skeleton sequence data 14 onto the moving image data 13, the learning data generation unit 23 performs camera calibration corresponding to spatial alignment and time synchronization corresponding to temporal alignment. And perform resampling.

(Camera calibration)
Here, camera calibration will be described. FIG. 7 is a diagram illustrating camera calibration. As shown in FIG. 7, in order to obtain a projection point when a certain 3D point is projected on an image, camera-specific parameters (camera internal parameters) such as focal distance and resolution and 3D are related to the camera that captured the image. Parameters of the camera position and orientation (camera external parameters) in the coordinate system (world coordinate system) that serves as a reference for points are required. The process of obtaining these parameters (camera parameters) is called camera calibration.

In the example of FIG. 7, the perspective projection that projects a 3D point on an image can be represented ^{by w [x, y, t] t} = K [R, t] [X, Y, Z] ^t. Here, [X, Y, Z] indicates the coordinates of the 3D point that is the projection source, and [x, y] indicates the coordinates of the projection point on the image that is the projection destination. K is an internal parameter of the camera and is a 3 × 3 internal matrix. R is an external parameter of the camera and is a 3 × 3 rotation matrix. t is a 3 × 1 translational vector. Of these, R and t are the targets of camera calibration.

(Time synchronization and resampling)
Next, time synchronization and resampling will be described. FIG. 8 is a diagram illustrating time synchronization and resampling. As shown in FIG. 8, when the moving image data 13 and the three-dimensional skeleton series data 14 are acquired in different time systems, the time of the entire two data is synchronized and then each frame time of the moving image data 13 is used. By resampling the 3D skeleton sequence data 14, the 3D skeleton data synchronized with the moving image data 13 can be acquired.

Here, time synchronization defines the conversion of the time system between the moving image data 13 and the three-dimensional skeleton series data 14, and resampling is 3 at the time of each frame of the moving image data 13. It is to interpolate the dimensional skeleton data.

For Figure 8, real-world time system is t, the time system of the three-dimensional framework sequence data 14 is t _s, the time-based moving image data 13 and t _v. In this state, the three-dimensional skeleton sequence data 14 is sampled at the sampling period T _s , and the three-dimensional skeleton data such as _{time ts, 0} , time ts _{, 1} , time t _{s, 2 is sampled.} Moving image data 13 is sampled at a sampling period _{T v,} time _{t v, 0,} time _{t v, 1,} frames such as time _{t v, 1} is sampled.

And time _{t s, 0} is the head of the three-dimensional framework sequence data 14, the difference between the time _{t v, 0} is the head of the moving image data 13, the time shift amount _{T v,} the _s. Here, the time synchronization is defined by specifying the time conversion between the two time systems such as _{"τ (tv, j} ) = time t _{s, 0} + time shift amount T _{v, s} + jT _v". Can be calculated. _{In resampling, the 3D skeletal data corresponding to the time tv, j} is referred to by the bilinear method by referring to the 3D skeletal data within the range of the predetermined time near " _{tv, j} " using the conversion formula of the time system. It can be executed by interpolating with. _{For example, for a frame at time tv, 0} in the moving image data 13, τ ( _{tv, 0} ) _{between time ts, 2} and time ts _{, 3} of the three-dimensional skeleton series data 14 By associating the time-synchronized 3D skeleton data, the resampled 3D skeleton data can be extracted.

As described above, the learning data generation unit 23 executes "camera calibration" and "time synchronization and resampling" in order to project the three-dimensional skeleton sequence data 14 onto the moving image data 13. At this time, the learning data generation unit 23 performs camera calibration and time synchronization based on the non-linear optimization so that the geometrical consistency between the three-dimensional skeleton series data 14 and the moving image data 13 is maximized. Define a cost function that implements at the same time. Then, the learning data generation unit 23 calculates the optimum time synchronization and camera calibration by optimizing the cost function.

Returning to FIG. 3, the initial setting unit 24 is a processing unit that sets the initial parameters of the cost function. Specifically, the initial setting unit 24 uses each of the plurality of frames associated by resampling and each of the plurality of three-dimensional skeleton data in each synchronization pattern in which the time synchronization is changed to generate the moving image data 13. The camera parameters are calculated by solving an estimation problem that estimates the position and orientation of the camera when projecting the three-dimensional skeleton sequence data 13. Then, the initial setting unit 24 calculates the likelihood indicating the validity of the time synchronization of each synchronization pattern and each camera parameter by using each camera parameter calculated for each synchronization pattern. After that, the initial setting unit 24 sets each of the plurality of frames specified for the synchronization pattern having the highest likelihood and each of the plurality of three-dimensional skeleton data as initial values.

FIG. 9 is a diagram illustrating the estimation of the initial parameters. As shown in FIG. 9, the initial setting unit 24 shows the three-dimensional skeleton data corresponding to the frame of the moving image data 13 in which the two-dimensional joint coordinates are acquired while appropriately changing the time synchronization. Obtained by resampling. Then, the initial setting unit 24 estimates the camera parameters by solving the PnP (Perspective-n-Point) problem in the correspondence between the two-dimensional joint position and the three-dimensional skeleton data.

At this time, the initial setting unit 24 quantitatively determines the geometrical consistency between the three-dimensional skeleton series data 14 and the two-dimensional joint coordinates as the likelihood indicating the validity of the obtained camera parameters and the time synchronization. calculate. For example, the initial setting unit 24 calculates the ratio of the number of joints in which the reprojection error of the three-dimensional joint coordinates in the resampled three-dimensional skeleton data is less than the threshold value as the likelihood. Then, the initial setting unit 24 adopts the time synchronization and the camera parameter when the likelihood takes the maximum value as the quasi-optimal solution in all the trials related to the time synchronization, and outputs the time synchronization to the optimization unit 25 as the initial value.

In the example of FIG. 9, the initial setting unit 24 executes trial i-1, trial i, and trial i + 1 corresponding to the synchronization pattern in which the first frame of the moving image data 13 is shifted at regular intervals to obtain a semi-optimal solution. Is shown as an example of calculating. Taking the trial i-1 as an example, the initial setting unit 24 corresponds to the frame of the moving image data 13 corresponding to _{the time ts, 0} at which the two-dimensional coordinates are acquired by the data acquisition unit 21. Data A1 is generated as a result of resampling using the three-dimensional skeleton data between t _{s, 0} and time t _{s, 1.} Similarly, the initial setting unit 24 generates data A3 by resampling the frame for which the two-dimensional coordinates have been acquired using the three-dimensional skeleton data between the _{time ts, 2} and the time ts _{, 3.} Then, resampling is performed using the three-dimensional skeleton data between the _{time t s, 3} and the time t _{s, 4 to generate the data A4.}

Subsequently, the initial setting unit 24 estimates the camera parameters by solving the PnP problem using the data A1, the data A3, and the data A4 including the three-dimensional skeleton data and the two-dimensional coordinates. Here, the initial setting unit 24 projects the three-dimensional skeleton data on the frame of the moving image data 13 using the estimated camera parameters. Then, the initial setting unit 24 calculates the distance between the two-dimensional coordinates of each joint and the three-dimensional coordinates (using only the two-dimensional coordinates) of each joint in the three-dimensional skeleton data for each joint in the frame. Then, the initial setting unit 24 calculates the ratio of joints whose distance is less than the threshold value as the likelihood. The likelihood can be calculated using the distance of each joint when the three-dimensional skeleton data is projected onto one or a plurality of frames.

In this way, the initial setting unit 24 executes the above-mentioned processing for each of the trial i-1, the trial i, and the trial i + 1, and calculates the likelihood. Then, the initial setting unit 24 determines the time synchronization of the trial i having the highest likelihood and the camera parameter as a quasi-optimal solution (initial value).

Returning to FIG. 3, the optimization unit 25 is a processing unit that optimizes the cost function related to the adjustment amount of time synchronization and the camera parameter by using the quasi-optimal solution calculated by the initial setting unit 24 as the initial value. Specifically, the optimization unit 25 defines the cost function C (Equation 1) related to the adjustment amount Δt of time synchronization and the camera parameter, applies the non-linear optimization with the quasi-optimal solution as the initial value, and applies the cost function. To minimize. At this time, the optimization unit 25 expresses the three-dimensional skeleton series data, which is discrete data, by a continuous function f (t) that can be differentiated for each joint, and incorporates it into the cost function C as a resampling process of the three-dimensional skeleton data. .. As f (t), third-order spline interpolation or the like can be applied.

In the equation (1), "i" indicates a joint, and "t" is the time of the frame of the moving image data 13 in which the two-dimensional coordinates in the quasi-optimal solution are acquired. “Pi _{, t} ” indicates the two-dimensional joint position of the joint i at time t. "F _{i (t)"} is the position of joint i at time t, it is a 3-dimensional joint coordinate phosphorylated sampled. “Π (x)” is a perspective projection of the 3D point X using camera parameters, and is a two-dimensional joint coordinate. In this cost function C, the time adjustment amounts Δt and π are the optimization targets.

FIG. 10 is a diagram illustrating parameter optimization. As shown in FIG. 10, the optimization unit 25 sets the quasi-optimal solution to the initial value of the cost function C, and repeats optimization and resampling to calculate the optimum value of each parameter. For example, the optimization unit 25 optimizes the cost function using the data B1 in which the two-dimensional coordinates and the three-dimensional skeleton data are associated with each other, and when the next data B2 is used, the above resampling is performed. After the data B2 is generated, the cost function is optimized using the data B2. By doing so, the optimization unit 25 can optimize Δt and the camera parameters at the same time.

FIG. 11 is a diagram illustrating the result of optimization. As shown in FIG. 11, the initial setting unit 24 sets initial parameters for the frame (image) in which the two-dimensional coordinates of the joints (1) to (8) are acquired. At this time, since the geometrical consistency is insufficient, the player's body on the image and the resampled three-dimensional skeleton data are out of alignment. After that, the optimization unit 25 executes the optimization to improve the geometrical consistency, so that the athlete's body on the image and the resampled three-dimensional skeleton data match. Then, the optimization unit 25 outputs the optimized Δt and the camera parameter to the output unit 26.

The output unit 26 is a processing unit that generates learning data using the optimization result by the optimization unit 25. Specifically, the output unit 26 generates an image to which two-dimensional joint coordinates and three-dimensional joint coordinates are added on the premise of the optimized time synchronization adjustment amount and camera parameters, and trains data as training data. Store in set 15.

FIG. 12 is a diagram showing an example of the generated learning data. _{As shown in FIG. 12, the output unit 26 has "I 1} , ({X ₁ , 1, Y ₁ , 1, Z ₁ , 1} ... {X" as "image, three-dimensional skeleton data, camera parameters". _{1, j} , Y _{1, j} , Z _{1, j} }), (K, R, t) "and" I ₂ , ({X ₂ , 1, Y ₂ , 1, Z ₂ , 1} ... { X _{2, j} , Y _{2, j} , Z _{2, j} }), (K, R, t) ”and the like are stored.

In this example, the three-dimensional coordinates of the joint with respect to the two-dimensional image I ₁ _{"{X 1} , 1, Y ₁ , 1, Z ₁ , 1} ... {X _{1, j} , Y _{1, j} , Z _{1 , J} } ”is associated, indicating that the camera parameter at the time of association is“ K, R, t ”. One camera parameter is set for a series of moving image data 13.

[Processing flow]
FIG. 13 is a flowchart showing the flow of the learning data generation process. As shown in FIG. 13, the coordinate acquisition unit 22 reads the moving image data 13 and the three-dimensional skeleton series data 14 from the storage unit 12 (S101), and acquires the two-dimensional joint coordinates in several frames of the moving image data 13. (S102).

Then, the learning data generation unit 23 estimates the initial value of the camera parameter and the time synchronization (S103), optimizes the camera parameter and the time synchronization (S104), and generates training data using the optimization result (S105). ..

Here, the details of the processes executed in S103 and S104 will be described. FIG. 14 is a flowchart showing the flow of processing from the estimation of the initial value to the optimization. As shown in FIG. 14, the learning data generation unit 23 generates a candidate group for time synchronization between the moving image data 13 and the three-dimensional skeleton series data 14 (S201).

Subsequently, the learning data generation unit 23 resamples the three-dimensional skeleton data corresponding to the frame having the two-dimensional joint coordinates for each candidate for time synchronization (S202). Then, the learning data generation unit 23 estimates the camera parameters for each candidate for time synchronization by solving the PnP problem based on the correspondence between the two-dimensional joint position and the three-dimensional skeleton data (S203).

After that, the learning data generation unit 23 calculates the likelihood indicating the validity of the time synchronization and the camera parameters for each candidate for time synchronization (S204), and when the likelihood becomes the maximum among the candidates for time synchronization. The time synchronization and the camera parameters are determined to be the semi-optimal solution (S205).

Then, the learning data generation unit 23 defines a cost function related to time synchronization and camera parameters incorporating resampling processing of three-dimensional skeleton data (S206), and executes non-linear optimization with the quasi-optimal solution as the initial value. To minimize the cost function and obtain the optimum time synchronization and camera parameters (S207).

[effect]
As mentioned above, the generator 10 simultaneously estimates the suboptimal camera parameters and time synchronization so that the geometrical consistency between the 3D skeletal sequence data 14 and the 2D joint coordinates is maximized. After that, the spatially and temporally optimal automatic alignment is executed by the non-linear optimization with the estimation result as the initial value. Therefore, the generation device 10 efficiently generates training data for image-based skeleton recognition by using the moving image data 13 and the three-dimensional skeleton sequence data 14 asynchronously acquired by a camera or 3D sensing technology. be able to.

Further, since the generation device 10 can calculate the ratio of the number of joints in which the reprojection error for each joint when projected onto the frame of the moving image data is less than the threshold value as the likelihood, an accurate initial value is set. can do. As a result, the generation device 10 can execute the optimization in a state of being narrowed down to some extent, so that the cost of the optimization process can be reduced and the processing time of the optimization process can be shortened.

Although the embodiments of the present invention have been described so far, the present invention may be implemented in various different forms other than the above-described embodiments.

[Numerical values, etc.]
The target data type, cost function, machine learning model, learning data, various parameters, etc. used in the above embodiment are merely examples and can be arbitrarily changed. In the above embodiment, the gymnastics competition has been described as an example, but the present invention is not limited to this, and can be applied to other competitions in which the athlete performs a series of techniques and the referee scores. Examples of other competitions include figure skating, rhythmic gymnastics, cheerleading, swimming dives, karate kata, and mogul air. Further, it can be applied not only to sports but also to posture detection of drivers of trucks, taxis, trains, etc. and posture detection of pilots.

[Application example]
The training data generated from the above can be adopted in various models for skeletal recognition. Here, an application example of learning data will be described. 15 to 17 are diagrams illustrating a processing example of skeleton recognition.

FIG. 15 is an example of executing three-dimensional skeleton recognition by a mathematical formula after executing two-dimensional skeleton recognition. In the example of FIG. 15, a human detection model is used to detect a person from a large number of images, and a 2D detection model is used to identify a plurality of two-dimensional joint coordinates from the heat map of each of the detected human joints. After that, the three-dimensional joint coordinates are obtained algebraically from the plurality of two-dimensional joint coordinates by the triangular survey method.

In this method, two machine learning models, a human detection model and a 2D detection model, are used. It is possible to improve the accuracy of each model by executing machine learning on these two models using the learning data in which the "image and the two-dimensional coordinates" generated in the first embodiment are associated with each other. can.

FIG. 16 is an example of executing three-dimensional skeleton recognition by a model after executing two-dimensional skeleton recognition. In the example of FIG. 16, a person detection model is used to detect a person from a large number of images, and a 2D detection model is used to identify a plurality of two-dimensional joint coordinates from the heat map of each joint of the detected person. do. Then, from the two-dimensional joint coordinates of multiple viewpoints, the three-dimensional joint coordinates are obtained using the 3D estimation model obtained by learning. Alternatively, the 3D joint coordinates are obtained from the 3D Voxel data that integrates the 2D heat maps of multiple viewpoints using the 3D estimation model obtained by learning.

In this method, three machine learning models are used. Of these, for the human detection model and the 2D detection model, machine learning is executed using the learning data associated with the "image and two-dimensional coordinates" generated in the first embodiment. For the 3D estimation model, machine learning is executed using the training data associated with "image, 3D coordinates, and camera parameters". As a result, the accuracy of each model can be improved.

FIG. 17 is an example of directly executing three-dimensional skeleton recognition. In the example of FIG. 17, a human detection model is used to detect a person from an image, and a 3D detection model obtained by learning is used to estimate the three-dimensional joint coordinates of each joint for the detected person. .. Two machine learning models are used in this method. For these models, machine learning is executed using the learning data associated with "images, 3D coordinates, and camera parameters". As a result, the accuracy of each model can be improved.

[system]
Information including processing procedures, control procedures, specific names, various data and parameters shown in the above documents and drawings can be arbitrarily changed unless otherwise specified. The coordinate acquisition unit 22 is an example of an acquisition unit, and the learning data generation unit 23 is an example of a specific unit, an execution unit, and a generation unit. Further, the three-dimensional skeleton series data 14 is an example of the three-dimensional series data. The camera parameter is an example of the projection parameter. The two-dimensional joint coordinates are coordinates when the joint positions are expressed in two dimensions, and are synonymous with the two-dimensional skeletal coordinates. Similarly, the two-dimensional joint coordinates are the coordinates when the joint positions are expressed in three dimensions, and are synonymous with the three-dimensional skeletal coordinates.

Further, each component of each device shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution or integration of each device is not limited to the one shown in the figure. That is, all or a part thereof can be functionally or physically distributed / integrated in any unit according to various loads, usage conditions, and the like.

Further, each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.

[hardware]
Next, a hardware configuration example will be described. FIG. 18 is a diagram illustrating a hardware configuration example. As shown in FIG. 18, the generation device 10 includes a communication device 10a, an HDD (Hard Disk Drive) 10b, a memory 10c, and a processor 10d. Further, the parts shown in FIG. 18 are connected to each other by a bus or the like.

The communication device 10a is a network interface card or the like, and communicates with other servers. The HDD 10b stores a program or DB that operates the function shown in FIG.

The processor 10d reads a program that executes the same processing as each processing unit shown in FIG. 3 from the HDD 10b or the like and expands the program into the memory 10c to operate a process that executes each function described in FIG. 3 or the like. For example, this process executes the same function as each processing unit of the generation device 10. Specifically, the processor 10d reads a program having the same functions as the data acquisition unit 21, the coordinate acquisition unit 22, the learning data generation unit 23, and the like from the HDD 10b and the like. Then, the processor 10d executes a process of executing the same processing as the data acquisition unit 21, the coordinate acquisition unit 22, the learning data generation unit 23, and the like.

In this way, the generation device 10 operates as an information processing device that executes various information processing methods by reading and executing the program. Further, the generation device 10 can realize the same function as that of the above-described embodiment by reading the program from the recording medium by the medium reader and executing the read program. The program referred to in the other embodiment is not limited to being executed by the generation device 10. For example, the present invention can be similarly applied when other computers or servers execute programs, or when they execute programs in cooperation with each other.

This program can be distributed via networks such as the Internet. In addition, this program is recorded on a computer-readable recording medium such as a hard disk, flexible disk (FD), CD-ROM, MO (Magneto-Optical disk), DVD (Digital Versatile Disc), and is recorded from the recording medium by the computer. It can be executed by being read.

10 Generator 11 Communication unit 12 Storage unit 13 Moving image data 14 3D skeletal sequence data 15 Learning data set 20 Control unit 21 Data acquisition unit 22 Coordinate acquisition unit 23 Learning data generation unit 24 Initial setting unit 25 Optimization unit 26 Output unit

Claims

The computer
Two-dimensional coordinates of each of a plurality of joints are acquired from each of a plurality of frames included in the moving image data obtained by capturing a subject performing a predetermined motion.
From the three-dimensional series data including the three-dimensional skeleton data relating to the plurality of joint positions of the subject performing the predetermined operation, each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames is specified.
Using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data, the adjustment amount regarding the time synchronization between the moving image data and the three-dimensional series data, and the moving image data are described. Perform optimization with projection parameters when projecting 3D series data,
A data generation method characterized by executing a process of generating data in which the moving image data and the three-dimensional series data are associated with each other by using the optimized adjustment amount and the projection parameter.
The specifying process is performed by resampling using the three-dimensional skeleton data within the sampling cycle of the three-dimensional series data including the time of each of the plurality of frames, and the plurality of three dimensions corresponding to each of the plurality of frames. The data generation method according to claim 1, wherein each of the skeleton data is specified.
In each synchronization pattern in which the time synchronization between the time of the moving image data and the time of the three-dimensional series data is adjusted, each of the plurality of frames associated with the resampling and each of the plurality of three-dimensional skeleton data are subjected to each other. Use to calculate the projection parameters by solving an estimation problem that estimates the position and orientation of the camera.
The computer executes a process of calculating the likelihood indicating the validity of each projection parameter calculated for each synchronization pattern and the time synchronization of each synchronization pattern.
In the process to be executed, the cost function of the adjustment amount and the projection parameter is optimized with the projection parameter calculated for the synchronization pattern having the highest likelihood and the time synchronization as initial values. The data generation method according to claim 2, wherein the data is generated.
In the calculation process, as the likelihood, the reprojection error for each joint when each of the resampled three-dimensional skeleton data is projected onto the frame of the corresponding moving image data is less than the threshold value. The data generation method according to claim 3, wherein the ratio of the number of joints is calculated.
In the generated process, the time of the moving image data and the time of the three-dimensional series data are synchronized according to the optimized adjustment amount of the time synchronization, and the optimized projection parameter is used. The data according to claim 1, wherein each three-dimensional skeleton data in the three-dimensional series data is projected onto each frame of the moving image data whose time is synchronized by the time synchronization to generate the data. Generation method.
On the computer
Two-dimensional coordinates of each of a plurality of joints are acquired from each of a plurality of frames included in the moving image data obtained by capturing a subject performing a predetermined motion.
From the three-dimensional series data including the three-dimensional skeleton data relating to the plurality of joint positions of the subject performing the predetermined operation, each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames is specified.
Using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data, the adjustment amount regarding the time synchronization between the moving image data and the three-dimensional series data, and the moving image data are described. Perform optimization with projection parameters when projecting 3D series data,
A data generation program characterized by executing a process of generating data in which the moving image data and the three-dimensional series data are associated with each other by using the optimized adjustment amount and the projection parameter.
An acquisition unit that acquires two-dimensional coordinates of each of a plurality of joints from each of a plurality of frames included in the moving image data obtained by capturing a subject performing a predetermined operation, and an acquisition unit.
A specific unit that identifies each of the plurality of three-dimensional skeleton data corresponding to each of the plurality of frames from the three-dimensional series data including the three-dimensional skeleton data relating to the positions of the plurality of joints of the subject performing the predetermined operation.
Using the two-dimensional coordinates of each of the plurality of frames and each of the plurality of three-dimensional skeleton data, the adjustment amount regarding the time synchronization between the moving image data and the three-dimensional series data, and the moving image data are described. An execution unit that executes optimization with projection parameters when projecting 3D series data,
An information processing apparatus characterized by having a generation unit that generates data in which the moving image data and the three-dimensional series data are associated with each other by using the optimized adjustment amount and the projection parameter.