WO2023223508A1

WO2023223508A1 - Video processing device, video processing method, and program

Info

Publication number: WO2023223508A1
Application number: PCT/JP2022/020848
Authority: WO
Inventors: 明男亀田; 誠明松村; 裕司青野
Original assignee: 日本電信電話株式会社
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2023-11-23

Abstract

A video processing device according to one aspect of the present invention comprises a two-dimensional pose information generation unit, a pose information processing unit, and a three-dimensionalizing processing unit. The two-dimensional pose information generation unit generates two-dimensional pose information of a photographic subject from video data of the photographic subject. The pose information processing unit corrects the two-dimensional pose information. The three-dimensionalizing processing unit generates three-dimensional pose information of the photographic subject from the corrected two-dimensional pose information. The pose information processing unit comprises a left/right flip determination unit that performs left/right flip determination processing. The left-right flip determination processing is processing in which a first pattern through a fourth pattern are set, said first pattern through fourth pattern corresponding to an upper body and a lower body being flipped or not being flipped in an image frame of the photographic subject in the video data, and the pattern with the lowest score among the first pattern through the fourth pattern is selected, said score being calculated using an evaluation formula in which the variance and the mean of distances of joint locations corresponding to the previous frame and the current frame are treated as variables.

Description

Video processing device, video processing method, and program

One aspect of the present invention relates to an image processing device, a computer-based image processing method, and a program for detecting, for example, a person's posture and creating a 3D (three-dimensional) wire frame model.

[Bodily knowledge] is expressed in a person's posture and movements, for example, as [techniques]. There is active research into technology that uses video processing to digitize and record information that cannot be expressed in words, but is based on human senses. If a person's posture can be quantified (hereinafter referred to as skill capture), it will be possible to visualize it for analysis by instructors, for example, through post-mortem analysis and comparison. Furthermore, in order to effectively reproduce the instructor's techniques and efficiently teach them to practitioners, it is desirable to be able to estimate (obtain) a person's posture with high accuracy.

OpenPose is an open source library that can detect skeletal coordinates from image data of a person as a subject and generate a wireframe model. A technique is known that utilizes this to estimate the position (2D posture information) of each joint point for each image frame and quantify the 2D posture (for example, see Non-Patent Document 1). Techniques for converting 2D (two-dimensional) posture information into three dimensions and obtaining 3D (three-dimensional) posture information are also known (for example, see Non-Patent Documents 2 and 3).

In the existing technology for 3D reconstruction of a group of subjects using pose estimation technology, the left and right coordinates are reversed when estimating the 2D pose (for example, the skeletal coordinates related to the right arm are incorrectly changed to the skeletal coordinates related to the left arm). ). If data whose left and right sides are reversed in 2D skeletal coordinates is used, the accuracy of estimating 3D posture information will decrease. This is undesirable, for example, because it causes deterioration in the processing accuracy of skill capture.
This invention was made in view of the above circumstances, and its purpose is to provide a technology that suppresses left-right reversal of skeletal coordinates in 2D posture information, thereby making it possible to accurately estimate 3D posture information. be.

A video processing device according to one aspect of the present invention includes a 2D posture information generation section, a posture information processing section, and a 3D conversion processing section. The 2D posture information generation unit generates 2D posture information of the subject from video data of the subject. The posture information processing section corrects the 2D posture information. The 3D processing unit generates 3D posture information of the subject from the corrected 2D posture information. The posture information processing section includes a left-right reversal determination section that performs left-right reversal determination processing. Then, in the horizontal reversal determination process, first to fourth patterns corresponding to whether or not the upper body and lower body are reversed are set for the image frame of the subject in the video data, and the first to fourth patterns are set. This process selects the pattern that minimizes the score calculated by an evaluation formula using the variance and average of distances between corresponding joint positions between the previous frame and the current frame as variables.

According to one aspect of the present invention, it is possible to provide a technique that suppresses horizontal reversal of skeletal coordinates in 2D posture information and thereby accurately estimates 3D posture information.

FIG. 1 is a diagram showing an example of a workflow for reproducing and transmitting physical knowledge. FIG. 2 is a diagram showing an example of a wireframe model used in skill capture. FIG. 3 is a diagram for explaining estimating 3D posture information from 2D posture information. FIG. 4 is a diagram for explaining horizontal reversal. FIG. 5 is a functional block diagram showing an example of a video processing device according to an embodiment. FIG. 6 is a diagram showing an example of a table stored in the 2D posture information 61. FIG. 7 is a diagram showing an example of a table stored in the 3D posture information 62. FIG. 8 is a diagram showing the flow of data between the functional blocks shown in FIG. 6. FIG. 9 is a diagram for explaining an overview of [basic processing]. FIG. 10 is a diagram for explaining the process of [Correction 2]. FIG. 11 is a diagram for explaining the process of [Correction 3]. FIG. 12 is a diagram for explaining details of the process of [Correction 3].

Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a diagram showing an example of a workflow for reproducing and teaching physical knowledge (including simulated experiences, etc.). Embodied knowledge involves analyzing skills captured by cameras and sensors, for example, comparing video data of an expert with video data of a practitioner, extracting areas for improvement, and providing feedback in an appropriate presentation method. will be used effectively.

Skill capture uses multiple cameras installed without paying special attention to the camera parameters (position, orientation, viewing angle, distortion), tracking of each subject, and 3D measurement of each subject without camera calibration. This can be said to be a technology that acquires items such as posture (three-dimensional coordinates of joints) at once.

Skill capture comprises a series of processes: 2D pose estimation from video data, digitization process, rotation angle fitting, and rotation angle denoising.
In 2D pose estimation, the pose of feature points (2D skeletal coordinates, etc.) is estimated using each frame of a video as input. It is also possible to accommodate a large number of people.
In the digitization process, each subject is separated and tracked, the pose of the 3D skeletal coordinates of the skeleton is estimated, and camera parameters are also estimated at the same time.
In rotation angle fitting and noise removal, 3D coordinates are converted into a 3D rotation angle model, and noise is further removed.

In order to more effectively reproduce and transfer physical knowledge, it is desirable to refine each element shown in Figure 1. Here, the "skills" to be analyzed include physical movements (including fingertips), physiological information, physiological reactions, psychological states, and the like. The scope of analysis can be further expanded.

FIG. 2 is a diagram showing an example of a wireframe model used in skill capture. In the wire frame model, a number (part ID) is assigned to each body part such as a joint. Two-dimensional posture information is expressed by acquiring the coordinates of each part of the wireframe model.

FIG. 3 is a diagram for explaining estimating 3D posture information from 2D posture information. 2D posture information is obtained from a plurality of cameras (camera 1, camera 2) installed at different viewpoints. Here, the plurality of cameras are synchronized with each other and each acquires video data of a subject (such as a person).

2D posture information is 2D coordinates representing the joint positions of the subject in the n-th frame of each camera image. For example, it can be represented by a set of four vectors (camera ID, part ID, part x coordinate, part y coordinate). In FIG. 3, when the right wrist is represented in the frame of camera 2, the 2D posture information is expressed as (camera ID=2, part ID=4, part x coordinate=30, part y coordinate=100), for example.

3D posture information is 3D coordinates representing joint positions estimated in 3D from 2D posture information of multiple cameras in the n-th frame of each camera image. For example, it can be represented by a set of four-dimensional vectors (part ID, part x coordinate, part y coordinate, part z coordinate). In the example of FIG. 3, the 3D posture information is expressed, for example, as (part ID=4, part x coordinate=50, part y coordinate=300, part z coordinate=150). Note that although two cameras are shown in FIG. 3, the same applies to the case of three or more cameras.

FIG. 4 is a diagram for explaining horizontal reversal. In FIG. 4A, it can be seen that the left and right correspondences of the upper body parts of the subject are reversed. This state is "left-right reversal." On the other hand, in FIG. 4(b), the right hand correctly corresponds to the right hand, and the left hand corresponds to the left hand, and no horizontal reversal occurs.

Left/right reversal occurs due to recognition errors during video processing, and can occur in either the upper or lower body. In the embodiment, combinations of four cases are considered: (upper body, not inverted), (upper body, inverted), (lower body, not inverted), and (lower body, inverted).

In this way, in the skill capture process, the posture of feature points (2D skeleton coordinates, etc.) is estimated using each frame of the video as input. At this time, the 2D skeleton coordinates may be horizontally reversed. This is not preferable because it causes a decrease in the accuracy of the 3D skeletal coordinates estimated from the 2D skeletal coordinates. When estimating the pose of 3D skeletal coordinates, existing techniques do not include a determination algorithm for solving the horizontal reversal problem for input 2D pose information.

Therefore, in the embodiment, when estimating 3D skeletal coordinates from 2D skeletal coordinates, the presence or absence of horizontal reversal is determined for each frame/for each subject ID, and further, a correction process is performed based on the result. In the following description, a [basic process] for determining the presence or absence of horizontal reversal, and a plurality of correction processes distinguished by [correction 1], [correction 2], and [correction 3] will be described.

Here, past 3D skeletal coordinates cannot be referenced in the first frame of the video data. However, from the next frame onwards, the past 3D skeleton coordinates can be referenced (Case 1). In other words, the object can be tracked between frames (the predicted object for the current (nth) frame can be generated from the past (n-1)th frame). Therefore, in the embodiment, (Case 1) will be explained in detail.

[Basic processing]
It is assumed that the predicted subject is correct (however, it will be corrected in a later process), and selected from patterns (four patterns with and without inversion in the upper/lower body) that have a small variance/average of joint position distances from the current frame.

(Intra-frame correction)
(1-1) [Correction 1] Determine and correct based on the number of cameras in the pattern with/without inversion.
(1-2) [Correction 2] In addition to the basic processing, correction 1 is performed twice (the predicted object for the second time is generated from the result of the first time. This is referred to as a two-step horizontal reversal determination).

(Interframe correction)
(1-3) [Correction 3] Determine whether there is left-right reversal between the current frame and the past frame (inter-frame left-right determination), and perform correction.

FIG. 5 is a functional block diagram showing an example of a video processing device according to an embodiment. The video processing device 10 is an information processing device (computer) that includes a processor 20 and a memory 90. In addition, the video processing device 10 includes a storage 60 and an interface unit 70 connected to the plurality of cameras 1-1 to 1-n. Each part of the video processing device 10 is connected via a bus 80.

The storage 60 stores 2D posture information 61, 3D posture information 62, and a program 63. The program 63 is loaded into the memory 90 by the OS (Operation System) of the video processing device 10 and executed by the processor 20. The program 63 causes the processor 20 to function as the 2D posture information generation section 30, the posture information processing section 40, and the 3D conversion processing section 50. The posture information processing section 40 includes a left/right reversal determination section 41 and a left/right reversal correction processing section 42 as a processing routine.

The 2D posture information generation unit 30 generates 2D posture information of the subject from video data of the subject.
The posture information processing section 40 corrects the 2D posture information generated by the 2D posture information generation section 30.
The 3D processing section 50 generates 3D posture information of the subject from the 2D posture information corrected by the posture information processing section 40.

The horizontal reversal determination unit 41 selects four patterns for the image frame of the subject in the video data of the subject: [as detected], [upper body horizontally reversed], [lower body horizontally reversed], and [entire horizontally reversed]. Set. These patterns are combinations of four cases: (upper body, no inversion), (upper body, with inversion), (lower body, no inversion), and (lower body, with inversion). In other words, these patterns correspond to cases where the upper body and lower body are inverted or not inverted.

Then, the left/right reversal determining unit 41 selects, from among these four patterns, the pattern that minimizes the score calculated by the evaluation formula using the variance and average of the distances between the corresponding joint positions of the previous frame and the current frame as variables. do.

The horizontal reversal correction processing unit 42 performs a horizontal reversal determination process within a frame.
Alternatively, the left-right reversal correction processing unit 42 performs two-stage left-right reversal determination processing within a frame.
Alternatively, the horizontal reversal correction processing unit 42 performs horizontal reversal determination processing between consecutive frames.
Here, the horizontal reversal correction processing performed by the horizontal reversal correction processing section 42 is similar to the processing performed by the horizontal reversal determination processing section 41.

FIG. 6 is a diagram showing an example of a table stored in the 2D posture information 61. The 2D posture information 61 is a table whose columns are subject ID, camera ID, region ID, region x coordinate, and region y coordinate.

FIG. 7 is a diagram showing an example of a table stored in the 3D posture information 62. The 3D posture information 62 is a table whose columns are subject ID, region ID, region x coordinate, region y coordinate, and region z coordinate.

FIG. 8 is a diagram showing the flow of data between the functional blocks shown in FIG. 6. FIG. 8 shows the case of camera parameter pre-calibration. In FIG. 8, multi-view video data from cameras 1-1 to 1-n is passed to the current frame extraction unit 21 along with the camera ID.

The current frame extraction unit 21 extracts the n-th frame image for each camera ID from the video data and passes it to the 2D posture information generation unit 30.
The 2D posture information generation section 30 calculates 2D posture information in the n-th frame image and passes it to the horizontal reversal determination section 41 of the posture information processing section 40 . Further, the 2D posture information is stored in the 2D posture information storage section of the posture information processing section 40.

The posture information processing section 40 performs [basic processing] using the horizontal reversal determining section 41, and performs [correction 1], [correction 2], and [correction 3] processing using the horizontal reversal correction processing section 42 to obtain the 3D posture. Calculate information. The 3D posture information is output to the outside and is also stored in the 3D posture information storage section.

[About basic processing]
FIG. 9 is a diagram for explaining an overview of [basic processing]. [Basic processing] is the idea that ``a predicted object is projected for each camera in the current frame, and if the distance between the predicted object and the two-dimensional posture information of the current frame is large, horizontal reversal processing is performed.''

To deal with horizontal reversal, in [basic processing] a predicted three-dimensional object is projected onto each camera and matched with the detected object for each camera. That is, matching processing is performed on a 2D plane on each camera image.
It is assumed that the three-dimensional object (predicted three-dimensional object) generated from the past (n-1)th frame is correct left and right (no inversion). Then, for one of the two-dimensional objects (detected object) detected from the current (nth) frame, [As detected], [Upper body horizontal flip], [Lower body horizontal flip], and [Entire horizontal flip] are selected. Generate four patterns. Then, a pattern is selected in which the variance or average of the joint position distance (error) with respect to the predicted object (the predicted three-dimensional object is projected onto each camera) is small. This is because if the left and right sides are correct, the average distance for all joints should be the same and the variance should be small. It is particularly effective to focus on dispersion.

For example, the evaluation value can be determined using equation (1).

In formula (1), α is a parameter, and is set to 1, for example. Using equation (1), the evaluation value of each of the four patterns is calculated, and the pattern with the smallest evaluation value (score of each pattern) is selected. For example, if there is upper body horizontal reversal, the score as detected will be large, and the score for upper body horizontal reversal will be smaller than this.

[About correction 1]
[Correction 1] is, so to speak, a horizontal reversal determination process within a frame. The predicted subject may be incorrect. However, assuming that the frequency of 2D skeleton information being horizontally reversed is low, if there are many reversed patterns (for each camera/each subject), there is a high possibility that the selection of the horizontally reversed pattern has failed. Therefore, in [Correction 1], if there are many cameras with inversion, it is determined that the result is incorrect and the result is not adopted (no correction is made).

If there are a large number of reversed cameras, it is determined that the prediction is incorrect and the detection is used as is. That is, as detected = n1 cameras, upper body horizontally reversed = n2 cameras, lower body horizontally reversed = n3 cameras, whole horizontally reversed = n4 cameras, and the total number of cameras N is N = n1 + n2 + n3 + n4. At this time, if n2+n3+n4>N/2, correction is performed to return all cameras to "as detected".

[About correction 2]
FIG. 10 is a diagram for explaining the process of [Correction 2]. [Correction 2] is, so to speak, a two-step horizontal reversal determination process within a frame.
If the frequency of 2D skeletal coordinates being reversed horizontally is low, and there are few errors in determining horizontal reversal in [Correction 1], the predicted 3D pose information (predicted 3D object = 3D object 1 frame before) will be The three-dimensional posture information (estimated three-dimensional object) is likely to be close to the true coordinates. Therefore, in order to bring the coordinates closer to the true value, in [Correction 2], the basic processing + [Correction 1] is repeated once again.

(a) The estimated three-dimensional object obtained in the first step is input as the predicted three-dimensional object in the second step.
(b) In the second stage, horizontal reversal is determined based on the estimated three-dimensional coordinates and the detected subject, and a second estimated value of the three-dimensional coordinates is obtained.

[About correction 3]
FIG. 11 is a diagram for explaining the process of correction 3. In FIG. 11, in order to simplify the explanation, the following explanation will focus on horizontal reversal of only the upper body. Actually, the same process is performed for the lower body. In FIG. 11, if (1) there is no inversion in the past frame and (2) there is inversion between frames, there is a high possibility that inversion will occur in the current frame (nth frame).

[Correction 3] is, so to speak, a horizontal reversal determination process between frames.
By introducing logic that detects horizontal reversal using 2D posture information between frames, it is possible to predict horizontal reversal in the current frame. For example, if there is no inversion in the past frame and there is inversion between frames, there is a high possibility that inversion will occur in the current frame. Therefore, (1) and (2) are estimated by the following comparison.
(1) The horizontal reversal status of the past frame can be estimated by comparing the 3D orientation information and the 2D orientation information of the past frame.
(2) By comparing the 2D orientation information of the past frame and the 2D orientation information of the current frame (inter-frame comparison), it can be estimated whether the 2D orientation information is horizontally reversed between the current frame and the past frame.

FIG. 12 is a diagram for explaining details of the correction 3 process. Referring to FIG. 12, the inter-frame horizontal reversal determination process will be described in detail. In the following, inter-frame horizontal reversal determination processing for the current frame using inter-frame comparison will be described. For simplicity, we will explain two patterns, one with and without inversion of the upper body. Actually, similar processing is performed for the lower body.

By comparing FIG. 12(a) and FIG. 12(b), it is possible to predict the horizontal reversal in the current frame. In other words, if "prediction of horizontal reversal between past frame and current frame (=d=a+b)" matches "presence of horizontal reversal of current frame (=c)", the result of c is reliable. it is conceivable that. At this time, reliable patterns are more likely to be adopted by lowering their score.

[1] Determination of horizontal reversal of past frames (Figure 12(a))
The 3D object estimated in the past frame is projected onto the screen and compared with the 2D pose information in the past frame. By comparison, it is determined whether the target camera/subject is horizontally reversed in the past frame. For horizontal reversal determination, horizontal reversal patterns are generated, and the pattern with the smallest positional distance between joints from the projection subject is selected. The left-right inversion estimation at this point is not a prediction based on movement from past frames, but is based on three-dimensional coordinates estimated from observation points of multiple cameras, so it is considered to be more accurate than one using prediction.

[2] Inter-frame left/right reversal determination (Figure 12(a))
The 2D posture information of the past frame and the 2D posture information of the current frame are compared to determine whether horizontal reversal has occurred between the frames of the target camera/subject.
The difference in joint positions between the past frame and the current frame is calculated, and if the position difference between the joints on the right and left sides exceeds a threshold value, it is determined that the left and right sides have been reversed between frames. This assumes that unless left and right is flipped, the joint positions will not change significantly. The scale of the joint position difference is adjusted according to the size of the subject on the screen, as shown in the formula below. It is assumed that the threshold value d _th is set to 0.8, and that the left and right sides are reversed between frames when the condition of equation (2) is satisfied for both the right and left sides of the body.

In formula (2), length indicates the two-dimensional length on the screen when the three-dimensional length between the waist joints of the target subject is displayed on the screen. pprevious indicates the joint position of the past frame. pcurrent indicates the joint position of the current frame. avg() is a function that calculates the average.

[3] Calculation of horizontal reversal pattern score for current frame (Figure 13(c))
The current frame horizontal reversal determination is the same as the method using the horizontal reversal pattern comparison, and uses the predicted three-dimensional object and the 2D posture information of the current frame to calculate the score using equation (1) for each reversal pattern.

[4] Judgment of horizontal reversal of current frame (Figure 13(d))
Predict the horizontal reversal of the current frame from the results of [1] and [2]. If it is determined that there is inversion in either [1] or [2], it is predicted that there will be left-right inversion. If both [1] and [2] have no inversion or inversion, it is predicted that there will be no left-right inversion. If both are reversed, the horizontally reversed state in the past has been reversed between frames. In other words, it is a reversal of an inversion, so the result is no inversion. Since the reliability of the prediction obtained here is insufficient to be used alone, it is treated as a correction to the score calculated in [3].

The score calculated in [3] is corrected so that the score of the predicted left-right reversal pattern is more likely to be selected. Adding this correction to the score in equation (1) yields equation (3).

In formula (3), S indicates each left-right reversal pattern. Spred indicates the horizontal reversal pattern of the current frame predicted from the results of [1] and [2]. The length indicates the two-dimensional length on the screen when the three-dimensional length between the waist joints of the target subject is displayed on the screen. β is a coefficient (parameter) of the correction term.

As described above, according to the embodiment, a pattern with a small variance/average of distances between corresponding joint positions between the previous frame and the current frame is selected from the four patterns of upper body, lower body, with inversion, and without inversion. Furthermore, intra-frame correction and inter-frame correction are performed based on the selected pattern. By doing so, it is possible to suppress horizontal reversal errors, and therefore it is possible to accurately estimate two-dimensional posture information and three-dimensional posture information.

That is, according to the embodiment, an algorithm for determining the horizontal reversal problem of 2D posture information is introduced as post-processing. This improves the robustness of processing and increases the accuracy of estimating 3D posture information. For these reasons, according to the embodiment, it is possible to suppress horizontal reversal of skeletal coordinates in 2D posture information, thereby making it possible to accurately estimate 3D posture information.

It should be noted that the present invention can be embodied by modifying the constituent elements without departing from the gist of the present invention at the implementation stage. Furthermore, various inventions can be formed by appropriately combining the plurality of components disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiments. Furthermore, components from different embodiments may be combined as appropriate.

1-1 to 1-n...Camera 10...Video processing device 20...Processor 21...Current frame extraction section 30...D attitude information generation section 40...Attitude information processing section 41...Left-right reversal determination section 42...Left-right reversal correction processing section 50 ...3D processing unit 60...Storage 61...2D attitude information 62...3D attitude information 63...Program 70...Interface unit 80...Bus 90...Memory.

Claims

a 2D posture information generation unit that generates 2D posture information of the subject from video data of the subject;
a posture information processing unit that corrects the 2D posture information;
a 3D processing unit that generates 3D posture information of the subject from the corrected 2D posture information,
The posture information processing unit includes a left-right reversal determination unit that performs left-right reversal determination processing,
In the horizontal reversal determination process, first to fourth patterns are set corresponding to whether or not the upper body and lower body are reversed for the image frame of the subject in the video data, and the first to fourth patterns are set. A video processing device that selects a pattern that minimizes the score calculated by an evaluation formula that uses as variables the variance and average of distances between corresponding joint positions between the previous frame and the current frame.
The video processing device according to claim 1, wherein the posture information processing section further includes a left-right reversal correction processing section that performs the left-right reversal determination process within a frame.
The video processing device according to claim 1, wherein the posture information processing section further includes a left-right reversal correction processing section that performs the left-right reversal determination process in two stages within a frame.
The video processing device according to claim 1, wherein the posture information processing section further includes a left-right reversal correction processing section that performs the left-right reversal determination process between consecutive frames.
In a video processing method executed by the processor of a video processing device comprising a processor and a storage,
a step in which the processor generates 2D posture information of the subject from video data of the subject;
the processor correcting the 2D pose information;
The processor generates 3D posture information of the subject from the corrected 2D posture information,
The process of correcting the 2D posture information includes:
a step in which the processor sets first to fourth patterns corresponding to whether or not the upper body and lower body are inverted for the image frame of the subject in the video data;
a step in which the processor calculates a score using an evaluation formula that uses as variables the variance and average of distances between corresponding joint positions of the previous frame and the current frame among the first to fourth patterns;
The video processing method includes a step in which the processor selects a pattern that minimizes the score.
A program that causes a computer to function as each section of the video processing apparatus according to claim 4.