WO2022181253A1

WO2022181253A1 - Joint point detection device, teaching model generation device, joint point detection method, teaching model generation method, and computer-readable recording medium

Info

Publication number: WO2022181253A1
Application number: PCT/JP2022/003767
Authority: WO
Inventors: 遊哉石井
Original assignee: 日本電気株式会社
Priority date: 2021-02-26
Filing date: 2022-02-01
Publication date: 2022-09-01
Also published as: JPWO2022181253A1

Abstract

A teaching model generation device 10 comprising: a training data acquisition unit 11 that acquires a first graph structure that indicates a two-dimensional feature value for a joint point and a second graph structure that indicates a three-dimensional feature value for a joint point as correct answer data; and a teaching model generation unit 12 that enters the first graph structure into a graphic convolution network, calculates the difference between the output graph structure and the correct answer data, and machine-learns parameters for the graphic convolution network so as to reduce the difference. The graphic convolution network comprises an intermediate layer and an output layer.The intermediate layer comprises: a feature extractor that executes feature extraction without changing the number of nodes in the graph structure; and a feature extractor that reduces the number of nodes in the graph structure and performs feature extraction. Each feature extractor uses the graph structure output from an upper level feature extractor as inputs therefor. The output layer outputs a graph structure, using the graph structure output by each feature extractor in an intermediate layer among the lowest layers as input therefor.

Description

Joint point detection device, learning model generation device, joint point detection method, learning model generation method, and computer-readable recording medium

The present invention relates to a joint point detection device and a joint point detection method for detecting joint points of an object from an image, and further relates to a computer-readable recording medium recording a program for realizing these. The present invention also relates to a learning model generation device and a learning model generation method for generating a learning model for detecting joint points of an object from an image, and furthermore, a program for realizing these is recorded. It relates to a computer-readable recording medium.

In recent years, a system for estimating a person's posture from an image has been proposed. Such systems are expected to be used in fields such as image monitoring and user interfaces. For example, in an image monitoring system, if the posture of a person can be estimated, it is possible to estimate what the person photographed by the camera is doing, thereby improving the monitoring accuracy. In addition, if a user's posture can be estimated in a user interface, input using gestures becomes possible.

For example, Non-Patent Document 1 discloses a system for estimating the posture of a person, especially the posture of a person's hand, from an image. The system disclosed in Non-Patent Document 1 first acquires image data including an image of a hand. , its two-dimensional coordinates are estimated.

Subsequently, the system disclosed in Non-Patent Document 1 inputs the estimated two-dimensional coordinates of each joint point to a graph convolution network (hereinafter also referred to as “GCN” (Graphic Convolution Network)), and each joint point Estimate the three-dimensional coordinates of A GCN is a network that takes as input a graph structure composed of a plurality of nodes and performs convolution processing using adjacent nodes (see, for example, Patent Document 1). In Non-Patent Document 1, each node in the input graph structure is configured by the two-dimensional coordinates of each joint point.

In the system disclosed in Non-Patent Document 1, the GCN performs pooling processing multiple times to reduce the number of nodes in the input graph structure in the previous stage, and finally reduces the number of nodes to 1. Also, in the later stage, the GCN performs unpooling processing for increasing the number of nodes by the same number of times as the pooling processing for the graph structure having one node. In addition, in the latter stage, the GCN connects the graph structure to be processed in the latter stage with the graph structure in the previous stage that has the same number of nodes, and executes convolution. Output a graph structure with the same number of nodes.

Furthermore, in the system disclosed in Non-Patent Document 1, as training data, the graph structure of the two-dimensional coordinates of each joint point is input to the GCN, and the output graph structure and the three points of each joint point as correct data. Machine learning of GCN is performed so that the difference with the graph structure of dimensional coordinates becomes small.

JP 2020-27399 A

By the way, in the system disclosed in Non-Patent Document 1, since the graph structure of the former stage is convoluted into the graph structure of the latter stage, a lot of spatial information is convoluted. Therefore, extraction of the feature quantity is not sufficient. In addition, in the system disclosed in Non-Patent Document 1, since the target is a graph structure with one node in the latter stage, the number of dimensions is insufficient and the feature amount cannot be extracted accurately. Therefore, the system disclosed in Non-Patent Document 1 has a problem that the accuracy of the three-dimensional coordinates of each joint point is low.

An example of the object of the present invention is a joint point detection device, a learning model generation device, a joint point detection method, a learning model generation method, and a An object of the present invention is to provide a computer-readable recording medium.

In order to achieve the above object, a joint point detection device according to one aspect of the present invention includes:
a graph structure acquisition unit that acquires a first graph structure in which two-dimensional feature values of each of a plurality of target joint points are represented by nodes;
a graph structure output unit that receives the first graph structure as an input and outputs a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points using a graph convolution network;
with
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by

In order to achieve the above object, the learning model generation device in one aspect of the present invention includes:
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition unit that acquires data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a learning model generation unit that generates a machine learning model constructed by the graph convolutional network by machine learning parameters in
with
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by

In order to achieve the above object, a joint point detection method according to one aspect of the present invention includes:
a graph structure obtaining step of obtaining a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points;
a graph structure output step of outputting a second graph structure showing three-dimensional feature amounts of each of the plurality of joint points using the first graph structure as an input and using a graph convolution network;
has
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by

In order to achieve the above object, a learning model generation method according to one aspect of the present invention includes:
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition step for acquiring as data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a machine learning model generation step of generating a machine learning model constructed by the graph convolutional network by machine learning parameters in
has
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by

To achieve the above object, a first computer-readable recording medium in one aspect of the present invention comprises:
to the computer,
a graph structure obtaining step of obtaining a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points;
a graph structure output step of outputting a second graph structure showing three-dimensional feature amounts of each of the plurality of joint points using the first graph structure as an input and using a graph convolution network;
Record a program containing instructions to execute
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by

To achieve the above object, a second computer-readable recording medium in one aspect of the present invention comprises:
to the computer,
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition step for acquiring as data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a machine learning model generation step of generating a machine learning model constructed by the graph convolutional network by machine learning parameters in
Record a program containing instructions to execute
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by

To achieve the above object, the graph convolutional network in one aspect of the present invention is
comprising a plurality of intermediate layers and an output layer,
By machine learning a relationship between a first graph structure indicating two-dimensional feature amounts of each of a plurality of target joint points and a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points. built and
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by

As described above, according to the present invention, it is possible to improve detection accuracy when detecting three-dimensional coordinates of joint points from an image.

FIG. 1 is a configuration diagram showing a schematic configuration of a learning model generation device according to Embodiment 1. As shown in FIG. FIG. 2 is a configuration diagram specifically showing the configuration of the learning model generation device according to the first embodiment. FIG. 3 is a configuration diagram showing the configuration of the graph convolutional network according to Embodiment 1. FIG. FIG. 4 is an explanatory diagram for explaining processing in the graph convolution network shown in FIG. FIG. 5 is a flowchart showing the operation of the learning model generation device according to Embodiment 1. FIG. FIG. 6 is a configuration diagram showing a schematic configuration of a joint point detection device according to Embodiment 2. As shown in FIG. FIG. 7 is a diagram more specifically showing the configuration of the joint point detection device according to the second embodiment. FIG. 8 is a flowchart showing the operation of the joint point detection device according to the second embodiment. FIG. 9 is a block diagram showing an example of a computer that realizes the learning model generation device according to Embodiment 1 and the joint point detection device according to Embodiment 2. FIG.

(Embodiment 1)
A learning model generation device, a learning model generation method, a learning model generation program, and a graph convolution network in Embodiment 1 will be described below with reference to FIGS. 1 to 5. FIG.

[Device configuration]
First, a schematic configuration of the learning model generation device according to Embodiment 1 will be described with reference to FIG. FIG. 1 is a configuration diagram showing a schematic configuration of a learning model generation device according to Embodiment 1. As shown in FIG.

The learning model generation device 10 according to Embodiment 1 shown in FIG. 1 is a device that generates a machine learning model for detecting joint points of interest. As shown in FIG. 1 , the learning model generation device 10 includes a training data acquisition section 11 and a learning model generation section 12 .

The training data acquisition unit 11 obtains a first graph structure indicating two-dimensional feature amounts of each of the target joint points, and a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points as correct data. Graph structure is obtained as training data.

The learning model generation unit 12 inputs the first graph structure to a graph convolution network (GCN) and calculates the difference between the graph structure output from the graph convolution network and the correct data. Then, the learning model generation unit 12 generates a machine learning model constructed by the graph convolution network by performing machine learning on the parameters in the graph convolution network so that the calculated difference becomes small.

A graph convolutional network comprises multiple intermediate layers and an output layer. All or some of the plurality of intermediate layers include feature extractors that perform feature extraction without changing the number of nodes in the graph structure and feature extractors that perform feature extraction with a reduced number of nodes in the graph structure. ing. In each of the multiple intermediate layers, each feature extractor uses as input the graph structure output by each upper layer feature extractor. The output layer uses the graph structure output by each feature extractor in the lowest intermediate layer as input and outputs a graph structure.

Thus, in the first embodiment, the graph convolutional network includes, in the intermediate layer, a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. and a feature extractor that performs Therefore, according to the graph convolution network 30, it is possible to avoid a situation in which the feature quantity cannot be sufficiently extracted due to a small number of convolutions and a situation in which the feature quantity cannot be accurately extracted due to an insufficient number of dimensions. As a result, according to the first embodiment, it is possible to improve the detection accuracy when detecting the three-dimensional coordinates of the joint points from the image.

Next, using FIG. 2, the configuration of the learning model generation device 10 according to Embodiment 1 will be described more specifically. FIG. 2 is a configuration diagram specifically showing the configuration of the learning model generation device according to the first embodiment.

As shown in FIG. 2, in Embodiment 1, the learning model generation device 10 includes a storage unit 13 in addition to the training data acquisition unit 11 and the learning model generation unit 12 described above. The storage unit 13 stores a graph convolutional network 30 (hereinafter referred to as "GCN 30").

Also, hereinafter, the case where the target is a human hand will be described as an example. In Embodiment 1, the target is not limited to the human hand, and may be the entire human body or other parts. The object may be anything that has joint points, and may be something other than a person, such as a robot.

In Embodiment 1, the object is the human hand, so in the graph structure, the nodes that constitute it are represented by two-dimensional or three-dimensional feature amounts of each joint point of the hand. A specific example of the feature amount is a coordinate value.

In FIG. 2, 21 indicates the graph structure obtained from the image data 20. In the graph structure 21, nodes 22 represent two-dimensional coordinate values of joint points of the hand as feature quantities. Graph structure 21 is a first graph structure. Reference numeral 23 denotes a second graph structure, which is a graph structure for correct data. In the second graph structure 23, nodes 24 represent three-dimensional coordinate values of joint points of the hand as feature quantities.

In addition to the joint points, the nodes of the graph structure may represent two-dimensional or three-dimensional feature amounts of parts other than the joint points, for example, characteristic parts such as fingertips. Further, in Embodiment 1, the first graph structure as training data is obtained by inputting target image data into a machine learning model that machine-learns the relationship between the image data of the joint points and the graph structure. can be done.

In the embodiment, the training data acquisition unit 11 acquires the first graph structure 21 and the second graph structure 23 as correct data as training data. Then, the training data acquisition unit 11 inputs the acquired training data to the learning model generation unit 12 .

In the embodiment, the learning model generation unit 12 first acquires the GCN 30 from the storage unit 13. Next, the learning model generation unit 12 inputs the first graph structure 21 constituting the training data to the GCN 30, the second graph structure output from the GCN 30, and the second graph structure 23 as correct data. Calculate the difference between Then, the learning model generation unit 12 updates the parameters of the GCN 30 so that the calculated difference is minimized, and stores the GCN 30 with the updated parameters in the storage unit 13 . As a result, a GCN is generated for detecting the three-dimensional coordinates of the target joint point.

Next, the configuration and functions of the graph convolution network 30 according to Embodiment 1 will be specifically described using FIGS. 3 and 4. FIG. FIG. 3 is a configuration diagram showing the configuration of the graph convolutional network according to Embodiment 1. FIG. FIG. 4 is an explanatory diagram for explaining processing in the graph convolution network shown in FIG.

As shown in FIG. 3, the GCN 30 includes an input layer 31, multiple intermediate layers, and an output layer 33. The input layer 31 accepts input of the first graph structure 21 . The plurality of hidden layers, consisting of a first hidden layer 32a and a second hidden layer 32b, perform feature extraction of the first graph structure. The output layer 33 uses, as input, the graph structure output by each feature extractor of the intermediate layer, which is the lowest layer, and outputs a second graph structure.

Also, as shown in FIG. 3, the first intermediate layer 32a includes only the first feature extractor ("○" in FIG. 3) that performs feature extraction without changing the number of nodes in the graph structure. . A first feature extractor performs convolution. Also, in each of the first intermediate layers 32a, each feature extractor uses as input the graph structure output by the feature extractor that performs feature extraction with the same number of nodes in the upper layer. When there are a plurality of feature extractors that serve as input sources, the feature extractor of the first intermediate layer 32a connects the graph structures output by each feature extractor and executes convolution.

The second intermediate layer 32b includes a second feature extractor (“●” in FIG. 3) that extracts features by reducing the number of nodes in the graph structure, and a second feature extractor that extracts features by increasing the number of nodes in the graph structure. 3 feature extractors (hatched "o" in FIG. 3), or both. A second feature extractor performs pooling and a third feature extractor performs unpooling. The second intermediate layer 32b also includes a first feature extractor (“O” in FIG. 3). In each of the second intermediate layers 32b, each feature extractor uses as input a plurality of graph structures output by each upper layer feature extractor.

The input layer 31 also includes a first feature extractor ("○" in FIG. 3). The output layer 33 includes a first feature extractor (“○” in FIG. 3) and a third feature extractor (hatched “○” in FIG. 3).

Then, as shown in FIG. 4, when the first graph structure is input to the GCN 30 having such a configuration, convolution, pooling, and unpooling are performed in the intermediate layer. Furthermore, in the intermediate layer, information is exchanged between a feature extractor targeting a graph structure with a high number of nodes and a feature extractor targeting a graph structure with a low number of nodes.

That is, as shown in FIG. 4, in the intermediate layer, graph structures with different numbers of nodes are generated by feature extractors, and graph structures with different numbers of nodes are exchanged between feature extractors. In FIG. 4, numbers attached to the graph structure indicate the number of nodes. As a result, according to the GCN 30, a situation in which the feature quantity cannot be sufficiently extracted due to a small number of convolutions and a situation in which the feature quantity cannot be extracted accurately due to an insufficient number of dimensions are suppressed.

[Device operation]
Next, the operation of the learning model generation device 10 according to Embodiment 1 will be described using FIG. FIG. 5 is a flowchart showing the operation of the learning model generation device according to Embodiment 1. FIG. 1 to 4 will be referred to as needed in the following description. Further, in Embodiment 1, the learning model generation method is implemented by operating the learning model generation device 10 . Therefore, the explanation of the learning model generation method in Embodiment 1 is replaced with the following explanation of the operation of the learning model generation device.

First, as shown in FIG. 5, the training data acquisition unit 11 obtains a first graph structure indicating two-dimensional feature amounts of each of the target joint points, and correct data of each of the joint points. A second graph structure representing a three-dimensional feature quantity is acquired as training data (step A1).

Next, the learning model generation unit 12 inputs the first graph structure acquired as training data in step A1 to the GCN 30, and calculates the difference between the graph structure output from the GCN and the correct data. Then, the learning model generation unit 12 updates the parameters of the GCN so that the calculated difference becomes smaller (step A2).

After that, the learning model generation unit 12 stores the GCN whose parameters are updated in step A2 in the storage unit 13 (step A3). This generates a GCN that can detect the three-dimensional coordinates of joint points.

As described above, according to Embodiment 1, since a GCN is constructed that can sufficiently and accurately extract the feature amount, the detection accuracy when detecting the three-dimensional coordinates of the target joint point from the image is will be improved.

[program]
A program for generating a learning model in Embodiment 1 may be a program that causes a computer to execute steps A1 to A3 shown in FIG. By installing this program in a computer and executing it, the learning model generation device and learning model generation method in Embodiment 1 can be realized. In this case, the processor of the computer functions as a training data acquisition unit 11 and a learning model generation unit 12 to perform processing.

Further, in Embodiment 1, the storage unit 13 may be implemented by storing the data files constituting these in a storage device such as a hard disk provided in the computer, or may be realized by storing the data files in a storage device of another computer. It may be realized by Moreover, as a computer, a smart phone and a tablet-type terminal device are mentioned other than general-purpose PC.

The learning model generation program in Embodiment 1 may be executed by a computer system constructed by a plurality of computers. In this case, for example, each computer may function as either the training data acquisition unit 11 or the learning model generation unit 12 .

(Embodiment 2)
Next, in Embodiment 2, a joint point detection device, a joint point detection method, and a joint point detection program will be described with reference to FIGS. 6 and 7. FIG.

[Device configuration]
First, the schematic configuration of the joint point detection device according to Embodiment 2 will be described with reference to FIG. FIG. 6 is a configuration diagram showing a schematic configuration of a joint point detection device according to Embodiment 2. As shown in FIG.

A joint point detection device 40 according to Embodiment 2 shown in FIG. 6 is a device for detecting joint points of an object, for example, a living body, a robot, or the like. As shown in FIG. 6 , the joint point detection device 40 includes a graph structure acquisition section 41 and a graph structure output section 42 .

The graph structure acquisition unit 41 acquires a first graph structure in which two-dimensional feature quantities of each of a plurality of target joint points are represented by nodes. The graph structure output unit 42 receives the first graph structure and uses a graph convolution network to output a second graph structure indicating the three-dimensional feature quantity of each of the joint points.

A graph convolutional network comprises multiple intermediate layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure. All or some of the plurality of intermediate layers include feature extractors that perform feature extraction without changing the number of nodes in the graph structure and feature extractors that perform feature extraction with a reduced number of nodes in the graph structure. ing.

Also, in each of the multiple intermediate layers, each feature extractor uses as input the graph structure output by each feature extractor in the upper layer. The output layer uses the graph structure output by each feature extractor in the lowest intermediate layer as input and outputs a graph structure.

As described above, in the second embodiment, in the intermediate layer, a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. , is used to output a second graph structure. Therefore, according to the second embodiment, it is possible to avoid a situation in which the feature amount cannot be sufficiently extracted due to a small number of convolutions and a situation in which the feature amount cannot be accurately extracted due to an insufficient number of dimensions. As a result, according to the second embodiment, it is possible to improve the detection accuracy when detecting the three-dimensional coordinates of the joint points from the image.

Next, using FIG. 7, the configuration and functions of the joint point detection device 40 according to Embodiment 2 will be specifically described. FIG. 7 is a diagram more specifically showing the configuration of the joint point detection device according to the second embodiment.

As shown in FIG. 7, in the second embodiment, the joint point detection device 40 includes a storage unit 43 in addition to the graph structure acquisition unit 41 and the graph structure output unit 42 described above. The storage unit 43 stores the GCN 30 shown in FIG. 2 in the first embodiment.

Also in Embodiment 2, the case where the target is a human hand will be described as an example. In the second embodiment as well, the joint point detection target is not limited to the human hand, and may be the entire human body or other parts. Further, the target of joint point detection may be any object that has joint points, and may be an object other than a person, such as a robot.

In addition, in Embodiment 2 as well, two-dimensional or three-dimensional coordinate values can be mentioned as feature values indicated by nodes in the graph structure. In addition to the joint points, the nodes of the graph structure may represent two-dimensional or three-dimensional feature amounts of parts other than the joint points, for example, characteristic parts such as fingertips.

In Embodiment 2, the graph structure acquisition unit 41 acquires a first graph structure 50 obtained from image data of a human hand, as shown in FIG. As described in the first embodiment, the first graph structure can be obtained by inputting image data into a machine learning model that machine-learns the relationship between the image data of the joint points and the graph structure.

The graph structure output unit 42 acquires the GCN 30 from the storage unit 43. Then, the graph structure output unit 42 inputs the first graph structure 50 to the GCN 30, and causes the GCN to output a second graph structure 51 indicating the three-dimensional feature quantity of each of the multiple joint points.

The GCN 30 includes an input layer 31, a plurality of

intermediate layers

32a and 32b, and an output layer 33, as described in the first embodiment, and mechanically expresses the relationship between the first graph structure and the second graph structure. Built by learning. Therefore, in the output second graph structure 51, the three-dimensional feature amount (coordinate value) indicated by each node is a highly accurate value.

[Device operation]
Next, the operation of the joint point detection device 40 according to Embodiment 2 will be described with reference to FIG. FIG. 8 is a flowchart showing the operation of the joint point detection device according to the second embodiment. 6 and 7 will be referred to as necessary in the following description. Further, in the second embodiment, the joint point detection method is implemented by operating the joint point detection device 40 . Therefore, the description of the joint point detection method in the second embodiment is replaced with the description of the operation of the joint point detection device 40 below.

As shown in FIG. 8, the graph structure acquisition unit 41 first acquires a first graph structure 50 obtained from image data of a human hand. (Step B1).

Next, the graph structure output unit 42 inputs the first graph structure to the GCN 30, and causes the GCN to output a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points (step B2). .

In the second graph structure, each node represents the three-dimensional coordinates of each joint point of the target human hand, so each joint point of the human hand is detected in step B2. As described above, since the GCN 30 is used in the second embodiment, the three-dimensional feature values (coordinate values) indicated by each node are highly accurate values in the output second graph structure.

[program]
The joint point detection program in the second embodiment may be any program that causes a computer to execute steps B1 and B2 shown in FIG. By installing this program in a computer and executing it, the joint point detecting device and the joint point detecting method in the second embodiment can be realized. In this case, the processor of the computer functions as a graph structure acquisition unit 41 and a graph structure output unit 42 to perform processing.

In Embodiment 2, the storage unit 43 may be realized by storing data files constituting these in a storage device such as a hard disk provided in the computer, or may be realized by storing the data files in a storage device of another computer. It may be realized by Moreover, as a computer, a smart phone and a tablet-type terminal device are mentioned other than general-purpose PC.

The joint point detection program in Embodiment 2 may be executed by a computer system constructed by a plurality of computers. In this case, for example, each computer may function as either the graph structure acquisition unit 41 or the graph structure output unit 42, respectively.

[Physical configuration]
Here, a computer that implements the learning model generation device 10 by executing the program in Embodiment 1 and a computer that implements the joint point detection device 40 by executing the program in Embodiment 2 are shown in FIG. will be used to explain. FIG. 9 is a block diagram showing an example of a computer that realizes the learning model generation device according to Embodiment 1 and the joint point detection device according to Embodiment 2. FIG.

As shown in FIG. 9, a computer 110 includes a CPU (Central Processing Unit) 111, a main memory 112, a storage device 113, an input interface 114, a display controller 115, a data reader/writer 116, and a communication interface 117. and These units are connected to each other via a bus 121 so as to be able to communicate with each other.

Also, the computer 110 may include a GPU (Graphics Processing Unit) or an FPGA (Field-Programmable Gate Array) in addition to the CPU 111 or instead of the CPU 111 . In this aspect, a GPU or FPGA can execute the programs in the embodiments.

The CPU 111 expands the program in the embodiment, which is composed of a code group stored in the storage device 113, into the main memory 112 and executes various operations by executing each code in a predetermined order. The main memory 112 is typically a volatile storage device such as DRAM (Dynamic Random Access Memory).

Also, the program in the embodiment is provided in a state stored in a computer-readable recording medium 120. It should be noted that the program in this embodiment may be distributed on the Internet connected via communication interface 117 .

Further, as a specific example of the storage device 113, in addition to a hard disk drive, a semiconductor storage device such as a flash memory can be cited. Input interface 114 mediates data transmission between CPU 111 and input devices 118 such as a keyboard and mouse. The display controller 115 is connected to the display device 119 and controls display on the display device 119 .

The data reader/writer 116 mediates data transmission between the CPU 111 and the recording medium 120, reads programs from the recording medium 120, and writes processing results in the computer 110 to the recording medium 120. Communication interface 117 mediates data transmission between CPU 111 and other computers.

Specific examples of the recording medium 120 include general-purpose semiconductor storage devices such as CF (Compact Flash (registered trademark)) and SD (Secure Digital), magnetic recording media such as flexible disks, and CD- Optical recording media such as ROM (Compact Disk Read Only Memory) can be mentioned.

The learning model generation device 10 and the joint point detection device 40 can also be realized by using hardware corresponding to each part instead of a computer in which a program is installed. Further, the learning model generation device 10 and the joint point detection device 40 may be partly implemented by a program and the rest by hardware.

Some or all of the above-described embodiments can be expressed by the following (Appendix 1) to (Appendix 11), but are not limited to the following descriptions.

(Appendix 1)
a graph structure acquisition unit that acquires a first graph structure in which two-dimensional feature values of each of a plurality of target joint points are represented by nodes;
a graph structure output unit that receives the first graph structure as an input and outputs a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points using a graph convolution network;
with
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A joint point detection device characterized by:

(Appendix 2)
The joint point detection device according to appendix 1,
the graph convolutional network, wherein the plurality of hidden layers comprises a first hidden layer and a second hidden layer;
The first hidden layer comprises only feature extractors that perform feature extraction without changing the number of nodes in the graph structure, and in each of the first hidden layers, each feature extractor has the same number of nodes in the upper layer. Using as input the graph structure output by the feature extractor that performs feature extraction,
The second intermediate layer includes a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure, and the number of nodes in the graph structure. one or both of feature extractors that perform feature extraction by increasing used as
A joint point detection device characterized by:

(Appendix 3)
The joint point detection device according to appendix 1 or 2,
the first graph structure indicates two-dimensional coordinates of each of the articulation points;
wherein the second graph structure indicates three-dimensional coordinates of each of the joint points;
A joint point detection device characterized by:

(Appendix 4)
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition unit that acquires data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a learning model generation unit that generates a machine learning model constructed by the graph convolutional network by machine learning parameters in
with
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A learning model generation device characterized by:

(Appendix 5)
The learning model generation device according to Supplementary Note 4,
the graph convolutional network, wherein the plurality of hidden layers comprises a first hidden layer and a second hidden layer;
The first hidden layer comprises only feature extractors that perform feature extraction without changing the number of nodes in the graph structure, and in each of the first hidden layers, each feature extractor has the same number of nodes in the upper layer. Using as input the graph structure output by the feature extractor that performs feature extraction,
The second intermediate layer includes a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure, and the number of nodes in the graph structure. one or both of feature extractors that perform feature extraction by increasing used as
A learning model generation device characterized by:

(Appendix 6)
The learning model generation device according to appendix 4 or 5,
the first graph structure indicates two-dimensional coordinates of each of the articulation points;
wherein the second graph structure indicates three-dimensional coordinates of each of the joint points;
A learning model generation device characterized by:

(Appendix 7)
a graph structure obtaining step of obtaining a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points;
a graph structure output step of outputting a second graph structure showing three-dimensional feature amounts of each of the plurality of joint points using the first graph structure as an input and using a graph convolution network;
has
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A joint point detection method characterized by:

(Appendix 8)
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition step for acquiring as data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a machine learning model generation step of generating a machine learning model constructed by the graph convolutional network by machine learning parameters in
has
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A learning model generation method characterized by:

(Appendix 9)
to the computer,
a graph structure obtaining step of obtaining a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points;
a graph structure output step of outputting a second graph structure showing three-dimensional feature amounts of each of the plurality of joint points using the first graph structure as an input and using a graph convolution network;
record a program containing instructions to execute the
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A computer-readable recording medium characterized by:

(Appendix 10)
to the computer,
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition step for acquiring as data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a machine learning model generation step of generating a machine learning model constructed by the graph convolutional network by machine learning parameters in
record a program containing instructions to execute the
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A computer-readable recording medium characterized by:

(Appendix 11)
comprising a plurality of intermediate layers and an output layer,
By machine learning a relationship between a first graph structure indicating two-dimensional feature amounts of each of a plurality of target joint points and a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points. built and
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A graph convolutional network characterized by:

Although the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

This application claims priority based on Japanese Patent Application No. 2021-029412 filed on February 26, 2021, and the entire disclosure thereof is incorporated herein.

As described above, according to the present invention, it is possible to improve detection accuracy when detecting three-dimensional coordinates of joint points from an image. INDUSTRIAL APPLICABILITY The present invention is useful in fields that require posture detection of objects having joint points, such as people and robots. Specific fields include video surveillance and user interfaces.

REFERENCE SIGNS LIST 10 learning model generation device 11 training data acquisition unit 12 learning model generation unit 13 storage unit 20 image data 21 first graph structure 22 nodes 23 second graph structure 24 nodes 30 graph convolution network (GCN)
31 input layer 32a first intermediate layer 32b second intermediate layer 33 output layer 40 joint point detection device 41 graph structure acquisition unit 42 graph structure output unit 43 storage unit 50 first graph structure 51 second graph structure 110 computer 111 CPUs
112 main memory 113 storage device 114 input interface 115 display controller 116 data reader/writer 117 communication interface 118 input device 119 display device 120 recording medium 121 bus

Claims

graph structure acquisition means for acquiring a first graph structure in which two-dimensional feature quantities of each of a plurality of target joint points are represented by nodes;
graph structure output means for outputting a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points using the first graph structure as an input and using a graph convolution network;
with
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A joint point detection device characterized by:
The joint point detection device according to claim 1,
the graph convolutional network, wherein the plurality of hidden layers comprises a first hidden layer and a second hidden layer;
The first hidden layer comprises only feature extractors that perform feature extraction without changing the number of nodes in the graph structure, and in each of the first hidden layers, each feature extractor has the same number of nodes in the upper layer. Using as input the graph structure output by the feature extractor that performs feature extraction,
The second intermediate layer includes a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure, and the number of nodes in the graph structure. one or both of feature extractors that perform feature extraction by increasing used as
A joint point detection device characterized by:
The joint point detection device according to claim 1 or 2,
the first graph structure indicates two-dimensional coordinates of each of the articulation points;
wherein the second graph structure indicates three-dimensional coordinates of each of the joint points;
A joint point detection device characterized by:
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition means for acquiring as data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a learning model generation means for generating a machine learning model constructed by the graph convolutional network by machine learning parameters in
with
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A learning model generation device characterized by:
The learning model generation device according to claim 4,
the graph convolutional network, wherein the plurality of hidden layers comprises a first hidden layer and a second hidden layer;
The first hidden layer comprises only feature extractors that perform feature extraction without changing the number of nodes in the graph structure, and in each of the first hidden layers, each feature extractor has the same number of nodes in the upper layer. Using as input the graph structure output by the feature extractor that performs feature extraction,
The second intermediate layer includes a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure, and the number of nodes in the graph structure. one or both of feature extractors that perform feature extraction by increasing used as
A learning model generation device characterized by:
The learning model generation device according to claim 4 or 5,
the first graph structure indicates two-dimensional coordinates of each of the articulation points;
wherein the second graph structure indicates three-dimensional coordinates of each of the joint points;
A learning model generation device characterized by:
Obtaining a first graph structure indicating a two-dimensional feature quantity for each of a plurality of joint points of interest;
taking the first graph structure as an input and using a graph convolution network to output a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points;
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A joint point detection method characterized by:
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; obtained as data,
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. generating a machine learning model constructed by said graph convolutional network by machine learning the parameters in
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A learning model generation method characterized by:
to the computer,
Acquiring a first graph structure indicating a two-dimensional feature quantity of each of a plurality of joint points of interest;
Using the first graph structure as an input and using a graph convolution network to output a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points;
record a program containing instructions,
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A computer-readable recording medium characterized by:
to the computer,
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; get it as data
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. generating a machine learning model constructed by the graph convolutional network by machine learning the parameters in
record a program containing instructions,
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A computer-readable recording medium characterized by:
comprising a plurality of intermediate layers and an output layer,
By machine learning a relationship between a first graph structure indicating two-dimensional feature amounts of each of a plurality of target joint points and a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points. built and
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A graph convolutional network characterized by: