WO2022181253A1 - Joint point detection device, teaching model generation device, joint point detection method, teaching model generation method, and computer-readable recording medium - Google Patents

Joint point detection device, teaching model generation device, joint point detection method, teaching model generation method, and computer-readable recording medium Download PDF

Info

Publication number
WO2022181253A1
WO2022181253A1 PCT/JP2022/003767 JP2022003767W WO2022181253A1 WO 2022181253 A1 WO2022181253 A1 WO 2022181253A1 JP 2022003767 W JP2022003767 W JP 2022003767W WO 2022181253 A1 WO2022181253 A1 WO 2022181253A1
Authority
WO
WIPO (PCT)
Prior art keywords
graph structure
graph
feature
feature extractor
output
Prior art date
Application number
PCT/JP2022/003767
Other languages
French (fr)
Japanese (ja)
Inventor
遊哉 石井
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2023502226A priority Critical patent/JPWO2022181253A5/en
Publication of WO2022181253A1 publication Critical patent/WO2022181253A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis

Definitions

  • the present invention relates to a joint point detection device and a joint point detection method for detecting joint points of an object from an image, and further relates to a computer-readable recording medium recording a program for realizing these.
  • the present invention also relates to a learning model generation device and a learning model generation method for generating a learning model for detecting joint points of an object from an image, and furthermore, a program for realizing these is recorded. It relates to a computer-readable recording medium.
  • Non-Patent Document 1 discloses a system for estimating the posture of a person, especially the posture of a person's hand, from an image.
  • the system disclosed in Non-Patent Document 1 first acquires image data including an image of a hand. , its two-dimensional coordinates are estimated.
  • Non-Patent Document 1 inputs the estimated two-dimensional coordinates of each joint point to a graph convolution network (hereinafter also referred to as “GCN” (Graphic Convolution Network)), and each joint point Estimate the three-dimensional coordinates of A GCN is a network that takes as input a graph structure composed of a plurality of nodes and performs convolution processing using adjacent nodes (see, for example, Patent Document 1).
  • GCN graph convolution Network
  • the GCN performs pooling processing multiple times to reduce the number of nodes in the input graph structure in the previous stage, and finally reduces the number of nodes to 1. Also, in the later stage, the GCN performs unpooling processing for increasing the number of nodes by the same number of times as the pooling processing for the graph structure having one node. In addition, in the latter stage, the GCN connects the graph structure to be processed in the latter stage with the graph structure in the previous stage that has the same number of nodes, and executes convolution. Output a graph structure with the same number of nodes.
  • the graph structure of the two-dimensional coordinates of each joint point is input to the GCN, and the output graph structure and the three points of each joint point as correct data.
  • Machine learning of GCN is performed so that the difference with the graph structure of dimensional coordinates becomes small.
  • HOPE-Net A Graph-based Model for Hand-Object Pose Estimation”, [online], IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Georgia University , 31 March 2020, [searched February 12, 2021], Internet ⁇ URL: https://arxiv.org/abs/2004.00060>
  • An example of the object of the present invention is a joint point detection device, a learning model generation device, a joint point detection method, a learning model generation method, and a An object of the present invention is to provide a computer-readable recording medium.
  • a joint point detection device includes: a graph structure acquisition unit that acquires a first graph structure in which two-dimensional feature values of each of a plurality of target joint points are represented by nodes; a graph structure output unit that receives the first graph structure as an input and outputs a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points using a graph convolution network; with The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure, All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure. It is characterized by
  • the learning model generation device in one aspect of the present invention includes: training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition unit that acquires data;
  • the first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small.
  • a learning model generation unit that generates a machine learning model constructed by the graph convolutional network by machine learning parameters in with
  • the graph convolutional network comprises a plurality of hidden layers and an output layer, All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure. It is characterized by
  • a joint point detection method includes: a graph structure obtaining step of obtaining a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points; a graph structure output step of outputting a second graph structure showing three-dimensional feature amounts of each of the plurality of joint points using the first graph structure as an input and using a graph convolution network; has The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure, All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure. It is characterized by
  • a learning model generation method includes: training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition step for acquiring as data;
  • the first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small.
  • the graph convolutional network comprises a plurality of hidden layers and an output layer, All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure. It is characterized by
  • a first computer-readable recording medium in one aspect of the present invention comprises: to the computer, a graph structure obtaining step of obtaining a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points; a graph structure output step of outputting a second graph structure showing three-dimensional feature amounts of each of the plurality of joint points using the first graph structure as an input and using a graph convolution network; Record a program containing instructions to execute
  • the graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure, All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure. It is characterized by
  • a second computer-readable recording medium in one aspect of the present invention comprises: to the computer, training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition step for acquiring as data;
  • the first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small.
  • the graph convolutional network comprises a plurality of hidden layers and an output layer, All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure. It is characterized by
  • the graph convolutional network in one aspect of the present invention is comprising a plurality of intermediate layers and an output layer, By machine learning a relationship between a first graph structure indicating two-dimensional feature amounts of each of a plurality of target joint points and a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points.
  • All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure. It is characterized by
  • FIG. 1 is a configuration diagram showing a schematic configuration of a learning model generation device according to Embodiment 1.
  • FIG. 2 is a configuration diagram specifically showing the configuration of the learning model generation device according to the first embodiment.
  • FIG. 3 is a configuration diagram showing the configuration of the graph convolutional network according to Embodiment 1.
  • FIG. 4 is an explanatory diagram for explaining processing in the graph convolution network shown in FIG.
  • FIG. 5 is a flowchart showing the operation of the learning model generation device according to Embodiment 1.
  • FIG. FIG. 6 is a configuration diagram showing a schematic configuration of a joint point detection device according to Embodiment 2.
  • FIG. 7 is a diagram more specifically showing the configuration of the joint point detection device according to the second embodiment.
  • FIG. 8 is a flowchart showing the operation of the joint point detection device according to the second embodiment.
  • FIG. 9 is a block diagram showing an example of a computer that realizes the learning model generation device according to Embodiment 1 and the joint point detection device according to Embodiment 2.
  • FIG. 9 is a block diagram showing an example of a computer that realizes the learning model generation device according to Embodiment 1 and the joint point detection device according to Embodiment 2.
  • Embodiment 1 A learning model generation device, a learning model generation method, a learning model generation program, and a graph convolution network in Embodiment 1 will be described below with reference to FIGS. 1 to 5.
  • FIG. 1 A learning model generation device, a learning model generation method, a learning model generation program, and a graph convolution network in Embodiment 1 will be described below with reference to FIGS. 1 to 5.
  • FIG. 1 A learning model generation device, a learning model generation method, a learning model generation program, and a graph convolution network in Embodiment 1 will be described below with reference to FIGS. 1 to 5.
  • FIG. 1 is a configuration diagram showing a schematic configuration of a learning model generation device according to Embodiment 1. As shown in FIG.
  • the learning model generation device 10 is a device that generates a machine learning model for detecting joint points of interest. As shown in FIG. 1 , the learning model generation device 10 includes a training data acquisition section 11 and a learning model generation section 12 .
  • the training data acquisition unit 11 obtains a first graph structure indicating two-dimensional feature amounts of each of the target joint points, and a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points as correct data.
  • Graph structure is obtained as training data.
  • the learning model generation unit 12 inputs the first graph structure to a graph convolution network (GCN) and calculates the difference between the graph structure output from the graph convolution network and the correct data. Then, the learning model generation unit 12 generates a machine learning model constructed by the graph convolution network by performing machine learning on the parameters in the graph convolution network so that the calculated difference becomes small.
  • GCN graph convolution network
  • a graph convolutional network comprises multiple intermediate layers and an output layer. All or some of the plurality of intermediate layers include feature extractors that perform feature extraction without changing the number of nodes in the graph structure and feature extractors that perform feature extraction with a reduced number of nodes in the graph structure. ing.
  • each feature extractor uses as input the graph structure output by each upper layer feature extractor.
  • the output layer uses the graph structure output by each feature extractor in the lowest intermediate layer as input and outputs a graph structure.
  • the graph convolutional network includes, in the intermediate layer, a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. and a feature extractor that performs Therefore, according to the graph convolution network 30, it is possible to avoid a situation in which the feature quantity cannot be sufficiently extracted due to a small number of convolutions and a situation in which the feature quantity cannot be accurately extracted due to an insufficient number of dimensions. As a result, according to the first embodiment, it is possible to improve the detection accuracy when detecting the three-dimensional coordinates of the joint points from the image.
  • FIG. 2 is a configuration diagram specifically showing the configuration of the learning model generation device according to the first embodiment.
  • the learning model generation device 10 includes a storage unit 13 in addition to the training data acquisition unit 11 and the learning model generation unit 12 described above.
  • the storage unit 13 stores a graph convolutional network 30 (hereinafter referred to as "GCN 30").
  • the target is a human hand
  • the target is not limited to the human hand, and may be the entire human body or other parts.
  • the object may be anything that has joint points, and may be something other than a person, such as a robot.
  • the object is the human hand, so in the graph structure, the nodes that constitute it are represented by two-dimensional or three-dimensional feature amounts of each joint point of the hand.
  • a specific example of the feature amount is a coordinate value.
  • 21 indicates the graph structure obtained from the image data 20.
  • nodes 22 represent two-dimensional coordinate values of joint points of the hand as feature quantities.
  • Graph structure 21 is a first graph structure.
  • Reference numeral 23 denotes a second graph structure, which is a graph structure for correct data.
  • nodes 24 represent three-dimensional coordinate values of joint points of the hand as feature quantities.
  • the nodes of the graph structure may represent two-dimensional or three-dimensional feature amounts of parts other than the joint points, for example, characteristic parts such as fingertips.
  • the first graph structure as training data is obtained by inputting target image data into a machine learning model that machine-learns the relationship between the image data of the joint points and the graph structure. can be done.
  • the training data acquisition unit 11 acquires the first graph structure 21 and the second graph structure 23 as correct data as training data. Then, the training data acquisition unit 11 inputs the acquired training data to the learning model generation unit 12 .
  • the learning model generation unit 12 first acquires the GCN 30 from the storage unit 13. Next, the learning model generation unit 12 inputs the first graph structure 21 constituting the training data to the GCN 30, the second graph structure output from the GCN 30, and the second graph structure 23 as correct data. Calculate the difference between Then, the learning model generation unit 12 updates the parameters of the GCN 30 so that the calculated difference is minimized, and stores the GCN 30 with the updated parameters in the storage unit 13 . As a result, a GCN is generated for detecting the three-dimensional coordinates of the target joint point.
  • FIG. 3 is a configuration diagram showing the configuration of the graph convolutional network according to Embodiment 1.
  • FIG. 4 is an explanatory diagram for explaining processing in the graph convolution network shown in FIG.
  • the GCN 30 includes an input layer 31, multiple intermediate layers, and an output layer 33.
  • the input layer 31 accepts input of the first graph structure 21 .
  • the plurality of hidden layers consisting of a first hidden layer 32a and a second hidden layer 32b, perform feature extraction of the first graph structure.
  • the output layer 33 uses, as input, the graph structure output by each feature extractor of the intermediate layer, which is the lowest layer, and outputs a second graph structure.
  • the first intermediate layer 32a includes only the first feature extractor (" ⁇ " in FIG. 3) that performs feature extraction without changing the number of nodes in the graph structure. .
  • a first feature extractor performs convolution.
  • each feature extractor uses as input the graph structure output by the feature extractor that performs feature extraction with the same number of nodes in the upper layer.
  • the feature extractor of the first intermediate layer 32a connects the graph structures output by each feature extractor and executes convolution.
  • the second intermediate layer 32b includes a second feature extractor (“ ⁇ ” in FIG. 3) that extracts features by reducing the number of nodes in the graph structure, and a second feature extractor that extracts features by increasing the number of nodes in the graph structure. 3 feature extractors (hatched “o” in FIG. 3), or both. A second feature extractor performs pooling and a third feature extractor performs unpooling.
  • the second intermediate layer 32b also includes a first feature extractor (“O” in FIG. 3). In each of the second intermediate layers 32b, each feature extractor uses as input a plurality of graph structures output by each upper layer feature extractor.
  • the input layer 31 also includes a first feature extractor (" ⁇ " in FIG. 3).
  • the output layer 33 includes a first feature extractor (“ ⁇ ” in FIG. 3) and a third feature extractor (hatched “ ⁇ ” in FIG. 3).
  • graph structures with different numbers of nodes are generated by feature extractors, and graph structures with different numbers of nodes are exchanged between feature extractors.
  • numbers attached to the graph structure indicate the number of nodes.
  • FIG. 5 is a flowchart showing the operation of the learning model generation device according to Embodiment 1.
  • FIG. 1 to 4 will be referred to as needed in the following description.
  • the learning model generation method is implemented by operating the learning model generation device 10 . Therefore, the explanation of the learning model generation method in Embodiment 1 is replaced with the following explanation of the operation of the learning model generation device.
  • the training data acquisition unit 11 obtains a first graph structure indicating two-dimensional feature amounts of each of the target joint points, and correct data of each of the joint points.
  • a second graph structure representing a three-dimensional feature quantity is acquired as training data (step A1).
  • the learning model generation unit 12 inputs the first graph structure acquired as training data in step A1 to the GCN 30, and calculates the difference between the graph structure output from the GCN and the correct data. Then, the learning model generation unit 12 updates the parameters of the GCN so that the calculated difference becomes smaller (step A2).
  • the learning model generation unit 12 stores the GCN whose parameters are updated in step A2 in the storage unit 13 (step A3). This generates a GCN that can detect the three-dimensional coordinates of joint points.
  • Embodiment 1 since a GCN is constructed that can sufficiently and accurately extract the feature amount, the detection accuracy when detecting the three-dimensional coordinates of the target joint point from the image is will be improved.
  • a program for generating a learning model in Embodiment 1 may be a program that causes a computer to execute steps A1 to A3 shown in FIG. By installing this program in a computer and executing it, the learning model generation device and learning model generation method in Embodiment 1 can be realized.
  • the processor of the computer functions as a training data acquisition unit 11 and a learning model generation unit 12 to perform processing.
  • the storage unit 13 may be implemented by storing the data files constituting these in a storage device such as a hard disk provided in the computer, or may be realized by storing the data files in a storage device of another computer. It may be realized by Moreover, as a computer, a smart phone and a tablet-type terminal device are mentioned other than general-purpose PC.
  • the learning model generation program in Embodiment 1 may be executed by a computer system constructed by a plurality of computers.
  • each computer may function as either the training data acquisition unit 11 or the learning model generation unit 12 .
  • Embodiment 2 Next, in Embodiment 2, a joint point detection device, a joint point detection method, and a joint point detection program will be described with reference to FIGS. 6 and 7.
  • FIG. 2 a joint point detection device, a joint point detection method, and a joint point detection program will be described with reference to FIGS. 6 and 7.
  • FIG. 6 is a configuration diagram showing a schematic configuration of a joint point detection device according to Embodiment 2. As shown in FIG.
  • a joint point detection device 40 according to Embodiment 2 shown in FIG. 6 is a device for detecting joint points of an object, for example, a living body, a robot, or the like. As shown in FIG. 6 , the joint point detection device 40 includes a graph structure acquisition section 41 and a graph structure output section 42 .
  • the graph structure acquisition unit 41 acquires a first graph structure in which two-dimensional feature quantities of each of a plurality of target joint points are represented by nodes.
  • the graph structure output unit 42 receives the first graph structure and uses a graph convolution network to output a second graph structure indicating the three-dimensional feature quantity of each of the joint points.
  • a graph convolutional network comprises multiple intermediate layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure. All or some of the plurality of intermediate layers include feature extractors that perform feature extraction without changing the number of nodes in the graph structure and feature extractors that perform feature extraction with a reduced number of nodes in the graph structure. ing.
  • each feature extractor uses as input the graph structure output by each feature extractor in the upper layer.
  • the output layer uses the graph structure output by each feature extractor in the lowest intermediate layer as input and outputs a graph structure.
  • a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. is used to output a second graph structure. Therefore, according to the second embodiment, it is possible to avoid a situation in which the feature amount cannot be sufficiently extracted due to a small number of convolutions and a situation in which the feature amount cannot be accurately extracted due to an insufficient number of dimensions. As a result, according to the second embodiment, it is possible to improve the detection accuracy when detecting the three-dimensional coordinates of the joint points from the image.
  • FIG. 7 is a diagram more specifically showing the configuration of the joint point detection device according to the second embodiment.
  • the joint point detection device 40 includes a storage unit 43 in addition to the graph structure acquisition unit 41 and the graph structure output unit 42 described above.
  • the storage unit 43 stores the GCN 30 shown in FIG. 2 in the first embodiment.
  • the target is a human hand
  • the joint point detection target is not limited to the human hand, and may be the entire human body or other parts.
  • the target of joint point detection may be any object that has joint points, and may be an object other than a person, such as a robot.
  • two-dimensional or three-dimensional coordinate values can be mentioned as feature values indicated by nodes in the graph structure.
  • the nodes of the graph structure may represent two-dimensional or three-dimensional feature amounts of parts other than the joint points, for example, characteristic parts such as fingertips.
  • the graph structure acquisition unit 41 acquires a first graph structure 50 obtained from image data of a human hand, as shown in FIG.
  • the first graph structure can be obtained by inputting image data into a machine learning model that machine-learns the relationship between the image data of the joint points and the graph structure.
  • the graph structure output unit 42 acquires the GCN 30 from the storage unit 43. Then, the graph structure output unit 42 inputs the first graph structure 50 to the GCN 30, and causes the GCN to output a second graph structure 51 indicating the three-dimensional feature quantity of each of the multiple joint points.
  • the GCN 30 includes an input layer 31, a plurality of intermediate layers 32a and 32b, and an output layer 33, as described in the first embodiment, and mechanically expresses the relationship between the first graph structure and the second graph structure. Built by learning. Therefore, in the output second graph structure 51, the three-dimensional feature amount (coordinate value) indicated by each node is a highly accurate value.
  • FIG. 8 is a flowchart showing the operation of the joint point detection device according to the second embodiment. 6 and 7 will be referred to as necessary in the following description. Further, in the second embodiment, the joint point detection method is implemented by operating the joint point detection device 40 . Therefore, the description of the joint point detection method in the second embodiment is replaced with the description of the operation of the joint point detection device 40 below.
  • the graph structure acquisition unit 41 first acquires a first graph structure 50 obtained from image data of a human hand. (Step B1).
  • the graph structure output unit 42 inputs the first graph structure to the GCN 30, and causes the GCN to output a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points (step B2). .
  • each node represents the three-dimensional coordinates of each joint point of the target human hand, so each joint point of the human hand is detected in step B2.
  • the three-dimensional feature values (coordinate values) indicated by each node are highly accurate values in the output second graph structure.
  • the joint point detection program in the second embodiment may be any program that causes a computer to execute steps B1 and B2 shown in FIG. By installing this program in a computer and executing it, the joint point detecting device and the joint point detecting method in the second embodiment can be realized.
  • the processor of the computer functions as a graph structure acquisition unit 41 and a graph structure output unit 42 to perform processing.
  • the storage unit 43 may be realized by storing data files constituting these in a storage device such as a hard disk provided in the computer, or may be realized by storing the data files in a storage device of another computer. It may be realized by Moreover, as a computer, a smart phone and a tablet-type terminal device are mentioned other than general-purpose PC.
  • the joint point detection program in Embodiment 2 may be executed by a computer system constructed by a plurality of computers.
  • each computer may function as either the graph structure acquisition unit 41 or the graph structure output unit 42, respectively.
  • FIG. 9 is a block diagram showing an example of a computer that realizes the learning model generation device according to Embodiment 1 and the joint point detection device according to Embodiment 2.
  • FIG. 9 is a block diagram showing an example of a computer that realizes the learning model generation device according to Embodiment 1 and the joint point detection device according to Embodiment 2.
  • a computer 110 includes a CPU (Central Processing Unit) 111, a main memory 112, a storage device 113, an input interface 114, a display controller 115, a data reader/writer 116, and a communication interface 117. and These units are connected to each other via a bus 121 so as to be able to communicate with each other.
  • CPU Central Processing Unit
  • the computer 110 may include a GPU (Graphics Processing Unit) or an FPGA (Field-Programmable Gate Array) in addition to the CPU 111 or instead of the CPU 111 .
  • a GPU or FPGA can execute the programs in the embodiments.
  • the CPU 111 expands the program in the embodiment, which is composed of a code group stored in the storage device 113, into the main memory 112 and executes various operations by executing each code in a predetermined order.
  • the main memory 112 is typically a volatile storage device such as DRAM (Dynamic Random Access Memory).
  • the program in the embodiment is provided in a state stored in a computer-readable recording medium 120. It should be noted that the program in this embodiment may be distributed on the Internet connected via communication interface 117 .
  • Input interface 114 mediates data transmission between CPU 111 and input devices 118 such as a keyboard and mouse.
  • the display controller 115 is connected to the display device 119 and controls display on the display device 119 .
  • the data reader/writer 116 mediates data transmission between the CPU 111 and the recording medium 120, reads programs from the recording medium 120, and writes processing results in the computer 110 to the recording medium 120.
  • Communication interface 117 mediates data transmission between CPU 111 and other computers.
  • the recording medium 120 include general-purpose semiconductor storage devices such as CF (Compact Flash (registered trademark)) and SD (Secure Digital), magnetic recording media such as flexible disks, and CD- Optical recording media such as ROM (Compact Disk Read Only Memory) can be mentioned.
  • CF Compact Flash
  • SD Secure Digital
  • magnetic recording media such as flexible disks
  • CD- Optical recording media such as ROM (Compact Disk Read Only Memory) can be mentioned.
  • the learning model generation device 10 and the joint point detection device 40 can also be realized by using hardware corresponding to each part instead of a computer in which a program is installed. Further, the learning model generation device 10 and the joint point detection device 40 may be partly implemented by a program and the rest by hardware.
  • (Appendix 1) a graph structure acquisition unit that acquires a first graph structure in which two-dimensional feature values of each of a plurality of target joint points are represented by nodes; a graph structure output unit that receives the first graph structure as an input and outputs a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points using a graph convolution network; with The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure, All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
  • the joint point detection device according to appendix 1, the graph convolutional network, wherein the plurality of hidden layers comprises a first hidden layer and a second hidden layer;
  • the first hidden layer comprises only feature extractors that perform feature extraction without changing the number of nodes in the graph structure, and in each of the first hidden layers, each feature extractor has the same number of nodes in the upper layer.
  • the second intermediate layer includes a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure, and the number of nodes in the graph structure.
  • a learning model generation unit that generates a machine learning model constructed by the graph convolutional network by machine learning parameters in with
  • the graph convolutional network comprises a plurality of hidden layers and an output layer, All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
  • a learning model generation device characterized by:
  • the learning model generation device comprises a first hidden layer and a second hidden layer;
  • the first hidden layer comprises only feature extractors that perform feature extraction without changing the number of nodes in the graph structure, and in each of the first hidden layers, each feature extractor has the same number of nodes in the upper layer.
  • the second intermediate layer includes a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure, and the number of nodes in the graph structure.
  • the graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure, All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
  • a machine learning model generation step of generating a machine learning model constructed by the graph convolutional network by machine learning parameters in has
  • the graph convolutional network comprises a plurality of hidden layers and an output layer, All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
  • a learning model generation method characterized by:
  • the graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure, All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
  • the graph convolutional network comprises a plurality of hidden layers and an output layer, All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
  • a computer-readable recording medium characterized by:
  • (Appendix 11) comprising a plurality of intermediate layers and an output layer, By machine learning a relationship between a first graph structure indicating two-dimensional feature amounts of each of a plurality of target joint points and a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points.
  • All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure.
  • each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer
  • the output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
  • a graph convolutional network characterized by:
  • the present invention it is possible to improve detection accuracy when detecting three-dimensional coordinates of joint points from an image.
  • INDUSTRIAL APPLICABILITY The present invention is useful in fields that require posture detection of objects having joint points, such as people and robots. Specific fields include video surveillance and user interfaces.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

A teaching model generation device 10 comprising: a training data acquisition unit 11 that acquires a first graph structure that indicates a two-dimensional feature value for a joint point and a second graph structure that indicates a three-dimensional feature value for a joint point as correct answer data; and a teaching model generation unit 12 that enters the first graph structure into a graphic convolution network, calculates the difference between the output graph structure and the correct answer data, and machine-learns parameters for the graphic convolution network so as to reduce the difference. The graphic convolution network comprises an intermediate layer and an output layer.The intermediate layer comprises: a feature extractor that executes feature extraction without changing the number of nodes in the graph structure; and a feature extractor that reduces the number of nodes in the graph structure and performs feature extraction. Each feature extractor uses the graph structure output from an upper level feature extractor as inputs therefor. The output layer outputs a graph structure, using the graph structure output by each feature extractor in an intermediate layer among the lowest layers as input therefor.

Description

関節点検出装置、学習モデル生成装置、関節点検出方法、学習モデル生成方法、及びコンピュータ読み取り可能な記録媒体Joint point detection device, learning model generation device, joint point detection method, learning model generation method, and computer-readable recording medium
 本発明は、画像から対象の関節点を検出するための、関節点検出装置、及び関節点検出方法に関し、更には、これらを実現するためのプログラムを記録したコンピュータ読み取り可能な記録媒体に関する。また、本発明は、画像からの対象の関節点の検出用の学習モデルを生成するための、学習モデル生成装置、及び学習モデル生成方法に関し、更には、これらを実現するためのプログラムを記録したコンピュータ読み取り可能な記録媒体に関する。 The present invention relates to a joint point detection device and a joint point detection method for detecting joint points of an object from an image, and further relates to a computer-readable recording medium recording a program for realizing these. The present invention also relates to a learning model generation device and a learning model generation method for generating a learning model for detecting joint points of an object from an image, and furthermore, a program for realizing these is recorded. It relates to a computer-readable recording medium.
 近年、画像から人の姿勢を推定するシステムが提案されている。このようなシステムは、画像監視、ユーザインタフェース等の分野での利用が期待されている。例えば、画像監視システムにおいて、人の姿勢を推定できれば、カメラに写った人物が何をしているかを推定できるので、監視精度の向上が図られる。また、ユーザインタフェースにおいて、人の姿勢を推定できれば、ジェスチャーによる入力が可能となる。 In recent years, a system for estimating a person's posture from an image has been proposed. Such systems are expected to be used in fields such as image monitoring and user interfaces. For example, in an image monitoring system, if the posture of a person can be estimated, it is possible to estimate what the person photographed by the camera is doing, thereby improving the monitoring accuracy. In addition, if a user's posture can be estimated in a user interface, input using gestures becomes possible.
 例えば、非特許文献1は、画像から人の姿勢、とりわけ、人の手の姿勢を推定するシステムを開示している。非特許文献1に開示されたシステムは、まず、手の画像を含む画像データを取得すると、取得した画像データを、関節点毎の画像特徴量を機械学習したニューラルネットワークに入力して、関節点毎に、その2次元座標を推定する。 For example, Non-Patent Document 1 discloses a system for estimating the posture of a person, especially the posture of a person's hand, from an image. The system disclosed in Non-Patent Document 1 first acquires image data including an image of a hand. , its two-dimensional coordinates are estimated.
 続いて、非特許文献1に開示されたシステムは、推定した各関節点の2次元座標を、グラフ畳み込みネットワーク(以下「GCN」(Graphic Convolution Network)とも表記する)に入力して、各関節点の3次元座標を推定する。GCNは、複数のノードで構成されたグラフ構造を入力とし、隣接するノードを用いて畳み込み処理を実行するネットワークである(例えば、特許文献1参照)。非特許文献1では、入力されるグラフ構造において、各ノードは、各関節点の2次元座標によって構成される。 Subsequently, the system disclosed in Non-Patent Document 1 inputs the estimated two-dimensional coordinates of each joint point to a graph convolution network (hereinafter also referred to as “GCN” (Graphic Convolution Network)), and each joint point Estimate the three-dimensional coordinates of A GCN is a network that takes as input a graph structure composed of a plurality of nodes and performs convolution processing using adjacent nodes (see, for example, Patent Document 1). In Non-Patent Document 1, each node in the input graph structure is configured by the two-dimensional coordinates of each joint point.
 そして、非特許文献1に開示されたシステムでは、GCNは、前段では、入力されたグラフ構造のノードの数を減らすプーリング処理を複数回実行し、最終的にノードの数を1にする。また、GCNは、後段では、ノードの数が1のグラフ構造に対して、プーリング処理と同じ回数だけ、ノードの数を増加させるアンプーリング処理を行う。また、GCNは、後段では、後段の処理対象となっているグラフ構造に、それとノード数が同一の前段におけるグラフ構造を連結して、畳み込みを実行し、最終的に、入力されたグラフ構造とノード数が同一のグラフ構造を出力する。 In the system disclosed in Non-Patent Document 1, the GCN performs pooling processing multiple times to reduce the number of nodes in the input graph structure in the previous stage, and finally reduces the number of nodes to 1. Also, in the later stage, the GCN performs unpooling processing for increasing the number of nodes by the same number of times as the pooling processing for the graph structure having one node. In addition, in the latter stage, the GCN connects the graph structure to be processed in the latter stage with the graph structure in the previous stage that has the same number of nodes, and executes convolution. Output a graph structure with the same number of nodes.
 更に、非特許文献1に開示されたシステムでは、訓練データとして、各関節点の2次元座標のグラフ構造が、GCNに入力され、出力されたグラフ構造と、正解データとなる各関節点の3次元座標のグラフ構造との差分が小さくなるように、GCNの機械学習が行われる。 Furthermore, in the system disclosed in Non-Patent Document 1, as training data, the graph structure of the two-dimensional coordinates of each joint point is input to the GCN, and the output graph structure and the three points of each joint point as correct data. Machine learning of GCN is performed so that the difference with the graph structure of dimensional coordinates becomes small.
特開2020-27399号公報JP 2020-27399 A
 ところで、非特許文献1に開示されたシステムでは、前段のグラフ構造が後段のグラフ構造に畳み込まれるため、多くの空間情報が畳み込まれることになるが、前段においては、畳み込みの回数が少なくなるので、特徴量の抽出が十分ではない。また、非特許文献1に開示されたシステムでは、後段においては、ノードの数が1のグラフ構造が対象となるので、次元数が足りずに、正確に特徴量を抽出できなくなる。このため、非特許文献1に開示されたシステムには、各関節点の3次元座標の精度が低いという問題がある。 By the way, in the system disclosed in Non-Patent Document 1, since the graph structure of the former stage is convoluted into the graph structure of the latter stage, a lot of spatial information is convoluted. Therefore, extraction of the feature quantity is not sufficient. In addition, in the system disclosed in Non-Patent Document 1, since the target is a graph structure with one node in the latter stage, the number of dimensions is insufficient and the feature amount cannot be extracted accurately. Therefore, the system disclosed in Non-Patent Document 1 has a problem that the accuracy of the three-dimensional coordinates of each joint point is low.
 本発明の目的の一例は、画像から関節点の3次元座標を検出する際の検出精度の向上を図り得る、関節点検出装置、学習モデル生成装置、関節点検出方法、学習モデル生成方法、及びコンピュータ読み取り可能な記録媒体を提供することにある。 An example of the object of the present invention is a joint point detection device, a learning model generation device, a joint point detection method, a learning model generation method, and a An object of the present invention is to provide a computer-readable recording medium.
 上記目的を達成するため、本発明の一側面における関節点検出装置は、
 対象の複数の関節点それぞれの2次元の特徴量がノードで表される第1のグラフ構造を取得する、グラフ構造取得部と、
 前記第1のグラフ構造を入力とし、グラフ畳み込みネットワークを用いて、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造を出力する、グラフ構造出力部と、
を備え、
 前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、前記第1のグラフ構造と前記第2のグラフ構造との関係を機械学習することによって構築され、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とする。
In order to achieve the above object, a joint point detection device according to one aspect of the present invention includes:
a graph structure acquisition unit that acquires a first graph structure in which two-dimensional feature values of each of a plurality of target joint points are represented by nodes;
a graph structure output unit that receives the first graph structure as an input and outputs a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points using a graph convolution network;
with
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by
 上記目的を達成するため、本発明の一側面における学習モデル生成装置は、
 対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、正解データとしての、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造とを、訓練データとして取得する、訓練データ取得部と、
 前記第1のグラフ構造を、グラフ畳み込みネットワークに入力し、前記グラフ畳み込みネットワークから出力されたグラフ構造と、前記正解データとの差分を算出し、算出した差分が小さくなるように、前記グラフ畳み込みネットワークにおけるパラメータを機械学習することによって、前記グラフ畳み込みネットワークによって構築された機械学習モデルを生成する、学習モデル生成部と、
を備え、
 前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とする。
In order to achieve the above object, the learning model generation device in one aspect of the present invention includes:
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition unit that acquires data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a learning model generation unit that generates a machine learning model constructed by the graph convolutional network by machine learning parameters in
with
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by
 上記目的を達成するため、本発明の一側面における関節点検出方法は、
 対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造を取得する、グラフ構造取得ステップと、
 前記第1のグラフ構造を入力とし、グラフ畳み込みネットワークを用いて、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造を出力する、グラフ構造出力ステップと、
を有し、
 前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、前記第1のグラフ構造と前記第2のグラフ構造との関係を機械学習することによって構築され、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とする。
In order to achieve the above object, a joint point detection method according to one aspect of the present invention includes:
a graph structure obtaining step of obtaining a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points;
a graph structure output step of outputting a second graph structure showing three-dimensional feature amounts of each of the plurality of joint points using the first graph structure as an input and using a graph convolution network;
has
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by
 上記目的を達成するため、本発明の一側面における学習モデル生成方法は、
 対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、正解データとしての、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造とを、訓練データとして取得する、訓練データ取得ステップと、
 前記第1のグラフ構造を、グラフ畳み込みネットワークに入力し、前記グラフ畳み込みネットワークから出力されたグラフ構造と、前記正解データとの差分を算出し、算出した差分が小さくなるように、前記グラフ畳み込みネットワークにおけるパラメータを機械学習することによって、前記グラフ畳み込みネットワークによって構築された機械学習モデルを生成する、機械学習モデル生成ステップと、
を有し、
 前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とする。
In order to achieve the above object, a learning model generation method according to one aspect of the present invention includes:
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition step for acquiring as data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a machine learning model generation step of generating a machine learning model constructed by the graph convolutional network by machine learning parameters in
has
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by
 上記目的を達成するため、本発明の一側面における第1のコンピュータ読み取り可能な記録媒体は、
コンピュータに、
 対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造を取得する、グラフ構造取得ステップと、
 前記第1のグラフ構造を入力とし、グラフ畳み込みネットワークを用いて、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造を出力する、グラフ構造出力ステップと、
を実行させる命令を含むプログラムを記録し、
 前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、前記第1のグラフ構造と前記第2のグラフ構造との関係を機械学習することによって構築され、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とする。
To achieve the above object, a first computer-readable recording medium in one aspect of the present invention comprises:
to the computer,
a graph structure obtaining step of obtaining a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points;
a graph structure output step of outputting a second graph structure showing three-dimensional feature amounts of each of the plurality of joint points using the first graph structure as an input and using a graph convolution network;
Record a program containing instructions to execute
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by
 上記目的を達成するため、本発明の一側面における第2のコンピュータ読み取り可能な記録媒体は、
コンピュータに、
 対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、正解データとしての、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造とを、訓練データとして取得する、訓練データ取得ステップと、
 前記第1のグラフ構造を、グラフ畳み込みネットワークに入力し、前記グラフ畳み込みネットワークから出力されたグラフ構造と、前記正解データとの差分を算出し、算出した差分が小さくなるように、前記グラフ畳み込みネットワークにおけるパラメータを機械学習することによって、前記グラフ畳み込みネットワークによって構築された機械学習モデルを生成する、機械学習モデル生成ステップと、
を実行させる命令を含むプログラムを記録し、
 前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とする。
To achieve the above object, a second computer-readable recording medium in one aspect of the present invention comprises:
to the computer,
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition step for acquiring as data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a machine learning model generation step of generating a machine learning model constructed by the graph convolutional network by machine learning parameters in
Record a program containing instructions to execute
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by
 上記目的を達成するため、本発明の一側面におけるグラフ畳み込みネットワークは、
 複数の中間層と、出力層とを備え、
 対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造との関係を機械学習することによって構築され、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とする。
To achieve the above object, the graph convolutional network in one aspect of the present invention is
comprising a plurality of intermediate layers and an output layer,
By machine learning a relationship between a first graph structure indicating two-dimensional feature amounts of each of a plurality of target joint points and a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points. built and
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
It is characterized by
 以上のように、本発明によれば、画像から関節点の3次元座標を検出する際の検出精度の向上を図ることができる。 As described above, according to the present invention, it is possible to improve detection accuracy when detecting three-dimensional coordinates of joint points from an image.
図1は、実施の形態1における学習モデル生成装置の概略構成を示す構成図である。FIG. 1 is a configuration diagram showing a schematic configuration of a learning model generation device according to Embodiment 1. As shown in FIG. 図2は、実施の形態1における学習モデル生成装置の構成を具体的に示す構成図である。FIG. 2 is a configuration diagram specifically showing the configuration of the learning model generation device according to the first embodiment. 図3は、実施の形態1におけるグラフ畳み込みネットワークの構成を示す構成図である。FIG. 3 is a configuration diagram showing the configuration of the graph convolutional network according to Embodiment 1. FIG. 図4は、図3に示すグラフ畳み込みネットワークでの処理を説明する説明図である。FIG. 4 is an explanatory diagram for explaining processing in the graph convolution network shown in FIG. 図5は、実施の形態1における学習モデル生成装置の動作を示すフロー図である。FIG. 5 is a flowchart showing the operation of the learning model generation device according to Embodiment 1. FIG. 図6は、実施の形態2における関節点検出装置の概略構成を示す構成図である。FIG. 6 is a configuration diagram showing a schematic configuration of a joint point detection device according to Embodiment 2. As shown in FIG. 図7は、実施の形態2における関節点検出装置の構成をより具体的に示す図である。FIG. 7 is a diagram more specifically showing the configuration of the joint point detection device according to the second embodiment. 図8は、実施の形態2における関節点検出装置の動作を示すフロー図である。FIG. 8 is a flowchart showing the operation of the joint point detection device according to the second embodiment. 図9は、実施の形態1における学習モデル生成装置と実施の形態2における関節点検出装置とを実現するコンピュータの一例を示すブロック図である。FIG. 9 is a block diagram showing an example of a computer that realizes the learning model generation device according to Embodiment 1 and the joint point detection device according to Embodiment 2. FIG.
(実施の形態1)
 以下、実施の形態1において、学習モデル生成装置、学習モデル生成方法、及び学習モデル生成用のプログラムについて、更にはグラフ畳み込みネットワークについて、図1~図5を参照しながら説明する。
(Embodiment 1)
A learning model generation device, a learning model generation method, a learning model generation program, and a graph convolution network in Embodiment 1 will be described below with reference to FIGS. 1 to 5. FIG.
[装置構成]
 最初に、実施の形態1における学習モデル生成装置の概略構成について図1を用いて説明する。図1は、実施の形態1における学習モデル生成装置の概略構成を示す構成図である。
[Device configuration]
First, a schematic configuration of the learning model generation device according to Embodiment 1 will be described with reference to FIG. FIG. 1 is a configuration diagram showing a schematic configuration of a learning model generation device according to Embodiment 1. As shown in FIG.
 図1に示す実施の形態1における学習モデル生成装置10は、対象の関節点の検出ための機械学習モデルを生成する装置である。図1に示すように、学習モデル生成装置10は、訓練データ取得部11と、学習モデル生成部12とを備えている。 The learning model generation device 10 according to Embodiment 1 shown in FIG. 1 is a device that generates a machine learning model for detecting joint points of interest. As shown in FIG. 1 , the learning model generation device 10 includes a training data acquisition section 11 and a learning model generation section 12 .
 訓練データ取得部11は、対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、正解データとしての、複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造とを、訓練データとして取得する。 The training data acquisition unit 11 obtains a first graph structure indicating two-dimensional feature amounts of each of the target joint points, and a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points as correct data. Graph structure is obtained as training data.
 学習モデル生成部12は、第1のグラフ構造を、グラフ畳み込みネットワーク(GCN:Graph Convolution Network)に入力し、グラフ畳み込みネットワークから出力されたグラフ構造と、正解データとの差分を算出する。そして、学習モデル生成部12は、算出した差分が小さくなるように、グラフ畳み込みネットワークにおけるパラメータを機械学習することによって、グラフ畳み込みネットワークによって構築された機械学習モデルを生成する。 The learning model generation unit 12 inputs the first graph structure to a graph convolution network (GCN) and calculates the difference between the graph structure output from the graph convolution network and the correct data. Then, the learning model generation unit 12 generates a machine learning model constructed by the graph convolution network by performing machine learning on the parameters in the graph convolution network so that the calculated difference becomes small.
 グラフ畳み込みネットワークは、複数の中間層と、出力層とを備えている。複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備えている。複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用いる。出力層は、最も下層の中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する。 A graph convolutional network comprises multiple intermediate layers and an output layer. All or some of the plurality of intermediate layers include feature extractors that perform feature extraction without changing the number of nodes in the graph structure and feature extractors that perform feature extraction with a reduced number of nodes in the graph structure. ing. In each of the multiple intermediate layers, each feature extractor uses as input the graph structure output by each upper layer feature extractor. The output layer uses the graph structure output by each feature extractor in the lowest intermediate layer as input and outputs a graph structure.
 このように、実施の形態1では、グラフ畳み込みネットワークは、中間層において、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備えている。このため、グラフ畳み込みネットワーク30によれば、畳み込みの回数が少ないために特徴量の抽出が十分行えない事態と、次元数が足りないために、正確に特徴量を抽出できない事態とを回避できる。この結果、実施の形態1によれば、画像から関節点の3次元座標を検出する際の検出精度の向上が図られる。 Thus, in the first embodiment, the graph convolutional network includes, in the intermediate layer, a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. and a feature extractor that performs Therefore, according to the graph convolution network 30, it is possible to avoid a situation in which the feature quantity cannot be sufficiently extracted due to a small number of convolutions and a situation in which the feature quantity cannot be accurately extracted due to an insufficient number of dimensions. As a result, according to the first embodiment, it is possible to improve the detection accuracy when detecting the three-dimensional coordinates of the joint points from the image.
 続いて、図2を用いて、実施の形態1における学習モデル生成装置10の構成についてより具体的に説明する。図2は、実施の形態1における学習モデル生成装置の構成を具体的に示す構成図である。 Next, using FIG. 2, the configuration of the learning model generation device 10 according to Embodiment 1 will be described more specifically. FIG. 2 is a configuration diagram specifically showing the configuration of the learning model generation device according to the first embodiment.
 図2に示すように、実施の形態1では、学習モデル生成装置10は、上述した訓練データ取得部11、及び学習モデル生成部12に加えて、記憶部13を備えている。記憶部13は、グラフ畳み込みネットワーク30(以下「GCN30」と表記する。)を格納している。 As shown in FIG. 2, in Embodiment 1, the learning model generation device 10 includes a storage unit 13 in addition to the training data acquisition unit 11 and the learning model generation unit 12 described above. The storage unit 13 stores a graph convolutional network 30 (hereinafter referred to as "GCN 30").
 また、以降においては、対象が人の手である場合を例に挙げて説明する。なお、実施の形態1において、対象は、人の手に限定されず、人の体全体であっても良いし、他の部位であっても良い。対象は、関節点を有するものであれば良く、人以外のもの、例えば、ロボットであっても良い。 Also, hereinafter, the case where the target is a human hand will be described as an example. In Embodiment 1, the target is not limited to the human hand, and may be the entire human body or other parts. The object may be anything that has joint points, and may be something other than a person, such as a robot.
 実施の形態1では、対象が人の手であるため、グラフ構造においては、それを構成するノードは、手の関節点それぞれの2次元又は3次元の特徴量によって表される。特徴量の具体例としては、座標値が挙げられる。 In Embodiment 1, the object is the human hand, so in the graph structure, the nodes that constitute it are represented by two-dimensional or three-dimensional feature amounts of each joint point of the hand. A specific example of the feature amount is a coordinate value.
 図2において、21は、画像データ20から得られたグラフ構造を示している。グラフ構造21において、ノード22は、手の関節点の2次元の座標値を特徴量として表している。グラフ構造21は、第1のグラフ構造である。また、23は、正解データとなるグラフ構造であり、第2のグラフ構造である。第2のグラフ構造23において、ノード24は、手の関節点の3次元の座標値を特徴量として表している。 In FIG. 2, 21 indicates the graph structure obtained from the image data 20. In the graph structure 21, nodes 22 represent two-dimensional coordinate values of joint points of the hand as feature quantities. Graph structure 21 is a first graph structure. Reference numeral 23 denotes a second graph structure, which is a graph structure for correct data. In the second graph structure 23, nodes 24 represent three-dimensional coordinate values of joint points of the hand as feature quantities.
 なお、グラフ構造のノードは、関節点に加え、関節点以外の部分、例えば、指先といった特徴的な部分の2次元又は3次元の特徴量を表していても良い。また、実施の形態1において、訓練データとなる第1のグラフ構造は、関節点の画像データとグラフ構造との関係を機械学習した機械学習モデルに、対象の画像データを入力することで得ることができる。 In addition to the joint points, the nodes of the graph structure may represent two-dimensional or three-dimensional feature amounts of parts other than the joint points, for example, characteristic parts such as fingertips. Further, in Embodiment 1, the first graph structure as training data is obtained by inputting target image data into a machine learning model that machine-learns the relationship between the image data of the joint points and the graph structure. can be done.
 訓練データ取得部11は、実施の形態では、訓練データとして、第1のグラフ構造21と正解データとなる第2のグラフ構造23とを取得する。そして、訓練データ取得部11は、取得した訓練データを学習モデル生成部12に入力する。 In the embodiment, the training data acquisition unit 11 acquires the first graph structure 21 and the second graph structure 23 as correct data as training data. Then, the training data acquisition unit 11 inputs the acquired training data to the learning model generation unit 12 .
 学習モデル生成部12は、実施の形態では、まず、記憶部13からGCN30を取得する。次いで、学習モデル生成部12は、GCN30に、訓練データを構成する第1のグラフ構造21を入力し、GCN30から出力された第2のグラフ構造と、正解データとなる第2のグラフ構造23との差分を算出する。そして、学習モデル生成部12は、算出した差分が最小となるように、GCN30のパラメータを更新し、パラメータが更新されたGCN30を記憶部13に格納する。この結果、対象の関節点の3次元座標を検出するためのGCNが生成されることになる。 In the embodiment, the learning model generation unit 12 first acquires the GCN 30 from the storage unit 13. Next, the learning model generation unit 12 inputs the first graph structure 21 constituting the training data to the GCN 30, the second graph structure output from the GCN 30, and the second graph structure 23 as correct data. Calculate the difference between Then, the learning model generation unit 12 updates the parameters of the GCN 30 so that the calculated difference is minimized, and stores the GCN 30 with the updated parameters in the storage unit 13 . As a result, a GCN is generated for detecting the three-dimensional coordinates of the target joint point.
 続いて、図3及び図4を用いて、実施の形態1におけるグラフ畳み込みネットワーク30の構成及び機能について具体的に説明する。図3は、実施の形態1におけるグラフ畳み込みネットワークの構成を示す構成図である。図4は、図3に示すグラフ畳み込みネットワークでの処理を説明する説明図である。 Next, the configuration and functions of the graph convolution network 30 according to Embodiment 1 will be specifically described using FIGS. 3 and 4. FIG. FIG. 3 is a configuration diagram showing the configuration of the graph convolutional network according to Embodiment 1. FIG. FIG. 4 is an explanatory diagram for explaining processing in the graph convolution network shown in FIG.
 図3に示すように、GCN30は、入力層31と、複数の中間層と、出力層33とを備えている。入力層31は、第1のグラフ構造21の入力を受け付ける。複数の中間層は、第1の中間層32aと第2の中間層32bとで構成されており、第1のグラフ構造の特徴抽出を実行する。出力層33は、最も下層にある中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、第2のグラフ構造を出力する。 As shown in FIG. 3, the GCN 30 includes an input layer 31, multiple intermediate layers, and an output layer 33. The input layer 31 accepts input of the first graph structure 21 . The plurality of hidden layers, consisting of a first hidden layer 32a and a second hidden layer 32b, perform feature extraction of the first graph structure. The output layer 33 uses, as input, the graph structure output by each feature extractor of the intermediate layer, which is the lowest layer, and outputs a second graph structure.
 また、図3に示すように、第1の中間層32aは、グラフ構造のノード数を変えずに特徴抽出を実行する第1の特徴抽出器(図3における「○」)のみを備えている。第1の特徴抽出器は畳み込みを実行する。また、第1の中間層32aそれぞれにおいて、各特徴抽出器は、上層の同じノード数で特徴抽出を実行する特徴抽出器が出力したグラフ構造を入力として用いる。第1の中間層32aの特徴抽出器は、入力元になる特徴抽出器が複数存在する場合は、各特徴器が出力したグラフ構造を連結して、畳み込みを実行する。 Also, as shown in FIG. 3, the first intermediate layer 32a includes only the first feature extractor ("○" in FIG. 3) that performs feature extraction without changing the number of nodes in the graph structure. . A first feature extractor performs convolution. Also, in each of the first intermediate layers 32a, each feature extractor uses as input the graph structure output by the feature extractor that performs feature extraction with the same number of nodes in the upper layer. When there are a plurality of feature extractors that serve as input sources, the feature extractor of the first intermediate layer 32a connects the graph structures output by each feature extractor and executes convolution.
 第2の中間層32bは、グラフ構造のノード数を小さくして特徴抽出を行う第2の特徴抽出器(図3における「●」)及びグラフ構造のノード数を大きくして特徴抽出を行う第3の特徴抽出器(図3におけるハッチングが施された「○」)のうちの一方又は両方を備えている。第2特徴抽出器はプーリングを実行し、第3の特徴抽出器はアンプーリングを実行する。また、第2の中間層32bは、第1の特徴抽出器(図3における「○」)も備えている。第2の中間層32bそれぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力した複数のグラフ構造を入力として用いる。 The second intermediate layer 32b includes a second feature extractor (“●” in FIG. 3) that extracts features by reducing the number of nodes in the graph structure, and a second feature extractor that extracts features by increasing the number of nodes in the graph structure. 3 feature extractors (hatched "o" in FIG. 3), or both. A second feature extractor performs pooling and a third feature extractor performs unpooling. The second intermediate layer 32b also includes a first feature extractor (“O” in FIG. 3). In each of the second intermediate layers 32b, each feature extractor uses as input a plurality of graph structures output by each upper layer feature extractor.
 また、入力層31は、第1の特徴抽出器(図3における「○」)を備えている。出力層33は、第1の特徴抽出器(図3における「○」)と、第3の特徴抽出器(図3におけるハッチングが施された「○」)とを備えている。 The input layer 31 also includes a first feature extractor ("○" in FIG. 3). The output layer 33 includes a first feature extractor (“○” in FIG. 3) and a third feature extractor (hatched “○” in FIG. 3).
 そして、図4に示すように、このような構成のGCN30に、第1のグラフ構造が入力されると、中間層において、畳み込み、プーリング、及びアンプーリングが行われる。更に、中間層では、ノード数の高いグラフ構造を対象とする特徴抽出器と、ノード数の低いグラフ構造を対象とする特徴抽出器との間で、情報の交換が行われることになる。 Then, as shown in FIG. 4, when the first graph structure is input to the GCN 30 having such a configuration, convolution, pooling, and unpooling are performed in the intermediate layer. Furthermore, in the intermediate layer, information is exchanged between a feature extractor targeting a graph structure with a high number of nodes and a feature extractor targeting a graph structure with a low number of nodes.
 つまり、図4に示すように、中間層では、特徴抽出器によって、ノード数の異なるグラフ構造が生成されると共に、特徴抽出器間で、ノード数の異なるグラフ構造の交換が行われる。図4において、グラフ構造に付された数字は、ノード数を示している。この結果、GCN30によれば、畳み込みの回数が少ないために特徴量の抽出が十分行えない事態と、次元数が足りないために、正確に特徴量を抽出できない事態とが抑制される。 That is, as shown in FIG. 4, in the intermediate layer, graph structures with different numbers of nodes are generated by feature extractors, and graph structures with different numbers of nodes are exchanged between feature extractors. In FIG. 4, numbers attached to the graph structure indicate the number of nodes. As a result, according to the GCN 30, a situation in which the feature quantity cannot be sufficiently extracted due to a small number of convolutions and a situation in which the feature quantity cannot be extracted accurately due to an insufficient number of dimensions are suppressed.
[装置動作]
 次に、実施の形態1における学習モデル生成装置10の動作について図5を用いて説明する。図5は、実施の形態1における学習モデル生成装置の動作を示すフロー図である。以下の説明においては、適宜図1~図4を参照する。また、実施の形態1では、学習モデル生成装置10を動作させることによって、学習モデル生成方法が実施される。よって、実施の形態1における学習モデル生成方法の説明は、以下の学習モデル生成装置の動作説明に代える。
[Device operation]
Next, the operation of the learning model generation device 10 according to Embodiment 1 will be described using FIG. FIG. 5 is a flowchart showing the operation of the learning model generation device according to Embodiment 1. FIG. 1 to 4 will be referred to as needed in the following description. Further, in Embodiment 1, the learning model generation method is implemented by operating the learning model generation device 10 . Therefore, the explanation of the learning model generation method in Embodiment 1 is replaced with the following explanation of the operation of the learning model generation device.
 最初に、図5に示すように、訓練データ取得部11は、対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、正解データとしての、複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造とを、訓練データとして取得する(ステップA1)。 First, as shown in FIG. 5, the training data acquisition unit 11 obtains a first graph structure indicating two-dimensional feature amounts of each of the target joint points, and correct data of each of the joint points. A second graph structure representing a three-dimensional feature quantity is acquired as training data (step A1).
 次に、学習モデル生成部12は、ステップA1で訓練データとして取得した第1のグラフ構造を、GCN30に入力し、GCNから出力されたグラフ構造と、正解データとの差分を算出する。そして、学習モデル生成部12は、算出した差分が小さくなるように、GCNのパラメータを更新する(ステップA2)。 Next, the learning model generation unit 12 inputs the first graph structure acquired as training data in step A1 to the GCN 30, and calculates the difference between the graph structure output from the GCN and the correct data. Then, the learning model generation unit 12 updates the parameters of the GCN so that the calculated difference becomes smaller (step A2).
 その後、学習モデル生成部12は、ステップA2でパラメータが更新されたGCNを、記憶部13に格納する(ステップA3)。これにより、関節点の3次元座標を検出できるGCNが生成される。 After that, the learning model generation unit 12 stores the GCN whose parameters are updated in step A2 in the storage unit 13 (step A3). This generates a GCN that can detect the three-dimensional coordinates of joint points.
 このように、実施の形態1によれば、特徴量を、十分、且つ、正確に抽出できる、GCNが構築されるので、画像から、対象の関節点の3次元座標を検出する際の検出精度の向上が図られる。 As described above, according to Embodiment 1, since a GCN is constructed that can sufficiently and accurately extract the feature amount, the detection accuracy when detecting the three-dimensional coordinates of the target joint point from the image is will be improved.
[プログラム]
 実施の形態1における学習モデル生成用のプログラムは、コンピュータに、図5に示すステップA1~A3を実行させるプログラムであれば良い。このプログラムをコンピュータにインストールし、実行することによって、実施の形態1における学習モデル生成装置と学習モデル生成方法とを実現することができる。この場合、コンピュータのプロセッサは、訓練データ取得部11、及び学習モデル生成部12として機能し、処理を行なう。
[program]
A program for generating a learning model in Embodiment 1 may be a program that causes a computer to execute steps A1 to A3 shown in FIG. By installing this program in a computer and executing it, the learning model generation device and learning model generation method in Embodiment 1 can be realized. In this case, the processor of the computer functions as a training data acquisition unit 11 and a learning model generation unit 12 to perform processing.
 また、実施の形態1では、記憶部13は、コンピュータに備えられたハードディスク等の記憶装置に、これらを構成するデータファイルを格納することによって実現されていても良いし、別のコンピュータの記憶装置によって実現されていても良い。また、コンピュータとしては、汎用のPCの他に、スマートフォン、タブレット型端末装置が挙げられる。 Further, in Embodiment 1, the storage unit 13 may be implemented by storing the data files constituting these in a storage device such as a hard disk provided in the computer, or may be realized by storing the data files in a storage device of another computer. It may be realized by Moreover, as a computer, a smart phone and a tablet-type terminal device are mentioned other than general-purpose PC.
 実施の形態1における学習モデル生成用のプログラムは、複数のコンピュータによって構築されたコンピュータシステムによって実行されても良い。この場合は、例えば、各コンピュータが、それぞれ、訓練データ取得部11、及び学習モデル生成部12のいずれかとして機能しても良い。 The learning model generation program in Embodiment 1 may be executed by a computer system constructed by a plurality of computers. In this case, for example, each computer may function as either the training data acquisition unit 11 or the learning model generation unit 12 .
 (実施の形態2)
 続いて、実施の形態2において、関節点検出装置、関節点検出方法、及び関節点検出用のプログラムについて、図6及び図7を参照しながら説明する。
(Embodiment 2)
Next, in Embodiment 2, a joint point detection device, a joint point detection method, and a joint point detection program will be described with reference to FIGS. 6 and 7. FIG.
[装置構成]
 最初に、実施の形態2における関節点検出装置の概略構成について図6を用いて説明する。図6は、実施の形態2における関節点検出装置の概略構成を示す構成図である。
[Device configuration]
First, the schematic configuration of the joint point detection device according to Embodiment 2 will be described with reference to FIG. FIG. 6 is a configuration diagram showing a schematic configuration of a joint point detection device according to Embodiment 2. As shown in FIG.
 図6に示す実施の形態2における関節点検出装置40は、対象、例えば、生体、ロボット等の関節点を検出するための装置である。図6に示すように、関節点検出装置40は、グラフ構造取得部41と、グラフ構造出力部42と、を備えている。 A joint point detection device 40 according to Embodiment 2 shown in FIG. 6 is a device for detecting joint points of an object, for example, a living body, a robot, or the like. As shown in FIG. 6 , the joint point detection device 40 includes a graph structure acquisition section 41 and a graph structure output section 42 .
 グラフ構造取得部41は、対象の複数の関節点それぞれの2次元の特徴量がノードで表
される第1のグラフ構造を取得する。グラフ構造出力部42は、第1のグラフ構造を入力
とし、グラフ畳み込みネットワークを用いて、複数の関節点それぞれの3次元の特徴量を
示す第2のグラフ構造を出力する。
The graph structure acquisition unit 41 acquires a first graph structure in which two-dimensional feature quantities of each of a plurality of target joint points are represented by nodes. The graph structure output unit 42 receives the first graph structure and uses a graph convolution network to output a second graph structure indicating the three-dimensional feature quantity of each of the joint points.
 グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、第1のグラフ構造と第2のグラフ構造との関係を機械学習することによって構築されている。複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備えている。 A graph convolutional network comprises multiple intermediate layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure. All or some of the plurality of intermediate layers include feature extractors that perform feature extraction without changing the number of nodes in the graph structure and feature extractors that perform feature extraction with a reduced number of nodes in the graph structure. ing.
 また、複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用いる。出力層は、最も下層の中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する。 Also, in each of the multiple intermediate layers, each feature extractor uses as input the graph structure output by each feature extractor in the upper layer. The output layer uses the graph structure output by each feature extractor in the lowest intermediate layer as input and outputs a graph structure.
 このように、実施の形態2では、中間層において、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器と、を備える、グラフ畳み込みネットワークを用いて、第2のグラフ構造が出力されている。このため、実施の形態2によれば、畳み込みの回数が少ないために特徴量の抽出が十分行えない事態と、次元数が足りないために、正確に特徴量を抽出できない事態とを回避できる。この結果、実施の形態2によれば、画像から関節点の3次元座標を検出する際の検出精度の向上が図られる。 As described above, in the second embodiment, in the intermediate layer, a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. , is used to output a second graph structure. Therefore, according to the second embodiment, it is possible to avoid a situation in which the feature amount cannot be sufficiently extracted due to a small number of convolutions and a situation in which the feature amount cannot be accurately extracted due to an insufficient number of dimensions. As a result, according to the second embodiment, it is possible to improve the detection accuracy when detecting the three-dimensional coordinates of the joint points from the image.
 続いて、図7を用いて、実施の形態2における関節点検出装置40の構成及び機能について具体的に説明する。図7は、実施の形態2における関節点検出装置の構成をより具体的に示す図である。 Next, using FIG. 7, the configuration and functions of the joint point detection device 40 according to Embodiment 2 will be specifically described. FIG. 7 is a diagram more specifically showing the configuration of the joint point detection device according to the second embodiment.
 図7に示すように、実施の形態2では、関節点検出装置40は、上述したグラフ構造取得部41及びグラフ構造出力部42に加えて、記憶部43を備えている。記憶部43は、実施の形態1において図2に示したGCN30を格納している。 As shown in FIG. 7, in the second embodiment, the joint point detection device 40 includes a storage unit 43 in addition to the graph structure acquisition unit 41 and the graph structure output unit 42 described above. The storage unit 43 stores the GCN 30 shown in FIG. 2 in the first embodiment.
 実施の形態2においても、対象が人の手である場合を例に挙げて説明する。なお、実施の形態2においても、関節点の検出の対象は、人の手に限定されず、人の体全体であっても良いし、他の部位であっても良い。また、関節点の検出の対象は、関節点を有するものであれば良く、人以外のもの、例えば、ロボットであっても良い。 Also in Embodiment 2, the case where the target is a human hand will be described as an example. In the second embodiment as well, the joint point detection target is not limited to the human hand, and may be the entire human body or other parts. Further, the target of joint point detection may be any object that has joint points, and may be an object other than a person, such as a robot.
 加えて、実施の形態2でも、グラフ構造のノードが示す特徴量としては、2次元又は3次元の座標値が挙げられる。また、グラフ構造のノードは、関節点に加え、関節点以外の部分、例えば、指先といった特徴的な部分の2次元又は3次元の特徴量を表していても良い。 In addition, in Embodiment 2 as well, two-dimensional or three-dimensional coordinate values can be mentioned as feature values indicated by nodes in the graph structure. In addition to the joint points, the nodes of the graph structure may represent two-dimensional or three-dimensional feature amounts of parts other than the joint points, for example, characteristic parts such as fingertips.
 グラフ構造取得部41は、実施の形態2では、図7に示すように、人の手の画像データから得られた第1のグラフ構造50を取得する。第1のグラフ構造は、実施の形態1でも述べたように、関節点の画像データとグラフ構造との関係を機械学習した機械学習モデルに、画像データを入力することで得ることができる。 In Embodiment 2, the graph structure acquisition unit 41 acquires a first graph structure 50 obtained from image data of a human hand, as shown in FIG. As described in the first embodiment, the first graph structure can be obtained by inputting image data into a machine learning model that machine-learns the relationship between the image data of the joint points and the graph structure.
 グラフ構造出力部42は、記憶部43から、GCN30を取得する。そして、グラフ構造出力部42は、第1のグラフ構造50を、GCN30に入力し、GCNから、複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造51を出力させる。 The graph structure output unit 42 acquires the GCN 30 from the storage unit 43. Then, the graph structure output unit 42 inputs the first graph structure 50 to the GCN 30, and causes the GCN to output a second graph structure 51 indicating the three-dimensional feature quantity of each of the multiple joint points.
 GCN30は、実施の形態1で述べたように、入力層31と、複数の中間層32a及び32bと、出力層33とを備え、第1のグラフ構造と第2のグラフ構造との関係を機械学習することによって構築されている。このため、出力される第2のグラフ構造51において、各ノードが示す3次元の特徴量(座標値)は精度の高い値となっている。 The GCN 30 includes an input layer 31, a plurality of intermediate layers 32a and 32b, and an output layer 33, as described in the first embodiment, and mechanically expresses the relationship between the first graph structure and the second graph structure. Built by learning. Therefore, in the output second graph structure 51, the three-dimensional feature amount (coordinate value) indicated by each node is a highly accurate value.
[装置動作]
 次に、実施の形態2における関節点検出装置40の動作について図8を用いて説明する。図8は、実施の形態2における関節点検出装置の動作を示すフロー図である。以下の説明においては、適宜図6及び図7を参照する。また、実施の形態2では、関節点検出装置40を動作させることによって、関節点検出方法が実施される。よって、実施の形態2における関節点検出方法の説明は、以下の関節点検出装置40の動作説明に代える。
[Device operation]
Next, the operation of the joint point detection device 40 according to Embodiment 2 will be described with reference to FIG. FIG. 8 is a flowchart showing the operation of the joint point detection device according to the second embodiment. 6 and 7 will be referred to as necessary in the following description. Further, in the second embodiment, the joint point detection method is implemented by operating the joint point detection device 40 . Therefore, the description of the joint point detection method in the second embodiment is replaced with the description of the operation of the joint point detection device 40 below.
 図8に示すように、最初に、グラフ構造取得部41は、人の手の画像データから得られ
た第1のグラフ構造50を取得する。(ステップB1)。
As shown in FIG. 8, the graph structure acquisition unit 41 first acquires a first graph structure 50 obtained from image data of a human hand. (Step B1).
 次に、グラフ構造出力部42は、第1のグラフ構造を、GCN30に入力し、GCNから、複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造を出力させる(ステップB2)。 Next, the graph structure output unit 42 inputs the first graph structure to the GCN 30, and causes the GCN to output a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points (step B2). .
 第2のグラフ構造において、各ノードは、対象となる人の手の各関節点の3次元座標を表しているため、ステップB2により、人の手の各関節点が検出されたことになる。以上のように、実施の形態2では、GCN30が用いられるので、出力される第2のグラフ構造において、各ノードが示す3次元の特徴量(座標値)は精度の高い値となっている。 In the second graph structure, each node represents the three-dimensional coordinates of each joint point of the target human hand, so each joint point of the human hand is detected in step B2. As described above, since the GCN 30 is used in the second embodiment, the three-dimensional feature values (coordinate values) indicated by each node are highly accurate values in the output second graph structure.
[プログラム]
 実施の形態2における関節点検出用のプログラムは、コンピュータに、図8に示すステップB1及びB2を実行させるプログラムであれば良い。このプログラムをコンピュータにインストールし、実行することによって、実施の形態2における関節点検出装置と関節点検出方法とを実現することができる。この場合、コンピュータのプロセッサは、グラフ構造取得部41、及びグラフ構造出力部42として機能し、処理を行なう。
[program]
The joint point detection program in the second embodiment may be any program that causes a computer to execute steps B1 and B2 shown in FIG. By installing this program in a computer and executing it, the joint point detecting device and the joint point detecting method in the second embodiment can be realized. In this case, the processor of the computer functions as a graph structure acquisition unit 41 and a graph structure output unit 42 to perform processing.
 また、実施の形態2では、記憶部43は、コンピュータに備えられたハードディスク等の記憶装置に、これらを構成するデータファイルを格納することによって実現されていても良いし、別のコンピュータの記憶装置によって実現されていても良い。また、コンピュータとしては、汎用のPCの他に、スマートフォン、タブレット型端末装置が挙げられる。 In Embodiment 2, the storage unit 43 may be realized by storing data files constituting these in a storage device such as a hard disk provided in the computer, or may be realized by storing the data files in a storage device of another computer. It may be realized by Moreover, as a computer, a smart phone and a tablet-type terminal device are mentioned other than general-purpose PC.
 実施の形態2における関節点検出用のプログラムは、複数のコンピュータによって構築されたコンピュータシステムによって実行されても良い。この場合は、例えば、各コンピュータが、それぞれ、グラフ構造取得部41、及びグラフ構造出力部42のいずれかとして機能しても良い。 The joint point detection program in Embodiment 2 may be executed by a computer system constructed by a plurality of computers. In this case, for example, each computer may function as either the graph structure acquisition unit 41 or the graph structure output unit 42, respectively.
[物理構成]
 ここで、実施の形態1におけるプログラムを実行することによって学習モデル生成装置10を実現するコンピュータと、実施の形態2におけるプログラムを実行することによって関節点検出装置40を実現するコンピュータとについて、図9を用いて説明する。図9は、実施の形態1における学習モデル生成装置と実施の形態2における関節点検出装置とを実現するコンピュータの一例を示すブロック図である。
[Physical configuration]
Here, a computer that implements the learning model generation device 10 by executing the program in Embodiment 1 and a computer that implements the joint point detection device 40 by executing the program in Embodiment 2 are shown in FIG. will be used to explain. FIG. 9 is a block diagram showing an example of a computer that realizes the learning model generation device according to Embodiment 1 and the joint point detection device according to Embodiment 2. FIG.
 図9に示すように、コンピュータ110は、CPU(Central Processing Unit)111と、メインメモリ112と、記憶装置113と、入力インターフェイス114と、表示コントローラ115と、データリーダ/ライタ116と、通信インターフェイス117とを備える。これらの各部は、バス121を介して、互いにデータ通信可能に接続される。 As shown in FIG. 9, a computer 110 includes a CPU (Central Processing Unit) 111, a main memory 112, a storage device 113, an input interface 114, a display controller 115, a data reader/writer 116, and a communication interface 117. and These units are connected to each other via a bus 121 so as to be able to communicate with each other.
 また、コンピュータ110は、CPU111に加えて、又はCPU111に代えて、GPU(Graphics Processing Unit)、又はFPGA(Field-Programmable Gate Array)を備えていても良い。この態様では、GPU又はFPGAが、実施の形態におけるプログラムを実行することができる。 Also, the computer 110 may include a GPU (Graphics Processing Unit) or an FPGA (Field-Programmable Gate Array) in addition to the CPU 111 or instead of the CPU 111 . In this aspect, a GPU or FPGA can execute the programs in the embodiments.
 CPU111は、記憶装置113に格納された、コード群で構成された実施の形態におけるプログラムをメインメモリ112に展開し、各コードを所定順序で実行することにより、各種の演算を実施する。メインメモリ112は、典型的には、DRAM(Dynamic Random Access Memory)等の揮発性の記憶装置である。 The CPU 111 expands the program in the embodiment, which is composed of a code group stored in the storage device 113, into the main memory 112 and executes various operations by executing each code in a predetermined order. The main memory 112 is typically a volatile storage device such as DRAM (Dynamic Random Access Memory).
 また、実施の形態におけるプログラムは、コンピュータ読み取り可能な記録媒体120に格納された状態で提供される。なお、本実施の形態におけるプログラムは、通信インターフェイス117を介して接続されたインターネット上で流通するものであっても良い。 Also, the program in the embodiment is provided in a state stored in a computer-readable recording medium 120. It should be noted that the program in this embodiment may be distributed on the Internet connected via communication interface 117 .
 また、記憶装置113の具体例としては、ハードディスクドライブの他、フラッシュメモリ等の半導体記憶装置が挙げられる。入力インターフェイス114は、CPU111と、キーボード及びマウスといった入力機器118との間のデータ伝送を仲介する。表示コントローラ115は、ディスプレイ装置119と接続され、ディスプレイ装置119での表示を制御する。 Further, as a specific example of the storage device 113, in addition to a hard disk drive, a semiconductor storage device such as a flash memory can be cited. Input interface 114 mediates data transmission between CPU 111 and input devices 118 such as a keyboard and mouse. The display controller 115 is connected to the display device 119 and controls display on the display device 119 .
 データリーダ/ライタ116は、CPU111と記録媒体120との間のデータ伝送を仲介し、記録媒体120からのプログラムの読み出し、及びコンピュータ110における処理結果の記録媒体120への書き込みを実行する。通信インターフェイス117は、CPU111と、他のコンピュータとの間のデータ伝送を仲介する。 The data reader/writer 116 mediates data transmission between the CPU 111 and the recording medium 120, reads programs from the recording medium 120, and writes processing results in the computer 110 to the recording medium 120. Communication interface 117 mediates data transmission between CPU 111 and other computers.
 また、記録媒体120の具体例としては、CF(Compact Flash(登録商標))及びSD(Secure Digital)等の汎用的な半導体記憶デバイス、フレキシブルディスク(Flexible Disk)等の磁気記録媒体、又はCD-ROM(Compact Disk Read Only Memory)などの光学記録媒体が挙げられる。 Specific examples of the recording medium 120 include general-purpose semiconductor storage devices such as CF (Compact Flash (registered trademark)) and SD (Secure Digital), magnetic recording media such as flexible disks, and CD- Optical recording media such as ROM (Compact Disk Read Only Memory) can be mentioned.
 なお、学習モデル生成装置10及び関節点検出装置40は、プログラムがインストールされたコンピュータではなく、各部に対応したハードウェアを用いることによっても実現可能である。更に、学習モデル生成装置10及び関節点検出装置40は、一部がプログラムで実現され、残りの部分がハードウェアで実現されていてもよい。 The learning model generation device 10 and the joint point detection device 40 can also be realized by using hardware corresponding to each part instead of a computer in which a program is installed. Further, the learning model generation device 10 and the joint point detection device 40 may be partly implemented by a program and the rest by hardware.
 上述した実施の形態の一部又は全部は、以下に記載する(付記1)~(付記11)によって表現することができるが、以下の記載に限定されるものではない。 Some or all of the above-described embodiments can be expressed by the following (Appendix 1) to (Appendix 11), but are not limited to the following descriptions.
(付記1)
 対象の複数の関節点それぞれの2次元の特徴量がノードで表される第1のグラフ構造を取得する、グラフ構造取得部と、
 前記第1のグラフ構造を入力とし、グラフ畳み込みネットワークを用いて、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造を出力する、グラフ構造出力部と、
を備え、
 前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、前記第1のグラフ構造と前記第2のグラフ構造との関係を機械学習することによって構築され、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とする関節点検出装置。
(Appendix 1)
a graph structure acquisition unit that acquires a first graph structure in which two-dimensional feature values of each of a plurality of target joint points are represented by nodes;
a graph structure output unit that receives the first graph structure as an input and outputs a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points using a graph convolution network;
with
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A joint point detection device characterized by:
(付記2)
 付記1に記載の関節点検出装置であって、
 前記グラフ畳み込みネットワークにおいて、前記複数の中間層が、第1の中間層と第2の中間層とを備え、
 前記第1の中間層は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器のみを備え、前記第1の中間層それぞれにおいて、各特徴抽出器は、上層の同じノード数で特徴抽出を実行する特徴抽出器が出力したグラフ構造を入力として用い、
 前記第2の中間層は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器、及びグラフ構造のノード数を大きくして特徴抽出を行う特徴抽出器の一方又は両方と、を備え、前記第2の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力した複数のグラフ構造を入力として用いる、
ことを特徴とする関節点検出装置。
(Appendix 2)
The joint point detection device according to appendix 1,
the graph convolutional network, wherein the plurality of hidden layers comprises a first hidden layer and a second hidden layer;
The first hidden layer comprises only feature extractors that perform feature extraction without changing the number of nodes in the graph structure, and in each of the first hidden layers, each feature extractor has the same number of nodes in the upper layer. Using as input the graph structure output by the feature extractor that performs feature extraction,
The second intermediate layer includes a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure, and the number of nodes in the graph structure. one or both of feature extractors that perform feature extraction by increasing used as
A joint point detection device characterized by:
(付記3)
 付記1または2に記載の関節点検出装置であって、
 前記第1のグラフ構造が、前記関節点それぞれの2次元座標を示し、
 前記第2のグラフ構造が、前記関節点それぞれの3次元座標を示す、
ことを特徴とする関節点検出装置。
(Appendix 3)
The joint point detection device according to appendix 1 or 2,
the first graph structure indicates two-dimensional coordinates of each of the articulation points;
wherein the second graph structure indicates three-dimensional coordinates of each of the joint points;
A joint point detection device characterized by:
(付記4)
 対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、正解データとしての、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造とを、訓練データとして取得する、訓練データ取得部と、
 前記第1のグラフ構造を、グラフ畳み込みネットワークに入力し、前記グラフ畳み込みネットワークから出力されたグラフ構造と、前記正解データとの差分を算出し、算出した差分が小さくなるように、前記グラフ畳み込みネットワークにおけるパラメータを機械学習することによって、前記グラフ畳み込みネットワークによって構築された機械学習モデルを生成する、学習モデル生成部と、
を備え、
 前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とする学習モデル生成装置。
(Appendix 4)
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition unit that acquires data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a learning model generation unit that generates a machine learning model constructed by the graph convolutional network by machine learning parameters in
with
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A learning model generation device characterized by:
(付記5)
 付記4に記載の学習モデル生成装置であって、
 前記グラフ畳み込みネットワークにおいて、前記複数の中間層が、第1の中間層と第2の中間層とを備え、
 前記第1の中間層は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器のみを備え、前記第1の中間層それぞれにおいて、各特徴抽出器は、上層の同じノード数で特徴抽出を実行する特徴抽出器が出力したグラフ構造を入力として用い、
 前記第2の中間層は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器、及びグラフ構造のノード数を大きくして特徴抽出を行う特徴抽出器の一方又は両方と、を備え、前記第2の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力した複数のグラフ構造を入力として用いる、
ことを特徴とする学習モデル生成装置。
(Appendix 5)
The learning model generation device according to Supplementary Note 4,
the graph convolutional network, wherein the plurality of hidden layers comprises a first hidden layer and a second hidden layer;
The first hidden layer comprises only feature extractors that perform feature extraction without changing the number of nodes in the graph structure, and in each of the first hidden layers, each feature extractor has the same number of nodes in the upper layer. Using as input the graph structure output by the feature extractor that performs feature extraction,
The second intermediate layer includes a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure, and the number of nodes in the graph structure. one or both of feature extractors that perform feature extraction by increasing used as
A learning model generation device characterized by:
(付記6)
 付記4または5に記載の学習モデル生成装置であって、
 前記第1のグラフ構造が、前記関節点それぞれの2次元座標を示し、
 前記第2のグラフ構造が、前記関節点それぞれの3次元座標を示す、
ことを特徴とする学習モデル生成装置。
(Appendix 6)
The learning model generation device according to appendix 4 or 5,
the first graph structure indicates two-dimensional coordinates of each of the articulation points;
wherein the second graph structure indicates three-dimensional coordinates of each of the joint points;
A learning model generation device characterized by:
(付記7)
 対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造を取得する、グラフ構造取得ステップと、
 前記第1のグラフ構造を入力とし、グラフ畳み込みネットワークを用いて、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造を出力する、グラフ構造出力ステップと、
を有し、
 前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、前記第1のグラフ構造と前記第2のグラフ構造との関係を機械学習することによって構築され、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とする関節点検出方法。
(Appendix 7)
a graph structure obtaining step of obtaining a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points;
a graph structure output step of outputting a second graph structure showing three-dimensional feature amounts of each of the plurality of joint points using the first graph structure as an input and using a graph convolution network;
has
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A joint point detection method characterized by:
(付記8)
 対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、正解データとしての、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造とを、訓練データとして取得する、訓練データ取得ステップと、
 前記第1のグラフ構造を、グラフ畳み込みネットワークに入力し、前記グラフ畳み込みネットワークから出力されたグラフ構造と、前記正解データとの差分を算出し、算出した差分が小さくなるように、前記グラフ畳み込みネットワークにおけるパラメータを機械学習することによって、前記グラフ畳み込みネットワークによって構築された機械学習モデルを生成する、機械学習モデル生成ステップと、
を有し、
 前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とする学習モデル生成方法。
(Appendix 8)
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition step for acquiring as data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a machine learning model generation step of generating a machine learning model constructed by the graph convolutional network by machine learning parameters in
has
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A learning model generation method characterized by:
(付記9)
コンピュータに、
 対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造を取得する、グラフ構造取得ステップと、
 前記第1のグラフ構造を入力とし、グラフ畳み込みネットワークを用いて、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造を出力する、グラフ構造出力ステップと、
を実行させる命令を含む、プログラムを記録し、
 前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、前記第1のグラフ構造と前記第2のグラフ構造との関係を機械学習することによって構築され、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とするコンピュータ読み取り可能な記録媒体。
(Appendix 9)
to the computer,
a graph structure obtaining step of obtaining a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points;
a graph structure output step of outputting a second graph structure showing three-dimensional feature amounts of each of the plurality of joint points using the first graph structure as an input and using a graph convolution network;
record a program containing instructions to execute the
The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A computer-readable recording medium characterized by:
(付記10)
コンピュータに、
 対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、正解データとしての、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造とを、訓練データとして取得する、訓練データ取得ステップと、
 前記第1のグラフ構造を、グラフ畳み込みネットワークに入力し、前記グラフ畳み込みネットワークから出力されたグラフ構造と、前記正解データとの差分を算出し、算出した差分が小さくなるように、前記グラフ畳み込みネットワークにおけるパラメータを機械学習することによって、前記グラフ畳み込みネットワークによって構築された機械学習モデルを生成する、機械学習モデル生成ステップと、
を実行させる命令を含む、プログラムを記録し、
 前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とするプコンピュータ読み取り可能な記録媒体。
(Appendix 10)
to the computer,
training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition step for acquiring as data;
The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a machine learning model generation step of generating a machine learning model constructed by the graph convolutional network by machine learning parameters in
record a program containing instructions to execute the
The graph convolutional network comprises a plurality of hidden layers and an output layer,
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A computer-readable recording medium characterized by:
(付記11)
 複数の中間層と、出力層とを備え、
 対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造との関係を機械学習することによって構築され、
 前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
 前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
ことを特徴とするグラフ畳み込みネットワーク。
(Appendix 11)
comprising a plurality of intermediate layers and an output layer,
By machine learning a relationship between a first graph structure indicating two-dimensional feature amounts of each of a plurality of target joint points and a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points. built and
All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
A graph convolutional network characterized by:
 以上、実施の形態を参照して本願発明を説明したが、本願発明は上記実施の形態に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
 この出願は、2021年2月26日に出願された日本出願特願2021-029412を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2021-029412 filed on February 26, 2021, and the entire disclosure thereof is incorporated herein.
 以上のように、本発明によれば、画像から関節点の3次元座標を検出する際の検出精度の向上を図ることができる。本発明は、人、ロボットといった、関節点を有するものの姿勢検出が求められる分野に有用である。具体的な分野としては、映像監視、ユーザインタフェースなどが挙げられる。 As described above, according to the present invention, it is possible to improve detection accuracy when detecting three-dimensional coordinates of joint points from an image. INDUSTRIAL APPLICABILITY The present invention is useful in fields that require posture detection of objects having joint points, such as people and robots. Specific fields include video surveillance and user interfaces.
 10 学習モデル生成装置
 11 訓練データ取得部
 12 学習モデル生成部
 13 記憶部
 20 画像データ
 21 第1のグラフ構造
 22 ノード
 23 第2のグラフ構造
 24 ノード
 30 グラフ畳み込みネットワーク(GCN)
 31 入力層
 32a 第1の中間層
 32b 第2の中間層
 33 出力層
 40 関節点検出装置
 41 グラフ構造取得部
 42 グラフ構造出力部
 43 記憶部
 50 第1のグラフ構造
 51 第2のグラフ構造
 110 コンピュータ
 111 CPU
 112 メインメモリ
 113 記憶装置
 114 入力インターフェイス
 115 表示コントローラ
 116 データリーダ/ライタ
 117 通信インターフェイス
 118 入力機器
 119 ディスプレイ装置
 120 記録媒体
 121 バス

 
REFERENCE SIGNS LIST 10 learning model generation device 11 training data acquisition unit 12 learning model generation unit 13 storage unit 20 image data 21 first graph structure 22 nodes 23 second graph structure 24 nodes 30 graph convolution network (GCN)
31 input layer 32a first intermediate layer 32b second intermediate layer 33 output layer 40 joint point detection device 41 graph structure acquisition unit 42 graph structure output unit 43 storage unit 50 first graph structure 51 second graph structure 110 computer 111 CPUs
112 main memory 113 storage device 114 input interface 115 display controller 116 data reader/writer 117 communication interface 118 input device 119 display device 120 recording medium 121 bus

Claims (11)

  1.  対象の複数の関節点それぞれの2次元の特徴量がノードで表される第1のグラフ構造を取得する、グラフ構造取得手段と、
     前記第1のグラフ構造を入力とし、グラフ畳み込みネットワークを用いて、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造を出力する、グラフ構造出力手段と、
    を備え、
     前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、前記第1のグラフ構造と前記第2のグラフ構造との関係を機械学習することによって構築され、
     前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
     前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
    ことを特徴とする関節点検出装置。
    graph structure acquisition means for acquiring a first graph structure in which two-dimensional feature quantities of each of a plurality of target joint points are represented by nodes;
    graph structure output means for outputting a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points using the first graph structure as an input and using a graph convolution network;
    with
    The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
    All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
    The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
    A joint point detection device characterized by:
  2.  請求項1に記載の関節点検出装置であって、
     前記グラフ畳み込みネットワークにおいて、前記複数の中間層が、第1の中間層と第2の中間層とを備え、
     前記第1の中間層は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器のみを備え、前記第1の中間層それぞれにおいて、各特徴抽出器は、上層の同じノード数で特徴抽出を実行する特徴抽出器が出力したグラフ構造を入力として用い、
     前記第2の中間層は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器、及びグラフ構造のノード数を大きくして特徴抽出を行う特徴抽出器の一方又は両方と、を備え、前記第2の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力した複数のグラフ構造を入力として用いる、
    ことを特徴とする関節点検出装置。
    The joint point detection device according to claim 1,
    the graph convolutional network, wherein the plurality of hidden layers comprises a first hidden layer and a second hidden layer;
    The first hidden layer comprises only feature extractors that perform feature extraction without changing the number of nodes in the graph structure, and in each of the first hidden layers, each feature extractor has the same number of nodes in the upper layer. Using as input the graph structure output by the feature extractor that performs feature extraction,
    The second intermediate layer includes a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure, and the number of nodes in the graph structure. one or both of feature extractors that perform feature extraction by increasing used as
    A joint point detection device characterized by:
  3.  請求項1または2に記載の関節点検出装置であって、
     前記第1のグラフ構造が、前記関節点それぞれの2次元座標を示し、
     前記第2のグラフ構造が、前記関節点それぞれの3次元座標を示す、
    ことを特徴とする関節点検出装置。
    The joint point detection device according to claim 1 or 2,
    the first graph structure indicates two-dimensional coordinates of each of the articulation points;
    wherein the second graph structure indicates three-dimensional coordinates of each of the joint points;
    A joint point detection device characterized by:
  4.  対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、正解データとしての、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造とを、訓練データとして取得する、訓練データ取得手段と、
     前記第1のグラフ構造を、グラフ畳み込みネットワークに入力し、前記グラフ畳み込みネットワークから出力されたグラフ構造と、前記正解データとの差分を算出し、算出した差分が小さくなるように、前記グラフ畳み込みネットワークにおけるパラメータを機械学習することによって、前記グラフ畳み込みネットワークによって構築された機械学習モデルを生成する、学習モデル生成手段と、
    を備え、
     前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、
     前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
     前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
    ことを特徴とする学習モデル生成装置。
    training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; a training data acquisition means for acquiring as data;
    The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. a learning model generation means for generating a machine learning model constructed by the graph convolutional network by machine learning parameters in
    with
    The graph convolutional network comprises a plurality of hidden layers and an output layer,
    All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
    The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
    A learning model generation device characterized by:
  5.  請求項4に記載の学習モデル生成装置であって、
     前記グラフ畳み込みネットワークにおいて、前記複数の中間層が、第1の中間層と第2の中間層とを備え、
     前記第1の中間層は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器のみを備え、前記第1の中間層それぞれにおいて、各特徴抽出器は、上層の同じノード数で特徴抽出を実行する特徴抽出器が出力したグラフ構造を入力として用い、
     前記第2の中間層は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器、及びグラフ構造のノード数を大きくして特徴抽出を行う特徴抽出器の一方又は両方と、を備え、前記第2の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力した複数のグラフ構造を入力として用いる、
    ことを特徴とする学習モデル生成装置。
    The learning model generation device according to claim 4,
    the graph convolutional network, wherein the plurality of hidden layers comprises a first hidden layer and a second hidden layer;
    The first hidden layer comprises only feature extractors that perform feature extraction without changing the number of nodes in the graph structure, and in each of the first hidden layers, each feature extractor has the same number of nodes in the upper layer. Using as input the graph structure output by the feature extractor that performs feature extraction,
    The second intermediate layer includes a feature extractor that performs feature extraction without changing the number of nodes in the graph structure, a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure, and the number of nodes in the graph structure. one or both of feature extractors that perform feature extraction by increasing used as
    A learning model generation device characterized by:
  6.  請求項4または5に記載の学習モデル生成装置であって、
     前記第1のグラフ構造が、前記関節点それぞれの2次元座標を示し、
     前記第2のグラフ構造が、前記関節点それぞれの3次元座標を示す、
    ことを特徴とする学習モデル生成装置。
    The learning model generation device according to claim 4 or 5,
    the first graph structure indicates two-dimensional coordinates of each of the articulation points;
    wherein the second graph structure indicates three-dimensional coordinates of each of the joint points;
    A learning model generation device characterized by:
  7.  対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造を取得し、
     前記第1のグラフ構造を入力とし、グラフ畳み込みネットワークを用いて、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造を出力し、
     前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、前記第1のグラフ構造と前記第2のグラフ構造との関係を機械学習することによって構築され、
     前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
     前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
    ことを特徴とする関節点検出方法。
    Obtaining a first graph structure indicating a two-dimensional feature quantity for each of a plurality of joint points of interest;
    taking the first graph structure as an input and using a graph convolution network to output a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points;
    The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
    All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
    The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
    A joint point detection method characterized by:
  8.  対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、正解データとしての、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造とを、訓練データとして取得し、
     前記第1のグラフ構造を、グラフ畳み込みネットワークに入力し、前記グラフ畳み込みネットワークから出力されたグラフ構造と、前記正解データとの差分を算出し、算出した差分が小さくなるように、前記グラフ畳み込みネットワークにおけるパラメータを機械学習することによって、前記グラフ畳み込みネットワークによって構築された機械学習モデルを生成し、
     前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、
     前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
     前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
    ことを特徴とする学習モデル生成方法。
    training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; obtained as data,
    The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. generating a machine learning model constructed by said graph convolutional network by machine learning the parameters in
    The graph convolutional network comprises a plurality of hidden layers and an output layer,
    All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
    The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
    A learning model generation method characterized by:
  9. コンピュータに、
     対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造を取得させ、
     前記第1のグラフ構造を入力とし、グラフ畳み込みネットワークを用いて、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造を出力させる、
    命令を含むプログラムを記録し、
     前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、前記第1のグラフ構造と前記第2のグラフ構造との関係を機械学習することによって構築され、
     前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
     前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
    ことを特徴とするコンピュータ読み取り可能な記録媒体。
    to the computer,
    Acquiring a first graph structure indicating a two-dimensional feature quantity of each of a plurality of joint points of interest;
    Using the first graph structure as an input and using a graph convolution network to output a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points;
    record a program containing instructions,
    The graph convolutional network comprises a plurality of hidden layers and an output layer, and is constructed by machine learning the relationship between the first graph structure and the second graph structure,
    All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
    The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
    A computer-readable recording medium characterized by:
  10. コンピュータに、
     対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、正解データとしての、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造とを、訓練データとして取得させ、
     前記第1のグラフ構造を、グラフ畳み込みネットワークに入力し、前記グラフ畳み込みネットワークから出力されたグラフ構造と、前記正解データとの差分を算出し、算出した差分が小さくなるように、前記グラフ畳み込みネットワークにおけるパラメータを機械学習することによって、前記グラフ畳み込みネットワークによって構築された機械学習モデルを生成させる、
    命令を含むプログラムを記録し、
     前記グラフ畳み込みネットワークは、複数の中間層と、出力層とを備え、
     前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
     前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
    ことを特徴とするコンピュータ読み取り可能な記録媒体。
    to the computer,
    training a first graph structure indicating a two-dimensional feature quantity of each of a plurality of target joint points and a second graph structure indicating a three-dimensional feature quantity of each of the plurality of joint points as correct data; get it as data
    The first graph structure is input to a graph convolution network, a difference between the graph structure output from the graph convolution network and the correct data is calculated, and the graph convolution network is adjusted so that the calculated difference becomes small. generating a machine learning model constructed by the graph convolutional network by machine learning the parameters in
    record a program containing instructions,
    The graph convolutional network comprises a plurality of hidden layers and an output layer,
    All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
    The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
    A computer-readable recording medium characterized by:
  11.  複数の中間層と、出力層とを備え、
     対象の複数の関節点それぞれの2次元の特徴量を示す第1のグラフ構造と、前記複数の関節点それぞれの3次元の特徴量を示す第2のグラフ構造との関係を機械学習することによって構築され、
     前記複数の中間層のうち全部又は一部は、グラフ構造のノード数を変えずに特徴抽出を実行する特徴抽出器と、グラフ構造のノード数を小さくして特徴抽出を行う特徴抽出器とを備え、前記複数の中間層それぞれにおいて、各特徴抽出器は、上層の各特徴抽出器が出力したグラフ構造を入力として用い、
     前記出力層は、最も下層の前記中間層の各特徴抽出器が出力したグラフ構造を、入力として用いて、グラフ構造を出力する、
    ことを特徴とするグラフ畳み込みネットワーク。
    comprising a plurality of intermediate layers and an output layer,
    By machine learning a relationship between a first graph structure indicating two-dimensional feature amounts of each of a plurality of target joint points and a second graph structure indicating three-dimensional feature amounts of each of the plurality of joint points. built and
    All or some of the plurality of intermediate layers include a feature extractor that performs feature extraction without changing the number of nodes in the graph structure and a feature extractor that performs feature extraction with a reduced number of nodes in the graph structure. wherein each feature extractor in each of the plurality of intermediate layers uses as input a graph structure output by each feature extractor in the upper layer,
    The output layer uses as input the graph structure output by each feature extractor of the lowest intermediate layer, and outputs a graph structure.
    A graph convolutional network characterized by:
PCT/JP2022/003767 2021-02-26 2022-02-01 Joint point detection device, teaching model generation device, joint point detection method, teaching model generation method, and computer-readable recording medium WO2022181253A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2023502226A JPWO2022181253A5 (en) 2022-02-01 Joint point detection device, learning model generation device, joint point detection method, learning model generation method, and program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2021029412 2021-02-26
JP2021-029412 2021-02-26

Publications (1)

Publication Number Publication Date
WO2022181253A1 true WO2022181253A1 (en) 2022-09-01

Family

ID=83048173

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/003767 WO2022181253A1 (en) 2021-02-26 2022-02-01 Joint point detection device, teaching model generation device, joint point detection method, teaching model generation method, and computer-readable recording medium

Country Status (1)

Country Link
WO (1) WO2022181253A1 (en)

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DOOSTI BARDIA; NAHA SHUJON; MIRBAGHERI MAJID; CRANDALL DAVID J: "HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation", 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 13 June 2020 (2020-06-13), pages 6607 - 6616, XP033804923, DOI: 10.1109/CVPR42600.2020.00664 *
MD ZAHANGIR ALOM; MAHMUDUL HASAN; CHRIS YAKOPCIC; TAREK M. TAHA; VIJAYAN K. ASARI: "Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation", ARXIV.ORG, 20 February 2018 (2018-02-20), pages 1 - 12, XP081216782 *
SEO HYUNSEOK; HUANG CHARLES; BASSENNE MAXIME; XIAO RUOXIU; XING LEI: "Modified U-Net (mU-Net) With Incorporation of Object-Dependent High Level Features for Improved Liver and Liver-Tumor Segmentation in CT Images", IEEE TRANSACTIONS ON MEDICAL IMAGING, vol. 39, no. 5, 18 October 2019 (2019-10-18), USA, pages 1316 - 1325, XP011785778, ISSN: 0278-0062, DOI: 10.1109/TMI.2019.2948320 *

Also Published As

Publication number Publication date
JPWO2022181253A1 (en) 2022-09-01

Similar Documents

Publication Publication Date Title
JP7178396B2 (en) Method and computer system for generating data for estimating 3D pose of object included in input image
EP2880633B1 (en) Animating objects using the human body
CN107851332B (en) Consistent subdivision via topology aware surface tracking
CN112506340B (en) Equipment control method, device, electronic equipment and storage medium
JP2021531565A (en) Focal detection methods, devices, equipment and storage media
JP5674550B2 (en) Status tracking apparatus, method, and program
US11676361B2 (en) Computer-readable recording medium having stored therein training program, training method, and information processing apparatus
CN110930386B (en) Image processing method, device, equipment and storage medium
JP2019016164A (en) Learning data generation device, estimation device, estimation method, and computer program
JP2010108496A (en) Method for selecting feature representing data, computer-readable medium, method and system for forming generative model
CN113158970B (en) Action identification method and system based on fast and slow dual-flow graph convolutional neural network
JP2010072700A (en) Image processor, image processing method, and image pickup system
WO2022181253A1 (en) Joint point detection device, teaching model generation device, joint point detection method, teaching model generation method, and computer-readable recording medium
CN116229262A (en) Modeling method and device for building information model
US20210374543A1 (en) System, training device, training method, and predicting device
JPWO2020031380A1 (en) Image processing method and image processing equipment
WO2022181252A1 (en) Joint detection device, training model generation device, joint detection method, training model generation method, and computer-readable recording medium
WO2022181251A1 (en) Articulation point detection device, articulation point detection method, and computer-readable recording medium
JP6962450B2 (en) Image processing equipment, image processing methods, and programs
Angelopoulou et al. Natural user interfaces in volume visualisation using microsoft kinect
WO2024128124A1 (en) Learning device, estimating device, learning method, estimating method, and recording medium
WO2024127824A1 (en) Information processing device, information processing method, and program
WO2023127005A1 (en) Data augmentation device, data augmentation method, and computer-readable recording medium
WO2023162132A1 (en) Image transformation device, method, and program
JP7207556B2 (en) Information processing device, information processing method and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22759294

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023502226

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22759294

Country of ref document: EP

Kind code of ref document: A1