WO2023148909A1 - Machine learning device, skilled action determination device, machine learning method, and machine learning program - Google Patents

Machine learning device, skilled action determination device, machine learning method, and machine learning program Download PDF

Info

Publication number
WO2023148909A1
WO2023148909A1 PCT/JP2022/004364 JP2022004364W WO2023148909A1 WO 2023148909 A1 WO2023148909 A1 WO 2023148909A1 JP 2022004364 W JP2022004364 W JP 2022004364W WO 2023148909 A1 WO2023148909 A1 WO 2023148909A1
Authority
WO
WIPO (PCT)
Prior art keywords
graph
image
machine learning
unit
action
Prior art date
Application number
PCT/JP2022/004364
Other languages
French (fr)
Japanese (ja)
Inventor
雄一 佐々木
翔貴 宮川
勇 小川
雅浩 虻川
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to JP2023532819A priority Critical patent/JP7387069B1/en
Priority to PCT/JP2022/004364 priority patent/WO2023148909A1/en
Priority to TW111127906A priority patent/TW202333089A/en
Publication of WO2023148909A1 publication Critical patent/WO2023148909A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion

Definitions

  • the present disclosure provides a machine learning device, a machine learning method, and a machine learning program for learning a learning model for inferring the action proficiency level of an action subject in an image, and the action proficiency level of an action subject in an image.
  • the present invention relates to an inference skilled action determination device.
  • Transfer learning in which a user modifies a region of interest generated for an image by a neural network (NN) in a learning model (that is, human knowledge is embedded in the learning model), and learning is performed using the corrected region of interest as correct data.
  • NN neural network
  • a learning model that is, human knowledge is embedded in the learning model
  • Transfer learning is a Human-in-the-Loop (HITL) type of learning.
  • HITL Human-in-the-Loop
  • ST-GCN Spatio-Temporal Graph Convolution Network
  • ST-GCN Spatio-Temporal Graph Convolution Network
  • RePN Relationship Proposal Networks
  • Graph R-CNN Graph Region Based Convolutional Neural Networks
  • the scene graph is a graph in which objects appearing in an image are nodes, and relationships established between the nodes are edges (for example, directed edges).
  • Non-Patent Document 1 since the user only corrects the region of interest of the image, it is possible to generate a learning model that can infer the skill level of human behavior as the subject of action with high prediction accuracy. can't
  • Non-Patent Document 2 since the graph structure uses only skeletal information, it is considered difficult to generate a learning model that can infer the proficiency level of human behavior with high prediction accuracy. .
  • Non-Patent Document 3 deals only with simple relationships between objects in images (positional relationships between trees and birds, trees and leaves, trees and branches, etc.). Therefore, it is considered difficult to generate a learning model that can infer the proficiency level of a person's behavior with high prediction accuracy.
  • the present disclosure provides a machine learning device, a machine learning method, and a machine learning program for learning a learning model capable of inferring the proficiency level of an action of an action subject with high prediction accuracy, and using the learning model, It is an object of the present invention to provide a skilled action determination device for inferring the skill level of actions of an action subject.
  • a machine learning device is a device that learns a learning model for inferring the action proficiency of an action subject in an image, and corresponds to a plurality of parts of the action subject based on a user's input operation.
  • a graph input unit for acquiring a graph composed of a plurality of nodes and information indicating relationships between the plurality of nodes;
  • a storage unit for storing the graph acquired by the graph input unit;
  • an object recognition unit for recognizing and outputting a plurality of object areas in the image in which a plurality of objects corresponding to the nodes of are present;
  • a skillful action feature extraction unit that extracts a first feature amount, a region of interest generation unit that generates a region of interest in the image based on the first feature amount, and the region of interest and the object region are a graph-object feature extraction unit that generates a second feature amount that emphasizes the first feature amount for the overlapping region; and a graph model learning unit that generates the learning model.
  • a machine learning method of the present disclosure is a method implemented by a machine learning device that learns a learning model for inferring the action proficiency of an actor in an image, the method comprising: a step of extracting a first feature amount that is a feature amount of actions of a plurality of parts; and a step between a plurality of nodes corresponding to the plurality of parts of the action subject and the plurality of nodes based on a user's input operation. and storing the graph; and recognizing a plurality of object regions in the image in which a plurality of objects corresponding to the plurality of nodes exist.
  • a step of generating a region of interest in the image based on the first feature quantity a step of emphasizing the first feature quantity for a region where the region of interest and the object region overlap. and generating the learning model based on the second feature when the image is learning data collected in advance.
  • machine learning device By using the machine learning device, machine learning method, and machine learning program of the present disclosure, it is possible to generate a learning model that can infer the proficiency level of the actions of the action subject with high prediction accuracy.
  • FIG. 1 is a diagram illustrating an example of a hardware configuration of a machine learning device according to Embodiment 1;
  • FIG. 1 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 1;
  • FIG. 4 is an explanatory diagram showing an operation during learning of the machine learning device according to Embodiment 1;
  • FIG. 10 is a diagram showing an example of operation during learning of the machine learning device according to Embodiment 1 in a tabular form; 4 is a flow chart showing the operation of the machine learning device according to Embodiment 1 during learning.
  • FIG. 1 is a diagram illustrating an example of a hardware configuration of a machine learning device according to Embodiment 1;
  • FIG. 1 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 1;
  • FIG. 4 is an explanatory diagram showing an operation during learning of the machine learning device according to Embodiment 1;
  • FIG. 10 is a diagram showing an example of operation during
  • FIG. 4 is an explanatory diagram showing the operation of the machine learning device (skilled action determination device) according to Embodiment 1 during inference; 4 is a flow chart showing operations during inference of the machine learning device (skilled action determination device) according to Embodiment 1; FIG. 4 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 2; (A) and (B) are explanatory diagrams showing the operation of the machine learning device according to the second embodiment. 10 is a flow chart showing operations during learning of the machine learning device according to Embodiment 2.
  • FIG. FIG. 12 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 3; FIG.
  • FIG. 12 is an explanatory diagram showing the operation of the machine learning device according to Embodiment 3 during learning; (A) and (B) are explanatory diagrams showing the operation of the learning rate adjusting unit of the machine learning device according to the third embodiment.
  • 11 is a flow chart showing operations during learning of the machine learning device according to Embodiment 3.
  • FIG. FIG. 11 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 4;
  • FIG. 13 is a flow chart showing an operation during learning of the machine learning device according to Embodiment 4;
  • FIG. FIG. 12 is an explanatory diagram showing the operation of the machine learning device according to Embodiment 4;
  • FIG. 12 is an explanatory diagram showing the operation of the machine learning device according to Embodiment 4;
  • a machine learning device, a skilled behavior inference device, a machine learning method, and a machine learning program according to embodiments will be described below with reference to the drawings.
  • the following embodiments are merely examples, and the embodiments can be combined as appropriate and each embodiment can be modified as appropriate.
  • the machine learning device is a device that learns a learning model for inferring the proficiency level of actions of an action subject in an image.
  • a machine learning device is, for example, a computer as an information processing device.
  • the subject of action is a person who performs the work (also called a worker, a technician, a skilled worker, etc.), or a mechanism or device (for example, a robot arm, an endoscope, etc.) that works in conjunction with the movement of a person , etc.).
  • a machine learning method is a method that can be implemented by a machine learning device.
  • This machine learning method is a method of learning a learning model for inferring the action proficiency of the action subject in the image.
  • a machine learning program is a program that can be executed by a computer as a machine learning device.
  • This machine learning program is a program for learning a learning model for inferring the action proficiency of the action subject in the image.
  • a skilled action inference device is a device that infers the action proficiency of an action subject using a learning model generated by a machine learning device, a machine learning method, or a machine learning program.
  • a skilled action reasoning device is, for example, a computer.
  • the expert behavior inference device and the machine learning device may be configured by a common computer. Also, the expert behavior inference device and the machine learning device may be configured by different computers.
  • FIG. 1 is a diagram showing an example of a hardware configuration of a machine learning device 100 according to the first embodiment.
  • the machine learning device 100 according to Embodiment 1 is a device that executes a learning process of generating a learning model M by performing machine learning. Further, the machine learning device 100 is also a skilled action determination device.
  • the machine learning device 100 includes a processor 101 such as a CPU (Central Processing Unit), a memory 102 that is a volatile storage device, a nonvolatile storage device 103 such as a hard disk drive (HDD) or solid state drive (SSD), and an interface 104 .
  • the memory 102 is, for example, a semiconductor memory such as a RAM (Random Access Memory).
  • the machine learning device 100 may have a communication device that communicates with an external device.
  • the processing circuitry is dedicated hardware.
  • the processing circuit may be processor 101 that executes a program stored in memory 102 (for example, a machine learning program according to the embodiment).
  • the processor 101 may be a processing device, an arithmetic device, a microprocessor, a microcomputer, or a DSP (Digital Signal Processor).
  • the processing circuit is dedicated hardware, the processing circuit is, for example, ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array).
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the machine learning method is implemented by software, firmware, or a combination of software and firmware.
  • Software and firmware are written as programs and stored in memory 102 .
  • Processor 101 can implement the machine learning method according to the first embodiment by reading and executing the program stored in memory 102 .
  • machine learning device 100 may be partially implemented by dedicated hardware and partially implemented by software or firmware.
  • the processing circuitry may implement each of the functions described above in hardware, software, firmware, or any combination thereof.
  • the interface 104 is used to communicate with other devices.
  • An external storage device, a display 105, an input device 106 as a user operation unit, and the like are connected to the interface 104 .
  • the input device 106 is, for example, a mouse, keyboard, touch panel, or the like.
  • FIG. 2 is a functional block diagram schematically showing the configuration of machine learning device 100 according to Embodiment 1.
  • the machine learning device 100 is a device that learns a learning model M for inferring the proficiency level of actions of an action subject in an image.
  • the machine learning device 100 includes a skilled action determination model 11, a graph input unit 15 as a correlation/causal graph input unit, a graph-object feature extraction unit 16, a user input area extraction table 17 stored in a storage unit, and an object recognition unit 18 .
  • the skillful action determination model 11 has a skillful action feature extraction unit 12 , a region-of-interest generation unit 13 , and a graph model learning unit 14 .
  • the learning model generation unit 11a At the time of learning, in the skilled action determination model 11, the learning model generation unit 11a generates a learning model M and stores it in the learning model storage unit 11b.
  • the inference section 11c reads out the learning model M from the learning model storage section 11b, performs inference based on the input data, and outputs the inference result.
  • the graph input unit 15 generates a graph composed of a plurality of nodes corresponding to a plurality of portions of the action subject in the image and information indicating the relationship between the plurality of nodes based on the input operation of the user 50. get G.
  • the user inputs nodes and correlations/causal relationships between the nodes from the input device to the graph input unit 15 . That is, the graph input unit 15 inputs relationships between nodes into the causality/correlation graph.
  • the user 50 uses the graph input unit 15 to specify a region (for example, right hand RH, left hand LH, head HE) in which knowledge is to be embedded, and selects an object for extracting the region in which knowledge is to be embedded. Register in Table 17. Furthermore, in the graph input unit 15, the user 50 registers in advance information about the relationship between regions that he/she expects.
  • the user input area extraction table 17 is a table of information acquired in advance or by an input operation from the graph input unit 15 .
  • the object recognition unit 18 recognizes and outputs multiple object areas O in which multiple objects corresponding to multiple nodes (also referred to as "graph nodes") exist.
  • the object recognition unit 18 recognizes an object corresponding to a node (for example, right hand, left hand, head) input by the user 50 and an object area O (for example, rectangular area) that is an area containing the object. That is, the object recognition unit 18 reads the object extraction method registered in the user input area extraction table 17, and extracts the corresponding object area from the moving image and time-series sensor data according to the method.
  • the skillful action feature extraction unit 12 extracts a first feature amount F1 that is a feature amount (that is, an intermediate feature amount) of actions of a plurality of portions of the action subject present in the image.
  • the multiple parts of the action subject are, for example, the operator's right hand RH, left hand LH, and head HE.
  • the skillful action feature extraction unit 12 acquires intermediate features using a feature extractor such as a CNN (Convolutional Neural Network), for example.
  • the region-of-interest generation unit 13 generates a region of interest A in the image based on the first feature amount F1.
  • the region-of-interest generation unit 13 generates heat map information indicating which region of the image to focus on to obtain the skill level, using a network mechanism such as an attention branch network (ABN).
  • the region-of-interest generation unit 13 registers the visualization result, which is information in the middle of generation of the heat map information, in the learning result portion.
  • ABN is described in Non-Patent Document 1, for example.
  • the graph-object feature extraction unit 16 generates a second feature quantity F2 that emphasizes the first feature quantity F1 for the region where the region of interest A and the object region O overlap.
  • the graph-object feature extraction unit 16 associates a sensor feature amount such as an image with a region in which human knowledge is assumed to be embedded.
  • the graph-object feature extraction unit 16 includes the object region O extracted by the object recognition unit 18 and the focus point generated by the region-of-interest generation unit 13 in the first feature quantity F1 extracted by the skilled action determination model 11. Areas other than the area overlapping with the area of interest A are masked, and the graph G, which is the user input, is associated with the first feature amount F1, which is the skillful action feature.
  • the graph-object feature extracting unit 16 extracts features for nodes by performing mask processing using heat map information indicating the object region and the region of interest.
  • the graph model learning unit 14 generates a learning model M based on the second feature amount F2 when the image input to the skillful behavior feature extraction unit 12 is the learning data L collected in advance.
  • the graph model learning unit 14 advances learning using, for example, a graph convolution learning method such as ST-GCN, and accumulates the learning results in the storage unit.
  • videos that are learning data videos that are inference data, time-series sensor data, and the like are accumulated.
  • the user 50 gives the information indicating the interrelationship between the objects in advance from the graph input unit 15, so that the object to be recognized can be designated according to the problem to be solved.
  • image features using the region of interest extracted in the process of recognizing the problem to be solved, it is possible to acquire more detailed features of the graph, such as the region of interest for skillful handling of objects. can be done. In addition, more accurate analysis becomes possible.
  • FIG. 3 is an explanatory diagram showing the operation during learning of the machine learning device 100 according to the first embodiment.
  • FIG. 4 is a diagram showing, in tabular form, information indicating an example of the operation of the machine learning device 100 during learning.
  • the user inputs, for example, causality and correlation between the right hand, left hand, and head to the graph input unit 15 .
  • a bird's-eye view image of the work is acquired, and the user 50 manually determines the relation of information such as "right hand”, “left hand”, and "head” shown in the image. given in the input.
  • causality when causality is expressed as a directed graph, as shown in FIG. Give “left hand”, “head”, and the edges of the graph.
  • the graph input unit 15 may provide a machine learning model for detecting an object from an image as shown in Non-Patent Document 3. Further, by image processing for extracting the color of the skin, it is possible to identify the object on the right side of the image as the right hand and the object on the left side of the image as the left hand.
  • FIG. 4 An example of the user input area extraction table 17 is shown in FIG. 4, for example. Information may be registered in the user input area extraction table 17 by means of extracting sensor data corresponding to the input by the user 50 .
  • the data set storage unit 60 stores learning data used during machine learning.
  • sensor data such as videos, pressure sensors, acceleration sensors, and sounds are stored. It stores the result of quality judgment and the like obtained.
  • the expert behavior feature extraction unit 12 employs a paired comparison method such as Attention Pairwise Ranking or Pairwise Deep Ranking, the superiority comparison results of the two sensor data may be retained.
  • FIG. 5 is a flowchart showing the operation of the machine learning device 100 during learning.
  • the object recognizing unit 18 extracts regions that become nodes of the correlation/causal graph input by the user from the image or sensor data (step S101).
  • an object detection model is used, the recognition results of the right hand, the left hand, and the head, and the rectangle information surrounding the objects of the recognition results are extracted.
  • the same information as the above object detection model can be obtained from the regions within the predetermined color range and their positional relationships.
  • the skillful action feature extraction unit 12 is a model that extracts features from images such as CNN.
  • the skillful action feature extraction unit 12 may use a model that handles time, such as an RNN (Recurrent Neural Network), in the case of a model that handles a time series, such as an acceleration sensor or sound.
  • RNN Recurrent Neural Network
  • 3D-CNN including convolution in the time direction or image features once convolved by CNN are input, and a model such as RNN for handling time series is combined. good too.
  • a model such as TSN Temporal Segment Network
  • TSN Temporal Segment Network
  • Data is input to the skillful action feature extraction unit 12 to obtain a first feature quantity F1, which is an intermediate feature of time t, width W, height H, and number of channels Ch (step S102).
  • the region-of-interest generation unit 13 uses the intermediate features extracted by the skillful action feature extraction unit 12 to generate a region of interest for judging the skill level (step S103).
  • the region of interest has a heat map ranging from 0 to 1 with respect to width W and height H by global average pooling in the channel direction and time t direction, activation function, or normalization of maximum and minimum values.
  • the heat map has, for example, a CAM (Class Activation Map) structure. Error backpropagation for skillful action determination acquires a point of interest indicating which of the features extracted by the skillful action feature extraction unit 12 should be focused on to easily determine the skill level.
  • the graph-object feature extraction unit 16 adds the intermediate feature amount (time t ⁇ width W ⁇ height H ⁇ channel number Ch) output by the skillful behavior feature extraction unit 12 to a mask based on the object extraction result, and extracts each node “
  • the feature values associated with the right hand, left hand, and head are extracted (step S104).
  • the intermediate features extracted by the object recognition unit 18 are used as they are and are not given as features to nodes.
  • the problem we want to solve is to acquire the skill level, not the features for extracting the right hand, left hand, and head. Apply.
  • a feature amount F1' obtained by masking the first feature amount F1, which is an intermediate feature amount, such as Attention Branch Network, and the first feature amount F1, which is an original intermediate feature amount, are used. may be obtained by taking the sum of In addition to this, the feature amount F1' may be used by eliminating the sum portion of the above method. Also, regions other than the region of interest (Attention region) extracted by the region-of-interest generation unit 13, which will be described later, are masked to extract appropriate features for the nodes of the graph.
  • the graph model learning unit 14 uses the adjacency matrix of causality and correlation between graphs given by the user to learn the features extracted at each time t in the Graph Convolutional Neural Network (Graph-CNN) By the method, error backpropagation is repeated so that the proficiency level of the data set is applied, and learning is performed (step S105). For error backpropagation, object recognition is not performed, and weight parameters are updated up to feature extraction.
  • FIG. 6 is an explanatory diagram showing the operation during inference of the machine learning device (skilled behavior determination device) 100 according to the first embodiment.
  • FIG. 7 is a flowchart showing the operation of the machine learning device (skilled action determination device) 100 during inference.
  • the object recognition unit 18 extracts an object (node) input in advance by the user and its area (step S111).
  • the skillful action feature extraction unit 12 extracts features for judging the skill level (step S112).
  • the region-of-interest generation unit 13 generates a heat map of width W ⁇ height H (step S113).
  • the graph-object feature extracting unit 16 extracts the feature amount of the object input by the user in advance based on the object recognition result and the region-of-interest generation result (step S114).
  • the inference unit recognizes the skill level by graph convolution (step S115).
  • the region-of-interest generation unit 13 selects image features using the points of interest extracted in the process of recognizing the problem to be solved, extracting the features of the thing (object) corresponding to the node, and simply taking them out for machine learning. It has the effect that it is possible to acquire more detailed characteristics such as ⁇ points of interest in handling things well'' instead of doing.
  • FIG. 8 is a functional block diagram schematically showing the configuration of the machine learning device 200 according to the second embodiment.
  • the same reference numerals as those shown in FIG. 2 are attached to the same or corresponding configurations as those shown in FIG.
  • Machine learning device 200 differs from machine learning device 100 according to the first embodiment in the operation of object recognition section 28 .
  • Machine learning device 200 is a device capable of implementing the machine learning method according to the fourth embodiment.
  • the hardware configuration of the machine learning device 200 is the same as that of FIG.
  • FIG. 9(A) and (B) are explanatory diagrams showing the operation of the machine learning device 200.
  • FIG. 9B When dealing with a working video of an expert, objects specified by the user overlap as shown in FIG. 9B, or objects specified by the user disappear from the screen as shown in FIG. 9A. problems may occur.
  • the machine learning device 200 provides a means for estimating the occurrence of the above problem in the object recognition unit 28, and appropriately updates the graph features linked to the image according to these states.
  • FIG. 10 is a flowchart showing the operation of the machine learning device 200 during learning.
  • the operation of FIG. 10 differs from the operation of the machine learning device 100 according to the first embodiment in the operations of the object recognition unit 28 and the operations of the graph-object feature extraction unit 16.
  • FIG. The machine learning device 200 recognizes an object (step S201), extracts a first feature amount F1 (step S202), generates a region of interest A (step S203), extracts a graph object feature (second feature amount F2). (Step S204), generating a graph model as a learning model (step S205).
  • the object recognition unit 28 operates with a position filtering technique such as the Kalman filter, in which Gaussian noise is added to position predictions based on observations and past observations.
  • a position filtering technique such as the Kalman filter, in which Gaussian noise is added to position predictions based on observations and past observations.
  • the flow estimating unit 28a holds information on positions, velocities, or positions and velocities that have been filtered and estimated by the previous time, and based on this, estimates the position where an object is predicted to exist.
  • the object existence probability estimating unit 28b calculates the existence probability so that the variance value observed at the position becomes small when the position is observed in object recognition, and the variance value gradually increases when the position is not observed. If the dispersion of the positions observed by the above filter is greater than a certain value, or if the position of the hand including the dispersion is estimated to have moved outside the screen, it is recognized that the right hand and left hand are not detected from the middle.
  • a Kalman filter is used to estimate which of the position estimated by the flow estimation unit 28a and the position observed by the object recognition unit 28 should be given more weight to calculate the position, according to the value of the variance value.
  • the overlap determination unit 28c recognizes that two objects overlap when the positions of the objects overlap during filtering and only one hand is found in object recognition.
  • the graph-object feature extraction unit 16 assigns the feature amount that was extracted before the object was not recognized to the node.
  • the graph-object feature extraction unit 16 determines a weight based on the area ratio of the Gaussian distribution overlapping portion and non-overlapping portion when, for example, the right and left hand objects overlap as a result of the object recognition, The feature values of the right hand and left hand up to the previous time and the feature values of the overlapping portions are mixed by weighted sum and assigned to nodes.
  • the object recognition unit 2 detects that objects do not exist or overlap each other, and based on this, appropriately determines the feature amount to be assigned to the node. Learning like ST-GCN can be performed more stably even if an object is not detected at a certain time.
  • the second embodiment is the same as the first embodiment.
  • FIG. 11 is a functional block diagram schematically showing the configuration of the machine learning device 300 according to the third embodiment.
  • the same reference numerals as those shown in FIG. 2 are attached to the same or corresponding configurations as those shown in FIG.
  • the machine learning device 300 differs from the machine learning device 100 according to Embodiment 1 in that it has a learning data generation unit 35 and the configuration and operation of the characteristic behavior determination model 31 .
  • Machine learning device 300 is a device capable of implementing the machine learning method according to the third embodiment.
  • the hardware configuration of machine learning device 300 is the same as that of FIG.
  • the machine learning device 300 is a device that learns a learning model M for inferring the proficiency level of the action of the action subject in the image.
  • the machine learning device 300 acquires a graph G composed of a plurality of nodes corresponding to a plurality of parts of an action subject and information indicating relationships between the plurality of nodes based on an input operation by the user 50.
  • the machine learning device 300 also includes a learning data generation unit 35 that generates learning data linked to a plurality of object regions O, and a plurality of subjects existing in an image that are linked to the plurality of object regions. (for example, right hand, left hand, head), and a skilled action determination model learning unit 33 for learning an action inference model M2 for inferring actions linked to a plurality of object regions. ing. Furthermore, the machine learning device 300 recognizes actions linked to a plurality of object regions inferred using the action inference model M2, and extracts a first feature amount F1 that is a feature amount of the actions.
  • the skillful action feature extraction unit 34 the graph-object feature extraction unit 16 that generates a second feature amount F2 that emphasizes the first feature amount F1, and the second feature when the image is the learning data and a graph model learning unit 14 that generates a learning model M based on the quantity F2.
  • FIG. 12 is an explanatory diagram showing the operation of the machine learning device 300 during learning.
  • a learning rate adjustment unit 32 is provided, and features are extracted with more weight on the CNN at the beginning, and more weight on the ST-GCN in the latter half, making it easier to learn the interrelationship between the right hand, the left hand, and the head.
  • the learning data generation unit 35 registers the recognition results of the right hand, left hand, head, etc. in the data set storage unit 60 by the object recognition unit 28 .
  • the object recognition/skilled action determination model learning unit 33 performs multitask learning with a model such as a normal CNN, and from the learning data, the skill level, the right hand, the left hand, and the feature amount including the head. to extract
  • the graph-object feature extraction unit 16 associates the above feature amount with the right hand, left hand, and head, and obtains the node feature amount of ST-GCN.
  • the learning rate adjustment unit 32 extracts features for finding the left hand, right hand, and head in the first half of learning, and emphasizes ST-GCN in the second half of learning, so that the interaction between human body parts gradually increases. Focus on relationships.
  • the object recognition/skilled behavior determination model learning unit 33 is a deep learning algorithm that associates labels or categories with all pixels in an image (for example, an algorithm capable of recognizing a group of pixels forming a characteristic category).
  • a behavioral inference model M2 which is a model such as
  • the object recognition/skilled action feature extraction unit 34 recognizes actions linked to a plurality of object regions inferred using the action inference model M2, and extracts a first feature amount F1, which is the feature amount of the action. Extract.
  • the action inference model M2 it is possible to extract features related to proficiency through multitask learning. Semantic segmentation, for example, is known as such an algorithm. Therefore, it is possible to extract a detailed area related to the skill level without providing a device such as the region-of-interest generation unit 13 in the first embodiment.
  • the graph-object feature extraction unit 16 can link nodes and features by using segmentation results and masks.
  • FIG. 13A and 13B are explanatory diagrams showing the operation of the learning rate adjusting section 32 of the machine learning device 300.
  • FIG. It is assumed that the operation image of the learning rate adjustment unit 32 and the following loss function are given.
  • L usr_cnn +L skill_cnn is the loss related to the object recognition/skilled behavior determination model learning unit 33
  • L skill_gcn is the loss related to the graph model learning unit 14 .
  • Embodiments 1 and 2 learning is performed using a graph structure, which is manually embedded knowledge, but the graph structure does not include features for extracting objects such as the right hand, left hand, and head.
  • the learning rate adjustment unit 32 executes learning by multitask learning for the object recognition/skilled behavior determination model learning unit 33 immediately after the start of learning (that is, the first period of learning), and a certain amount of time elapses.
  • the value of ⁇ in the following loss function Loss is adjusted so as not to fall below a certain object recognition rate.
  • the ST-GCN incorporates features related to object extraction, and the learning is adjusted so as to calculate the skill level from the graph.
  • Loss ⁇ ( ⁇ (L usr_cnn + L skill_cnn ) + (1- ⁇ ) L skill_gcn )
  • a network configuration example is shown below.
  • a learning rate adjustment unit 32 is provided, and features are extracted with more weight on the CNN at the beginning, and more weight on the ST-GCN in the latter half, making it easier to learn the interrelationship between the right hand, the left hand, and the head.
  • FIG. 14 is a flowchart showing the operation of the machine learning device 300 during learning.
  • the machine learning device 300 recognizes an object (step S301), generates learning data (step S302), extracts object recognition/skilled behavior features (step S303), adjusts the learning rate (step S304), A feature (second feature amount F2) is extracted (step S305), and a graph model is generated as a learning model (step S306).
  • ST-GCN can also have features related to objects. As a result, it becomes possible to learn to determine a skilled action based on the features related to hand and head extraction. This can be expected to make learning more stable.
  • Embodiment 3 is the same as Embodiment 1 or 2.
  • FIG. 15 is a functional block diagram schematically showing the configuration of machine learning device 400 according to the fourth embodiment.
  • the same reference numerals as those shown in FIG. 2 are attached to the same or corresponding configurations as those shown in FIG.
  • Machine learning device 400 differs from machine learning device 100 according to Embodiment 1 in the configuration of skilled action determination model 41 and in having graph candidate generation unit 43 .
  • Machine learning device 400 is a device capable of implementing the machine learning method according to the fourth embodiment.
  • the hardware configuration of machine learning device 400 is the same as that of FIG.
  • the knowledge provided may become noise, contrary to the user's intention.
  • the region of interest in the time direction of each object is extracted by the Attention Branch Network, and the graph candidate generation unit 43 determines the firing order of the heat map (that is, which region of interest is considered important when determining the skill level). information) to generate graph candidates.
  • the machine learning device 400 according to the fourth embodiment when the user 50 inputs only node candidate information, node correlation/causality is automatically discovered.
  • the machine learning device 400 is a device that learns a learning model M for inferring the proficiency level of actions of the action subject in the image.
  • the machine learning device 400 acquires a graph G composed of a plurality of nodes corresponding to a plurality of parts of an action subject and information indicating relationships between the plurality of nodes based on an input operation by the user 50.
  • the machine learning device 400 also includes a skillful action feature extraction unit 12 that extracts a first feature amount F1 that is a feature amount of actions of a plurality of parts (for example, right hand, left hand, head) of the action subject present in the image. and a region-of-interest generation unit 13 that generates a region-of-interest A that overlaps one of the plurality of object regions O based on the plurality of object regions O and the first feature amount F1, and outputs the region-of-interest as a heat map; It has a region-of-interest storage unit 42 and a graph candidate generation unit 43 that generates information for presenting to the user graph candidates input from the graph input unit 15 based on the heat map.
  • a skillful action feature extraction unit 12 that extracts a first feature amount F1 that is a feature amount of actions of a plurality of parts (for example, right hand, left hand, head) of the action subject present in the image.
  • a region-of-interest generation unit 13 that generates a region-of-interest A
  • the machine learning device 400 uses the graph-object feature extraction unit 16 that generates a second feature amount F2 that emphasizes the first feature amount F1 for the region of interest A, and the learning data L in which images are collected in advance. and a graph model learning unit 14 that generates a learning model M based on the second feature amount F2 at a certain time.
  • FIG. 16 is a flow chart showing the operation of the machine learning device 400 during learning.
  • the machine learning device 400 recognizes an object (step S401), extracts a first feature amount F1 (step S402), generates a region of interest A (step S403), extracts a graph object feature (second feature amount F2). (Step S404), a graph model is generated as a learning model (step S405).
  • FIG. 17 is an explanatory diagram showing the operation of the machine learning device 400.
  • the user 50 defines only the likely relevant nodes such as right hand, left hand and head. Also, these extraction methods are registered in the user input area extraction table 17 .
  • the region-of-interest generating unit 13 calculates the degree of superimposition of the heat map with respect to the region of interest in which the object is recognized, in addition to the heat map information indicating where in each object the focus is focused to determine the skill level.
  • the firing order of nodes is generated as shown in FIG.
  • the graph candidate generation unit 43 generates graph candidates from the firing order of the heat map (that is, information indicating which is given importance in determining the skill level).
  • the above is an example of extracting the firing order of nodes such as right hand, left hand, and head. .
  • the graph candidate generation unit finds node candidates based on the following N-divided time-series attention information (Attention) information and information indicating when each node was noticed.
  • FIG. 18 is an explanatory diagram showing the operation of the machine learning device 400.
  • the graph candidate generating unit 43 may verify whether or not there is a causal relationship between the nodes in judging the skill level, in addition to the automatically discovered relationships between the nodes. For example, the graph candidate generating unit 43 assumes that the time-series focused information shown in FIG. 18 has been obtained, and invalidates some of the focused information to investigate the effect. The graph candidate generation unit 43 verifies whether there is causality from the left hand to the right hand and from the left hand to the head, as described in FIG. 18 as causal relationship extraction.
  • the graph candidate generation unit 43 When verifying the causality of the right hand and left hand, the graph candidate generation unit 43 first invalidates the focus information of the head. Then, shift the time zone of the heat map on the left side so that it is the same as the time on the right side, and find the change in loss ( ⁇ loss) at this time.
  • the graph candidate generation unit 43 when verifying the causality of the left hand and the head, the graph candidate generation unit 43 first disables the attention information of the right hand, and then sets the time zone of the heat map of the left hand to the same time as that of the right hand. Then, find the change in loss ( ⁇ loss) at this time.
  • the graph candidate generating unit 43 verifies that the loss changes greatly when the relationship between the directed edges pointing from the left hand to the right is broken by the calculation of the loss described above, and the causality of the actually obtained graph is verified. It is possible to present candidates that there may be.
  • the graph candidate generating unit 43 presents information that enables the user 50 to discover the relationship between the nodes. It is possible to avoid giving a relationship definition that would be noisy by entering an inappropriate relationship between.
  • Embodiment 4 is the same as any of Embodiments 1 to 3.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A machine learning device (100) has: a graph input unit (15) that acquires a graph (G) constituted by a plurality of nodes corresponding to a plurality of parts of an action subject, and information indicating a relationship among the plurality of nodes; an object recognition unit (18) that recognizes and outputs a plurality of object regions (O) within an image; a skilled action feature extraction unit (12) that extracts a first feature amount (F1) which is a feature amount of actions in the plurality of parts of the action subject present in the image; a focus region generation unit (13) that generates a focus region (A) on the basis of the first feature amount (F1); a graph-object feature extraction unit (16) that generates a second feature amount (F2) which emphasizes the first feature amount (F1) for regions where the focus region (A) and the object region (O) overlap; and a graph model training unit (14) that generates a training model (M) on the basis of the second feature amount (F2) when the image input into the skilled action feature extraction unit (12) is training data which has been collected ahead of time.

Description

機械学習装置、熟練行動判定装置、機械学習方法、及び機械学習プログラムMachine learning device, skilled behavior determination device, machine learning method, and machine learning program
 本開示は、画像内の動作主体の行動の熟練度を推論するための学習モデルを学習する機械学習装置、機械学習方法、及び機械学習プログラム、並びに、画像内の動作主体の行動の熟練度を推論する熟練行動判定装置に関する。 The present disclosure provides a machine learning device, a machine learning method, and a machine learning program for learning a learning model for inferring the action proficiency level of an action subject in an image, and the action proficiency level of an action subject in an image. The present invention relates to an inference skilled action determination device.
 学習モデルにおけるニューラルネットワーク(NN)が画像に対して生成した着目領域を、ユーザが修正し(つまり、学習モデルに人の知見を埋め込み)、修正された着目領域を正解データとして学習を行う転移学習が知られている(例えば、非特許文献1を参照)。転移学習は、Human-in-the-Loop(HITL)型の学習である。転移学習により、例えば、ユーザとインタラクションをしながら画像内の人の行動の熟練度を判定する学習モデルである熟練行動判定モデルが生成される。 Transfer learning in which a user modifies a region of interest generated for an image by a neural network (NN) in a learning model (that is, human knowledge is embedded in the learning model), and learning is performed using the corrected region of interest as correct data. is known (see, for example, Non-Patent Document 1). Transfer learning is a Human-in-the-Loop (HITL) type of learning. By transfer learning, for example, a skilled action determination model is generated, which is a learning model that determines the skill level of human action in an image while interacting with the user.
 また、骨格から人の動きを検出する学習モデルとしてSpatio-Temporal Graph Convolution Network(ST-GCN)が知られている(例えば、非特許文献2を参照)。この方法では、人の関節座標をノードとし、関節間の関係性をエッジとした、グラフが用いられている。 In addition, the Spatio-Temporal Graph Convolution Network (ST-GCN) is known as a learning model for detecting human movement from the skeleton (see, for example, Non-Patent Document 2). This method uses a graph in which nodes are the joint coordinates of a person and edges are the relationships between the joints.
 また、一般的な物体検出モデルであるRelationship Proposal Networks(RePN)を用いて、オブジェクトの抽出とオブジェクトに紐づく画像特徴の抽出とを行い、さらにそれらの関連性を示すグラフ構造を持つシーングラフを学習する方法としてGraph Region Based Convolutional Neural Networks(Graph R-CNN)が知られている(例えば、非特許文献3を参照)。ここで、シーングラフとは、画像に映るオブジェクトをノードとし、ノード間に成立する関係をエッジ(例えば、有向エッジ)とするグラフである。 In addition, using Relationship Proposal Networks (RePN), which is a general object detection model, we extract objects and extract image features linked to objects, and create a scene graph with a graph structure that shows their relationships. Graph Region Based Convolutional Neural Networks (Graph R-CNN) is known as a learning method (see, for example, Non-Patent Document 3). Here, the scene graph is a graph in which objects appearing in an image are nodes, and relationships established between the nodes are edges (for example, directed edges).
 しかしながら、非特許文献1の方法では、ユーザが画像の着目領域を修正するだけであるから、動作主体としての人の行動の熟練度を高い予測精度で推論することができる学習モデルを生成することができない。 However, in the method of Non-Patent Document 1, since the user only corrects the region of interest of the image, it is possible to generate a learning model that can infer the skill level of human behavior as the subject of action with high prediction accuracy. can't
 また、非特許文献2の方法では、グラフ構造が骨格の情報のみを用いているので、人の行動の熟練度を高い予測精度で推論することができる学習モデルを生成することは難しいと考えられる。 In addition, in the method of Non-Patent Document 2, since the graph structure uses only skeletal information, it is considered difficult to generate a learning model that can infer the proficiency level of human behavior with high prediction accuracy. .
 さらに、非特許文献3の方法では、画像に映っているオブジェクト同士の単純な関係性(木と鳥、木と葉、木と枝、の位置関係など)のみを扱っている。したがって、人の行動の熟練度を高い予測精度で推論することができる学習モデルを生成することは難しいと考えられる。 Furthermore, the method of Non-Patent Document 3 deals only with simple relationships between objects in images (positional relationships between trees and birds, trees and leaves, trees and branches, etc.). Therefore, it is considered difficult to generate a learning model that can infer the proficiency level of a person's behavior with high prediction accuracy.
 本開示は、動作主体の行動の熟練度を高い予測精度で推論することができる学習モデルを学習する機械学習装置、機械学習方法、及び機械学習プログラム、並びに、前記学習モデルを用いて画像内の動作主体の行動の熟練度を推論する熟練行動判定装置を提供することを目的とする。 The present disclosure provides a machine learning device, a machine learning method, and a machine learning program for learning a learning model capable of inferring the proficiency level of an action of an action subject with high prediction accuracy, and using the learning model, It is an object of the present invention to provide a skilled action determination device for inferring the skill level of actions of an action subject.
 本開示の機械学習装置は、画像内の動作主体の行動の熟練度を推論するための学習モデルを学習する装置であって、ユーザの入力操作に基づいて、前記動作主体の複数の部分に対応する複数のノードと前記複数のノードの間の関係性を示す情報とで構成されるグラフを取得するグラフ入力部と、前記グラフ入力部で取得された前記グラフを記憶する記憶部と、前記複数のノードに対応する複数のオブジェクトが存在する、前記画像内における複数のオブジェクト領域を認識して出力するオブジェクト認識部と、前記画像内に存在する前記動作主体の前記複数の部分の行動の特徴量である第1の特徴量を抽出する熟練行動特徴抽出部と、前記第1の特徴量に基づいて前記画像内における着目領域を生成する着目領域生成部と、前記着目領域と前記オブジェクト領域とが重なる領域について前記第1の特徴量を強調した第2の特徴量を生成するグラフ-オブジェクト特徴抽出部と、前記画像が予め収集された学習用データであるときにおける前記第2の特徴量に基づいて、前記学習モデルを生成するグラフモデル学習部と、を有することを特徴とする。 A machine learning device according to the present disclosure is a device that learns a learning model for inferring the action proficiency of an action subject in an image, and corresponds to a plurality of parts of the action subject based on a user's input operation. a graph input unit for acquiring a graph composed of a plurality of nodes and information indicating relationships between the plurality of nodes; a storage unit for storing the graph acquired by the graph input unit; an object recognition unit for recognizing and outputting a plurality of object areas in the image in which a plurality of objects corresponding to the nodes of are present; a skillful action feature extraction unit that extracts a first feature amount, a region of interest generation unit that generates a region of interest in the image based on the first feature amount, and the region of interest and the object region are a graph-object feature extraction unit that generates a second feature amount that emphasizes the first feature amount for the overlapping region; and a graph model learning unit that generates the learning model.
 本開示の機械学習方法は、画像内の動作主体の行動の熟練度を推論するための学習モデルを学習する機械学習装置によって実施される方法であって、前記画像内に存在する前記動作主体の複数の部分の行動の特徴量である第1の特徴量を抽出するステップと、ユーザの入力操作に基づいて、前記動作主体の前記複数の部分に対応する複数のノードと前記複数のノードの間の関係性を示す情報とで構成されるグラフを取得し、前記グラフを記憶するステップと、前記複数のノードに対応する複数のオブジェクトが存在する、前記画像内における複数のオブジェクト領域を認識して出力するステップと、前記第1の特徴量に基づいて前記画像内における着目領域を生成するステップと、前記着目領域と前記オブジェクト領域とが重なる領域について前記第1の特徴量を強調した第2の特徴量を生成するステップと、前記画像が予め収集された学習用データであるときにおける前記第2の特徴量に基づいて、前記学習モデルを生成するステップと、を有することを特徴とする。 A machine learning method of the present disclosure is a method implemented by a machine learning device that learns a learning model for inferring the action proficiency of an actor in an image, the method comprising: a step of extracting a first feature amount that is a feature amount of actions of a plurality of parts; and a step between a plurality of nodes corresponding to the plurality of parts of the action subject and the plurality of nodes based on a user's input operation. and storing the graph; and recognizing a plurality of object regions in the image in which a plurality of objects corresponding to the plurality of nodes exist. a step of generating a region of interest in the image based on the first feature quantity; and a second step of emphasizing the first feature quantity for a region where the region of interest and the object region overlap. and generating the learning model based on the second feature when the image is learning data collected in advance.
 本開示の機械学習装置、機械学習方法、及び機械学習プログラムを用いれば、動作主体の行動の熟練度を高い予測精度で推論することができる学習モデルを生成することができる。 By using the machine learning device, machine learning method, and machine learning program of the present disclosure, it is possible to generate a learning model that can infer the proficiency level of the actions of the action subject with high prediction accuracy.
 また、本開示の熟練行動判定装置を用いれば、動作主体の行動の熟練度を高い予測精度で推論することができる。 Also, by using the skilled action determination device of the present disclosure, it is possible to infer the skill level of the action subject's action with high prediction accuracy.
実施の形態1に係る機械学習装置のハードウェア構成の例を示す図である。1 is a diagram illustrating an example of a hardware configuration of a machine learning device according to Embodiment 1; FIG. 実施の形態1に係る機械学習装置の構成を概略的に示す機能ブロック図である。1 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 1; FIG. 実施の形態1に係る機械学習装置の学習時の動作を示す説明図である。FIG. 4 is an explanatory diagram showing an operation during learning of the machine learning device according to Embodiment 1; 実施の形態1に係る機械学習装置の学習時の動作の例を表形式で示す図である。FIG. 10 is a diagram showing an example of operation during learning of the machine learning device according to Embodiment 1 in a tabular form; 実施の形態1に係る機械学習装置の学習時の動作を示すフローチャートである。4 is a flow chart showing the operation of the machine learning device according to Embodiment 1 during learning. 実施の形態1に係る機械学習装置(熟練行動判定装置)の推論時の動作を示す説明図である。FIG. 4 is an explanatory diagram showing the operation of the machine learning device (skilled action determination device) according to Embodiment 1 during inference; 実施の形態1に係る機械学習装置(熟練行動判定装置)の推論時の動作を示すフローチャートである。4 is a flow chart showing operations during inference of the machine learning device (skilled action determination device) according to Embodiment 1; 実施の形態2に係る機械学習装置の構成を概略的に示す機能ブロック図である。FIG. 4 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 2; (A)及び(B)は、実施の形態2に係る機械学習装置の動作を示す説明図である。(A) and (B) are explanatory diagrams showing the operation of the machine learning device according to the second embodiment. 実施の形態2に係る機械学習装置の学習時の動作を示すフローチャートである。10 is a flow chart showing operations during learning of the machine learning device according to Embodiment 2. FIG. 実施の形態3に係る機械学習装置の構成を概略的に示す機能ブロック図である。FIG. 12 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 3; 実施の形態3に係る機械学習装置の学習時の動作を示す説明図である。FIG. 12 is an explanatory diagram showing the operation of the machine learning device according to Embodiment 3 during learning; (A)及び(B)は、実施の形態3に係る機械学習装置の学習率調整部の動作を示す説明図である。(A) and (B) are explanatory diagrams showing the operation of the learning rate adjusting unit of the machine learning device according to the third embodiment. 実施の形態3に係る機械学習装置の学習時の動作を示すフローチャートである。11 is a flow chart showing operations during learning of the machine learning device according to Embodiment 3. FIG. 実施の形態4に係る機械学習装置の構成を概略的に示す機能ブロック図である。FIG. 11 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 4; 実施の形態4に係る機械学習装置の学習時の動作を示すフローチャートである。FIG. 13 is a flow chart showing an operation during learning of the machine learning device according to Embodiment 4; FIG. 実施の形態4に係る機械学習装置の動作を示す説明図である。FIG. 12 is an explanatory diagram showing the operation of the machine learning device according to Embodiment 4; 実施の形態4に係る機械学習装置の動作を示す説明図である。FIG. 12 is an explanatory diagram showing the operation of the machine learning device according to Embodiment 4;
 以下に、実施の形態に係る機械学習装置、熟練行動推論装置、機械学習方法、及び機械学習プログラムを、図面を参照しながら説明する。以下の実施の形態は、例にすぎず、実施の形態を適宜組み合わせること及び各実施の形態を適宜変更することが可能である。 A machine learning device, a skilled behavior inference device, a machine learning method, and a machine learning program according to embodiments will be described below with reference to the drawings. The following embodiments are merely examples, and the embodiments can be combined as appropriate and each embodiment can be modified as appropriate.
 実施の形態に係る機械学習装置は、画像内の動作主体の行動の熟練度を推論するための学習モデルを学習する装置である。実施の形態に係る機械学習装置は、例えば、情報処理装置としてのコンピュータである。動作主体は、作業を行う人(作業者、技能者、熟練者、などとも呼ばれる。)、又は人の動作に連動して動作して作業を行う機構又は装置(例えば、ロボットアーム、内視鏡、など)である。 The machine learning device according to the embodiment is a device that learns a learning model for inferring the proficiency level of actions of an action subject in an image. A machine learning device according to an embodiment is, for example, a computer as an information processing device. The subject of action is a person who performs the work (also called a worker, a technician, a skilled worker, etc.), or a mechanism or device (for example, a robot arm, an endoscope, etc.) that works in conjunction with the movement of a person , etc.).
 実施の形態に係る機械学習方法は、機械学習装置によって実施されることができる方法である。この機械学習方法は、画像内の動作主体の行動の熟練度を推論するための学習モデルを学習する方法である。 A machine learning method according to an embodiment is a method that can be implemented by a machine learning device. This machine learning method is a method of learning a learning model for inferring the action proficiency of the action subject in the image.
 実施の形態に係る機械学習プログラムは、機械学習装置としてのコンピュータによって実行されることができるプログラムである。この機械学習プログラムは、画像内の動作主体の行動の熟練度を推論するための学習モデルを学習するプログラムである。 A machine learning program according to the embodiment is a program that can be executed by a computer as a machine learning device. This machine learning program is a program for learning a learning model for inferring the action proficiency of the action subject in the image.
 実施の形態に係る熟練行動推論装置は、機械学習装置、機械学習方法、又は機械学習プログラムによって生成された学習モデルを用いて、動作主体の行動の熟練度を推論する装置である。熟練行動推論装置は、例えば、コンピュータである。熟練行動推論装置と機械学習装置とは、共通のコンピュータで構成されてもよい。また、熟練行動推論装置と機械学習装置とは、異なるコンピュータで構成されてもよい。 A skilled action inference device according to an embodiment is a device that infers the action proficiency of an action subject using a learning model generated by a machine learning device, a machine learning method, or a machine learning program. A skilled action reasoning device is, for example, a computer. The expert behavior inference device and the machine learning device may be configured by a common computer. Also, the expert behavior inference device and the machine learning device may be configured by different computers.
《1》実施の形態1
《1-1》構成
 図1は、実施の形態1に係る機械学習装置100のハードウェア構成の例を示す図である。実施の形態1に係る機械学習装置100は、機械学習を行うことで学習モデルMを生成する学習プロセスを実行する装置である。また、機械学習装置100は、熟練行動判定装置でもある。機械学習装置100は、CPU(Central Processing Unit)などのプロセッサ101と、揮発性の記憶装置であるメモリ102と、ハードディスクドライブ(HDD)又はソリッドステートドライブ(SSD)などの不揮発性記憶装置103と、インタフェース104とを有している。メモリ102は、例えば、RAM(Random Access Memory)などの半導体メモリである。機械学習装置100は、外部の装置との通信を行う通信装置を有してもよい。
<<1>> Embodiment 1
<<1-1>> Configuration FIG. 1 is a diagram showing an example of a hardware configuration of a machine learning device 100 according to the first embodiment. The machine learning device 100 according to Embodiment 1 is a device that executes a learning process of generating a learning model M by performing machine learning. Further, the machine learning device 100 is also a skilled action determination device. The machine learning device 100 includes a processor 101 such as a CPU (Central Processing Unit), a memory 102 that is a volatile storage device, a nonvolatile storage device 103 such as a hard disk drive (HDD) or solid state drive (SSD), and an interface 104 . The memory 102 is, for example, a semiconductor memory such as a RAM (Random Access Memory). The machine learning device 100 may have a communication device that communicates with an external device.
 機械学習装置100の各機能は、処理回路により実現される。処理回路は、専用のハードウェアである。処理回路は、メモリ102に格納されるプログラム(例えば、実施の形態に係る機械学習プログラム)を実行するプロセッサ101であってもよい。プロセッサ101は、処理装置、演算装置、マイクロプロセッサ、マイクロコンピュータ、又はDSP(Digital Signal Processor)であってもよい。 Each function of the machine learning device 100 is realized by a processing circuit. The processing circuitry is dedicated hardware. The processing circuit may be processor 101 that executes a program stored in memory 102 (for example, a machine learning program according to the embodiment). The processor 101 may be a processing device, an arithmetic device, a microprocessor, a microcomputer, or a DSP (Digital Signal Processor).
 処理回路が専用のハードウェアである場合、処理回路は、例えば、ASIC(Application Specific Integrated Circuit)又はFPGA(Field Programmable Gate Array)などである。 When the processing circuit is dedicated hardware, the processing circuit is, for example, ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array).
 処理回路がプロセッサ101である場合、機械学習方法は、ソフトウェア、ファームウェア、又はソフトウェアとファームウェアとの組み合わせにより実行される。ソフトウェア及びファームウェアは、プログラムとして記述され、メモリ102に格納される。プロセッサ101は、メモリ102に記憶されたプログラムを読み出して実行することにより、実施の形態1に係る機械学習方法を実施することができる。 When the processing circuit is the processor 101, the machine learning method is implemented by software, firmware, or a combination of software and firmware. Software and firmware are written as programs and stored in memory 102 . Processor 101 can implement the machine learning method according to the first embodiment by reading and executing the program stored in memory 102 .
 なお、機械学習装置100は、一部を専用のハードウェアで実現し、他の一部をソフトウェア又はファームウェアで実現するようにしてもよい。このように、処理回路は、ハードウェア、ソフトウェア、ファームウェア、又はこれらのうちのいずれかの組み合わせによって、上述の各機能を実現することができる。 It should be noted that the machine learning device 100 may be partially implemented by dedicated hardware and partially implemented by software or firmware. As such, the processing circuitry may implement each of the functions described above in hardware, software, firmware, or any combination thereof.
 インタフェース104は、他の装置と通信するために用いられる。インタフェース104には、外部の記憶装置、ディスプレイ105、及びユーザ操作部としての入力装置106、などが接続される。入力装置106は、例えば、マウス、キーボード、タッチパネル、などである。 The interface 104 is used to communicate with other devices. An external storage device, a display 105, an input device 106 as a user operation unit, and the like are connected to the interface 104 . The input device 106 is, for example, a mouse, keyboard, touch panel, or the like.
 図2は、実施の形態1に係る機械学習装置100の構成を概略的に示す機能ブロック図である。機械学習装置100は、画像内の動作主体の行動の熟練度を推論するための学習モデルMを学習する装置である。機械学習装置100は、熟練行動判定モデル11と、相関・因果グラフ入力部としてのグラフ入力部15と、グラフ-オブジェクト特徴抽出部16と、記憶部に記憶されたユーザ入力領域抽出テーブル17と、オブジェクト認識部18とを有している。熟練行動判定モデル11は、熟練行動特徴抽出部12と、着目領域生成部13と、グラフモデル学習部14とを有している。 FIG. 2 is a functional block diagram schematically showing the configuration of machine learning device 100 according to Embodiment 1. As shown in FIG. The machine learning device 100 is a device that learns a learning model M for inferring the proficiency level of actions of an action subject in an image. The machine learning device 100 includes a skilled action determination model 11, a graph input unit 15 as a correlation/causal graph input unit, a graph-object feature extraction unit 16, a user input area extraction table 17 stored in a storage unit, and an object recognition unit 18 . The skillful action determination model 11 has a skillful action feature extraction unit 12 , a region-of-interest generation unit 13 , and a graph model learning unit 14 .
 学習時には、熟練行動判定モデル11では、学習モデル生成部11aが、学習モデルMを生成して、これを学習モデル記憶部11bに格納する。推論時には、熟練行動判定モデル11では、推論部11cが学習モデル記憶部11bから学習モデルMを読み出し、これを用いて、入力されるデータに基づく推論を行う、推論の結果を出力する。 At the time of learning, in the skilled action determination model 11, the learning model generation unit 11a generates a learning model M and stores it in the learning model storage unit 11b. At the time of inference, in the skilled action determination model 11, the inference section 11c reads out the learning model M from the learning model storage section 11b, performs inference based on the input data, and outputs the inference result.
 グラフ入力部15は、ユーザ50の入力操作に基づいて、画像内の動作主体の複数の部分に対応する複数のノードと、複数のノードの間の関係性を示す情報と、で構成されるグラフGを取得する。つまり、グラフ入力部15に対し、ユーザは、入力装置からノード及びノード間の相関・因果関係を入力する。つまり、グラフ入力部15では、因果・相関グラフにノード同士の関係性が入力される。ユーザ50は、グラフ入力部15で、知見を埋め込みたい領域(例えば、右手RH、左手LH、頭HE)を指定し、知見を埋め込みたいと想定する領域を抽出するためのオブジェクトをユーザ入力領域抽出テーブル17に登録する。さらに、グラフ入力部15で、ユーザ50は、自分が予想する領域同士の関わり合いに関する情報を予め登録する。ユーザ入力領域抽出テーブル17は、予め又はグラフ入力部15からの入力操作により取得された情報のテーブルである。 The graph input unit 15 generates a graph composed of a plurality of nodes corresponding to a plurality of portions of the action subject in the image and information indicating the relationship between the plurality of nodes based on the input operation of the user 50. get G. In other words, the user inputs nodes and correlations/causal relationships between the nodes from the input device to the graph input unit 15 . That is, the graph input unit 15 inputs relationships between nodes into the causality/correlation graph. The user 50 uses the graph input unit 15 to specify a region (for example, right hand RH, left hand LH, head HE) in which knowledge is to be embedded, and selects an object for extracting the region in which knowledge is to be embedded. Register in Table 17. Furthermore, in the graph input unit 15, the user 50 registers in advance information about the relationship between regions that he/she expects. The user input area extraction table 17 is a table of information acquired in advance or by an input operation from the graph input unit 15 .
 オブジェクト認識部18は、複数のノード(「グラフノード」ともいう。)に対応する複数のオブジェクトが存在する複数のオブジェクト領域Oを認識して出力する。オブジェクト認識部18は、ユーザ50が入力したノード(例えば、右手、左手、頭)に対応するオブジェクトと、オブジェクトを含む領域であるオブジェクト領域O(例えば、矩形領域)を認識する。つまり、オブジェクト認識部18は、ユーザ入力領域抽出テーブル17に登録されているオブジェクトの抽出方法を読み取り、その方式に従って動画、時系列のセンサデータから該当するオブジェクトの領域を抽出する。 The object recognition unit 18 recognizes and outputs multiple object areas O in which multiple objects corresponding to multiple nodes (also referred to as "graph nodes") exist. The object recognition unit 18 recognizes an object corresponding to a node (for example, right hand, left hand, head) input by the user 50 and an object area O (for example, rectangular area) that is an area containing the object. That is, the object recognition unit 18 reads the object extraction method registered in the user input area extraction table 17, and extracts the corresponding object area from the moving image and time-series sensor data according to the method.
 熟練行動特徴抽出部12は、画像内に存在する動作主体の複数の部分の行動の特徴量(すなわち、中間特徴量)である第1の特徴量F1を抽出する。動作主体の複数の部分は、例えば、作業者の右手RH、左手LH、及び頭HEである。熟練行動特徴抽出部12は、例えば、CNN(Convolutional Neural Network)などの特徴抽出機により、中間特徴を取得する。 The skillful action feature extraction unit 12 extracts a first feature amount F1 that is a feature amount (that is, an intermediate feature amount) of actions of a plurality of portions of the action subject present in the image. The multiple parts of the action subject are, for example, the operator's right hand RH, left hand LH, and head HE. The skillful action feature extraction unit 12 acquires intermediate features using a feature extractor such as a CNN (Convolutional Neural Network), for example.
 着目領域生成部13は、第1の特徴量F1に基づいて画像内における着目領域Aを生成する。着目領域生成部13は、Attention branch network(ABN)のようなネットワーク機構により、画像のどの領域に着目することで熟練度を求めることができるかを示すヒートマップ情報を生成する。また、着目領域生成部13は、ヒートマップ情報の生成途中の情報である可視化結果を学習結果部分に登録する。ABNは、例えば、非特許文献1に記載されている。 The region-of-interest generation unit 13 generates a region of interest A in the image based on the first feature amount F1. The region-of-interest generation unit 13 generates heat map information indicating which region of the image to focus on to obtain the skill level, using a network mechanism such as an attention branch network (ABN). In addition, the region-of-interest generation unit 13 registers the visualization result, which is information in the middle of generation of the heat map information, in the learning result portion. ABN is described in Non-Patent Document 1, for example.
 グラフ-オブジェクト特徴抽出部16は、着目領域Aとオブジェクト領域Oとが重なる領域について第1の特徴量F1を強調した第2の特徴量F2を生成する。グラフ-オブジェクト特徴抽出部16は、人の知見を埋め込みたいと想定する領域に対して、画像等のセンサ特徴量を紐づけする。グラフ-オブジェクト特徴抽出部16は、熟練行動判定モデル11が抽出した第1の特徴量F1に対して、オブジェクト認識部18が抽出したオブジェクト領域Oと着目領域生成部13が生成した着目点を含む着目領域Aとが重なる領域以外をマスクして、ユーザ入力であるグラフGと熟練行動特徴である第1の特徴量F1との紐づけを行う。このような構成とすることで、右手RH、左手LH、頭HEの関わり合いを予め指定することができ、さらに、手と一緒に映っている「モノの扱い方」を含めた、オブジェクトを解析することができる。ここで、「モノの扱い方」とは、例えば、ドライバー、ペンのような道具の扱い方である。つまり、グラフ-オブジェクト特徴抽出部16は、オブジェクト領域及び着目領域を示すヒートマップ情報を用いたマスク処理を行うことで、ノードに対し特徴を抽出する。 The graph-object feature extraction unit 16 generates a second feature quantity F2 that emphasizes the first feature quantity F1 for the region where the region of interest A and the object region O overlap. The graph-object feature extraction unit 16 associates a sensor feature amount such as an image with a region in which human knowledge is assumed to be embedded. The graph-object feature extraction unit 16 includes the object region O extracted by the object recognition unit 18 and the focus point generated by the region-of-interest generation unit 13 in the first feature quantity F1 extracted by the skilled action determination model 11. Areas other than the area overlapping with the area of interest A are masked, and the graph G, which is the user input, is associated with the first feature amount F1, which is the skillful action feature. With such a configuration, it is possible to specify in advance the relationship between the right hand RH, the left hand LH, and the head HE, and furthermore, analyze the object including "how to handle things" that are shown together with the hand. can do. Here, "how to handle things" means, for example, how to handle tools such as screwdrivers and pens. In other words, the graph-object feature extracting unit 16 extracts features for nodes by performing mask processing using heat map information indicating the object region and the region of interest.
 グラフモデル学習部14は、熟練行動特徴抽出部12に入力される画像が予め収集された学習用データLであるときにおける第2の特徴量F2に基づいて、学習モデルMを生成する。グラフモデル学習部14は、例えば、ST-GCNのようなグラフを畳み込む学習方式を用い学習を進め、学習結果を記憶部に蓄積する。 The graph model learning unit 14 generates a learning model M based on the second feature amount F2 when the image input to the skillful behavior feature extraction unit 12 is the learning data L collected in advance. The graph model learning unit 14 advances learning using, for example, a graph convolution learning method such as ST-GCN, and accumulates the learning results in the storage unit.
 データセット記憶部60には、学習用データである動画及び推論用テータである動画、時系列のセンサデータ、などが蓄積されている。 In the data set storage unit 60, videos that are learning data, videos that are inference data, time-series sensor data, and the like are accumulated.
 実施の形態1では、ユーザ50がオブジェクト同士の相互関係を示す情報をグラフ入力部15から予め与えるので、解きたい課題に合わせて認識すべきオブジェクトを指定することができる。また、解きたい課題の認識過程で抽出した着目領域を用いて画像特徴を選択することで、グラフに対して「モノを上手に扱う上での着目領域」といった、より詳細な特徴を獲得することができる。また、より精度の良い分析が可能となる。 In the first embodiment, the user 50 gives the information indicating the interrelationship between the objects in advance from the graph input unit 15, so that the object to be recognized can be designated according to the problem to be solved. In addition, by selecting image features using the region of interest extracted in the process of recognizing the problem to be solved, it is possible to acquire more detailed features of the graph, such as the region of interest for skillful handling of objects. can be done. In addition, more accurate analysis becomes possible.
《1-2》学習時の動作
 図3は、実施の形態1に係る機械学習装置100の学習時の動作を示す説明図である。図4は、機械学習装置100の学習時の動作の例を示す情報を表形式で示す図である。
<<1-2>> Operation During Learning FIG. 3 is an explanatory diagram showing the operation during learning of the machine learning device 100 according to the first embodiment. FIG. 4 is a diagram showing, in tabular form, information indicating an example of the operation of the machine learning device 100 during learning.
 ユーザは、グラフ入力部15に対して、例えば、右手、左手、頭の因果関係、相関関係について入力する。図3の例では、「お絵描き」の熟練度を判定するために、作業中の俯瞰映像を取得し、それに映っている「右手」「左手」「頭」といった情報の関わりをユーザ50が手入力で与える。 The user inputs, for example, causality and correlation between the right hand, left hand, and head to the graph input unit 15 . In the example of FIG. 3, in order to determine the skill level of "drawing", a bird's-eye view image of the work is acquired, and the user 50 manually determines the relation of information such as "right hand", "left hand", and "head" shown in the image. given in the input.
 例えば、因果を有向グラフとして表現した場合、図3に示されるように、左手の動きに起因して右手の動きが定まり、頭と右手はお互いに連動して動くようなものとして、ノード「右手」「左手」「頭」、とグラフのエッジを与える。 For example, when causality is expressed as a directed graph, as shown in FIG. Give "left hand", "head", and the edges of the graph.
 また、グラフ入力部15のノードを抽出するための方法をユーザ入力領域抽出テーブル17に与える。グラフ入力部15は、非特許文献3に示されるような画像から物体を検出する機械学習モデルを与えてもよい。また、肌の色を抽出する画像処理によって、画像内の右側に映っているものを右手、左側に映っているものを左手と識別してもよい。 Also, a method for extracting the nodes of the graph input unit 15 is given to the user input region extraction table 17. The graph input unit 15 may provide a machine learning model for detecting an object from an image as shown in Non-Patent Document 3. Further, by image processing for extracting the color of the skin, it is possible to identify the object on the right side of the image as the right hand and the object on the left side of the image as the left hand.
 ユーザ入力領域抽出テーブル17の例は、例えば、図4に示される。ユーザ入力領域抽出テーブル17は、センサデータのうちの、ユーザ50の入力に該当するものを抽出する手段によって、情報が登録されてもよい。 An example of the user input area extraction table 17 is shown in FIG. 4, for example. Information may be registered in the user input area extraction table 17 by means of extracting sensor data corresponding to the input by the user 50 .
 データセット記憶部60は、機械学習時に使用される学習用データが記憶されている。熟練行動を判定するようなモデルの場合、動画、圧力センサ、加速度センサ、音などのセンサデータを記憶し、さらに、そのセンサデータで計測された行動はどの程度の熟練度なのか、行動の結果得られた品質の良否判断結果等を記憶する。 The data set storage unit 60 stores learning data used during machine learning. In the case of a model that determines skillful behavior, sensor data such as videos, pressure sensors, acceleration sensors, and sounds are stored. It stores the result of quality judgment and the like obtained.
 熟練行動特徴抽出部12がAttention Pairwise Ranking、Pairwise Deep Rankingのような一対比較手法を採用している場合、2つのセンサデータの優劣比較結果を保持してもよい。 If the expert behavior feature extraction unit 12 employs a paired comparison method such as Attention Pairwise Ranking or Pairwise Deep Ranking, the superiority comparison results of the two sensor data may be retained.
 図5は、機械学習装置100の学習時の動作を示すフローチャートである。オブジェクト認識部18は、画像又はセンサデータから、ユーザが入力した相関・因果グラフのノードになる領域を抽出する(ステップS101)。物体検出モデルを用いる場合は、右手、左手、頭の認識結果と、その認識結果のオブジェクトが囲まれた矩形情報が抽出される。また、肌の色、髪の色などを用いた画像処理によって、右手、左手、頭を認識する場合は、所定の色範囲となっている領域とその位置関係から、上記物体検出モデルと同等情報を抽出する。 FIG. 5 is a flowchart showing the operation of the machine learning device 100 during learning. The object recognizing unit 18 extracts regions that become nodes of the correlation/causal graph input by the user from the image or sensor data (step S101). When an object detection model is used, the recognition results of the right hand, the left hand, and the head, and the rectangle information surrounding the objects of the recognition results are extracted. In addition, when recognizing the right hand, left hand, and head by image processing using skin color, hair color, etc., the same information as the above object detection model can be obtained from the regions within the predetermined color range and their positional relationships. to extract
 熟練行動特徴抽出部12は、CNNのような画像から特徴を抽出するようなモデルである。熟練行動特徴抽出部12は、加速度センサ、音のような時系列を扱うモデルの場合、RNN(Recurrent neural network)のような時間を扱うモデルを用いてもよい。動画のような時系列も扱うような場合、時間方向への畳み込みも含む3D-CNN又は、CNNで一旦畳み込んだ画像特徴を入力とし、時系列を扱うためのRNNのようなモデルを組み合わせてもよい。上記以外にも一定間隔に(例えば、動画を1/3のサイズごとに)区切った時刻同士をそれぞれCNNに入力するTSN(Temporal Segment Network)のようなモデルを用いてもよい。熟練行動特徴抽出部12に対しデータを入力し、時間t、幅W、高さH、チャンネル数Chの中間特徴である第1の特徴量F1を得る(ステップS102)。 The skillful action feature extraction unit 12 is a model that extracts features from images such as CNN. The skillful action feature extraction unit 12 may use a model that handles time, such as an RNN (Recurrent Neural Network), in the case of a model that handles a time series, such as an acceleration sensor or sound. When dealing with time series such as videos, 3D-CNN including convolution in the time direction or image features once convolved by CNN are input, and a model such as RNN for handling time series is combined. good too. In addition to the above, a model such as TSN (Temporal Segment Network) may be used in which time points obtained by dividing a moving image at regular intervals (for example, every 1/3 size of a moving image) are input to CNN. Data is input to the skillful action feature extraction unit 12 to obtain a first feature quantity F1, which is an intermediate feature of time t, width W, height H, and number of channels Ch (step S102).
 着目領域生成部13は、熟練行動特徴抽出部12が抽出した中間特徴を用いて、熟練度を判定するための着目領域を生成する(ステップS103)。着目領域は、チャンネル方向、時間t方向へのGlobal Average Poolingと、活性化関数、もしくは最大、最小値の正規化により、幅W、高さHに関する0~1の範囲のヒートマップを持つ。ヒートマップは、例えば、CAM(Class Activation Map)構造を持つ。熟練行動判定に対する誤差逆伝播により、熟練行動特徴抽出部12が抽出した特徴のうちどこに着目すれば、熟練度が判定しやすいかの着目点を獲得する。 The region-of-interest generation unit 13 uses the intermediate features extracted by the skillful action feature extraction unit 12 to generate a region of interest for judging the skill level (step S103). The region of interest has a heat map ranging from 0 to 1 with respect to width W and height H by global average pooling in the channel direction and time t direction, activation function, or normalization of maximum and minimum values. The heat map has, for example, a CAM (Class Activation Map) structure. Error backpropagation for skillful action determination acquires a point of interest indicating which of the features extracted by the skillful action feature extraction unit 12 should be focused on to easily determine the skill level.
 グラフ-オブジェクト特徴抽出部16は、熟練行動特徴抽出部12が出力した中間特徴量(時間t×幅W×高さH×チャンネル数Ch)を、オブジェクト抽出結果に基づくマスクを加え、各ノード「右手」「左手」「頭」と紐づく特徴量を抽出する(ステップS104)。実施の形態1では、オブジェクト認識部18が抽出した中間特徴をそのまま用いてノードへの特徴として与えない。あくまで解きたい課題は、熟練度を取得することであり、右手、左手、頭を抽出するための特徴を捉えたいわけではないため、熟練行動を判定するためのモデルに対する特徴に対してマスク処理を施す。 The graph-object feature extraction unit 16 adds the intermediate feature amount (time t×width W×height H×channel number Ch) output by the skillful behavior feature extraction unit 12 to a mask based on the object extraction result, and extracts each node “ The feature values associated with the right hand, left hand, and head are extracted (step S104). In Embodiment 1, the intermediate features extracted by the object recognition unit 18 are used as they are and are not given as features to nodes. The problem we want to solve is to acquire the skill level, not the features for extracting the right hand, left hand, and head. Apply.
 マスク処理の方法としては、Attention Branch Networkのように中間特徴量である第1の特徴量F1にマスク処理を施した特徴量F1´と、オリジナルの中間特徴量である第1の特徴量F1との和を取って取得してもよい。これ以外にも、上記の手法の和の部分を無くしたもの、すなわち、特徴量F1´を用いてもよい。また、後述する着目領域生成部13が抽出した着目領域(Attention領域)以外の領域をマスキングして、グラフのノードに対して適切な特徴を抽出する。 As a method of mask processing, a feature amount F1' obtained by masking the first feature amount F1, which is an intermediate feature amount, such as Attention Branch Network, and the first feature amount F1, which is an original intermediate feature amount, are used. may be obtained by taking the sum of In addition to this, the feature amount F1' may be used by eliminating the sum portion of the above method. Also, regions other than the region of interest (Attention region) extracted by the region-of-interest generation unit 13, which will be described later, are masked to extract appropriate features for the nodes of the graph.
 グラフモデル学習部14(ST-GCN)はユーザから与えられたグラフ間の因果・相関関係の隣接行列を用いて、各時刻tで抽出された特徴をGraph Convolutional Neural Network(Graph-CNN)の学習手法により、データセットの熟練度が当てられるように誤差逆伝播を繰り返し、学習をする(ステップS105)。誤差逆伝播について、オブジェクト認識までは実施せず、特徴抽出までの重みパラメータを更新する。 The graph model learning unit 14 (ST-GCN) uses the adjacency matrix of causality and correlation between graphs given by the user to learn the features extracted at each time t in the Graph Convolutional Neural Network (Graph-CNN) By the method, error backpropagation is repeated so that the proficiency level of the data set is applied, and learning is performed (step S105). For error backpropagation, object recognition is not performed, and weight parameters are updated up to feature extraction.
《1-3》推論時の動作
 図6は、実施の形態1に係る機械学習装置(熟練行動判定装置)100の推論時の動作を示す説明図である。図7は、機械学習装置(熟練行動判定装置)100の推論時の動作を示すフローチャートである。
<<1-3>> Operation During Inference FIG. 6 is an explanatory diagram showing the operation during inference of the machine learning device (skilled behavior determination device) 100 according to the first embodiment. FIG. 7 is a flowchart showing the operation of the machine learning device (skilled action determination device) 100 during inference.
 オブジェクト認識部18は、ユーザが予め入力したオブジェクト(ノード)と、その領域を抽出する(ステップS111)。熟練行動特徴抽出部12が、熟練度を判定するための特徴を抽出する(ステップS112)。着目領域生成部13が、幅W×高さHのヒートマップを生成する(ステップS113)。グラフ-オブジェクト特徴抽出部16が、ユーザが予め入力したオブジェクトに対する特徴量を、オブジェクト認識結果、及び、着目領域生成結果を基に抽出する(ステップS114)。推論部が、グラフ畳み込みにより熟練度を認識する(ステップS115)。 The object recognition unit 18 extracts an object (node) input in advance by the user and its area (step S111). The skillful action feature extraction unit 12 extracts features for judging the skill level (step S112). The region-of-interest generation unit 13 generates a heat map of width W×height H (step S113). The graph-object feature extracting unit 16 extracts the feature amount of the object input by the user in advance based on the object recognition result and the region-of-interest generation result (step S114). The inference unit recognizes the skill level by graph convolution (step S115).
《1-4》効果
 グラフ入力部15で、ユーザが予めオブジェクト同士の相互関係を与え、解きたい課題に合わせて、認識すべきオブジェクトの抽出方法を指定することができる。このような実施の形態により、解きたい課題に関連するユーザの知見を知識グラフという形で機械学習に取り込むことができるようになる。
<<1-4>> Effect With the graph input unit 15, the user can specify the mutual relationship between the objects in advance, and specify the extraction method of the object to be recognized according to the problem to be solved. According to such an embodiment, it becomes possible to incorporate the user's knowledge related to the problem to be solved into machine learning in the form of a knowledge graph.
 着目領域生成部13により、解きたい課題の認識過程で抽出した着目点を用い画像特徴を選択することで、ノードに対応するモノ(オブジェクト)の特徴を抽出してそれを単純に取り出して機械学習するのではなく、「モノを上手に扱う上での着目点」といったより詳細な特徴を獲得することができる」という効果がある。 The region-of-interest generation unit 13 selects image features using the points of interest extracted in the process of recognizing the problem to be solved, extracting the features of the thing (object) corresponding to the node, and simply taking them out for machine learning. It has the effect that it is possible to acquire more detailed characteristics such as ``points of interest in handling things well'' instead of doing.
《2》実施の形態2
《2-1》構成
 図8は、実施の形態2に係る機械学習装置200の構成を概略的に示す機能ブロック図である。図8において、図2に示される構成と同一又は対応する構成には、図2に示される符号と同じ符号が付されている。機械学習装置200は、オブジェクト認識部28の動作の点で、実施の形態1に係る機械学習装置100と相違する。機械学習装置200は、実施の形態4に係る機械学習方法を実施できる装置である。機械学習装置200のハードウェア構成は、図1のものと同様である。
<<2>> Embodiment 2
<<2-1>> Configuration FIG. 8 is a functional block diagram schematically showing the configuration of the machine learning device 200 according to the second embodiment. In FIG. 8, the same reference numerals as those shown in FIG. 2 are attached to the same or corresponding configurations as those shown in FIG. Machine learning device 200 differs from machine learning device 100 according to the first embodiment in the operation of object recognition section 28 . Machine learning device 200 is a device capable of implementing the machine learning method according to the fourth embodiment. The hardware configuration of the machine learning device 200 is the same as that of FIG.
 図9(A)及び(B)は、機械学習装置200の動作を示す説明図である。熟練者の作業動画を扱う場合、図9(B)に示されるように、ユーザの指定したオブジェクトが重なり合う、又は図9(A)に示されるように、ユーザの指定したオブジェクトが画面から無くなる等の問題が発生することがある。これに対し、機械学習装置200は、オブジェクト認識部28で、上記問題の発生を推定する手段を設け、これらの状態に応じて適切に画像に紐づいたグラフ特徴の更新を行う。 9(A) and (B) are explanatory diagrams showing the operation of the machine learning device 200. FIG. When dealing with a working video of an expert, objects specified by the user overlap as shown in FIG. 9B, or objects specified by the user disappear from the screen as shown in FIG. 9A. problems may occur. In response to this, the machine learning device 200 provides a means for estimating the occurrence of the above problem in the object recognition unit 28, and appropriately updates the graph features linked to the image according to these states.
《2-2》動作
 図10は、機械学習装置200の学習時の動作を示すフローチャートである。図10の動作は、オブジェクト認識部28の動作とグラフ-オブジェクト特徴抽出部16の動作の点で、実施の形態1に係る機械学習装置100の動作と相違する。機械学習装置200は、オブジェクトの認識(ステップS201)、第1の特徴量F1の抽出(ステップS202)、着目領域Aの生成(ステップS203)、グラフオブジェクト特徴(第2の特徴量F2)の抽出(ステップS204)、学習モデルとしてのグラフモデルの生成(ステップS205)を行う。
<<2-2>> Operation FIG. 10 is a flowchart showing the operation of the machine learning device 200 during learning. The operation of FIG. 10 differs from the operation of the machine learning device 100 according to the first embodiment in the operations of the object recognition unit 28 and the operations of the graph-object feature extraction unit 16. FIG. The machine learning device 200 recognizes an object (step S201), extracts a first feature amount F1 (step S202), generates a region of interest A (step S203), extracts a graph object feature (second feature amount F2). (Step S204), generating a graph model as a learning model (step S205).
 オブジェクト認識部28は、観測値及び過去の観測に基づく位置の予想に対しガウス分布のノイズが加わったカルマンフィルタのような位置のフィルタリング手法で動作する。 The object recognition unit 28 operates with a position filtering technique such as the Kalman filter, in which Gaussian noise is added to position predictions based on observations and past observations.
 フロー推定部28aは、前回までにフィルタ推定した位置、又は速度、又は位置及び速度の情報を保持し、これに基づいてオブジェクトが存在すると予測される位置を推定する。 The flow estimating unit 28a holds information on positions, velocities, or positions and velocities that have been filtered and estimated by the previous time, and based on this, estimates the position where an object is predicted to exist.
 オブジェクト存在確率推定部28bは、オブジェクト認識で位置が観測された場合その位置で観測される分散値が小さくなり、位置が観測されない場合分散値が段々と大きくなるように存在確率を算出する。上記フィルタにより観測された位置の分散が一定以上、もしくは手の位置が分散も含め画面の外に移動したと推測された場合、右手、左手は途中から検出されないものとして認識する。 The object existence probability estimating unit 28b calculates the existence probability so that the variance value observed at the position becomes small when the position is observed in object recognition, and the variance value gradually increases when the position is not observed. If the dispersion of the positions observed by the above filter is greater than a certain value, or if the position of the hand including the dispersion is estimated to have moved outside the screen, it is recognized that the right hand and left hand are not detected from the middle.
 カルマンフィルタにより、上記分散値の値に応じて、フロー推定部28aが推定した位置か、オブジェクト認識部28が観測した位置のどちらに比重を置いて位置を算出するかを推定する。 A Kalman filter is used to estimate which of the position estimated by the flow estimation unit 28a and the position observed by the object recognition unit 28 should be given more weight to calculate the position, according to the value of the variance value.
 重なり判定部28cは、フィルタ時の位置が重なりあい、オブジェクト認識も片方の手しか見つからない場合、2つのオブジェクトが重なりあったものとして認識する。 The overlap determination unit 28c recognizes that two objects overlap when the positions of the objects overlap during filtering and only one hand is found in object recognition.
 グラフ-オブジェクト特徴抽出部16は、前記オブジェクト認識の結果で、途中からオブジェクトが認識されなくなった場合、認識されなくなる前に抽出していた特徴量をノードに割り当てる。 If the object is not recognized partway through as a result of the object recognition, the graph-object feature extraction unit 16 assigns the feature amount that was extracted before the object was not recognized to the node.
 グラフ-オブジェクト特徴抽出部16は、前記オブジェクト認識の結果で、例えば、右手、左手のオブジェクトの重なり合いが発生した場合、ガウス分布の重なり合う部分と、重なり合わない部分の面積比率により重みを決定し、前回までの右手、左手それぞれの特徴量と重なりあっている箇所の特徴量を重み付き和で混ぜ合わせ、ノードに割り当てる。 The graph-object feature extraction unit 16 determines a weight based on the area ratio of the Gaussian distribution overlapping portion and non-overlapping portion when, for example, the right and left hand objects overlap as a result of the object recognition, The feature values of the right hand and left hand up to the previous time and the feature values of the overlapping portions are mixed by weighted sum and assigned to nodes.
《2-3》効果
 実施の形態2によれば、オブジェクト認識部2で、オブジェクトが存在しないもしくは重なり合っていることを検知し、これに基づいて適切にノードに割り当てる特徴量を決定することで、ある時刻でオブジェクトが検出されない場合でもより安定的にST-GCNのような学習を実行することができる。
<<2-3>> Effect According to Embodiment 2, the object recognition unit 2 detects that objects do not exist or overlap each other, and based on this, appropriately determines the feature amount to be assigned to the node. Learning like ST-GCN can be performed more stably even if an object is not detected at a certain time.
 上記以外に関し、実施の形態2は、実施の形態1と同じである。 Except for the above, the second embodiment is the same as the first embodiment.
《3》実施の形態3
《3-1》構成
 図11は、実施の形態3に係る機械学習装置300の構成を概略的に示す機能ブロック図である。図11において、図2に示される構成と同一又は対応する構成には、図2に示される符号と同じ符号が付されている。機械学習装置300は、学習データ生成部35を有する点及び特徴行動判定モデル31の構成及び動作の点において、実施の形態1に係る機械学習装置100と相違する。機械学習装置300は、実施の形態3に係る機械学習方法を実施できる装置である。機械学習装置300のハードウェア構成は、図1のものと同様である。
<<3>> Embodiment 3
<<3-1>> Configuration FIG. 11 is a functional block diagram schematically showing the configuration of the machine learning device 300 according to the third embodiment. In FIG. 11, the same reference numerals as those shown in FIG. 2 are attached to the same or corresponding configurations as those shown in FIG. The machine learning device 300 differs from the machine learning device 100 according to Embodiment 1 in that it has a learning data generation unit 35 and the configuration and operation of the characteristic behavior determination model 31 . Machine learning device 300 is a device capable of implementing the machine learning method according to the third embodiment. The hardware configuration of machine learning device 300 is the same as that of FIG.
 機械学習装置300は、画像内の動作主体の行動の熟練度を推論するための学習モデルMを学習する装置である。機械学習装置300は、ユーザ50の入力操作に基づいて、動作主体の複数の部分に対応する複数のノードと複数のノードの間の関係性を示す情報とで構成されるグラフGを取得するグラフ入力部15と、グラフ入力部15で取得されたグラフGを記憶するユーザ入力領域抽出テーブル17と、複数のノードに対応する複数のオブジェクトが存在する、画像内における複数のオブジェクト領域Oを認識して出力するオブジェクト認識部18とを有している。また、機械学習装置300は、複数のオブジェクト領域Oに紐づけられた学習用データを生成する学習データ生成部35と、複数のオブジェクト領域に紐づけられた、画像内に存在する動作主体の複数の部分(例えば、右手、左手、頭)の行動であって、複数のオブジェクト領域に紐づけられた行動を推論するための行動推論モデルM2を学習する熟練行動判定モデル学習部33とを有している。さらに、機械学習装置300は、行動推論モデルM2を用いて推論された、複数のオブジェクト領域に紐づけられた行動を認識し、行動の特徴量である第1の特徴量F1を抽出するオブジェクト認識・熟練行動特徴抽出部34と、第1の特徴量F1を強調した第2の特徴量F2を生成するグラフ-オブジェクト特徴抽出部16と、画像が前記学習用データであるときにおける第2の特徴量F2に基づいて、学習モデルMを生成するグラフモデル学習部14とを有している。 The machine learning device 300 is a device that learns a learning model M for inferring the proficiency level of the action of the action subject in the image. The machine learning device 300 acquires a graph G composed of a plurality of nodes corresponding to a plurality of parts of an action subject and information indicating relationships between the plurality of nodes based on an input operation by the user 50. An input unit 15, a user input region extraction table 17 storing the graph G acquired by the graph input unit 15, and a plurality of object regions O in an image in which a plurality of objects corresponding to a plurality of nodes exist. and an object recognition unit 18 for outputting. The machine learning device 300 also includes a learning data generation unit 35 that generates learning data linked to a plurality of object regions O, and a plurality of subjects existing in an image that are linked to the plurality of object regions. (for example, right hand, left hand, head), and a skilled action determination model learning unit 33 for learning an action inference model M2 for inferring actions linked to a plurality of object regions. ing. Furthermore, the machine learning device 300 recognizes actions linked to a plurality of object regions inferred using the action inference model M2, and extracts a first feature amount F1 that is a feature amount of the actions. The skillful action feature extraction unit 34, the graph-object feature extraction unit 16 that generates a second feature amount F2 that emphasizes the first feature amount F1, and the second feature when the image is the learning data and a graph model learning unit 14 that generates a learning model M based on the quantity F2.
《3-2》動作
 図12は、機械学習装置300の学習時の動作を示す説明図である。学習率調整部32を設け、最初はCNNに比重を置いて特徴抽出して、後半に行くほどST-GCNに比重を置くことで、右手、左手、頭の相互関係を習得しやすくする。学習データ生成部35は、オブジェクト認識部28により、右手、左手、頭などの認識結果をデータセット記憶部60に登録する。これに基づいて、先ず、オブジェクト認識・熟練行動判定モデル学習部33は、通常のCNN等のモデルでマルチタスクラーニングを行い、学習用データから熟練度と右手、左手、頭も含めた特徴量とを抽出する。
<<3-2>> Operation FIG. 12 is an explanatory diagram showing the operation of the machine learning device 300 during learning. A learning rate adjustment unit 32 is provided, and features are extracted with more weight on the CNN at the beginning, and more weight on the ST-GCN in the latter half, making it easier to learn the interrelationship between the right hand, the left hand, and the head. The learning data generation unit 35 registers the recognition results of the right hand, left hand, head, etc. in the data set storage unit 60 by the object recognition unit 28 . Based on this, first, the object recognition/skilled action determination model learning unit 33 performs multitask learning with a model such as a normal CNN, and from the learning data, the skill level, the right hand, the left hand, and the feature amount including the head. to extract
 グラフ-オブジェクト特徴抽出部16は、上記特徴量と右手、左手、頭との間の紐づけを行い、ST-GCNのノード特徴量を求める。 The graph-object feature extraction unit 16 associates the above feature amount with the right hand, left hand, and head, and obtains the node feature amount of ST-GCN.
 学習率調整部32は、学習の前半を左手と右手と頭を発見するための特徴量抽出、学習の後半をST-GCNに比重を置いて学習させることで、徐々に人間の身体部位の相互関係に焦点を当てる。 The learning rate adjustment unit 32 extracts features for finding the left hand, right hand, and head in the first half of learning, and emphasizes ST-GCN in the second half of learning, so that the interaction between human body parts gradually increases. Focus on relationships.
 オブジェクト認識・熟練行動判定モデル学習部33は、画像内の全画素にラベル又はカテゴリを関連付ける深層学習のアルゴリズム(例えば、特徴的なカテゴリを形成する画素の集まりを認識することが可能なアルゴリズム)のようなモデルである行動推論モデルM2を学習する。 The object recognition/skilled behavior determination model learning unit 33 is a deep learning algorithm that associates labels or categories with all pixels in an image (for example, an algorithm capable of recognizing a group of pixels forming a characteristic category). A behavioral inference model M2, which is a model such as
 オブジェクト認識・熟練行動特徴抽出部34は、行動推論モデルM2を用いて推論された、複数のオブジェクト領域に紐づけられた行動を認識し、前記行動の特徴量である第1の特徴量F1を抽出する。行動推論モデルM2を用いれば、マルチタスクラーニングにより熟練度に関連する特徴も抽出することができる。このようなアルゴリズムとしては、例えば、セマンティックセグメンテーションが知られている。したがって、実施の形態1における着目領域生成部13のようなものを設けなくても、熟練度に関係する細かな領域抽出が可能となる。 The object recognition/skilled action feature extraction unit 34 recognizes actions linked to a plurality of object regions inferred using the action inference model M2, and extracts a first feature amount F1, which is the feature amount of the action. Extract. By using the action inference model M2, it is possible to extract features related to proficiency through multitask learning. Semantic segmentation, for example, is known as such an algorithm. Therefore, it is possible to extract a detailed area related to the skill level without providing a device such as the region-of-interest generation unit 13 in the first embodiment.
 また、セマンティックセグメンテーションを用いた場合は、グラフ-オブジェクト特徴抽出部16では、セグメンテーション結果とマスクを用いることで、ノードと特徴の紐づけを行うことができる。 Also, when semantic segmentation is used, the graph-object feature extraction unit 16 can link nodes and features by using segmentation results and masks.
 図13(A)及び(B)は、機械学習装置300の学習率調整部32の動作を示す説明図である。学習率調整部32の動作イメージ、以下のようなロス関数が与えられたとする。Lusr_cnn+Lskill_cnnがオブジェクト認識・熟練行動判定モデル学習部33に関するロスで、Lskill_gcnがグラフモデル学習部14に関するロスである。 13A and 13B are explanatory diagrams showing the operation of the learning rate adjusting section 32 of the machine learning device 300. FIG. It is assumed that the operation image of the learning rate adjustment unit 32 and the following loss function are given. L usr_cnn +L skill_cnn is the loss related to the object recognition/skilled behavior determination model learning unit 33 , and L skill_gcn is the loss related to the graph model learning unit 14 .
 実施の形態1及び2では人手で埋め込んだ知見であるグラフ構造を用いた学習を進めるが、グラフ構造には右手、左手、頭のようなオブジェクトを抽出する特徴は含まれない。このため、学習率調整部32は、学習の開始直後(すなわち、学習の最初の期間)で、オブジェクト認識・熟練行動判定モデル学習部33に対するマルチタスクラーニングで学習を実行し、ある程度の時間が経過した後、安定して右手、左手、頭を抽出できるようになった時点以降に、以下のロス関数Lossの式におけるαの値を調整し、一定のオブジェクト認識率を下回らないようにする。こうした上で、ST-GCNにオブジェクトの抽出に関する特徴を取り込み、かつ、グラフから、熟練度を算出するように学習を調整する。 In Embodiments 1 and 2, learning is performed using a graph structure, which is manually embedded knowledge, but the graph structure does not include features for extracting objects such as the right hand, left hand, and head. For this reason, the learning rate adjustment unit 32 executes learning by multitask learning for the object recognition/skilled behavior determination model learning unit 33 immediately after the start of learning (that is, the first period of learning), and a certain amount of time elapses. After that, after the right hand, left hand, and head can be stably extracted, the value of α in the following loss function Loss is adjusted so as not to fall below a certain object recognition rate. On this basis, the ST-GCN incorporates features related to object extraction, and the learning is adjusted so as to calculate the skill level from the graph.
 Loss=β(α(Lusr_cnn+Lskill_cnn)+(1-α)Lskill_gcnLoss = β (α (L usr_cnn + L skill_cnn ) + (1-α) L skill_gcn )
 ネットワーク構成例を以下に示す。学習率調整部32を設け、最初はCNNに比重を置いて特徴抽出して、後半に行くほどST-GCNに比重を置くことで、右手、左手、頭の相互関係を習得しやすくする。 A network configuration example is shown below. A learning rate adjustment unit 32 is provided, and features are extracted with more weight on the CNN at the beginning, and more weight on the ST-GCN in the latter half, making it easier to learn the interrelationship between the right hand, the left hand, and the head.
 図14は、機械学習装置300の学習時の動作を示すフローチャートである。機械学習装置300は、オブジェクトの認識(ステップS301)、学習用データの生成(ステップS302)、オブジェクト認識・熟練行動特徴を抽出し(ステップS303)、学習率を調整し(ステップS304)、グラフオブジェクト特徴(第2の特徴量F2)の抽出(ステップS305)、学習モデルとしてのグラフモデルの生成(ステップS306)を行う。 FIG. 14 is a flowchart showing the operation of the machine learning device 300 during learning. The machine learning device 300 recognizes an object (step S301), generates learning data (step S302), extracts object recognition/skilled behavior features (step S303), adjusts the learning rate (step S304), A feature (second feature amount F2) is extracted (step S305), and a graph model is generated as a learning model (step S306).
《3-3》効果
 以上に説明したように、実施の形態3によれば、オブジェクト認識・熟練行動判定モデル学習部33を設けることで、ST-GCNに対してオブジェクトに関する特徴も持たせることができ、その結果、手及び頭の抽出に関する特徴をベースに熟練行動を判定するような学習が可能となる。これにより、学習をより安定にすることが期待できる。
<<3-3>> Effect As described above, according to the third embodiment, by providing the object recognition/skilled action determination model learning unit 33, ST-GCN can also have features related to objects. As a result, it becomes possible to learn to determine a skilled action based on the features related to hand and head extraction. This can be expected to make learning more stable.
 上記以外に関し、実施の形態3は、実施の形態1又は2と同じである。 Except for the above, Embodiment 3 is the same as Embodiment 1 or 2.
《4》実施の形態4
《4-1》構成
 図15は、実施の形態4に係る機械学習装置400の構成を概略的に示す機能ブロック図である。図15において、図2に示される構成と同一又は対応する構成には、図2に示される符号と同じ符号が付されている。機械学習装置400は、熟練行動判定モデル41の構成の点及びグラフ候補生成部43を有する点において、実施の形態1に係る機械学習装置100と相違する。機械学習装置400は、実施の形態4に係る機械学習方法を実施できる装置である。機械学習装置400のハードウェア構成は、図1のものと同様である。
<<4>> Embodiment 4
<<4-1>> Configuration FIG. 15 is a functional block diagram schematically showing the configuration of machine learning device 400 according to the fourth embodiment. In FIG. 15, the same reference numerals as those shown in FIG. 2 are attached to the same or corresponding configurations as those shown in FIG. Machine learning device 400 differs from machine learning device 100 according to Embodiment 1 in the configuration of skilled action determination model 41 and in having graph candidate generation unit 43 . Machine learning device 400 is a device capable of implementing the machine learning method according to the fourth embodiment. The hardware configuration of machine learning device 400 is the same as that of FIG.
 ユーザ50がオブジェクト間の関係を予め定義した知見を機械学習装置400に与えると、意に反して、与えた知見がノイズになる場合がある。例えば、非特許文献3に記載の相関に基づくグラフの生成では、特徴が似ていないノード同士の特徴量の交換ができない。実施の形態4では、各オブジェクトの時間方向の着目領域をAttention Branch Networkによって抽出し、グラフ候補生成部43がヒートマップの発火順(すなわち、どの着目領域を重要視して熟練度を判定したかの情報)から、グラフの候補を生成する。実施の形態4に係る機械学習装置400では、ユーザ50がノード候補情報だけを入力すれば、ノードの相関・因果が自動で発見される。 When the user 50 provides the machine learning device 400 with knowledge that predefines the relationships between objects, the knowledge provided may become noise, contrary to the user's intention. For example, in generating a graph based on correlation described in Non-Patent Document 3, it is not possible to exchange feature amounts between nodes that have dissimilar features. In Embodiment 4, the region of interest in the time direction of each object is extracted by the Attention Branch Network, and the graph candidate generation unit 43 determines the firing order of the heat map (that is, which region of interest is considered important when determining the skill level). information) to generate graph candidates. In the machine learning device 400 according to the fourth embodiment, when the user 50 inputs only node candidate information, node correlation/causality is automatically discovered.
 機械学習装置400は、画像内の動作主体の行動の熟練度を推論するための学習モデルMを学習する装置である。機械学習装置400は、ユーザ50の入力操作に基づいて、動作主体の複数の部分に対応する複数のノードと複数のノードの間の関係性を示す情報とで構成されるグラフGを取得するグラフ入力部15と、グラフ入力部15で取得されたグラフGを記憶するユーザ入力領域抽出テーブル17と、複数のノードに対応する複数のオブジェクトが存在する、画像内における複数のオブジェクト領域Oを認識して出力するオブジェクト認識部18とを有している。また、機械学習装置400は、画像内に存在する動作主体の複数の部分(例えば、右手、左手、頭)の行動の特徴量である第1の特徴量F1を抽出する熟練行動特徴抽出部12と、複数のオブジェクト領域O及び第1の特徴量F1に基づいて、複数のオブジェクト領域Oのいずれかに重なる着目領域Aを生成し、着目領域をヒートマップとして出力する着目領域生成部13と、着目領域記憶部42と、ヒートマップに基づいて、グラフ入力部15から入力されるグラフの候補をユーザに提示するための情報を生成するグラフ候補生成部43とを有している。さらに、機械学習装置400は、着目領域Aについて第1の特徴量F1を強調した第2の特徴量F2を生成するグラフ-オブジェクト特徴抽出部16と、画像が予め収集された学習用データLであるときにおける第2の特徴量F2に基づいて、学習モデルMを生成するグラフモデル学習部14とを有している。 The machine learning device 400 is a device that learns a learning model M for inferring the proficiency level of actions of the action subject in the image. The machine learning device 400 acquires a graph G composed of a plurality of nodes corresponding to a plurality of parts of an action subject and information indicating relationships between the plurality of nodes based on an input operation by the user 50. An input unit 15, a user input region extraction table 17 storing the graph G acquired by the graph input unit 15, and a plurality of object regions O in an image in which a plurality of objects corresponding to a plurality of nodes exist. and an object recognition unit 18 for outputting. The machine learning device 400 also includes a skillful action feature extraction unit 12 that extracts a first feature amount F1 that is a feature amount of actions of a plurality of parts (for example, right hand, left hand, head) of the action subject present in the image. and a region-of-interest generation unit 13 that generates a region-of-interest A that overlaps one of the plurality of object regions O based on the plurality of object regions O and the first feature amount F1, and outputs the region-of-interest as a heat map; It has a region-of-interest storage unit 42 and a graph candidate generation unit 43 that generates information for presenting to the user graph candidates input from the graph input unit 15 based on the heat map. Furthermore, the machine learning device 400 uses the graph-object feature extraction unit 16 that generates a second feature amount F2 that emphasizes the first feature amount F1 for the region of interest A, and the learning data L in which images are collected in advance. and a graph model learning unit 14 that generates a learning model M based on the second feature amount F2 at a certain time.
《4-2》動作
 図16は、機械学習装置400の学習時の動作を示すフローチャートである。機械学習装置400は、オブジェクトの認識(ステップS401)、第1の特徴量F1の抽出(ステップS402)、着目領域Aの生成(ステップS403)、グラフオブジェクト特徴(第2の特徴量F2)の抽出(ステップS404)、学習モデルとしてのグラフモデルの生成(ステップS405)を行う。
<<4-2>> Operation FIG. 16 is a flow chart showing the operation of the machine learning device 400 during learning. The machine learning device 400 recognizes an object (step S401), extracts a first feature amount F1 (step S402), generates a region of interest A (step S403), extracts a graph object feature (second feature amount F2). (Step S404), a graph model is generated as a learning model (step S405).
 図17は、機械学習装置400の動作を示す説明図である。グラフ入力部15で、ユーザ50は、右手、左手、頭のような関わりのありそうなノードのみを定義する。また、これらの抽出方法は、ユーザ入力領域抽出テーブル17に登録される。 FIG. 17 is an explanatory diagram showing the operation of the machine learning device 400. FIG. In the graph input section 15, the user 50 defines only the likely relevant nodes such as right hand, left hand and head. Also, these extraction methods are registered in the user input area extraction table 17 .
 グラフモデル学習部14では、ノード同士は全てエッジによって結合されているものとして、学習を行う。 In the graph model learning unit 14, learning is performed assuming that all nodes are connected by edges.
 着目領域生成部13は、各オブジェクトのどこに着目して熟練度を判定したかを示すヒートマップ情報に加えて、オブジェクト認識した着目領域に対するヒートマップの重畳度合いを算出し、重畳度合いが一定以上のノードの発火順を図17のように生成する。グラフ候補生成部43がヒートマップの発火順(すなわち、どれを重要視して熟練度を判定したかを示す情報)から、グラフ候補を生成する。 The region-of-interest generating unit 13 calculates the degree of superimposition of the heat map with respect to the region of interest in which the object is recognized, in addition to the heat map information indicating where in each object the focus is focused to determine the skill level. The firing order of nodes is generated as shown in FIG. The graph candidate generation unit 43 generates graph candidates from the firing order of the heat map (that is, information indicating which is given importance in determining the skill level).
 上記は右手、左手、頭のようなノードの発火順を抽出する一例であり、例えば、非特許文献2のように、ノード自体の着目生成部を設け、これを用いた分析をしてもよい。 The above is an example of extracting the firing order of nodes such as right hand, left hand, and head. .
 グラフ候補生成部は、以下のようなN分割した時系列の着目情報(Attention)情報を基に、どの時間で各ノードが着目されたのかを示す情報を基に、ノードの候補を見つける。 The graph candidate generation unit finds node candidates based on the following N-divided time-series attention information (Attention) information and information indicating when each node was noticed.
 図17で説明すると、左手が着目された後に、右手が着目された場合、左手から右手に有向エッジがあるものとし、右手と頭が同時に着目された場合、右手、左手に無向エッジがあるものとする。左手のみが発火した場合、左手にセルフループがあるものと判定する。上記のようなグラフを一定分割数に区切った時間を一定のウィンドウ幅で探索し、少しずつウィンドウをスライドして探索しグラフの候補を発見する。発見したグラフ候補からグラフパターンマッチングにより頻出パターンを候補として複数個抽出する。 Referring to FIG. 17, when the right hand is focused after the left hand is focused, there is a directed edge from the left hand to the right hand, and when the right hand and the head are focused at the same time, there are undirected edges on the right and left hands. Assume that there is If only the left hand fires, it is determined that there is a self-loop in the left hand. The graph as described above is divided into a certain number of divisions, and the time is searched with a certain window width, and the window is slid little by little to find graph candidates. A plurality of frequent patterns are extracted as candidates from the discovered graph candidates by graph pattern matching.
 図18は、機械学習装置400の動作を示す説明図である。グラフ候補生成部43は、自動発見したノード間の関係性に加え、熟練度を判定する上で本当にノード間の因果あるかどうかを検証できるようにしてもよい。グラフ候補生成部43は、例えば、図18の時系列の着目情報が得られているものとし、そのうち一部の着目情報を無効化することで、その影響を調査する。グラフ候補生成部43は、図18に因果関係の抽出として記載しているように、左手から右手、左手から頭の因果があるかどうかを検証する。 FIG. 18 is an explanatory diagram showing the operation of the machine learning device 400. FIG. The graph candidate generating unit 43 may verify whether or not there is a causal relationship between the nodes in judging the skill level, in addition to the automatically discovered relationships between the nodes. For example, the graph candidate generating unit 43 assumes that the time-series focused information shown in FIG. 18 has been obtained, and invalidates some of the focused information to investigate the effect. The graph candidate generation unit 43 verifies whether there is causality from the left hand to the right hand and from the left hand to the head, as described in FIG. 18 as causal relationship extraction.
 グラフ候補生成部43は、右手、左手の因果を検証する際に、まず頭の着目情報を無効化する。そしてその上で左手のヒートマップの時間帯を右手と同じ時間になるようにずらし、このときのロスの変化(Δロス)を求める。 When verifying the causality of the right hand and left hand, the graph candidate generation unit 43 first invalidates the focus information of the head. Then, shift the time zone of the heat map on the left side so that it is the same as the time on the right side, and find the change in loss (Δ loss) at this time.
 一方で、グラフ候補生成部43は、左手、頭の因果を検証する際に、まず右手の着目情報を無効化し、その上で、左手のヒートマップの時間帯を右手と同じ時間になるようにずらし、このときのロスの変化(Δロス)を求める。 On the other hand, when verifying the causality of the left hand and the head, the graph candidate generation unit 43 first disables the attention information of the right hand, and then sets the time zone of the heat map of the left hand to the same time as that of the right hand. Then, find the change in loss (Δ loss) at this time.
 グラフ候補生成部43は、上記したロスの計算により、左手から右手を向く有向エッジの関係が崩れたときに、ロスが大きく変化することを検証し、実際に獲得したグラフに対して因果があるかもしれないと候補を提示することができる。 The graph candidate generating unit 43 verifies that the loss changes greatly when the relationship between the directed edges pointing from the left hand to the right is broken by the calculation of the loss described above, and the causality of the actually obtained graph is verified. It is possible to present candidates that there may be.
《4-3》効果
 以上に説明したように、実施の形態4によれば、グラフ候補生成部43がユーザ50のノードに対し、関係性を発見できるようにする情報を提示することで、ノード間の不適切な関係性の入力によって、ノイズとなるような関係性の定義を与えることを回避できる。
<<4-3>> Effect As described above, according to the fourth embodiment, the graph candidate generating unit 43 presents information that enables the user 50 to discover the relationship between the nodes. It is possible to avoid giving a relationship definition that would be noisy by entering an inappropriate relationship between.
 上記以外に関し、実施の形態4は、実施の形態1から3のいずれかと同じである。 Except for the above, Embodiment 4 is the same as any of Embodiments 1 to 3.
 11、21、31、41 熟練行動判定モデル、 11a 学習モデル生成部、 11b 学習モデル記憶部、 11c 推論部、 12 熟練行動特徴抽出部、 13 着目領域生成部、 14 グラフモデル学習部、 15 グラフ入力部、 16 グラフ-オブジェクト特徴抽出部、 17 ユーザ入力領域抽出テーブル(記憶部)、 18 オブジェクト認識部、 28a フロー推定部、 28b オブジェクト存在確率確定部、 28c 重なり判定部、 32 学習率調整部、 33 オブジェクト認識・熟練行動判定モデル学習部、 34 オブジェクト認識・熟練行動特徴抽出部、 35 学習データ生成部、 50 ユーザ、 60 データセット記憶部、 100、200、300、400 機械学習装置、 A 着目領域、 F1 第1の特徴量(中間特徴)、 F2 第2の特徴量、 G グラフ、 L 学習用データ、 M 学習モデル、 M2 行動推論モデル、 O オブジェクト領域、 RH 右手、 LH 左手、 HE 頭。 11, 21, 31, 41 Skilled behavior determination model, 11a Learning model generation unit, 11b Learning model storage unit, 11c Inference unit, 12 Skilled behavior feature extraction unit, 13 Region of interest generation unit, 14 Graph model learning unit, 15 Graph input 16 Graph-object feature extraction unit 17 User input region extraction table (storage unit) 18 Object recognition unit 28a Flow estimation unit 28b Object existence probability determination unit 28c Overlap determination unit 32 Learning rate adjustment unit 33 Object recognition/skilled action determination model learning unit, 34 Object recognition/skilled action feature extraction unit, 35 Learning data generation unit, 50 User, 60 Data set storage unit, 100, 200, 300, 400 Machine learning device, A Region of interest, F1 First feature quantity (intermediate feature), F2 Second feature quantity, G graph, L learning data, M learning model, M2 action inference model, O object area, RH right hand, LH left hand, HE head.

Claims (11)

  1.  画像内の動作主体の行動の熟練度を推論するための学習モデルを学習する機械学習装置であって、
     ユーザの入力操作に基づいて、前記動作主体の複数の部分に対応する複数のノードと前記複数のノードの間の関係性を示す情報とで構成されるグラフを取得するグラフ入力部と、
     前記グラフ入力部で取得された前記グラフを記憶する記憶部と、
     前記複数のノードに対応する複数のオブジェクトが存在する、前記画像内における複数のオブジェクト領域を認識して出力するオブジェクト認識部と、
     前記画像内に存在する前記動作主体の前記複数の部分の行動の特徴量である第1の特徴量を抽出する熟練行動特徴抽出部と、
     前記第1の特徴量に基づいて前記画像内における着目領域を生成する着目領域生成部と、
     前記着目領域と前記オブジェクト領域とが重なる領域について前記第1の特徴量を強調した第2の特徴量を生成するグラフ-オブジェクト特徴抽出部と、
     前記画像が予め収集された学習用データであるときにおける前記第2の特徴量に基づいて、前記学習モデルを生成するグラフモデル学習部と、
     を有することを特徴とする機械学習装置。
    A machine learning device that learns a learning model for inferring the action proficiency of an action subject in an image,
    a graph input unit that acquires a graph composed of a plurality of nodes corresponding to a plurality of parts of the action subject and information indicating relationships between the plurality of nodes, based on a user's input operation;
    a storage unit that stores the graph acquired by the graph input unit;
    an object recognition unit that recognizes and outputs a plurality of object regions in the image in which a plurality of objects corresponding to the plurality of nodes exist;
    a skillful action feature extracting unit for extracting a first feature amount, which is a feature amount of actions of the plurality of portions of the action subject existing in the image;
    a region-of-interest generation unit that generates a region of interest in the image based on the first feature amount;
    a graph-object feature extracting unit that generates a second feature quantity that emphasizes the first feature quantity for a region where the region of interest and the object region overlap;
    a graph model learning unit that generates the learning model based on the second feature amount when the image is learning data collected in advance;
    A machine learning device comprising:
  2.  前記オブジェクト認識部は、前記複数のノードに対応する複数のオブジェクトの位置及び速度に関する過去の情報を保持し、前記過去の情報に基づいて前記複数のオブジェクト領域の位置を予測して、前記複数のオブジェクト領域の重なり合いを判定し、
     前記グラフ-オブジェクト特徴抽出部は、前記複数のオブジェクト領域のうちの重なり合うオブジェクト領域の前記第1の特徴量に基づいて、前記重なり合うオブジェクト領域の前記第1の特徴量を変更する
     ことを特徴とする請求項1に記載の機械学習装置。
    The object recognition unit holds past information about positions and velocities of the plurality of objects corresponding to the plurality of nodes, predicts positions of the plurality of object areas based on the past information, determine the overlap of object regions,
    The graph-object feature extracting unit changes the first feature amount of the overlapping object areas based on the first feature amount of the overlapping object areas among the plurality of object areas. The machine learning device according to claim 1.
  3.  画像内の動作主体の行動の熟練度を推論するための学習モデルを学習する機械学習装置であって、
     ユーザの入力操作に基づいて、前記動作主体の複数の部分に対応する複数のノードと前記複数のノードの間の関係性を示す情報とで構成されるグラフを取得するグラフ入力部と、
     前記グラフ入力部で取得された前記グラフを記憶する記憶部と、
     前記複数のノードに対応する複数のオブジェクトが存在する、前記画像内における複数のオブジェクト領域を認識して出力するオブジェクト認識部と、
     前記複数のオブジェクト領域に紐づけられた学習用データを生成する学習データ生成部と、
     前記複数のオブジェクト領域に紐づけられた、前記画像内に存在する前記動作主体の前記複数の部分の行動であって、前記複数のオブジェクト領域に紐づけられた行動を推論するための行動推論モデルを学習する行動判定モデル学習部と、
     前記行動推論モデルを用いて推論された、前記複数のオブジェクト領域に紐づけられた行動を認識し、前記行動の特徴量である第1の特徴量を抽出するオブジェクト認識・熟練行動特徴抽出部と、
     前記第1の特徴量を強調した第2の特徴量を生成するグラフ-オブジェクト特徴抽出部と、
     前記画像が前記学習用データであるときにおける前記第2の特徴量に基づいて、前記学習モデルを生成するグラフモデル学習部と、
     を有することを特徴とする機械学習装置。
    A machine learning device that learns a learning model for inferring the action proficiency of an action subject in an image,
    a graph input unit that acquires a graph composed of a plurality of nodes corresponding to a plurality of parts of the action subject and information indicating relationships between the plurality of nodes, based on a user's input operation;
    a storage unit that stores the graph acquired by the graph input unit;
    an object recognition unit that recognizes and outputs a plurality of object regions in the image in which a plurality of objects corresponding to the plurality of nodes exist;
    a learning data generation unit that generates learning data linked to the plurality of object regions;
    an action inference model for inferring actions of the plurality of parts of the subject existing in the image, which are associated with the plurality of object areas, and which are associated with the plurality of object areas; an action determination model learning unit that learns
    an object recognition/skilled behavior feature extracting unit for recognizing actions linked to the plurality of object regions inferred using the action inference model and extracting a first feature amount that is a feature amount of the actions; ,
    a graph-object feature extracting unit that generates a second feature amount that emphasizes the first feature amount;
    a graph model learning unit that generates the learning model based on the second feature amount when the image is the learning data;
    A machine learning device comprising:
  4.  画像内の動作主体の行動の熟練度を推論するための学習モデルを学習する機械学習装置であって、
     ユーザの入力操作に基づいて、前記動作主体の複数の部分に対応する複数のノードと前記複数のノードの間の関係性を示す情報とで構成されるグラフを取得するグラフ入力部と、
     前記グラフ入力部で取得された前記グラフを記憶する記憶部と、
     前記複数のノードに対応する複数のオブジェクトが存在する、前記画像内における複数のオブジェクト領域を認識して出力するオブジェクト認識部と、
     前記画像内に存在する前記動作主体の前記複数の部分の行動の特徴量である第1の特徴量を抽出する熟練行動特徴抽出部と、
     前記複数のオブジェクト領域及び前記第1の特徴量に基づいて、前記複数のオブジェクト領域のいずれかに重なる着目領域を生成し、前記着目領域をヒートマップとして出力する着目領域生成部と、
     前記ヒートマップに基づいて、前記グラフ入力部から入力される前記グラフの候補を前記ユーザに提示するための情報を生成するグラフ候補生成部と、
     前記着目領域について前記第1の特徴量を強調した第2の特徴量を生成するグラフ-オブジェクト特徴抽出部と、
     前記画像が予め収集された学習用データであるときにおける前記第2の特徴量に基づいて、前記学習モデルを生成するグラフモデル学習部と、
     を有することを特徴とする機械学習装置。
    A machine learning device that learns a learning model for inferring the action proficiency of an action subject in an image,
    a graph input unit that acquires a graph composed of a plurality of nodes corresponding to a plurality of parts of the action subject and information indicating relationships between the plurality of nodes, based on a user's input operation;
    a storage unit that stores the graph acquired by the graph input unit;
    an object recognition unit that recognizes and outputs a plurality of object regions in the image in which a plurality of objects corresponding to the plurality of nodes exist;
    a skillful action feature extracting unit for extracting a first feature amount, which is a feature amount of actions of the plurality of portions of the action subject existing in the image;
    a region-of-interest generating unit that generates a region of interest that overlaps one of the plurality of object regions based on the plurality of object regions and the first feature amount, and outputs the region of interest as a heat map;
    a graph candidate generation unit that generates information for presenting the user with the graph candidate input from the graph input unit based on the heat map;
    a graph-object feature extracting unit that generates a second feature amount that emphasizes the first feature amount for the region of interest;
    a graph model learning unit that generates the learning model based on the second feature amount when the image is learning data collected in advance;
    A machine learning device comprising:
  5.  前記動作主体は、人であり、
     前記複数の部分は、前記人の複数の身体部位を含む
     ことを特徴とする請求項1から4のいずれか1項に記載の機械学習装置。
    the subject of action is a person,
    The machine learning device according to any one of claims 1 to 4, wherein the plurality of parts includes a plurality of body parts of the person.
  6.  前記動作主体は、人の身体の部位の動きに連動して動く機構であり、
     前記複数の部分は、前記機構の複数の部分である
     ことを特徴とする請求項1から4のいずれか1項に記載の機械学習装置。
    The subject of action is a mechanism that moves in conjunction with the movement of a part of the human body,
    The machine learning device according to any one of claims 1 to 4, wherein the plurality of parts are a plurality of parts of the mechanism.
  7.  前記複数のノードの間の関係性を示す前記情報は、有向エッジである
     ことを特徴とする請求項1から6のいずれか1項に記載の機械学習装置。
    The machine learning device according to any one of claims 1 to 6, wherein the information indicating the relationship between the plurality of nodes is a directed edge.
  8.  前記複数のノードの間の関係性を示す前記情報は、
     前記複数の部分の各々の位置を示す情報、
     前記複数の部分の各々の動きの方向及び速さを示す情報、並びに
     前記複数の部分の動きの順番を示す情報、
     のうちの1つ以上を含む
     ことを特徴とする請求項1から6のいずれか1項に記載の機械学習装置。
    The information indicating the relationship between the plurality of nodes,
    information indicating the position of each of the plurality of portions;
    information indicating the direction and speed of movement of each of the plurality of parts, and information indicating the order of movement of the plurality of parts;
    The machine learning device according to any one of claims 1 to 6, comprising one or more of
  9.  請求項1から8のいずれか1項に記載の機械学習装置と、
     前記熟練行動特徴抽出部に入力される前記画像が推論対象の画像であるときにおける前記第2の特徴量に基づいて、前記動作主体の行動の熟練度を推論する前記学習モデルと、
     を有することを特徴とする熟練行動判定装置。
    A machine learning device according to any one of claims 1 to 8;
    the learning model for inferring the skill level of the behavior of the action subject based on the second feature amount when the image input to the skillful action feature extraction unit is the inference target image;
    A skilled action determination device, comprising:
  10.  画像内の動作主体の行動の熟練度を推論するための学習モデルを学習する機械学習装置によって実施される機械学習方法であって、
     前記画像内に存在する前記動作主体の複数の部分の行動の特徴量である第1の特徴量を抽出するステップと、
     ユーザの入力操作に基づいて、前記動作主体の前記複数の部分に対応する複数のノードと前記複数のノードの間の関係性を示す情報とで構成されるグラフを取得し、前記グラフを記憶するステップと、
     前記複数のノードに対応する複数のオブジェクトが存在する、前記画像内における複数のオブジェクト領域を認識して出力するステップと、
     前記第1の特徴量に基づいて前記画像内における着目領域を生成するステップと、
     前記着目領域と前記オブジェクト領域とが重なる領域について前記第1の特徴量を強調した第2の特徴量を生成するステップと、
     前記画像が予め収集された学習用データであるときにおける前記第2の特徴量に基づいて、前記学習モデルを生成するステップと、
     を有することを特徴とする機械学習方法。
    A machine learning method implemented by a machine learning device that learns a learning model for inferring proficiency of actions of an agent in an image, comprising:
    a step of extracting a first feature quantity that is a feature quantity of actions of a plurality of portions of the action subject present in the image;
    Acquiring a graph composed of a plurality of nodes corresponding to the plurality of parts of the action subject and information indicating relationships between the plurality of nodes based on a user's input operation, and storing the graph. a step;
    recognizing and outputting a plurality of object regions in the image in which there are a plurality of objects corresponding to the plurality of nodes;
    generating a region of interest in the image based on the first feature amount;
    generating a second feature quantity that emphasizes the first feature quantity for a region where the region of interest and the object region overlap;
    generating the learning model based on the second feature amount when the image is learning data collected in advance;
    A machine learning method characterized by comprising:
  11.  画像内の動作主体の行動の熟練度を推論するための学習モデルを学習するコンピュータに、
     前記画像内に存在する前記動作主体の複数の部分の行動の特徴量である第1の特徴量を抽出するステップと、
     ユーザの入力操作に基づいて、前記動作主体の前記複数の部分に対応する複数のノードと前記複数のノードの間の関係性を示す情報とで構成されるグラフを取得し、前記グラフを記憶するステップと、
     前記複数のノードに対応する複数のオブジェクトが存在する、前記画像内における複数のオブジェクト領域を認識して出力するステップと、
     前記第1の特徴量に基づいて前記画像内における着目領域を生成するステップと、
     前記着目領域と前記オブジェクト領域とが重なる領域について前記第1の特徴量を強調した第2の特徴量を生成するステップと、
     前記画像が予め収集された学習用データであるときにおける前記第2の特徴量に基づいて、前記学習モデルを生成するステップと、
     を実行させることを特徴とする機械学習プログラム。
    A computer that learns a learning model for inferring the action proficiency of the action subject in the image,
    a step of extracting a first feature quantity that is a feature quantity of actions of a plurality of portions of the action subject present in the image;
    Acquiring a graph composed of a plurality of nodes corresponding to the plurality of parts of the action subject and information indicating relationships between the plurality of nodes based on a user's input operation, and storing the graph. a step;
    recognizing and outputting a plurality of object regions in the image in which there are a plurality of objects corresponding to the plurality of nodes;
    generating a region of interest in the image based on the first feature amount;
    generating a second feature quantity that emphasizes the first feature quantity for a region where the region of interest and the object region overlap;
    generating the learning model based on the second feature amount when the image is learning data collected in advance;
    A machine learning program characterized by executing
PCT/JP2022/004364 2022-02-04 2022-02-04 Machine learning device, skilled action determination device, machine learning method, and machine learning program WO2023148909A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2023532819A JP7387069B1 (en) 2022-02-04 2022-02-04 Machine learning device, skilled behavior determination device, machine learning method, and machine learning program
PCT/JP2022/004364 WO2023148909A1 (en) 2022-02-04 2022-02-04 Machine learning device, skilled action determination device, machine learning method, and machine learning program
TW111127906A TW202333089A (en) 2022-02-04 2022-07-26 Machine learning device, skilled action determination device, machine learning method, and machine learning program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/004364 WO2023148909A1 (en) 2022-02-04 2022-02-04 Machine learning device, skilled action determination device, machine learning method, and machine learning program

Publications (1)

Publication Number Publication Date
WO2023148909A1 true WO2023148909A1 (en) 2023-08-10

Family

ID=87553402

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/004364 WO2023148909A1 (en) 2022-02-04 2022-02-04 Machine learning device, skilled action determination device, machine learning method, and machine learning program

Country Status (3)

Country Link
JP (1) JP7387069B1 (en)
TW (1) TW202333089A (en)
WO (1) WO2023148909A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021077230A (en) * 2019-11-12 2021-05-20 オムロン株式会社 Movement recognition device, movement recognition method, movement recognition program, and movement recognition system
JP2021135898A (en) * 2020-02-28 2021-09-13 富士通株式会社 Behavior recognition method, behavior recognition program and behavior recognition device
JP2021163293A (en) * 2020-04-01 2021-10-11 株式会社デンソーウェーブ Work analyzer and work analysis program

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009048098A (en) 2007-08-22 2009-03-05 Fujitsu Ltd Skill measuring program, computer readable recording medium with the program recorded thereon, skill measuring device, and skill measuring method
CN113239897B (en) 2021-06-16 2023-08-18 石家庄铁道大学 Human body action evaluation method based on space-time characteristic combination regression

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021077230A (en) * 2019-11-12 2021-05-20 オムロン株式会社 Movement recognition device, movement recognition method, movement recognition program, and movement recognition system
JP2021135898A (en) * 2020-02-28 2021-09-13 富士通株式会社 Behavior recognition method, behavior recognition program and behavior recognition device
JP2021163293A (en) * 2020-04-01 2021-10-11 株式会社デンソーウェーブ Work analyzer and work analysis program

Also Published As

Publication number Publication date
JPWO2023148909A1 (en) 2023-08-10
JP7387069B1 (en) 2023-11-27
TW202333089A (en) 2023-08-16

Similar Documents

Publication Publication Date Title
WO2020224403A1 (en) Classification task model training method, apparatus and device and storage medium
CN109670474B (en) Human body posture estimation method, device and equipment based on video
JP7274048B2 (en) Motion recognition method, apparatus, computer program and computer device
CN113642361B (en) Fall behavior detection method and equipment
Huang et al. Deepfinger: A cascade convolutional neuron network approach to finger key point detection in egocentric vision with mobile camera
Harrou et al. Malicious attacks detection in crowded areas using deep learning-based approach
KR102397248B1 (en) Image analysis-based patient motion monitoring system and method for providing the same
CN110738650A (en) infectious disease infection identification method, terminal device and storage medium
CN112836641A (en) Hand hygiene monitoring method based on machine vision
Ansar et al. Robust hand gesture tracking and recognition for healthcare via Recurent neural network
Michel et al. Gesture recognition supporting the interaction of humans with socially assistive robots
WO2023148909A1 (en) Machine learning device, skilled action determination device, machine learning method, and machine learning program
KR20230080938A (en) Method and apparatus of gesture recognition and classification using convolutional block attention module
CN108089753B (en) Positioning method for predicting fingertip position by using fast-RCNN
JP7459949B2 (en) Learning devices, learning methods, tracking devices and programs
Abdulhamied et al. Real-time recognition of American sign language using long-short term memory neural network and hand detection
JP2021047538A (en) Image processing device, image processing method, and program
CN111833375A (en) Method and system for tracking animal group track
JP7254262B2 (en) Work estimating device, work estimating method, and work estimating program
Jankowski et al. Neural network classifier for fall detection improved by Gram-Schmidt variable selection
JP7205628B2 (en) Information processing device, control method, and program
Saliaj et al. Artificial Neural Networks for COVID-19 Time Series Forecasting
CN110852394A (en) Data processing method and device, computer system and readable storage medium
US20230298336A1 (en) Video-based surgical skill assessment using tool tracking
WO2023079943A1 (en) Information processing device, information processing method, and information processing program

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2023532819

Country of ref document: JP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22924819

Country of ref document: EP

Kind code of ref document: A1