WO2023148909A1

WO2023148909A1 - Machine learning device, skilled action determination device, machine learning method, and machine learning program

Info

Publication number: WO2023148909A1
Application number: PCT/JP2022/004364
Authority: WO
Inventors: 雄一佐々木; 翔貴宮川; 勇小川; 雅浩虻川
Original assignee: 三菱電機株式会社
Priority date: 2022-02-04
Filing date: 2022-02-04
Publication date: 2023-08-10
Also published as: JPWO2023148909A1; JP7387069B1; TW202333089A

Abstract

A machine learning device (100) has: a graph input unit (15) that acquires a graph (G) constituted by a plurality of nodes corresponding to a plurality of parts of an action subject, and information indicating a relationship among the plurality of nodes; an object recognition unit (18) that recognizes and outputs a plurality of object regions (O) within an image; a skilled action feature extraction unit (12) that extracts a first feature amount (F1) which is a feature amount of actions in the plurality of parts of the action subject present in the image; a focus region generation unit (13) that generates a focus region (A) on the basis of the first feature amount (F1); a graph-object feature extraction unit (16) that generates a second feature amount (F2) which emphasizes the first feature amount (F1) for regions where the focus region (A) and the object region (O) overlap; and a graph model training unit (14) that generates a training model (M) on the basis of the second feature amount (F2) when the image input into the skilled action feature extraction unit (12) is training data which has been collected ahead of time.

Description

Machine learning device, skilled behavior determination device, machine learning method, and machine learning program

The present disclosure provides a machine learning device, a machine learning method, and a machine learning program for learning a learning model for inferring the action proficiency level of an action subject in an image, and the action proficiency level of an action subject in an image. The present invention relates to an inference skilled action determination device.

Transfer learning in which a user modifies a region of interest generated for an image by a neural network (NN) in a learning model (that is, human knowledge is embedded in the learning model), and learning is performed using the corrected region of interest as correct data. is known (see, for example, Non-Patent Document 1). Transfer learning is a Human-in-the-Loop (HITL) type of learning. By transfer learning, for example, a skilled action determination model is generated, which is a learning model that determines the skill level of human action in an image while interacting with the user.

In addition, the Spatio-Temporal Graph Convolution Network (ST-GCN) is known as a learning model for detecting human movement from the skeleton (see, for example, Non-Patent Document 2). This method uses a graph in which nodes are the joint coordinates of a person and edges are the relationships between the joints.

In addition, using Relationship Proposal Networks (RePN), which is a general object detection model, we extract objects and extract image features linked to objects, and create a scene graph with a graph structure that shows their relationships. Graph Region Based Convolutional Neural Networks (Graph R-CNN) is known as a learning method (see, for example, Non-Patent Document 3). Here, the scene graph is a graph in which objects appearing in an image are nodes, and relationships established between the nodes are edges (for example, directed edges).

However, in the method of Non-Patent Document 1, since the user only corrects the region of interest of the image, it is possible to generate a learning model that can infer the skill level of human behavior as the subject of action with high prediction accuracy. can't

In addition, in the method of Non-Patent Document 2, since the graph structure uses only skeletal information, it is considered difficult to generate a learning model that can infer the proficiency level of human behavior with high prediction accuracy. .

Furthermore, the method of Non-Patent Document 3 deals only with simple relationships between objects in images (positional relationships between trees and birds, trees and leaves, trees and branches, etc.). Therefore, it is considered difficult to generate a learning model that can infer the proficiency level of a person's behavior with high prediction accuracy.

The present disclosure provides a machine learning device, a machine learning method, and a machine learning program for learning a learning model capable of inferring the proficiency level of an action of an action subject with high prediction accuracy, and using the learning model, It is an object of the present invention to provide a skilled action determination device for inferring the skill level of actions of an action subject.

A machine learning device according to the present disclosure is a device that learns a learning model for inferring the action proficiency of an action subject in an image, and corresponds to a plurality of parts of the action subject based on a user's input operation. a graph input unit for acquiring a graph composed of a plurality of nodes and information indicating relationships between the plurality of nodes; a storage unit for storing the graph acquired by the graph input unit; an object recognition unit for recognizing and outputting a plurality of object areas in the image in which a plurality of objects corresponding to the nodes of are present; a skillful action feature extraction unit that extracts a first feature amount, a region of interest generation unit that generates a region of interest in the image based on the first feature amount, and the region of interest and the object region are a graph-object feature extraction unit that generates a second feature amount that emphasizes the first feature amount for the overlapping region; and a graph model learning unit that generates the learning model.

A machine learning method of the present disclosure is a method implemented by a machine learning device that learns a learning model for inferring the action proficiency of an actor in an image, the method comprising: a step of extracting a first feature amount that is a feature amount of actions of a plurality of parts; and a step between a plurality of nodes corresponding to the plurality of parts of the action subject and the plurality of nodes based on a user's input operation. and storing the graph; and recognizing a plurality of object regions in the image in which a plurality of objects corresponding to the plurality of nodes exist. a step of generating a region of interest in the image based on the first feature quantity; and a second step of emphasizing the first feature quantity for a region where the region of interest and the object region overlap. and generating the learning model based on the second feature when the image is learning data collected in advance.

By using the machine learning device, machine learning method, and machine learning program of the present disclosure, it is possible to generate a learning model that can infer the proficiency level of the actions of the action subject with high prediction accuracy.

Also, by using the skilled action determination device of the present disclosure, it is possible to infer the skill level of the action subject's action with high prediction accuracy.

1 is a diagram illustrating an example of a hardware configuration of a machine learning device according to Embodiment 1; FIG. 1 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 1; FIG. FIG. 4 is an explanatory diagram showing an operation during learning of the machine learning device according to Embodiment 1; FIG. 10 is a diagram showing an example of operation during learning of the machine learning device according to Embodiment 1 in a tabular form; 4 is a flow chart showing the operation of the machine learning device according to Embodiment 1 during learning. FIG. 4 is an explanatory diagram showing the operation of the machine learning device (skilled action determination device) according to Embodiment 1 during inference; 4 is a flow chart showing operations during inference of the machine learning device (skilled action determination device) according to Embodiment 1; FIG. 4 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 2; (A) and (B) are explanatory diagrams showing the operation of the machine learning device according to the second embodiment. 10 is a flow chart showing operations during learning of the machine learning device according to Embodiment 2. FIG. FIG. 12 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 3; FIG. 12 is an explanatory diagram showing the operation of the machine learning device according to Embodiment 3 during learning; (A) and (B) are explanatory diagrams showing the operation of the learning rate adjusting unit of the machine learning device according to the third embodiment. 11 is a flow chart showing operations during learning of the machine learning device according to Embodiment 3. FIG. FIG. 11 is a functional block diagram schematically showing the configuration of a machine learning device according to Embodiment 4; FIG. 13 is a flow chart showing an operation during learning of the machine learning device according to Embodiment 4; FIG. FIG. 12 is an explanatory diagram showing the operation of the machine learning device according to Embodiment 4; FIG. 12 is an explanatory diagram showing the operation of the machine learning device according to Embodiment 4;

A machine learning device, a skilled behavior inference device, a machine learning method, and a machine learning program according to embodiments will be described below with reference to the drawings. The following embodiments are merely examples, and the embodiments can be combined as appropriate and each embodiment can be modified as appropriate.

The machine learning device according to the embodiment is a device that learns a learning model for inferring the proficiency level of actions of an action subject in an image. A machine learning device according to an embodiment is, for example, a computer as an information processing device. The subject of action is a person who performs the work (also called a worker, a technician, a skilled worker, etc.), or a mechanism or device (for example, a robot arm, an endoscope, etc.) that works in conjunction with the movement of a person , etc.).

A machine learning method according to an embodiment is a method that can be implemented by a machine learning device. This machine learning method is a method of learning a learning model for inferring the action proficiency of the action subject in the image.

A machine learning program according to the embodiment is a program that can be executed by a computer as a machine learning device. This machine learning program is a program for learning a learning model for inferring the action proficiency of the action subject in the image.

A skilled action inference device according to an embodiment is a device that infers the action proficiency of an action subject using a learning model generated by a machine learning device, a machine learning method, or a machine learning program. A skilled action reasoning device is, for example, a computer. The expert behavior inference device and the machine learning device may be configured by a common computer. Also, the expert behavior inference device and the machine learning device may be configured by different computers.

<<1>> Embodiment 1
<<1-1>> Configuration FIG. 1 is a diagram showing an example of a hardware configuration of a machine learning device 100 according to the first embodiment. The machine learning device 100 according to Embodiment 1 is a device that executes a learning process of generating a learning model M by performing machine learning. Further, the machine learning device 100 is also a skilled action determination device. The machine learning device 100 includes a processor 101 such as a CPU (Central Processing Unit), a memory 102 that is a volatile storage device, a nonvolatile storage device 103 such as a hard disk drive (HDD) or solid state drive (SSD), and an interface 104 . The memory 102 is, for example, a semiconductor memory such as a RAM (Random Access Memory). The machine learning device 100 may have a communication device that communicates with an external device.

Each function of the machine learning device 100 is realized by a processing circuit. The processing circuitry is dedicated hardware. The processing circuit may be processor 101 that executes a program stored in memory 102 (for example, a machine learning program according to the embodiment). The processor 101 may be a processing device, an arithmetic device, a microprocessor, a microcomputer, or a DSP (Digital Signal Processor).

When the processing circuit is dedicated hardware, the processing circuit is, for example, ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array).

When the processing circuit is the processor 101, the machine learning method is implemented by software, firmware, or a combination of software and firmware. Software and firmware are written as programs and stored in memory 102 . Processor 101 can implement the machine learning method according to the first embodiment by reading and executing the program stored in memory 102 .

It should be noted that the machine learning device 100 may be partially implemented by dedicated hardware and partially implemented by software or firmware. As such, the processing circuitry may implement each of the functions described above in hardware, software, firmware, or any combination thereof.

The interface 104 is used to communicate with other devices. An external storage device, a display 105, an input device 106 as a user operation unit, and the like are connected to the interface 104 . The input device 106 is, for example, a mouse, keyboard, touch panel, or the like.

FIG. 2 is a functional block diagram schematically showing the configuration of machine learning device 100 according to Embodiment 1. As shown in FIG. The machine learning device 100 is a device that learns a learning model M for inferring the proficiency level of actions of an action subject in an image. The machine learning device 100 includes a skilled action determination model 11, a graph input unit 15 as a correlation/causal graph input unit, a graph-object feature extraction unit 16, a user input area extraction table 17 stored in a storage unit, and an object recognition unit 18 . The skillful action determination model 11 has a skillful action feature extraction unit 12 , a region-of-interest generation unit 13 , and a graph model learning unit 14 .

At the time of learning, in the skilled action determination model 11, the learning model generation unit 11a generates a learning model M and stores it in the learning model storage unit 11b. At the time of inference, in the skilled action determination model 11, the inference section 11c reads out the learning model M from the learning model storage section 11b, performs inference based on the input data, and outputs the inference result.

The graph input unit 15 generates a graph composed of a plurality of nodes corresponding to a plurality of portions of the action subject in the image and information indicating the relationship between the plurality of nodes based on the input operation of the user 50. get G. In other words, the user inputs nodes and correlations/causal relationships between the nodes from the input device to the graph input unit 15 . That is, the graph input unit 15 inputs relationships between nodes into the causality/correlation graph. The user 50 uses the graph input unit 15 to specify a region (for example, right hand RH, left hand LH, head HE) in which knowledge is to be embedded, and selects an object for extracting the region in which knowledge is to be embedded. Register in Table 17. Furthermore, in the graph input unit 15, the user 50 registers in advance information about the relationship between regions that he/she expects. The user input area extraction table 17 is a table of information acquired in advance or by an input operation from the graph input unit 15 .

The object recognition unit 18 recognizes and outputs multiple object areas O in which multiple objects corresponding to multiple nodes (also referred to as "graph nodes") exist. The object recognition unit 18 recognizes an object corresponding to a node (for example, right hand, left hand, head) input by the user 50 and an object area O (for example, rectangular area) that is an area containing the object. That is, the object recognition unit 18 reads the object extraction method registered in the user input area extraction table 17, and extracts the corresponding object area from the moving image and time-series sensor data according to the method.

The skillful action feature extraction unit 12 extracts a first feature amount F1 that is a feature amount (that is, an intermediate feature amount) of actions of a plurality of portions of the action subject present in the image. The multiple parts of the action subject are, for example, the operator's right hand RH, left hand LH, and head HE. The skillful action feature extraction unit 12 acquires intermediate features using a feature extractor such as a CNN (Convolutional Neural Network), for example.

The region-of-interest generation unit 13 generates a region of interest A in the image based on the first feature amount F1. The region-of-interest generation unit 13 generates heat map information indicating which region of the image to focus on to obtain the skill level, using a network mechanism such as an attention branch network (ABN). In addition, the region-of-interest generation unit 13 registers the visualization result, which is information in the middle of generation of the heat map information, in the learning result portion. ABN is described in Non-Patent Document 1, for example.

The graph-object feature extraction unit 16 generates a second feature quantity F2 that emphasizes the first feature quantity F1 for the region where the region of interest A and the object region O overlap. The graph-object feature extraction unit 16 associates a sensor feature amount such as an image with a region in which human knowledge is assumed to be embedded. The graph-object feature extraction unit 16 includes the object region O extracted by the object recognition unit 18 and the focus point generated by the region-of-interest generation unit 13 in the first feature quantity F1 extracted by the skilled action determination model 11. Areas other than the area overlapping with the area of interest A are masked, and the graph G, which is the user input, is associated with the first feature amount F1, which is the skillful action feature. With such a configuration, it is possible to specify in advance the relationship between the right hand RH, the left hand LH, and the head HE, and furthermore, analyze the object including "how to handle things" that are shown together with the hand. can do. Here, "how to handle things" means, for example, how to handle tools such as screwdrivers and pens. In other words, the graph-object feature extracting unit 16 extracts features for nodes by performing mask processing using heat map information indicating the object region and the region of interest.

The graph model learning unit 14 generates a learning model M based on the second feature amount F2 when the image input to the skillful behavior feature extraction unit 12 is the learning data L collected in advance. The graph model learning unit 14 advances learning using, for example, a graph convolution learning method such as ST-GCN, and accumulates the learning results in the storage unit.

In the data set storage unit 60, videos that are learning data, videos that are inference data, time-series sensor data, and the like are accumulated.

In the first embodiment, the user 50 gives the information indicating the interrelationship between the objects in advance from the graph input unit 15, so that the object to be recognized can be designated according to the problem to be solved. In addition, by selecting image features using the region of interest extracted in the process of recognizing the problem to be solved, it is possible to acquire more detailed features of the graph, such as the region of interest for skillful handling of objects. can be done. In addition, more accurate analysis becomes possible.

<<1-2>> Operation During Learning FIG. 3 is an explanatory diagram showing the operation during learning of the machine learning device 100 according to the first embodiment. FIG. 4 is a diagram showing, in tabular form, information indicating an example of the operation of the machine learning device 100 during learning.

The user inputs, for example, causality and correlation between the right hand, left hand, and head to the graph input unit 15 . In the example of FIG. 3, in order to determine the skill level of "drawing", a bird's-eye view image of the work is acquired, and the user 50 manually determines the relation of information such as "right hand", "left hand", and "head" shown in the image. given in the input.

For example, when causality is expressed as a directed graph, as shown in FIG. Give "left hand", "head", and the edges of the graph.

Also, a method for extracting the nodes of the graph input unit 15 is given to the user input region extraction table 17. The graph input unit 15 may provide a machine learning model for detecting an object from an image as shown in Non-Patent Document 3. Further, by image processing for extracting the color of the skin, it is possible to identify the object on the right side of the image as the right hand and the object on the left side of the image as the left hand.

An example of the user input area extraction table 17 is shown in FIG. 4, for example. Information may be registered in the user input area extraction table 17 by means of extracting sensor data corresponding to the input by the user 50 .

The data set storage unit 60 stores learning data used during machine learning. In the case of a model that determines skillful behavior, sensor data such as videos, pressure sensors, acceleration sensors, and sounds are stored. It stores the result of quality judgment and the like obtained.

If the expert behavior feature extraction unit 12 employs a paired comparison method such as Attention Pairwise Ranking or Pairwise Deep Ranking, the superiority comparison results of the two sensor data may be retained.

FIG. 5 is a flowchart showing the operation of the machine learning device 100 during learning. The object recognizing unit 18 extracts regions that become nodes of the correlation/causal graph input by the user from the image or sensor data (step S101). When an object detection model is used, the recognition results of the right hand, the left hand, and the head, and the rectangle information surrounding the objects of the recognition results are extracted. In addition, when recognizing the right hand, left hand, and head by image processing using skin color, hair color, etc., the same information as the above object detection model can be obtained from the regions within the predetermined color range and their positional relationships. to extract

The skillful action feature extraction unit 12 is a model that extracts features from images such as CNN. The skillful action feature extraction unit 12 may use a model that handles time, such as an RNN (Recurrent Neural Network), in the case of a model that handles a time series, such as an acceleration sensor or sound. When dealing with time series such as videos, 3D-CNN including convolution in the time direction or image features once convolved by CNN are input, and a model such as RNN for handling time series is combined. good too. In addition to the above, a model such as TSN (Temporal Segment Network) may be used in which time points obtained by dividing a moving image at regular intervals (for example, every 1/3 size of a moving image) are input to CNN. Data is input to the skillful action feature extraction unit 12 to obtain a first feature quantity F1, which is an intermediate feature of time t, width W, height H, and number of channels Ch (step S102).

The region-of-interest generation unit 13 uses the intermediate features extracted by the skillful action feature extraction unit 12 to generate a region of interest for judging the skill level (step S103). The region of interest has a heat map ranging from 0 to 1 with respect to width W and height H by global average pooling in the channel direction and time t direction, activation function, or normalization of maximum and minimum values. The heat map has, for example, a CAM (Class Activation Map) structure. Error backpropagation for skillful action determination acquires a point of interest indicating which of the features extracted by the skillful action feature extraction unit 12 should be focused on to easily determine the skill level.

The graph-object feature extraction unit 16 adds the intermediate feature amount (time t×width W×height H×channel number Ch) output by the skillful behavior feature extraction unit 12 to a mask based on the object extraction result, and extracts each node “ The feature values associated with the right hand, left hand, and head are extracted (step S104). In Embodiment 1, the intermediate features extracted by the object recognition unit 18 are used as they are and are not given as features to nodes. The problem we want to solve is to acquire the skill level, not the features for extracting the right hand, left hand, and head. Apply.

As a method of mask processing, a feature amount F1' obtained by masking the first feature amount F1, which is an intermediate feature amount, such as Attention Branch Network, and the first feature amount F1, which is an original intermediate feature amount, are used. may be obtained by taking the sum of In addition to this, the feature amount F1' may be used by eliminating the sum portion of the above method. Also, regions other than the region of interest (Attention region) extracted by the region-of-interest generation unit 13, which will be described later, are masked to extract appropriate features for the nodes of the graph.

The graph model learning unit 14 (ST-GCN) uses the adjacency matrix of causality and correlation between graphs given by the user to learn the features extracted at each time t in the Graph Convolutional Neural Network (Graph-CNN) By the method, error backpropagation is repeated so that the proficiency level of the data set is applied, and learning is performed (step S105). For error backpropagation, object recognition is not performed, and weight parameters are updated up to feature extraction.

<<1-3>> Operation During Inference FIG. 6 is an explanatory diagram showing the operation during inference of the machine learning device (skilled behavior determination device) 100 according to the first embodiment. FIG. 7 is a flowchart showing the operation of the machine learning device (skilled action determination device) 100 during inference.

The object recognition unit 18 extracts an object (node) input in advance by the user and its area (step S111). The skillful action feature extraction unit 12 extracts features for judging the skill level (step S112). The region-of-interest generation unit 13 generates a heat map of width W×height H (step S113). The graph-object feature extracting unit 16 extracts the feature amount of the object input by the user in advance based on the object recognition result and the region-of-interest generation result (step S114). The inference unit recognizes the skill level by graph convolution (step S115).

<<1-4>> Effect With the graph input unit 15, the user can specify the mutual relationship between the objects in advance, and specify the extraction method of the object to be recognized according to the problem to be solved. According to such an embodiment, it becomes possible to incorporate the user's knowledge related to the problem to be solved into machine learning in the form of a knowledge graph.

The region-of-interest generation unit 13 selects image features using the points of interest extracted in the process of recognizing the problem to be solved, extracting the features of the thing (object) corresponding to the node, and simply taking them out for machine learning. It has the effect that it is possible to acquire more detailed characteristics such as ``points of interest in handling things well'' instead of doing.

<<2>> Embodiment 2
<<2-1>> Configuration FIG. 8 is a functional block diagram schematically showing the configuration of the machine learning device 200 according to the second embodiment. In FIG. 8, the same reference numerals as those shown in FIG. 2 are attached to the same or corresponding configurations as those shown in FIG. Machine learning device 200 differs from machine learning device 100 according to the first embodiment in the operation of object recognition section 28 . Machine learning device 200 is a device capable of implementing the machine learning method according to the fourth embodiment. The hardware configuration of the machine learning device 200 is the same as that of FIG.

9(A) and (B) are explanatory diagrams showing the operation of the machine learning device 200. FIG. When dealing with a working video of an expert, objects specified by the user overlap as shown in FIG. 9B, or objects specified by the user disappear from the screen as shown in FIG. 9A. problems may occur. In response to this, the machine learning device 200 provides a means for estimating the occurrence of the above problem in the object recognition unit 28, and appropriately updates the graph features linked to the image according to these states.

<<2-2>> Operation FIG. 10 is a flowchart showing the operation of the machine learning device 200 during learning. The operation of FIG. 10 differs from the operation of the machine learning device 100 according to the first embodiment in the operations of the object recognition unit 28 and the operations of the graph-object feature extraction unit 16. FIG. The machine learning device 200 recognizes an object (step S201), extracts a first feature amount F1 (step S202), generates a region of interest A (step S203), extracts a graph object feature (second feature amount F2). (Step S204), generating a graph model as a learning model (step S205).

The object recognition unit 28 operates with a position filtering technique such as the Kalman filter, in which Gaussian noise is added to position predictions based on observations and past observations.

The flow estimating unit 28a holds information on positions, velocities, or positions and velocities that have been filtered and estimated by the previous time, and based on this, estimates the position where an object is predicted to exist.

The object existence probability estimating unit 28b calculates the existence probability so that the variance value observed at the position becomes small when the position is observed in object recognition, and the variance value gradually increases when the position is not observed. If the dispersion of the positions observed by the above filter is greater than a certain value, or if the position of the hand including the dispersion is estimated to have moved outside the screen, it is recognized that the right hand and left hand are not detected from the middle.

A Kalman filter is used to estimate which of the position estimated by the flow estimation unit 28a and the position observed by the object recognition unit 28 should be given more weight to calculate the position, according to the value of the variance value.

The overlap determination unit 28c recognizes that two objects overlap when the positions of the objects overlap during filtering and only one hand is found in object recognition.

If the object is not recognized partway through as a result of the object recognition, the graph-object feature extraction unit 16 assigns the feature amount that was extracted before the object was not recognized to the node.

The graph-object feature extraction unit 16 determines a weight based on the area ratio of the Gaussian distribution overlapping portion and non-overlapping portion when, for example, the right and left hand objects overlap as a result of the object recognition, The feature values of the right hand and left hand up to the previous time and the feature values of the overlapping portions are mixed by weighted sum and assigned to nodes.

<<2-3>> Effect According to Embodiment 2, the object recognition unit 2 detects that objects do not exist or overlap each other, and based on this, appropriately determines the feature amount to be assigned to the node. Learning like ST-GCN can be performed more stably even if an object is not detected at a certain time.

Except for the above, the second embodiment is the same as the first embodiment.

<<3>> Embodiment 3
<<3-1>> Configuration FIG. 11 is a functional block diagram schematically showing the configuration of the machine learning device 300 according to the third embodiment. In FIG. 11, the same reference numerals as those shown in FIG. 2 are attached to the same or corresponding configurations as those shown in FIG. The machine learning device 300 differs from the machine learning device 100 according to Embodiment 1 in that it has a learning data generation unit 35 and the configuration and operation of the characteristic behavior determination model 31 . Machine learning device 300 is a device capable of implementing the machine learning method according to the third embodiment. The hardware configuration of machine learning device 300 is the same as that of FIG.

The machine learning device 300 is a device that learns a learning model M for inferring the proficiency level of the action of the action subject in the image. The machine learning device 300 acquires a graph G composed of a plurality of nodes corresponding to a plurality of parts of an action subject and information indicating relationships between the plurality of nodes based on an input operation by the user 50. An input unit 15, a user input region extraction table 17 storing the graph G acquired by the graph input unit 15, and a plurality of object regions O in an image in which a plurality of objects corresponding to a plurality of nodes exist. and an object recognition unit 18 for outputting. The machine learning device 300 also includes a learning data generation unit 35 that generates learning data linked to a plurality of object regions O, and a plurality of subjects existing in an image that are linked to the plurality of object regions. (for example, right hand, left hand, head), and a skilled action determination model learning unit 33 for learning an action inference model M2 for inferring actions linked to a plurality of object regions. ing. Furthermore, the machine learning device 300 recognizes actions linked to a plurality of object regions inferred using the action inference model M2, and extracts a first feature amount F1 that is a feature amount of the actions. The skillful action feature extraction unit 34, the graph-object feature extraction unit 16 that generates a second feature amount F2 that emphasizes the first feature amount F1, and the second feature when the image is the learning data and a graph model learning unit 14 that generates a learning model M based on the quantity F2.

<<3-2>> Operation FIG. 12 is an explanatory diagram showing the operation of the machine learning device 300 during learning. A learning rate adjustment unit 32 is provided, and features are extracted with more weight on the CNN at the beginning, and more weight on the ST-GCN in the latter half, making it easier to learn the interrelationship between the right hand, the left hand, and the head. The learning data generation unit 35 registers the recognition results of the right hand, left hand, head, etc. in the data set storage unit 60 by the object recognition unit 28 . Based on this, first, the object recognition/skilled action determination model learning unit 33 performs multitask learning with a model such as a normal CNN, and from the learning data, the skill level, the right hand, the left hand, and the feature amount including the head. to extract

The graph-object feature extraction unit 16 associates the above feature amount with the right hand, left hand, and head, and obtains the node feature amount of ST-GCN.

The learning rate adjustment unit 32 extracts features for finding the left hand, right hand, and head in the first half of learning, and emphasizes ST-GCN in the second half of learning, so that the interaction between human body parts gradually increases. Focus on relationships.

The object recognition/skilled behavior determination model learning unit 33 is a deep learning algorithm that associates labels or categories with all pixels in an image (for example, an algorithm capable of recognizing a group of pixels forming a characteristic category). A behavioral inference model M2, which is a model such as

The object recognition/skilled action feature extraction unit 34 recognizes actions linked to a plurality of object regions inferred using the action inference model M2, and extracts a first feature amount F1, which is the feature amount of the action. Extract. By using the action inference model M2, it is possible to extract features related to proficiency through multitask learning. Semantic segmentation, for example, is known as such an algorithm. Therefore, it is possible to extract a detailed area related to the skill level without providing a device such as the region-of-interest generation unit 13 in the first embodiment.

Also, when semantic segmentation is used, the graph-object feature extraction unit 16 can link nodes and features by using segmentation results and masks.

13A and 13B are explanatory diagrams showing the operation of the learning rate adjusting section 32 of the machine learning device 300. FIG. It is assumed that the operation image of the learning rate adjustment unit 32 and the following loss function are given. L _{usr_cnn} +L _{skill_cnn} is the loss related to the object recognition/skilled behavior determination model learning unit 33 , and L _{skill_gcn} is the loss related to the graph model learning unit 14 .

In

Embodiments

1 and 2, learning is performed using a graph structure, which is manually embedded knowledge, but the graph structure does not include features for extracting objects such as the right hand, left hand, and head. For this reason, the learning rate adjustment unit 32 executes learning by multitask learning for the object recognition/skilled behavior determination model learning unit 33 immediately after the start of learning (that is, the first period of learning), and a certain amount of time elapses. After that, after the right hand, left hand, and head can be stably extracted, the value of α in the following loss function Loss is adjusted so as not to fall below a certain object recognition rate. On this basis, the ST-GCN incorporates features related to object extraction, and the learning is adjusted so as to calculate the skill level from the graph.

Loss = β (α (L _{usr_cnn} + L _{skill_cnn} ) + (1-α) L _{skill_gcn} )

A network configuration example is shown below. A learning rate adjustment unit 32 is provided, and features are extracted with more weight on the CNN at the beginning, and more weight on the ST-GCN in the latter half, making it easier to learn the interrelationship between the right hand, the left hand, and the head.

FIG. 14 is a flowchart showing the operation of the machine learning device 300 during learning. The machine learning device 300 recognizes an object (step S301), generates learning data (step S302), extracts object recognition/skilled behavior features (step S303), adjusts the learning rate (step S304), A feature (second feature amount F2) is extracted (step S305), and a graph model is generated as a learning model (step S306).

<<3-3>> Effect As described above, according to the third embodiment, by providing the object recognition/skilled action determination model learning unit 33, ST-GCN can also have features related to objects. As a result, it becomes possible to learn to determine a skilled action based on the features related to hand and head extraction. This can be expected to make learning more stable.

Except for the above, Embodiment 3 is the same as

Embodiment

1 or 2.

<<4>> Embodiment 4
<<4-1>> Configuration FIG. 15 is a functional block diagram schematically showing the configuration of machine learning device 400 according to the fourth embodiment. In FIG. 15, the same reference numerals as those shown in FIG. 2 are attached to the same or corresponding configurations as those shown in FIG. Machine learning device 400 differs from machine learning device 100 according to Embodiment 1 in the configuration of skilled action determination model 41 and in having graph candidate generation unit 43 . Machine learning device 400 is a device capable of implementing the machine learning method according to the fourth embodiment. The hardware configuration of machine learning device 400 is the same as that of FIG.

When the user 50 provides the machine learning device 400 with knowledge that predefines the relationships between objects, the knowledge provided may become noise, contrary to the user's intention. For example, in generating a graph based on correlation described in Non-Patent Document 3, it is not possible to exchange feature amounts between nodes that have dissimilar features. In Embodiment 4, the region of interest in the time direction of each object is extracted by the Attention Branch Network, and the graph candidate generation unit 43 determines the firing order of the heat map (that is, which region of interest is considered important when determining the skill level). information) to generate graph candidates. In the machine learning device 400 according to the fourth embodiment, when the user 50 inputs only node candidate information, node correlation/causality is automatically discovered.

The machine learning device 400 is a device that learns a learning model M for inferring the proficiency level of actions of the action subject in the image. The machine learning device 400 acquires a graph G composed of a plurality of nodes corresponding to a plurality of parts of an action subject and information indicating relationships between the plurality of nodes based on an input operation by the user 50. An input unit 15, a user input region extraction table 17 storing the graph G acquired by the graph input unit 15, and a plurality of object regions O in an image in which a plurality of objects corresponding to a plurality of nodes exist. and an object recognition unit 18 for outputting. The machine learning device 400 also includes a skillful action feature extraction unit 12 that extracts a first feature amount F1 that is a feature amount of actions of a plurality of parts (for example, right hand, left hand, head) of the action subject present in the image. and a region-of-interest generation unit 13 that generates a region-of-interest A that overlaps one of the plurality of object regions O based on the plurality of object regions O and the first feature amount F1, and outputs the region-of-interest as a heat map; It has a region-of-interest storage unit 42 and a graph candidate generation unit 43 that generates information for presenting to the user graph candidates input from the graph input unit 15 based on the heat map. Furthermore, the machine learning device 400 uses the graph-object feature extraction unit 16 that generates a second feature amount F2 that emphasizes the first feature amount F1 for the region of interest A, and the learning data L in which images are collected in advance. and a graph model learning unit 14 that generates a learning model M based on the second feature amount F2 at a certain time.

<<4-2>> Operation FIG. 16 is a flow chart showing the operation of the machine learning device 400 during learning. The machine learning device 400 recognizes an object (step S401), extracts a first feature amount F1 (step S402), generates a region of interest A (step S403), extracts a graph object feature (second feature amount F2). (Step S404), a graph model is generated as a learning model (step S405).

FIG. 17 is an explanatory diagram showing the operation of the machine learning device 400. FIG. In the graph input section 15, the user 50 defines only the likely relevant nodes such as right hand, left hand and head. Also, these extraction methods are registered in the user input area extraction table 17 .

In the graph model learning unit 14, learning is performed assuming that all nodes are connected by edges.

The region-of-interest generating unit 13 calculates the degree of superimposition of the heat map with respect to the region of interest in which the object is recognized, in addition to the heat map information indicating where in each object the focus is focused to determine the skill level. The firing order of nodes is generated as shown in FIG. The graph candidate generation unit 43 generates graph candidates from the firing order of the heat map (that is, information indicating which is given importance in determining the skill level).

The above is an example of extracting the firing order of nodes such as right hand, left hand, and head. .

The graph candidate generation unit finds node candidates based on the following N-divided time-series attention information (Attention) information and information indicating when each node was noticed.

Referring to FIG. 17, when the right hand is focused after the left hand is focused, there is a directed edge from the left hand to the right hand, and when the right hand and the head are focused at the same time, there are undirected edges on the right and left hands. Assume that there is If only the left hand fires, it is determined that there is a self-loop in the left hand. The graph as described above is divided into a certain number of divisions, and the time is searched with a certain window width, and the window is slid little by little to find graph candidates. A plurality of frequent patterns are extracted as candidates from the discovered graph candidates by graph pattern matching.

FIG. 18 is an explanatory diagram showing the operation of the machine learning device 400. FIG. The graph candidate generating unit 43 may verify whether or not there is a causal relationship between the nodes in judging the skill level, in addition to the automatically discovered relationships between the nodes. For example, the graph candidate generating unit 43 assumes that the time-series focused information shown in FIG. 18 has been obtained, and invalidates some of the focused information to investigate the effect. The graph candidate generation unit 43 verifies whether there is causality from the left hand to the right hand and from the left hand to the head, as described in FIG. 18 as causal relationship extraction.

When verifying the causality of the right hand and left hand, the graph candidate generation unit 43 first invalidates the focus information of the head. Then, shift the time zone of the heat map on the left side so that it is the same as the time on the right side, and find the change in loss (Δ loss) at this time.

On the other hand, when verifying the causality of the left hand and the head, the graph candidate generation unit 43 first disables the attention information of the right hand, and then sets the time zone of the heat map of the left hand to the same time as that of the right hand. Then, find the change in loss (Δ loss) at this time.

The graph candidate generating unit 43 verifies that the loss changes greatly when the relationship between the directed edges pointing from the left hand to the right is broken by the calculation of the loss described above, and the causality of the actually obtained graph is verified. It is possible to present candidates that there may be.

<<4-3>> Effect As described above, according to the fourth embodiment, the graph candidate generating unit 43 presents information that enables the user 50 to discover the relationship between the nodes. It is possible to avoid giving a relationship definition that would be noisy by entering an inappropriate relationship between.

Except for the above, Embodiment 4 is the same as any of Embodiments 1 to 3.

11, 21, 31, 41 Skilled behavior determination model, 11a Learning model generation unit, 11b Learning model storage unit, 11c Inference unit, 12 Skilled behavior feature extraction unit, 13 Region of interest generation unit, 14 Graph model learning unit, 15 Graph input 16 Graph-object feature extraction unit 17 User input region extraction table (storage unit) 18 Object recognition unit 28a Flow estimation unit 28b Object existence probability determination unit 28c Overlap determination unit 32 Learning rate adjustment unit 33 Object recognition/skilled action determination model learning unit, 34 Object recognition/skilled action feature extraction unit, 35 Learning data generation unit, 50 User, 60 Data set storage unit, 100, 200, 300, 400 Machine learning device, A Region of interest, F1 First feature quantity (intermediate feature), F2 Second feature quantity, G graph, L learning data, M learning model, M2 action inference model, O object area, RH right hand, LH left hand, HE head.

Claims

A machine learning device that learns a learning model for inferring the action proficiency of an action subject in an image,
a graph input unit that acquires a graph composed of a plurality of nodes corresponding to a plurality of parts of the action subject and information indicating relationships between the plurality of nodes, based on a user's input operation;
a storage unit that stores the graph acquired by the graph input unit;
an object recognition unit that recognizes and outputs a plurality of object regions in the image in which a plurality of objects corresponding to the plurality of nodes exist;
a skillful action feature extracting unit for extracting a first feature amount, which is a feature amount of actions of the plurality of portions of the action subject existing in the image;
a region-of-interest generation unit that generates a region of interest in the image based on the first feature amount;
a graph-object feature extracting unit that generates a second feature quantity that emphasizes the first feature quantity for a region where the region of interest and the object region overlap;
a graph model learning unit that generates the learning model based on the second feature amount when the image is learning data collected in advance;
A machine learning device comprising:
The object recognition unit holds past information about positions and velocities of the plurality of objects corresponding to the plurality of nodes, predicts positions of the plurality of object areas based on the past information, determine the overlap of object regions,
The graph-object feature extracting unit changes the first feature amount of the overlapping object areas based on the first feature amount of the overlapping object areas among the plurality of object areas. The machine learning device according to claim 1.
A machine learning device that learns a learning model for inferring the action proficiency of an action subject in an image,
a graph input unit that acquires a graph composed of a plurality of nodes corresponding to a plurality of parts of the action subject and information indicating relationships between the plurality of nodes, based on a user's input operation;
a storage unit that stores the graph acquired by the graph input unit;
an object recognition unit that recognizes and outputs a plurality of object regions in the image in which a plurality of objects corresponding to the plurality of nodes exist;
a learning data generation unit that generates learning data linked to the plurality of object regions;
an action inference model for inferring actions of the plurality of parts of the subject existing in the image, which are associated with the plurality of object areas, and which are associated with the plurality of object areas; an action determination model learning unit that learns
an object recognition/skilled behavior feature extracting unit for recognizing actions linked to the plurality of object regions inferred using the action inference model and extracting a first feature amount that is a feature amount of the actions; ,
a graph-object feature extracting unit that generates a second feature amount that emphasizes the first feature amount;
a graph model learning unit that generates the learning model based on the second feature amount when the image is the learning data;
A machine learning device comprising:
A machine learning device that learns a learning model for inferring the action proficiency of an action subject in an image,
a graph input unit that acquires a graph composed of a plurality of nodes corresponding to a plurality of parts of the action subject and information indicating relationships between the plurality of nodes, based on a user's input operation;
a storage unit that stores the graph acquired by the graph input unit;
an object recognition unit that recognizes and outputs a plurality of object regions in the image in which a plurality of objects corresponding to the plurality of nodes exist;
a skillful action feature extracting unit for extracting a first feature amount, which is a feature amount of actions of the plurality of portions of the action subject existing in the image;
a region-of-interest generating unit that generates a region of interest that overlaps one of the plurality of object regions based on the plurality of object regions and the first feature amount, and outputs the region of interest as a heat map;
a graph candidate generation unit that generates information for presenting the user with the graph candidate input from the graph input unit based on the heat map;
a graph-object feature extracting unit that generates a second feature amount that emphasizes the first feature amount for the region of interest;
a graph model learning unit that generates the learning model based on the second feature amount when the image is learning data collected in advance;
A machine learning device comprising:
the subject of action is a person,
The machine learning device according to any one of claims 1 to 4, wherein the plurality of parts includes a plurality of body parts of the person.
The subject of action is a mechanism that moves in conjunction with the movement of a part of the human body,
The machine learning device according to any one of claims 1 to 4, wherein the plurality of parts are a plurality of parts of the mechanism.
The machine learning device according to any one of claims 1 to 6, wherein the information indicating the relationship between the plurality of nodes is a directed edge.
The information indicating the relationship between the plurality of nodes,
information indicating the position of each of the plurality of portions;
information indicating the direction and speed of movement of each of the plurality of parts, and information indicating the order of movement of the plurality of parts;
The machine learning device according to any one of claims 1 to 6, comprising one or more of
A machine learning device according to any one of claims 1 to 8;
the learning model for inferring the skill level of the behavior of the action subject based on the second feature amount when the image input to the skillful action feature extraction unit is the inference target image;
A skilled action determination device, comprising:
A machine learning method implemented by a machine learning device that learns a learning model for inferring proficiency of actions of an agent in an image, comprising:
a step of extracting a first feature quantity that is a feature quantity of actions of a plurality of portions of the action subject present in the image;
Acquiring a graph composed of a plurality of nodes corresponding to the plurality of parts of the action subject and information indicating relationships between the plurality of nodes based on a user's input operation, and storing the graph. a step;
recognizing and outputting a plurality of object regions in the image in which there are a plurality of objects corresponding to the plurality of nodes;
generating a region of interest in the image based on the first feature amount;
generating a second feature quantity that emphasizes the first feature quantity for a region where the region of interest and the object region overlap;
generating the learning model based on the second feature amount when the image is learning data collected in advance;
A machine learning method characterized by comprising:
A computer that learns a learning model for inferring the action proficiency of the action subject in the image,
a step of extracting a first feature quantity that is a feature quantity of actions of a plurality of portions of the action subject present in the image;
Acquiring a graph composed of a plurality of nodes corresponding to the plurality of parts of the action subject and information indicating relationships between the plurality of nodes based on a user's input operation, and storing the graph. a step;
recognizing and outputting a plurality of object regions in the image in which there are a plurality of objects corresponding to the plurality of nodes;
generating a region of interest in the image based on the first feature amount;
generating a second feature quantity that emphasizes the first feature quantity for a region where the region of interest and the object region overlap;
generating the learning model based on the second feature amount when the image is learning data collected in advance;
A machine learning program characterized by executing