WO2023243397A1

WO2023243397A1 - Recognition device, recognition system, and computer program

Info

Publication number: WO2023243397A1
Application number: PCT/JP2023/020076
Authority: WO
Inventors: 大気関井
Original assignee: コニカミノルタ株式会社
Priority date: 2022-06-13
Filing date: 2023-05-30
Publication date: 2023-12-21

Abstract

Provided is a learning method of a learning model that recognizes the occurrence of a plurality of events. According to a learning method of a machine learning model that recognizes a plurality of events of an object, information (first feature point group data 201) about a plurality of feature points generated from an image of the object captured upon occurrence of an event A, and information (second feature point group data 211) about a plurality of feature points generated from an image of the object captured upon occurrence of an event B are prepared. Thereafter, the first feature point group data 201 and the second feature point group data 211 are combined to generate combined feature point group data 221. The combined feature point group data 221, and label data 222 which is for learning that a first event and a second event have occurred, are used as teacher data 223 for performing learning.

Description

Recognition device, recognition system and computer program

The present disclosure relates to a technique for recognizing an event of an object from a moving image captured by a camera, and particularly relates to a technique for recognizing the occurrence of a plurality of different events caused by a plurality of objects.

Technology for recognizing the actions of people, etc. from moving images generated by cameras is needed in a variety of fields, such as video analysis of surveillance cameras and sports video analysis.

According to Non-Patent Document 1, a person's skeleton, that is, a set of joint points of the person, is detected from an input video image, and processing by DNN (Deep Neural Network) is applied to each detected joint point. Then, the event in the input moving image is recognized.

According to Non-Patent Document 1, DNN is trained on the premise that only one event of a target object (person, object, etc.) occurs in one scene. Therefore, if a plurality of objects each causing a different event are present in a moving image, this premise does not hold, and the accuracy of event recognition decreases. For example, learning is performed on the assumption that the sum of the estimated probabilities for each event is 1, so if the probabilities of multiple events should be calculated high, the probabilities of each event are averaged and calculated low. There is a risk of it being stored away. Further, when two events, event A and event B, occur, there is a possibility that event A may be recognized as occurring, but event B may be recognized as not occurring.

The present disclosure aims to provide a learning method for a learning model that can recognize the occurrence of multiple events, and a recognition method and recognition device using the learning model learned by the learning method.

To achieve this objective, one aspect of the present disclosure provides a learning method for a machine learning model, the first object being generated from a first image taken at the time of occurrence of a first event of the first object. first feature point group data consisting of information on a plurality of feature points of the second object, and a plurality of feature points of the second object generated from a second image taken at the time of occurrence of a second event of the second object. Prepare second feature point group data consisting of information, synthesize the first feature point group data and the second feature point group data to generate synthesized feature point group data, and combine the synthesized feature point group data with the The learning model is characterized in that the learning model is trained using label data as teacher data to learn that the first event and the second event have occurred.

According to the learning method according to the present disclosure, it is possible to accurately recognize a plurality of events occurring in a plurality of objects under the occurrence of a plurality of events.

1 shows a configuration of a monitoring system 1 according to a first embodiment. 1 is a block diagram showing the configuration of a recognition device 10 of Example 1. FIG. 1 is a block diagram showing the configuration of a typical neural network 50. FIG. 5 is a schematic diagram showing one neuron U of the neural network 50. FIG. 5 is a diagram schematically showing a data propagation model during pre-learning (training) in the neural network 50. FIG. 5 is a diagram schematically showing a data propagation model during practical inference in the neural network 50. FIG. 2 is a block diagram showing the configuration of a recognition processing section 121. FIG. 5 is a flowchart showing the operation of recognition processing in the recognition device 10. FIG. FIG. 3 is a diagram schematically illustrating synthesis processing of feature point group data. 3 is a diagram schematically showing learning processing of the DNN unit 142. FIG. 5 is a flowchart showing the operation of learning processing by the DNN unit 142. FIG. FIG. 2 is a block diagram showing the configuration of a recognition processing unit 121a in Example 2. FIG. FIG. 7 is a diagram schematically showing a process of synthesizing feature point group data according to a modified example.

1 Example 1
1.1 Monitoring system 1
A monitoring system 1 (recognition system) according to a first embodiment will be explained using FIG. 1.

The monitoring system 1 constitutes a part of the security management system, and is composed of a camera 5 (photographing device) and a recognition device 10.

The camera 5 is fixed at a predetermined position and is installed facing a predetermined direction. Camera 5 is connected to recognition device 10 via cable 11.

The camera 5 photographs objects such as people and objects passing through the passageway 6, and generates a frame image. Since the camera 5 continuously photographs objects within the photographing range, it generates a plurality of frame images. In this way, the camera 5 generates a moving image (scene image) consisting of a plurality of frame images. The camera 5 transmits moving images to the recognition device 10 at any time. The recognition device 10 receives moving images from the camera 5.

The recognition device 10 analyzes the moving image received from the camera 5 and recognizes events (behavior patterns) of objects reflected in the moving image. For example, if a person or the like appearing in the moving image is playing a sport (baseball, basketball, soccer, etc.), the recognition device 10 analyzes the received moving image and identifies the person appearing in the moving image as a behavioral pattern. Recognize that a person, etc. is playing a sport.

Note that in FIG. 1, the frame image 132a indicates a frame image generated by the camera 5. This does not indicate that the frame image 132a is projected onto the wall of the passageway 6.

1.2 Recognition device 10
As shown in FIG. 2, the recognition device 10 includes a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, a storage circuit 104, an input circuit 109, and a CPU (Central Processing Unit) 101 connected to a bus B1. It is composed of a network communication circuit 111, a GPU (Graphics Processing Unit) 105, a ROM 106, a RAM 107, and a storage circuit 108 connected to a bus B2. Bus B1 and bus B2 are interconnected.

(CPU101, ROM102, RAM103)
The RAM 103 is composed of a semiconductor memory, and provides a work area when the CPU 101 executes a program.

The ROM 102 is composed of a semiconductor memory. The ROM 102 stores a control program, which is a computer program, for causing the recognition device 10 to execute processing.

The CPU 101 is a processor that operates according to a control program stored in the ROM 102.

The CPU 101, the ROM 102, and the RAM 103 constitute the main control unit 110 by using the RAM 103 as a work area and operating according to the control program stored in the ROM 102.

(Network communication circuit 111)
The network communication circuit 111 is connected to an external information terminal via a network. The network communication circuit 111 relays transmission and reception of information to and from an external information terminal via the network. For example, the network communication circuit 111 transmits the recognition result by the recognition processing unit 121, which will be described later, to an external information terminal via the network.

(Input circuit 109)
Input circuit 109 is connected to camera 5 via cable 11.

The input circuit 109 receives a moving image from the camera 5 and writes the received moving image into the storage circuit 104.

(Memory circuit 104)
The storage circuit 104 includes, for example, a hard disk drive.

The storage circuit 104 stores the moving image 131 received from the camera 5 via the input circuit 109, for example.

(Main control unit 110)
The main control unit 110 centrally controls the entire recognition device 10 .

The main control unit 110 also controls the moving image 131 stored in the storage circuit 104 to be written into the storage circuit 108 as a moving image 132 via the bus B1 and the bus B2. The main control unit 110 also outputs an instruction to the recognition processing unit 121 to start recognition processing.

(GPU105, ROM106, RAM107)
The RAM 107 is composed of a semiconductor memory, and provides a work area when the GPU 105 executes a program.

The ROM 106 is composed of a semiconductor memory. The ROM 106 stores a control program, which is a computer program for causing the recognition processing unit 121 to execute processing.

The GPU 105 is a graphics processor that operates according to a control program stored in the ROM 106.

The GPU 105 uses the RAM 107 as a work area and operates according to the control program stored in the ROM 106, so that the GPU 105, the ROM 106, and the RAM 107 constitute the recognition processing unit 121.

The recognition processing unit 121 incorporates a neural network and the like. The neural network and the like incorporated in the recognition processing unit 121 perform their functions when the GPU 105 operates according to a control program stored in the ROM 106.

Details of the recognition processing unit 121 will be described later.

(Memory circuit 108)
The memory circuit 108 is composed of a semiconductor memory. The storage circuit 108 is, for example, an SSD (Solid State Drive).

The storage circuit 108 stores, for example, a moving image 132 consisting of frame images 132a, 132b, 132c, . . . (see FIG. 7).

1.3 Typical Neural Network As an example of a typical neural network, a neural network 50 shown in FIG. 3 will be described.

(1) Structure of Neural Network 50 As shown in this figure, the neural network 50 is a hierarchical neural network having an input layer 50a, a feature extraction layer 50b, and a recognition layer 50c.

Here, a neural network is an information processing system that imitates a human neural network. In the neural network 50, an engineering neuron model corresponding to a nerve cell is herein referred to as a neuron U. The input layer 50a, the feature extraction layer 50b, and the recognition layer 50c each include a plurality of neurons U.

The input layer 50a usually consists of one layer. Each neuron U of the input layer 50a receives, for example, the pixel value of each pixel constituting one image. The received image values are directly output from each neuron U of the input layer 50a to the feature extraction layer 50b.

The feature extraction layer 50b extracts features from the data (all pixel values forming one image) received from the input layer 50a and outputs them to the recognition layer 50c. This feature extraction layer 50b extracts, for example, a region in which a person is shown from the received image by calculations in each neuron U.

The recognition layer 50c performs identification using the features extracted by the feature extraction layer 50b. The recognition layer 50c identifies, for example, the direction of the person, the gender of the person, the clothing of the person, etc. from the region of the person extracted in the feature extraction layer 50b, through calculations in each neuron U.

As the neuron U, a multi-input, single-output element is usually used, as shown in FIG. The signal is transmitted in only one direction, and the input signal xi (i=1, 2, . . . , n) is multiplied by a certain neuron weight value (SUwi) and input to the neuron U. This neuron weight value represents the strength of the connection between neurons U arranged hierarchically. The neuron weight value can be changed by learning. From the neuron U, a value X obtained by subtracting the neuron threshold θU from the sum of each input value (SUwi x xi) multiplied by the neuron weight SUwi is output after being transformed by the response function f(X). . That is, the output value y of the neuron U is expressed by the following formula.

y=f(X)
here,
X=Σ(SUwi×xi)−θU
It is. Note that, for example, a sigmoid function can be used as the response function.

Each neuron U in the input layer 50a usually does not have a sigmoid characteristic or a neuron threshold. Therefore, the input value appears as is in the output. On the other hand, each neuron U in the final layer (output layer) of the recognition layer 50c outputs the identification result in the recognition layer 50c.

As a learning algorithm for the neural network 50, for example, the recognition layer 50c uses the steepest descent method so that the square error between the value (data) indicating the correct answer and the output value (data) from the recognition layer 50c is minimized. An error backpropagation method is used in which the neuron weight values, etc. of the feature extraction layer 50b and the neuron weight values of the feature extraction layer 50b are sequentially changed.

(2) Training process The training process in the neural network 50 will be explained.

The training step is a step in which the neural network 50 is trained in advance. In the training step, the neural network 50 is trained in advance using image data with correct answers (supervised, annotated) obtained in advance.

FIG. 5 schematically shows a data propagation model during pre-learning.

Image data is input to the input layer 50a of the neural network 50 for each image, and is output from the input layer 50a to the feature extraction layer 50b. Each neuron U of the feature extraction layer 50b performs calculations with neuron weights on input data. Through this calculation, the feature extraction layer 50b extracts a feature (for example, a region of a person) from the input data, and data indicating the extracted feature is output to the recognition layer 50c (step S51).

Each neuron U of the recognition layer 50c performs calculations with neuron weights on input data (step S52). As a result, identification (for example, identification of a person) is performed based on the above characteristics. Data indicating the identification result is output from the recognition layer 50c.

The output value (data) of the recognition layer 50c is compared with the value indicating the correct answer, and their error (loss) is calculated (step S53). In order to reduce this error, the neuron weight values of the recognition layer 50c and the neuron weight values of the feature extraction layer 50b are sequentially changed (back propagation) (step S54). Thereby, the recognition layer 50c and the feature extraction layer 50b are trained.

(3) Practical recognition process The practical recognition process in the neural network 50 will be explained.

FIG. 6 shows a data propagation model when actually performing recognition (for example, recognizing the gender of a person) using the neural network 50 learned through the above training process and inputting data obtained in the field. There is.

In the practical recognition step in the neural network 50, feature extraction and recognition are performed using the learned feature extraction layer 50b and the learned recognition layer 50c (step S55).

1.4 Recognition processing unit 121
The recognition processing section 121 includes a point detection section 141 and a DNN section 142, as shown in FIG.

The recognition processing unit 121 receives an instruction to start recognition processing from the main control unit 110. Upon receiving the instruction to start the recognition process, the recognition processing unit 121 starts the recognition process.

(1) Point detection section 141
Upon receiving an instruction to start recognition processing from the main control unit 110, the point detection unit 141 (point detection means) reads a moving image 132 consisting of frame images 132a, 132b, 132c, . . . from the storage circuit 108. . Here, the unit of the frame image 132a, the unit of the frame image 132b, the unit of the frame image 132c, etc. are respectively referred to as frames, and as shown in FIG. 7, the respective frames are indicated as F1, F2, F3.

Here, as shown in FIG. 7, as an example, the frame image 132a includes objects representing a person A, a person B, and a person C, respectively. Note that images of people, images of objects, etc. included in the frame images 132a, 132b, 132c, . . . are referred to as objects.

The point detection unit 141 detects and recognizes objects such as people and objects from the frame images 132a, 132b, 132c, . . . that constitute the moving image 132.

In addition, the point detection unit 141 uses OpenPose (see Non-Patent Document 2) to detect skeletal points on the skeleton of an object such as a person from the frame images 132a, 132b, 132c, etc. that constitute the moving image 132. Detect feature point information indicating (joint points). Here, the skeleton point is the coordinate value (X coordinate value, Y coordinate value) of the position where the skeleton point exists in the frame image and the coordinate value on the time axis corresponding to the frame number of the frame image where the skeleton point exists. (time t).

Note that the point detection unit 141 uses YOLO (see Non-Patent Document 3) to detect objects (hereinafter referred to as objects) from the frame images 132a, 132b, 132c, . . . that constitute the moving image 132. You may also detect feature point information indicating end points (vertices) on the contour. Here, the endpoint is also determined by the coordinate value (X coordinate value, Y coordinate value) of the position where the endpoint exists within the frame image and the coordinate value (time t) on the time axis corresponding to the frame number of the frame image where the endpoint exists. ).

The minutiae information also includes (a) a detection score that indicates the likelihood of the skeleton point or vertex indicated by the detected point information, and (b) the type of object that includes the skeleton point or vertex indicated by the point information. It may include at least one of a feature vector, (c) a feature vector indicating the type of point information, and (d) a feature vector indicating the appearance of the object.

The point detection unit 141 generates feature point group data 133a, 133b, which is composed of a plurality of detected feature point information (indicating a plurality of skeleton points or a plurality of end points), for each of the frame images 132a, 132b, 132c, . 133c, . . . are generated.

The point detection unit 141 writes feature point group data 133 consisting of feature point group data 133a, 133b, 133c, . . . into the storage circuit 108.

The point detection unit 141 extracts feature point information from one frame image out of a plurality of frame images constituting a moving image or some frame images out of a plurality of frame images constituting a moving image. May be detected.

Additionally, the point detection unit 141 may detect point information by neural network calculation detection processing. At this time, the point detection unit 141 may use one or more of the Convolutional Newral Network and the Self-Attention mechanism.

(2) DNN section 142
The DNN unit 142 (recognition unit) is a deep neural network (DNN). DNN is a neural network that supports deep learning and has four or more layers.

The DNN unit 142 reads feature point group data 133 consisting of feature point group data 133a, 133b, 133c, . . . from the storage circuit 108.

The DNN unit 142 estimates a label 134 using DNN for the read feature point group data 133.

Here, the label 134 is vector data in which each component is the probability of occurrence of the event to be recognized by the object. If the events to be recognized are, for example, three events: falling of a person, walking of a person, and running of a person, the label 134 is a component representing the probability that the event of falling of a person has occurred, ``walking of a person''. This is three-dimensional vector data consisting of a component representing the probability that an event has occurred and a component representing the probability that the event of a person running has occurred. As a result of the recognition processing by the DNN unit 142, for example, if it is recognized that the person is walking and the person is falling, but the person is not running, there is a probability that the event of the person falling is occurring. The component representing the probability of occurrence of the event of a person walking and the component representing the probability of occurrence of the event of a person walking take a value close to 1, and the component representing the probability of occurrence of the event of a person running takes a value close to 0.

The DNN unit 142 writes the label 134 obtained by estimation into the storage circuit 108.

The neural network 172 may estimate the label from the feature point group data using a neural network with permutation-invariant characteristics that allows the same output to be obtained even if the order of input changes.

The DNN unit 142 may estimate the label from the feature point group data using PointNet (see Non-Patent Document 4).

1.5 Operation of the recognition device 10 during recognition The operation of the recognition device 10 during recognition will be explained using the flowchart shown in FIG. 8.

The input circuit 109 acquires a moving image consisting of a plurality of frame images from the camera 5 and writes the acquired moving image into the storage circuit 104. The main control unit 110 controls the moving image 131 stored in the storage circuit 104 to be written into the storage circuit 108 as the moving image 132 (step S101).

The point detection unit 141 recognizes objects from each frame image included in the moving image, detects skeleton points or end points, and generates feature point group data 133 (step S102).

The DNN unit 178 estimates the label 134 using the DNN from the feature point group data 133, and writes the label 134 obtained by the estimation into the storage circuit 108 (step S103).

With the above, the recognition operation in the recognition device 10 is completed.

1.6 Learning of the DNN unit 142 (1) Synthesis of learning data The DNN unit 142 synthesizes the learning data used in learning on the premise that one object event occurs in one scene. Learning is performed using the synthesized training data.

The synthetic learning data will be explained with reference to FIG. 9.

The learning data 203 in FIG. 9 is learning data for recognizing the event A (for example, a person's fall), and consists of feature point group data 201a, 201b, 201c, . . . and a teacher label 202. The feature point group data 201a, 201b, 201c, . . . is feature point information detected from each frame image of the moving image in which the event A was photographed. The teacher label 202 is vector data whose components are the probability of occurrence of each event to be recognized, with the probability of occurrence of event A being 1 and the probability of occurrence of other events being 0.

The learning data 213 in FIG. 9 is learning data for recognizing the event B (for example, the walking of a person), and consists of feature point group data 211a, 211b, 211c, . . . and a teacher label 212. The feature point group data 211a, 211b, 211c, . . . is feature point information detected from each frame image of the moving image in which the event B was photographed. The teacher label 212 is vector data whose components are the probability of occurrence of each event to be recognized, with the probability of occurrence of event B being 1 and the probability of occurrence of other events being 0.

The learning data 203 and 213 in FIG. 9 are learning data used in learning on the premise that one object event occurs in one scene.

The DNN learning unit 143 that controls the learning of the DNN unit 142 in the recognition device 10 synthesizes the learning data 203 and the learning data 213 to generate synthetic learning data 223.

The synthetic learning data 223 is learning data for recognizing the occurrence of the event A and the event B, and consists of synthetic feature point group data 221a, 221b, 221c, . . . and a synthetic teacher label 222. The composite feature point group data 221a, 221b, 221c, . . . is a concatenation of the feature point group data 201a, 201b, 201c, . . . and the feature point group data 211a, 211b, 211c, . be. The composite teacher label 222 is vector data whose components are the probability of occurrence of each event to be recognized, with the probability of occurrence of event A and event B being 1, and the probability of occurrence of other events being 0.

(2) Learning processing using synthetic learning data Learning processing using synthetic learning data 223 will be described with reference to FIG. 10.

When learning the DNN 142, the DNN learning unit 143 inputs the synthetic feature point group data 221a, 221b, 221c, . Outputs 135.

The DNN learning unit 143 calculates the error between the recognition result label 135 and the synthetic teacher label 222, and updates the parameters of the DNN unit 142 using the error backpropagation method.

The operation of the DNN unit 142 during learning will be explained with reference to FIG. 11.

The DNN learning unit 143 acquires each frame image (scene A frame image group) of a moving image in which event A of the recognition target object is photographed (step S201).

The DNN learning unit 143 inputs the scene A frame image group to the point detection unit 141, and the point detection unit 141 inputs information on a plurality of feature points (first feature point) of the recognition target object detected from the scene A frame image group. Group data: feature point group data 201a, 201b, 201c, . . . in FIG. 9 is output (step S202).

The DNN learning unit 143 acquires each frame image (scene B frame image group) of the moving image in which event B of the recognition target object is photographed (step S203).

The DNN learning unit 143 inputs the scene B frame image group to the point detection unit 141, and the point detection unit 141 inputs information on a plurality of feature points (second feature point) of the recognition target object detected from the scene B frame image group. Group data: feature point group data 211a, 211b, 211c, . . . in FIG. 9 is output (step S204).

The DNN learning unit 143 synthesizes (connects) the first feature point group data and the second feature point group data to generate synthesized feature point group data (synthesized feature point group data 221a, 221b, 221c, etc. in FIGS. 9 and 10). ) is generated (step S205).

The DNN learning unit 143 generates teacher labels (composite teacher labels 222 in FIGS. 9 and 10) corresponding to the occurrence of event A and event B (step S206).

The DNN learning unit 143 inputs the synthetic feature point group data to the DNN learning unit 143, and the DNN learning unit 143 outputs the label of the recognition result (label 135 in FIG. 10) (step S207).

The DNN learning unit 143 calculates the error between the recognition result label and the synthetic teacher label, and updates the parameters of the DNN unit 142 using the error backpropagation method (step S208).

With the above, the operation of the DNN unit 142 during learning is completed.

1.7 Summary As explained above, according to the first embodiment, object events are recognized by the DNN unit 142 trained using learning data corresponding to the occurrence of multiple events, so that Even if an object includes multiple events, the occurrence of multiple events can be recognized.

2 Example 2
Example 2 is a modification of Example 1.

Here, the differences from Example 1 will be mainly explained.

In the second embodiment, a value (degree of contribution) indicating which feature point among the features included in the feature point group data 133 contributed to the generation of a label of the recognition result is calculated.

The error between the label estimated by the configuration of Example 1 and the teacher label when a predetermined action (occurrence of a predetermined event) is taken as the correct answer is calculated. Next, using the error backpropagation method, gradient information indicating the gradient of the error with respect to the input value of each feature point is calculated, and the degree of contribution of each feature point is calculated using the calculated gradient information.

In the second embodiment, instead of the recognition processing unit 121 of the first embodiment, the GPU 105 uses the RAM 107 as a work area and operates according to the control program stored in the ROM 106. As shown in 12, a recognition processing section 121a is configured.

The recognition processing unit 121a includes a contribution calculation unit 144 in addition to the configuration of the recognition processing unit 121 of the first embodiment.

The contribution calculation unit 144 calculates the error L between the label D estimated by the configuration of Example 1 and the teacher label T when a predetermined action is determined as the correct answer.

L = |T-D|
Next, the contribution calculation unit 144 calculates gradients ∂L/∂x, ∂L/∂y, ∂L/∂t, . . . using the error backpropagation method. Here, (x, y, t, ...) is the value of each dimension of feature point information of one feature point, for example, (x coordinate value, y coordinate value, time axis coordinate value (frame number ), a feature point detection score, a feature vector indicating the object type, a feature vector indicating the type of feature point, a feature vector indicating the appearance of the object), etc.

Next, the contribution calculation unit 144 uses the calculated gradient information to calculate contribution=(∂L/∂x) ² +(∂L/∂y) ² +(∂L/∂t) ² +・... is calculated.

In this way, the contribution calculation unit 144 calculates the contribution of each feature point.

In this way, the contribution calculation unit 144 calculates the degree to which each feature point has contributed to the recognition result by back-propagating the gradient information regarding the neural calculation using the recognition result obtained by recognition.

The higher the obtained degree of contribution, the more it can be determined that the feature point contributed to label estimation.

As a result, it is possible to know which feature points were important in the inference of behavior classification.

3 Modifications The present invention has been described above based on examples, but it goes without saying that the present invention is not limited to the above-mentioned embodiments, and the following modifications are of course included in the technical scope of the present invention. be.

(1) In the above embodiment, the learning data 203 for recognizing event A and the learning data 213 for recognizing event B are combined to create synthetic learning data 223 corresponding to the occurrence of multiple events. was being generated. However, the method for generating synthetic learning data corresponding to the occurrence of multiple events is not limited to this.

FIG. 13 is a diagram schematically showing a modification of the method of generating synthetic learning data corresponding to the occurrence of multiple events.

The difference from the above embodiment is that each feature point included in the feature point group data 201a, 201b, 201c, . . . and each feature point included in the feature point group data 211a, 211b, 211c, . is subjected to affine transformation on a three-dimensional space consisting of frame coordinates and frame numbers, and feature point group data 201A, 201B, 201C, ... and feature point group data 211A, 211B, 211C, ... This is the point where it is being generated.

The synthetic learning data 233 is learning data for recognizing the occurrence of the event A and the event B, and consists of synthetic feature point group data 231a, 231b, 231c, . . . and a synthetic teacher label 232. The composite feature point group data 231a, 231b, 231c, . . . is a concatenation of the feature point group data 201A, 201B, 201C, . . . and the feature point group data 211A, 211B, 211C, . be. The composite teacher label 232 is vector data whose components are the probability of occurrence of each event to be recognized, with the probability of occurrence of event A and event B being 1, and the probability of occurrence of other events being 0.

Here, regarding the affine transformation of each feature point included in the feature point group data 201a, 201b, 201c, ... and the feature point group data 211a, 211b, 211c, ..., the feature point group data 201a, 201b , 201c, . . . and feature point groups included in the feature point group data 211a, 211b, 211c, . . . may be subjected to affine transformation with the same settings. , affine transformation may be applied with different settings. In addition, affine transformation is performed to only either the feature point group included in the feature point group data 201a, 201b, 201c, . . . or the feature point group included in the feature point group data 211a, 211b, 211c, . may be applied.

(2) In the above embodiment, the training data for recognizing two different events are combined to generate synthetic training data, but three or more training data are combined to generate the composite training data. Learning data may also be generated.

(3) In the above embodiment, feature point group data of one object to be recognized is generated for one event. For example, for one event such as a person falling, feature point group data is generated from an image of one person. The present invention is not limited to this, and feature point group data of a plurality of objects to be recognized may be generated for one event. For example, feature point group data may be generated from images of a plurality of people for events such as a collision between people, a handshake between people, and a hug between people.

(4) In the above embodiment, the object detector 112 uses OpenPose to detect the joint points of the object and YOLO to detect the circumscribed rectangle of the object, but it may also use a neural network to detect other feature points. Good too.

(5) The above embodiment and the above modification may be combined respectively.

4 Others One aspect of the present disclosure is a learning method for a machine learning model, in which a plurality of feature points of the first object are generated from a first image photographed when a first event of the first object occurs. first feature point group data consisting of information on the second object, and second feature point group data consisting of information on a plurality of feature points of the second object generated from a second image taken at the time of occurrence of the second event of the second object. Prepare feature point group data, synthesize the first feature point group data and the second feature point group data to generate synthesized feature point group data, and combine the synthesized feature point group data, the first event, and the The learning model is characterized in that the learning model is trained using label data as teacher data in order to learn that the second event has occurred.

In the above learning method, the first feature point group data includes information on each feature point of the first object detected for each frame of the first video, and the second feature point group data includes information about each feature point of the first object detected for each frame of the first video. It may also include information on each feature point of the second object detected for each frame of two videos.

In the above learning method, each of the information on the plurality of feature points may include information indicating the frame coordinates of the feature point and information identifying the frame number of the frame in which the feature point was detected.

In the above learning method, each of the information on the plurality of feature points further includes likelihood information indicating that the feature point is likely detected, a feature vector indicating the type of the object,
It may include at least one of a feature vector representing the type of feature point and a feature vector representing the appearance of the object.

In the above learning method, each of the information on the plurality of feature points may be generated by a feature point detection process using a single frame image of a video of the object or a plurality of frame images as input.

In the above learning method, the feature point detection process may use neural calculations.

In the above learning method, the machine learning model may use neural computation.

In the above learning method, the machine learning model may use a permutation-invariant DNN (Deep Neural Network).

One aspect of the present disclosure is to prepare a recognition device including a learning model learned by the above learning method, and input information on a plurality of feature points generated from a new video of a target object to the recognition device. The present invention is characterized in that a recognition result is output, and the degree of contribution of each piece of information about a plurality of feature points generated from the new video to the recognition result is calculated using an error backpropagation method related to neural operations.

In the above learning method, when generating the composite feature point group data, frame coordinates and The synthesis may be performed by applying affine transformation on a three-dimensional space consisting of frame numbers.

One aspect of the present disclosure is a recognition device, which detects a plurality of feature points of a target object from an image of the target object, and generates feature point group data consisting of information on the plurality of feature points. a point detection unit; and a recognition unit including a learning model that recognizes an event of the object by inputting the feature point group data, and the learning model includes feature points related to a first event in the feature point group data. and information on feature points related to the second event are included, the first event and the second event are recognized at once.

In the above recognition device, the learning model includes a first feature point group consisting of information on a plurality of feature points of the first object generated from a first image taken when a first event of the first object occurs. data, and second feature point group data consisting of information on a plurality of feature points of the second object generated from a second image taken at the time of occurrence of a second event of the second object, and The first feature point group data and the second feature point group data are combined to generate synthesized feature point group data, and the synthesized feature point group data is used to learn that a first event and a second event have occurred. The learning may be performed using label data as teacher data.

One aspect of the present disclosure is a recognition system, which is characterized by comprising a photographing device that generates an image by photographing, and the recognition device.

One aspect of the present disclosure is a control computer program used in a recognition device that performs recognition processing on an image obtained by shooting, the computer program being used for controlling a recognition device that is a computer from a captured image of an object. a feature point detection step of detecting a plurality of feature points of the object and generating feature point group data consisting of information on the plurality of feature points; and recognition using a learning model that recognizes events using the feature point group data as input. The learning model includes a first feature that includes information about a plurality of feature points of the first object generated from a first image taken at the time of occurrence of a first event of the first object. Prepare second feature point group data consisting of point group data and information on a plurality of feature points of the second object generated from a second image taken at the time of occurrence of a second event of the second object. , the first feature point group data and the second feature point group data are combined to generate synthetic feature point group data, and the synthesized feature point group data and the first event and the second event occur. It is characterized in that it is trained using label data for learning as teacher data.

The recognition device according to the present invention is useful as a technology for recognizing multiple actions of multiple people, etc. from a moving image generated by photography.

1 Surveillance system 5 Camera 10 Recognition device 11 Cable 50 Neural network 101 CPU
102 ROM
103 RAM
104 Memory circuit 105 GPU
106 ROM
107 RAM
108 Memory circuit 109 Input circuit 110 Main control section 111 Network communication circuit 121 Recognition processing section 141 Point detection section 142 DNN section 143 DNN learning section 144 Contribution degree calculation section

Claims

A learning method for a machine learning model, the method comprising:
first feature point group data consisting of information on a plurality of feature points of the first object generated from a first image taken at the time of occurrence of a first event of the first object; preparing second feature point group data consisting of information on a plurality of feature points of the second object generated from a second image taken when two events occur;
synthesizing the first feature point group data and the second feature point group data to generate synthetic feature point group data;
A learning method in which the learning model is trained using the synthetic feature point group data and label data for learning that the first event and the second event have occurred as teacher data.
The first feature point group data includes information on each feature point of the first object detected for each frame of the first image, and the second feature point group data includes information for each feature point of the first object detected for each frame of the second image. including information on each feature point of the second object detected in
The learning method according to claim 1.
The learning method according to claim 1, wherein each of the information on the plurality of feature points is expressed by information indicating the frame coordinates of the feature point and information identifying the frame number of the frame in which the feature point was detected.
Each of the information on the plurality of feature points further includes:
Likelihood information indicating that the feature point is likely detected;
a feature vector representing the type of object,
A feature vector representing the type of feature point,
a feature vector representing the appearance of the object,
The learning method according to claim 3, comprising at least one of the following.
The learning method according to claim 1, wherein each of the plurality of feature point information is generated by a feature point detection process using a single frame image or a plurality of frame images of an image of the object as input.
The learning method according to claim 5, wherein the feature point detection process uses a neural calculation.
The learning method according to claim 1, wherein the machine learning model uses neural operations.
The learning method according to claim 7, wherein the machine learning model uses a permutation-invariant DNN (Deep Neural Network).
preparing a recognition device including a learning model learned by the learning method according to claim 7;
inputting information on a plurality of feature points generated from a new video of the target object into the recognition device and outputting a recognition result;
The new video is different from the video of the object taken when the first event occurs and the video of the object taken when the second event occurs,
A recognition method, comprising calculating the degree of contribution of each piece of information on a plurality of feature points generated from the new video to the recognition result using an error backpropagation method related to neural calculations.
When generating the synthetic feature point group data, for each of the feature points included in at least one of the first feature point group data and the second feature point group data, a cubic value consisting of frame coordinates and frame numbers is generated. The learning method according to claim 1, wherein synthesis is performed by applying affine transformation on the original space.
a feature point detection unit that detects a plurality of feature points of the object from an image of the object and generates feature point group data consisting of information on the plurality of feature points;
a recognition unit including a learning model that receives the feature point group data as input and recognizes an event of the object;
Equipped with
The learning model recognizes the first event and the second event at once when the feature point group data includes information on feature points related to a first event and information on feature points related to a second event. recognition device.
The learning model is
first feature point group data consisting of information on a plurality of feature points of the first object generated from a first image taken at the time of occurrence of a first event of the first object; preparing second feature point group data consisting of information on a plurality of feature points of the second object generated from a second image taken when two events occur;
synthesizing the first feature point group data and the second feature point group data to generate synthetic feature point group data;
The recognition device according to claim 11, wherein learning is performed using the synthetic feature point group data and label data for learning that the first event and the second event have occurred as teacher data.
a photographing device that generates an image by photographing; and a recognition device according to claim 12;
A recognition system equipped with
A control computer program used in a recognition device that performs recognition processing on images obtained by shooting,
The recognition device, which is a computer,
a feature point detection step of detecting a plurality of feature points of the object from an image of the object and generating feature point group data consisting of information on the plurality of feature points;
a recognition step using a learning model that recognizes an event of the object using the feature point group data as input;
run the
The learning model is
first feature point group data consisting of information on a plurality of feature points of the first object generated from a first image taken at the time of occurrence of a first event of the first object; preparing second feature point group data consisting of information on a plurality of feature points of the second object generated from a second image taken when two events occur;
synthesizing the first feature point group data and the second feature point group data to generate synthetic feature point group data;
A computer program that is trained using the synthetic feature point group data and label data for learning that the first event and the second event have occurred as teacher data.