WO2024038969A1

WO2024038969A1 - Deep learning-based action recognition device and method

Info

Publication number: WO2024038969A1
Application number: PCT/KR2022/019596
Authority: WO
Inventors: 장달원; 이종설; 이재원
Original assignee: 한국전자기술연구원
Priority date: 2022-08-19
Filing date: 2022-12-05
Publication date: 2024-02-22
Also published as: KR20240026400A

Abstract

Disclosed are a deep learning-based action recognition device and method for performing deep learning for a video filmed by a camera to recognize a person or animal's actions. The deep learning-based action recognition device according to the present invention comprises: a deep learning network unit which trains a corresponding deep learning network among K deep learning networks by using a corresponding frame group among K frame groups, to infer respective action recognition results accordingly; and a final action recognition determination unit which holds weighted voting on the action recognition results, and selects and outputs, as a final action recognition result, an action recognition result having the highest probability value from the result of the weighted voting.

Description

Deep learning-based behavior recognition device and method

The present invention relates to a technology for recognizing human or animal behavior by performing deep learning on video captured through a camera. More specifically, deep learning training is performed by reducing the amount of input video data. It relates to a deep learning-based action recognition device and method that performs action recognition and thereby prevents performance degradation in action recognition.

Recently, deep learning-based recognition technology has been widely used to implement functions such as detecting a specific object (object detection) or recognizing a face (face recognition) from images captured by a camera. In order to improve the recognition rate in such image recognition technology, a large amount of data and a deep learning training process based on it are required, and for this, high-performance computing power is required.

When trying to recognize an object from a video captured through a camera, object recognition is possible from a single image extracted from each frame in the video. In comparison, when trying to recognize the behavior of people or animals from a video, information from several frames within the video is required. At this time, a large amount of data is required during the deep learning training process, and technology development is required to ensure smooth progress of deep learning training.

In order to recognize human or animal behavior from a video, the image size and frame rate of the input video are adjusted appropriately during the preprocessing process, regardless of which deep learning structure is used internally.

Afterwards, after going through a classification process, an action recognition process is finally performed to determine what actions exist in the input video data. At this time, several candidate behavior classes are selected and the behavior is inferred based on one of them. In the action recognition process, the above input video data is input into the deep learning network as is, or information such as optical flow is additionally input.

In the paper [1] below, the technology for performing action recognition with a single network structure without using optical flow is published. In addition, in paper [2], the technology used in the form of a combination of a deep learning network that inputs optical flow and a deep learning network that uses the video itself as input is published.

In order for deep learning networks to achieve optimal performance, very high complexity is required, and when deep learning networks are newly trained to be used in special situations, the expected performance may not be achieved due to the complexity.

When recognizing actions from input videos using a deep learning network, the action classes pursued by each application and the composition of the videos may be completely different. Therefore, it is necessary to newly train the deep learning network in a form suitable for each application.

Additionally, the inference process can also consume a lot of time due to the complexity of the deep learning network. In order to improve this, it is necessary to simplify the deep learning network, but this has the problem of lowering performance.

[Prior art literature]

[thesis]

[1] Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, Boqing Gong, “MoViNets: Mobile Video Networks for Efficient Video Recognition,” in Proc. on CVPR 2021

[2] Quo vadis, action recognition? a new model and the kinetics dataset, CVPR 2017.

The technical problem that the present invention aims to achieve is to perform deep learning on videos captured through a camera to recognize the behavior of people or animals, and to perform deep learning training by reducing the amount of input video data. The purpose is to provide a deep learning-based action recognition device and method that prevents performance degradation in action recognition due to a decrease in data volume by processing compensation during the inference process.

To achieve the above object, a deep learning-based behavior recognition device according to the present invention includes a video input unit for inputting a video; a frame classification unit that extracts one frame from every K frames from the video and sequentially assigns it to K frame groups; A deep learning network unit comprising K deep learning networks, each training a corresponding deep learning network among the K deep learning networks using a corresponding frame group among the K frame groups, and inferring respective action recognition results accordingly; and a final action recognition decision unit that performs weighted voting on the action recognition results and selects and outputs the action recognition result with the highest probability value among the weighted voting results as the final action recognition result.

In addition, the K frame groups are W × H × 3 ×

It has a size of N/K, where W means the width of the video, H means the height of the frame, N means the input time of the video,

means the frame rate (fps, frame per second) of the video, and 3 means that the video has three primary color information: R, G, and B.

In addition, the deep learning network is characterized by being implemented based on a 3D CNN (Convolutional Neural Network) or by applying LSTM (Long Short-Term Memory) to a 2D CNN.

In addition, the deep learning network unit includes a first deep learning network that performs training with the first group of frames supplied from the frame classification unit and infers a first action recognition result accordingly; a second deep learning network that performs training with a second group of frames supplied from the frame classification unit and infers a second action recognition result accordingly; and a third deep learning network that performs training with the third frame group supplied from the frame classification unit and infers the third action recognition result accordingly.

In addition, the action recognition final decision unit performs a weighted vote on the first to third action recognition results output from the first to third deep learning networks, and outputs the final action recognition result using the weighted voting results. It is characterized by

In addition, the action recognition final decision unit calculates each probability for the first to third action recognition results for each class, adds up the calculation results for each class, and selects the class with the highest probability as the final action recognition result. It is characterized by output as .

In order to achieve the above object, the deep learning-based behavior recognition method according to the present invention includes a video input step of inputting a video; A frame classification step of extracting one frame from every K frames from the video and sequentially assigning it to K frame groups; A deep learning network training step in which K deep learning networks are provided, and each deep learning network among the K deep learning networks is trained with a corresponding frame group among the K frame groups to infer respective action recognition results. ; and a final action recognition decision step of conducting a weighted vote on the action recognition results and selecting and outputting the action recognition result with the highest probability value among the weighted voting results as the final action recognition result.

In addition, the weighted voting is characterized in that it refers to a voting method in which more votes are given to behavior recognition results with higher reliability.

The deep learning-based behavior recognition device and method of the present invention has the following effects.

First, when processing large amounts of training data, training is performed by reducing the size of the data, which has the effect of processing action recognition quickly.

Second, by changing the input of the deep learning network for action recognition in a way that reduces the number of frames, there is an effect of preventing performance degradation.

Third, by utilizing a low frame rate, it is possible to implement a behavior recognition device with a good structure for detecting dynamic information.

Figure 1 is a block diagram of a deep learning-based behavior recognition device according to the present invention.

Figure 2 is a diagram illustrating the structure of a video according to the present invention.

Figure 3 is an explanatory diagram of the deep learning-based action recognition principle according to the present invention.

Figure 4 is a flow chart of the deep learning-based action recognition method according to the present invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings. First, it should be noted that when adding reference numerals to components in each drawing, the same components are given the same reference numerals as much as possible even if they are shown in different drawings. Additionally, in describing the present invention, if a detailed description of a related known configuration or function is determined to be obvious to those skilled in the art or may obscure the gist of the present invention, the detailed description will be omitted.

Figure 1 is a block diagram of a deep learning-based behavior recognition device according to an embodiment of the present invention, Figure 2 is a diagram illustrating the form of a video according to the present invention, and Figure 3 is a deep learning-based behavior recognition device according to the present invention. This is an explanation of the action recognition principle.

Referring to FIG. 1, the deep learning-based action recognition device 100 includes a video input unit 110, a frame classification unit 120, a deep learning network unit 130, and a final action recognition decision unit 140.

The operation of the deep learning-based behavior recognition device according to an embodiment of the present invention configured as described above will be described with reference to FIGS. 2 and 3 as follows.

The video input unit 110 inputs video captured through a camera. The supply path for the video is not particularly limited. For example, the video may include video supplied from a camera or storage media such as memory and database, or video transmitted through a wired or wireless network.

Figure 2 exemplarily shows the form of the video. As shown in FIG. 2, the video (MP) may be composed of several video frames (hereinafter referred to as 'frames'). Here, W means the width of the video (MP) made up of frames, H means the height of the frame, N means the input time of the video (MP),

means the frame rate (fps, frame per second) of the video (MP).

A pre-processing process may be performed in the video input unit 110, whereby the video is W×H×3×

It can be changed to the form of N. Here, 3 is given because the input video has three primary color information: red (R), green (G), and blue (B).

The frame classification unit 120 extracts one frame for every K frames from the input video and

Create a group of K frames with a size of N/K. FIG. 3 exemplarily shows that three frame groups (FG1-FG3) are formed by the frame classification unit 120.

Referring to FIG. 3, when the input video (MP) consists of 15 frames (F1-F15), one frame is selected for every K frames (e.g., K=3) and each 1st-3rd frame group (FG1) -Allocate to FG3) in order. Accordingly, the frames F1, F4, F7, F10, and F13 are allocated to the first frame group (FG1), and the frames F2, F5, F8, F11, and F14 are allocated to the second frame group (FG2). Frames F3, F6, F9, F12, and F15 are allocated to the 3-frame group (FG3).

As a result, the 1st-3rd frame groups (FG1-FG3) have a size reduced by 1/K on the time axis compared to the entire video (MP). That is, the size of the 1st-3rd frame groups (FG1-FG3) is reduced by 1/K compared to the size (data capacity) of the entire video (MP).

The deep learning network unit 130 performs training with the 1-3 frame groups (FG1-FG3) as shown in FIG. 3 and infers each action recognition result accordingly. The deep learning network unit 130 may be implemented as a typical deep learning network based on a 3D CNN (Convolutional Neural Network), or may be implemented by applying LSTM (Long Short-Term Memory), etc. to a 2D CNN. In this case, it can be implemented to estimate dynamic information by initially configuring a network of a previously trained 2D CNN and extending it on the time axis.

The deep learning network unit 130 may include first-third deep learning networks (131-133). In this case, the first deep learning network 131 performs training with the first frame group (FG1) and infers the first action recognition result accordingly. The second deep learning network 132 performs training with the second frame group (FG2) and infers the second action recognition result accordingly. The third deep learning network 133 performs training with the third frame group (FG3) and infers the third action recognition result accordingly. Accordingly, the complexity of the first to third deep learning networks 131 to 133 can be implemented in a structure that is K times reduced compared to a typical deep learning network.

The action recognition final decision unit 140 performs weighted voting on the 1st to 3rd action recognition results output from the 1st to 3rd deep learning networks 131 to 133, and determines the weighted voting result. Use it to output the final action recognition result. Here, the weighted voting refers to a voting method that gives more votes to action recognition results with higher reliability relative to the first to third action recognition results.

That is, the action recognition final decision unit 140 calculates each probability for the first to third action recognition results for each class, adds up these calculation results for each class, and selects the class with the highest probability as the final action. The recognition result is output.

For example, assuming that there are three action recognition classification classes: running, swimming, and arm exercise, the action recognition final decision unit 140 determines each action output from the first to third deep learning networks (131 to 133). The explanation of weighted voting for recognition results is as follows:

The action recognition final decision unit 140 conducts a weighted vote on the first action recognition result output from the first deep learning network 131 to assign probability values to running, swimming, and arm exercise as 0.5, 0.2, and 0.3. You can. In addition, the action recognition final decision unit 140 conducts a weighted vote on the second action recognition result output from the second deep learning network 132 to set the probability values for running, swimming, and arm exercise to 0.4, 0.3, and 0.2. Can be assigned. In addition, the action recognition final decision unit 140 conducts a weighted vote on the third action recognition result output from the third deep learning network 133 to set the probability values for running, swimming, and arm exercise to 0.3, 0.2, and 0.3. Can be assigned.

Then, the action recognition final decision unit 140 sums the corresponding probability values for each class for the action recognition classification classes. Therefore, the action recognition final decision unit 140 obtains 1.2 (0.5 + 0.4 + 0.3) as a result of summing the probability values for running, and 0.7 (0.2 + 0.3 + 0.2) as a result of summing probability values for swimming. , the result of summing the probability values for arm movement is 0.8 (0.3 + 0.2 + 0.3).

Accordingly, the action recognition final decision unit 140 selects and outputs the run with the largest value among the summation results of probability values as the final action recognition result.

Meanwhile, Figure 4 is a flowchart of the deep learning-based action recognition method according to the present invention.

Referring to Figure 4, the deep learning-based action recognition method according to the present invention includes a video input step (S1), a frame classification step (S2), a deep learning network training step (S3), and a final action recognition decision step (S4). Includes.

The deep learning-based action recognition method according to an embodiment of the present invention configured as described above will be described with reference to FIGS. 1 to 3 as follows.

In the video input step (S1), the video input unit 110 inputs a video captured through a camera. Here, the video can be supplied through the same path as in the description above, and has the same form as in FIG. 2.

In the frame classification step (S2), the frame classification unit 120 extracts one frame for every K frames from the video as shown in FIG. 3 and divides it into W × H × 3 ×

Create a group of frames (FG1-FG3) with a size of N/K.

In the deep learning network training step (S3), the deep learning network unit 130 trains the 1-3 deep learning networks 131-133 with the 1-3 frame groups (FG1-FG3) and each Infer the action recognition results from 1 to 3.

In the action recognition final decision step (S4), the action recognition final decision unit 140 performs weighted voting on the 1st and 3rd action recognition results output from the 1st and 3rd deep learning networks 131-133, respectively. carry out. Then, the action recognition final decision unit 140 calculates each probability for each of the first to third action recognition results for each class, adds up these calculation results for each class, and selects the class with the highest probability as the final action. Output as recognition result.

Although the preferred embodiments have been described and illustrated above to illustrate the technical idea of the present invention, the present invention is not limited to the configuration and operation as shown and described, and does not depart from the scope of the technical idea. Those skilled in the art will appreciate that many changes and modifications are possible to the present invention. Accordingly, all such appropriate changes, modifications and equivalents shall be considered to fall within the scope of the present invention.

Claims

A video input unit for inputting a video;

a frame classification unit that extracts one frame from every K frames from the video and sequentially assigns it to K frame groups;

A deep learning network unit comprising K deep learning networks, each training a corresponding deep learning network among the K deep learning networks using a corresponding frame group among the K frame groups, and inferring respective action recognition results accordingly; and

Deep learning-based behavior recognition comprising a behavior recognition final decision unit that conducts a weighted vote on the behavior recognition results and selects and outputs the behavior recognition result with the highest probability value as the final behavior recognition result among the weighted voting results. Device.
According to paragraph 1,

The K frame groups are

W×H×3×
Have a size of N/K,

Here, W means the width of the video, H means the height of the frame, N means the input time of the video,
means the frame rate (fps, frame per second) of the video, and 3 means that the video has three primary color information: R, G, and B. A deep learning-based behavior recognition device.
According to paragraph 1,

The deep learning network is

A deep learning-based action recognition device that is implemented based on a 3D CNN (Convolutional Neural Network) or by applying LSTM (Long Short-Term Memory) to a 2D CNN.
According to paragraph 1,

The deep learning network unit

a first deep learning network that performs training with the first group of frames supplied from the frame classification unit and infers a first action recognition result accordingly;

a second deep learning network that performs training with a second group of frames supplied from the frame classification unit and infers a second action recognition result accordingly; and

A deep learning-based action recognition device comprising a third deep learning network that performs training with the third frame group supplied from the frame classification unit and infers the third action recognition result accordingly. .
According to paragraph 4,

The action recognition final decision unit

A deep learning-based method characterized in that weighted voting is performed on the first and third action recognition results output from the first and third deep learning networks, and the final action recognition result is output using the weighted voting results. Behavior recognition device.
According to paragraph 4,

The action recognition final decision unit

Deep, characterized in that the probability of each of the first to third action recognition results is calculated for each class, the calculation results are summed for each class, and the class with the highest probability is output as the final action recognition result. Learning-based behavior recognition device.
A video input step of inputting a video;

A frame classification step of extracting one frame from every K frames from the video and sequentially assigning it to K frame groups;

A deep learning network training step in which K deep learning networks are provided, and each deep learning network among the K deep learning networks is trained with a corresponding frame group among the K frame groups to infer respective action recognition results. ; and

Deep learning-based behavior including; conducting a weighted vote on the behavior recognition results, selecting and outputting the behavior recognition result with the highest probability value among the weighted voting results as the final behavior recognition result; How to recognize.
In clause 7,

The weighted voting refers to a voting method in which more votes are given to behavior recognition results with high reliability relative to the behavior recognition results. A deep learning-based behavior recognition method.