WO2024038969A1 - Deep learning-based action recognition device and method - Google Patents

Deep learning-based action recognition device and method Download PDF

Info

Publication number
WO2024038969A1
WO2024038969A1 PCT/KR2022/019596 KR2022019596W WO2024038969A1 WO 2024038969 A1 WO2024038969 A1 WO 2024038969A1 KR 2022019596 W KR2022019596 W KR 2022019596W WO 2024038969 A1 WO2024038969 A1 WO 2024038969A1
Authority
WO
WIPO (PCT)
Prior art keywords
deep learning
action recognition
frame
video
results
Prior art date
Application number
PCT/KR2022/019596
Other languages
French (fr)
Korean (ko)
Inventor
장달원
이종설
이재원
Original Assignee
한국전자기술연구원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국전자기술연구원 filed Critical 한국전자기술연구원
Publication of WO2024038969A1 publication Critical patent/WO2024038969A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • the present invention relates to a technology for recognizing human or animal behavior by performing deep learning on video captured through a camera. More specifically, deep learning training is performed by reducing the amount of input video data. It relates to a deep learning-based action recognition device and method that performs action recognition and thereby prevents performance degradation in action recognition.
  • the image size and frame rate of the input video are adjusted appropriately during the preprocessing process, regardless of which deep learning structure is used internally.
  • an action recognition process is finally performed to determine what actions exist in the input video data.
  • several candidate behavior classes are selected and the behavior is inferred based on one of them.
  • the above input video data is input into the deep learning network as is, or information such as optical flow is additionally input.
  • the action classes pursued by each application and the composition of the videos may be completely different. Therefore, it is necessary to newly train the deep learning network in a form suitable for each application.
  • the inference process can also consume a lot of time due to the complexity of the deep learning network. In order to improve this, it is necessary to simplify the deep learning network, but this has the problem of lowering performance.
  • the technical problem that the present invention aims to achieve is to perform deep learning on videos captured through a camera to recognize the behavior of people or animals, and to perform deep learning training by reducing the amount of input video data.
  • the purpose is to provide a deep learning-based action recognition device and method that prevents performance degradation in action recognition due to a decrease in data volume by processing compensation during the inference process.
  • a deep learning-based behavior recognition device includes a video input unit for inputting a video; a frame classification unit that extracts one frame from every K frames from the video and sequentially assigns it to K frame groups; A deep learning network unit comprising K deep learning networks, each training a corresponding deep learning network among the K deep learning networks using a corresponding frame group among the K frame groups, and inferring respective action recognition results accordingly; and a final action recognition decision unit that performs weighted voting on the action recognition results and selects and outputs the action recognition result with the highest probability value among the weighted voting results as the final action recognition result.
  • the K frame groups are W ⁇ H ⁇ 3 ⁇ It has a size of N/K, where W means the width of the video, H means the height of the frame, N means the input time of the video, means the frame rate (fps, frame per second) of the video, and 3 means that the video has three primary color information: R, G, and B.
  • the deep learning network is characterized by being implemented based on a 3D CNN (Convolutional Neural Network) or by applying LSTM (Long Short-Term Memory) to a 2D CNN.
  • 3D CNN Convolutional Neural Network
  • LSTM Long Short-Term Memory
  • the deep learning network unit includes a first deep learning network that performs training with the first group of frames supplied from the frame classification unit and infers a first action recognition result accordingly; a second deep learning network that performs training with a second group of frames supplied from the frame classification unit and infers a second action recognition result accordingly; and a third deep learning network that performs training with the third frame group supplied from the frame classification unit and infers the third action recognition result accordingly.
  • the action recognition final decision unit performs a weighted vote on the first to third action recognition results output from the first to third deep learning networks, and outputs the final action recognition result using the weighted voting results. It is characterized by
  • the action recognition final decision unit calculates each probability for the first to third action recognition results for each class, adds up the calculation results for each class, and selects the class with the highest probability as the final action recognition result. It is characterized by output as .
  • the deep learning-based behavior recognition method includes a video input step of inputting a video; A frame classification step of extracting one frame from every K frames from the video and sequentially assigning it to K frame groups; A deep learning network training step in which K deep learning networks are provided, and each deep learning network among the K deep learning networks is trained with a corresponding frame group among the K frame groups to infer respective action recognition results. ; and a final action recognition decision step of conducting a weighted vote on the action recognition results and selecting and outputting the action recognition result with the highest probability value among the weighted voting results as the final action recognition result.
  • weighted voting is characterized in that it refers to a voting method in which more votes are given to behavior recognition results with higher reliability.
  • the deep learning-based behavior recognition device and method of the present invention has the following effects.
  • Figure 1 is a block diagram of a deep learning-based behavior recognition device according to the present invention.
  • Figure 2 is a diagram illustrating the structure of a video according to the present invention.
  • Figure 3 is an explanatory diagram of the deep learning-based action recognition principle according to the present invention.
  • Figure 4 is a flow chart of the deep learning-based action recognition method according to the present invention.
  • Figure 1 is a block diagram of a deep learning-based behavior recognition device according to an embodiment of the present invention
  • Figure 2 is a diagram illustrating the form of a video according to the present invention
  • Figure 3 is a deep learning-based behavior recognition device according to the present invention. This is an explanation of the action recognition principle.
  • the deep learning-based action recognition device 100 includes a video input unit 110, a frame classification unit 120, a deep learning network unit 130, and a final action recognition decision unit 140.
  • the video input unit 110 inputs video captured through a camera.
  • the supply path for the video is not particularly limited.
  • the video may include video supplied from a camera or storage media such as memory and database, or video transmitted through a wired or wireless network.
  • Figure 2 exemplarily shows the form of the video.
  • the video (MP) may be composed of several video frames (hereinafter referred to as 'frames').
  • W means the width of the video (MP) made up of frames
  • H means the height of the frame
  • N means the input time of the video (MP)
  • fps frame per second
  • a pre-processing process may be performed in the video input unit 110, whereby the video is W ⁇ H ⁇ 3 ⁇ It can be changed to the form of N.
  • 3 is given because the input video has three primary color information: red (R), green (G), and blue (B).
  • the frame classification unit 120 extracts one frame for every K frames from the input video and Create a group of K frames with a size of N/K.
  • FIG. 3 exemplarily shows that three frame groups (FG1-FG3) are formed by the frame classification unit 120.
  • the input video (MP) consists of 15 frames (F1-F15)
  • the frames F1, F4, F7, F10, and F13 are allocated to the first frame group (FG1)
  • the frames F2, F5, F8, F11, and F14 are allocated to the second frame group (FG2).
  • Frames F3, F6, F9, F12, and F15 are allocated to the 3-frame group (FG3).
  • the 1st-3rd frame groups (FG1-FG3) have a size reduced by 1/K on the time axis compared to the entire video (MP). That is, the size of the 1st-3rd frame groups (FG1-FG3) is reduced by 1/K compared to the size (data capacity) of the entire video (MP).
  • the deep learning network unit 130 performs training with the 1-3 frame groups (FG1-FG3) as shown in FIG. 3 and infers each action recognition result accordingly.
  • the deep learning network unit 130 may be implemented as a typical deep learning network based on a 3D CNN (Convolutional Neural Network), or may be implemented by applying LSTM (Long Short-Term Memory), etc. to a 2D CNN. In this case, it can be implemented to estimate dynamic information by initially configuring a network of a previously trained 2D CNN and extending it on the time axis.
  • 3D CNN Convolutional Neural Network
  • LSTM Long Short-Term Memory
  • the deep learning network unit 130 may include first-third deep learning networks (131-133).
  • the first deep learning network 131 performs training with the first frame group (FG1) and infers the first action recognition result accordingly.
  • the second deep learning network 132 performs training with the second frame group (FG2) and infers the second action recognition result accordingly.
  • the third deep learning network 133 performs training with the third frame group (FG3) and infers the third action recognition result accordingly. Accordingly, the complexity of the first to third deep learning networks 131 to 133 can be implemented in a structure that is K times reduced compared to a typical deep learning network.
  • the action recognition final decision unit 140 performs weighted voting on the 1st to 3rd action recognition results output from the 1st to 3rd deep learning networks 131 to 133, and determines the weighted voting result. Use it to output the final action recognition result.
  • the weighted voting refers to a voting method that gives more votes to action recognition results with higher reliability relative to the first to third action recognition results.
  • the action recognition final decision unit 140 calculates each probability for the first to third action recognition results for each class, adds up these calculation results for each class, and selects the class with the highest probability as the final action.
  • the recognition result is output.
  • the action recognition final decision unit 140 determines each action output from the first to third deep learning networks (131 to 133).
  • the explanation of weighted voting for recognition results is as follows:
  • the action recognition final decision unit 140 conducts a weighted vote on the first action recognition result output from the first deep learning network 131 to assign probability values to running, swimming, and arm exercise as 0.5, 0.2, and 0.3. You can. In addition, the action recognition final decision unit 140 conducts a weighted vote on the second action recognition result output from the second deep learning network 132 to set the probability values for running, swimming, and arm exercise to 0.4, 0.3, and 0.2. Can be assigned. In addition, the action recognition final decision unit 140 conducts a weighted vote on the third action recognition result output from the third deep learning network 133 to set the probability values for running, swimming, and arm exercise to 0.3, 0.2, and 0.3. Can be assigned.
  • the action recognition final decision unit 140 sums the corresponding probability values for each class for the action recognition classification classes. Therefore, the action recognition final decision unit 140 obtains 1.2 (0.5 + 0.4 + 0.3) as a result of summing the probability values for running, and 0.7 (0.2 + 0.3 + 0.2) as a result of summing probability values for swimming. , the result of summing the probability values for arm movement is 0.8 (0.3 + 0.2 + 0.3).
  • the action recognition final decision unit 140 selects and outputs the run with the largest value among the summation results of probability values as the final action recognition result.
  • Figure 4 is a flowchart of the deep learning-based action recognition method according to the present invention.
  • the deep learning-based action recognition method includes a video input step (S1), a frame classification step (S2), a deep learning network training step (S3), and a final action recognition decision step (S4). Includes.
  • the video input unit 110 inputs a video captured through a camera.
  • the video can be supplied through the same path as in the description above, and has the same form as in FIG. 2.
  • the frame classification unit 120 extracts one frame for every K frames from the video as shown in FIG. 3 and divides it into W ⁇ H ⁇ 3 ⁇ Create a group of frames (FG1-FG3) with a size of N/K.
  • the deep learning network unit 130 trains the 1-3 deep learning networks 131-133 with the 1-3 frame groups (FG1-FG3) and each Infer the action recognition results from 1 to 3.
  • the action recognition final decision unit 140 performs weighted voting on the 1st and 3rd action recognition results output from the 1st and 3rd deep learning networks 131-133, respectively. carry out. Then, the action recognition final decision unit 140 calculates each probability for each of the first to third action recognition results for each class, adds up these calculation results for each class, and selects the class with the highest probability as the final action. Output as recognition result.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed are a deep learning-based action recognition device and method for performing deep learning for a video filmed by a camera to recognize a person or animal's actions. The deep learning-based action recognition device according to the present invention comprises: a deep learning network unit which trains a corresponding deep learning network among K deep learning networks by using a corresponding frame group among K frame groups, to infer respective action recognition results accordingly; and a final action recognition determination unit which holds weighted voting on the action recognition results, and selects and outputs, as a final action recognition result, an action recognition result having the highest probability value from the result of the weighted voting.

Description

딥러닝 기반의 행동인식 장치 및 방법Deep learning-based behavior recognition device and method
본 발명은 카메라를 통해 촬영된 동영상을 대상으로 딥러닝(Deep Learning)을 수행하여 사람이나 동물의 행동을 인식하는 기술에 관한 것으로, 더욱 상세하게는 입력되는 동영상 데이터의 량을 줄여서 딥러닝 트레이닝을 수행하고, 이에 의해 행동인식의 성능저하 현상이 나타나지 않도록 한 딥러닝 기반의 행동인식 장치 및 방법에 관한 것이다.The present invention relates to a technology for recognizing human or animal behavior by performing deep learning on video captured through a camera. More specifically, deep learning training is performed by reducing the amount of input video data. It relates to a deep learning-based action recognition device and method that performs action recognition and thereby prevents performance degradation in action recognition.
근래 들어, 카메라를 통해 촬영된 영상으로부터 특정 물체를 찾아내거나(object detection) 얼굴을 인식하는(face recognition) 기능을 구현하기 위해서 딥러닝 기반의 인식기술이 많이 활용되고 있다. 이와 같은 영상 인식기술에서 인식률 향상을 위해서는 많은 량의 데이터와 이를 기반으로 한 딥러닝 트레이닝 과정을 필요로하고, 이를 위해 고성능의 컴퓨팅 파워가 요구된다.Recently, deep learning-based recognition technology has been widely used to implement functions such as detecting a specific object (object detection) or recognizing a face (face recognition) from images captured by a camera. In order to improve the recognition rate in such image recognition technology, a large amount of data and a deep learning training process based on it are required, and for this, high-performance computing power is required.
카메라를 통해 촬영된 동영상으로부터 물체를 인식하고자 하는 경우, 동영상 내의 각각의 프레임에서 추출한 단일영상으로부터 물체 인식이 가능하다. 이에 비하여, 동영상으로부터 사람이나 동물의 행동을 인식하고자 하는 경우에는 동영상 내의 여러 프레임의 정보를 필요로 한다. 이때, 수행되는 딥러닝 트레이닝 과정에서 많은 양의 데이터를 필요로 하고, 딥러닝 트레이닝의 원활한 진행을 위한 기술개발이 요구된다.When trying to recognize an object from a video captured through a camera, object recognition is possible from a single image extracted from each frame in the video. In comparison, when trying to recognize the behavior of people or animals from a video, information from several frames within the video is required. At this time, a large amount of data is required during the deep learning training process, and technology development is required to ensure smooth progress of deep learning training.
동영상으로부터 사람이나 동물의 행동을 인식하기 위해서는 내부적으로 어떠한 딥러닝 구조를 사용하는 것에 관계없이 전처리과정에서 입력되는 동영상의 영상 사이즈와 프레임레이트가 적절하게 조정된다.In order to recognize human or animal behavior from a video, the image size and frame rate of the input video are adjusted appropriately during the preprocessing process, regardless of which deep learning structure is used internally.
이후, 분류(classification) 과정을 거친 후 최종적으로 입력 동영상 데이터에서 어떤 행동이 존재하는지를 판단하는 행동인식 과정을 수행하게 된다. 이때, 여러개의 후보 행동 클래스를 선출한 후 이들 중에서 하나를 바탕으로 행동을 추론하게 된다. 행동인식 과정에서, 상기와 같은 입력 동영상 데이터를 그대로 딥러닝 네트워크에 입력하거나, 옵티컬 플로우(optical flow) 등의 정보를 추가적으로 입력하기도 한다. Afterwards, after going through a classification process, an action recognition process is finally performed to determine what actions exist in the input video data. At this time, several candidate behavior classes are selected and the behavior is inferred based on one of them. In the action recognition process, the above input video data is input into the deep learning network as is, or information such as optical flow is additionally input.
아래의 논문 [1]에서는 옵티컬 플로우를 사용하지 않고, 하나의 네트워크 구조로 행동인식을 수행하는 기술내용이 게재되어 있다. 그리고, 논문 [2]에서는 옵티컬 플로우를 입력하는 딥러닝 네트워크와 동영상 자체를 입력으로 하는 딥러닝 네트워크가 결합된 형태로 사용되는 기술내용이 게재되어 있다. In the paper [1] below, the technology for performing action recognition with a single network structure without using optical flow is published. In addition, in paper [2], the technology used in the form of a combination of a deep learning network that inputs optical flow and a deep learning network that uses the video itself as input is published.
딥러닝 네트워크가 최적의 성능을 발휘하기 위해서는 아주 높은 복잡도가 요구되며, 딥러닝 네트워크를 특수한 상황에 활용하기 위해 새롭게 트레이닝할 경우, 그 복잡도로 인해서 기대한 성능이 발휘되지 않을 수 있다. In order for deep learning networks to achieve optimal performance, very high complexity is required, and when deep learning networks are newly trained to be used in special situations, the expected performance may not be achieved due to the complexity.
딥러닝 네트워크를 이용하여 입력되는 동영상으로부터 행동인식을 하는 경우, 각각의 어플리케이션에서 추구하는 행동 클래스들과 동영상의 구성 등이 전혀 다를 수 있다. 따라서, 딥러닝 네트워크를 각각의 어플리케이션에 맞는 형태로 새롭게 트레이닝하는 것이 필요하다.When recognizing actions from input videos using a deep learning network, the action classes pursued by each application and the composition of the videos may be completely different. Therefore, it is necessary to newly train the deep learning network in a form suitable for each application.
또한, 추론 과정에서도 딥러닝 네트워크의 복잡도로 인하여 시간을 많이 소모할 수 있다. 이런 것을 개선하기 위해서는 딥러닝 네트워크를 단순화하는 것이 필요하지만, 이에 의해 성능이 저하되는 문제점이 있다. Additionally, the inference process can also consume a lot of time due to the complexity of the deep learning network. In order to improve this, it is necessary to simplify the deep learning network, but this has the problem of lowering performance.
[선행기술문헌][Prior art literature]
[논문][thesis]
[1] Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, Boqing Gong, "MoViNets: Mobile Video Networks for Efficient Video Recognition," in Proc. on CVPR 2021[1] Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, Boqing Gong, “MoViNets: Mobile Video Networks for Efficient Video Recognition,” in Proc. on CVPR 2021
[2] Quo vadis, action recognition? a new model and the kinetics dataset, CVPR 2017.[2] Quo vadis, action recognition? a new model and the kinetics dataset, CVPR 2017.
본 발명이 이루고자 하는 기술적 과제는 카메라를 통해 촬영된 동영상을 대상으로 딥러닝(Deep Learning)을 수행하여 사람이나 동물의 행동을 인식함에 있어서, 입력되는 동영상 데이터의 량을 줄여서 딥러닝 트레이닝을 수행하고, 추론과정에서 보상처리하여 데이터량 감소에 의한 행동인식의 성능저하 현상이 나타나지 않도록 한 딥러닝 기반의 행동인식 장치 및 방법을 제공하는데 그 목적이 있다.The technical problem that the present invention aims to achieve is to perform deep learning on videos captured through a camera to recognize the behavior of people or animals, and to perform deep learning training by reducing the amount of input video data. The purpose is to provide a deep learning-based action recognition device and method that prevents performance degradation in action recognition due to a decrease in data volume by processing compensation during the inference process.
상기 목적을 달성하기 위해 본 발명에 따른 딥러닝 기반의 행동인식 장치는, 동영상을 입력하는 동영상 입력부; 상기 동영상을 대상으로 K 프레임마다 하나의 프레임을 추출하여 K개의 프레임군들에 순차적으로 배정하는 프레임 분류부; K개의 딥러닝 네트워크를 구비하되, 상기 K개의 프레임군들 중에서 해당 프레임 군으로 상기 K개의 딥러닝 네트워크 중에서 해당 딥러닝 네트워크를 각기 트레이닝하여 그에 따른 각각의 행동인식 결과들을 추론하는 딥러닝 네트워크부; 및 상기 행동인식 결과들에 대한 가중치 투표를 실시하고, 상기 가중치 투표 결과 중에서 확률값이 가장 높은 행동인식 결과를 최종행동인식결과로 선정하여 출력하는 행동인식 최종 결정부;를 포함한다.To achieve the above object, a deep learning-based behavior recognition device according to the present invention includes a video input unit for inputting a video; a frame classification unit that extracts one frame from every K frames from the video and sequentially assigns it to K frame groups; A deep learning network unit comprising K deep learning networks, each training a corresponding deep learning network among the K deep learning networks using a corresponding frame group among the K frame groups, and inferring respective action recognition results accordingly; and a final action recognition decision unit that performs weighted voting on the action recognition results and selects and outputs the action recognition result with the highest probability value among the weighted voting results as the final action recognition result.
또한, 상기 K개의 프레임군들은 W×H×3×
Figure PCTKR2022019596-appb-img-000001
N/K의 크기를 갖되, 여기서, W는 상기 동영상의 폭을 의미하고, H는 상기 프레임의 높이를 의미하며, N은 상기 동영상의 입력시간을 의미하고,
Figure PCTKR2022019596-appb-img-000002
는 상기 동영상의 프레임레이트(fps, frame per second)를 의미하고, 3은 상기 동영상이 R,G,B 3원색 정보를 갖는 것을 의미하는 것을 특징으로 한다.
In addition, the K frame groups are W × H × 3 ×
Figure PCTKR2022019596-appb-img-000001
It has a size of N/K, where W means the width of the video, H means the height of the frame, N means the input time of the video,
Figure PCTKR2022019596-appb-img-000002
means the frame rate (fps, frame per second) of the video, and 3 means that the video has three primary color information: R, G, and B.
또한, 상기 딥러닝 네트워크는 3D CNN(Convolutional Neural Network) 기반으로 구현되거나, 2D CNN에 LSTM(Long Short-Term Memory)을 적용하여 구현된 것을 특징으로 한다.In addition, the deep learning network is characterized by being implemented based on a 3D CNN (Convolutional Neural Network) or by applying LSTM (Long Short-Term Memory) to a 2D CNN.
또한, 상기 딥러닝 네트워크부는 상기 프레임 분류부로부터 공급되는 제1 프레임군으로 트레이닝을 수행하여 그에 따른 제1행동인식결과를 추론하는 제1 딥러닝 네트워크; 상기 프레임 분류부로부터 공급되는 제2 프레임군으로 트레이닝을 수행하여 그에 따른 제2행동인식결과를 추론하는 제2 딥러닝 네트워크; 및 상기 프레임 분류부로부터 공급되는 제3 프레임군으로 트레이닝을 수행하여 그에 따른 제3행동인식결과를 추론하는 제3 딥러닝 네트워크;를 포함하는 것을 특징으로 하는 것을 특징으로 한다.In addition, the deep learning network unit includes a first deep learning network that performs training with the first group of frames supplied from the frame classification unit and infers a first action recognition result accordingly; a second deep learning network that performs training with a second group of frames supplied from the frame classification unit and infers a second action recognition result accordingly; and a third deep learning network that performs training with the third frame group supplied from the frame classification unit and infers the third action recognition result accordingly.
또한, 상기 행동인식 최종결정부는 상기 제1-3 딥러닝 네트워크에서 각기 출력되는 제1-3행동인식결과에 대한 가중치 투표를 실시하고, 상기 가중치 투표 결과를 이용하여 상기 최종행동인식결과를 출력하는 것을 특징으로 한다.In addition, the action recognition final decision unit performs a weighted vote on the first to third action recognition results output from the first to third deep learning networks, and outputs the final action recognition result using the weighted voting results. It is characterized by
또한, 상기 행동인식 최종결정부는 상기 제1-3행동인식결과에 대한 각각의 확률을 클래스(class)별로 계산하고, 상기 계산 결과들을 클래스별로 모두 합산하여 가장 확률이 높은 클래스를 상기 최종행동인식결과로 출력하는 것을 특징으로 한다.In addition, the action recognition final decision unit calculates each probability for the first to third action recognition results for each class, adds up the calculation results for each class, and selects the class with the highest probability as the final action recognition result. It is characterized by output as .
상기 목적을 달성하기 위해 본 발명에 따른 딥러닝 기반의 행동인식 방법은, 동영상을 입력하는 동영상 입력단계; 상기 동영상을 대상으로 K 프레임마다 하나의 프레임을 추출하여 K개의 프레임군들에 순차적으로 배정하는 프레임 분류단계; K개의 딥러닝 네트워크를 구비하되, 상기 K개의 프레임군들 중에서 해당 프레임 군으로 상기 K개의 딥러닝 네트워크 중에서 해당 딥러닝 네트워크를 각기 트레이닝하여 그에 따른 각각의 행동인식 결과들을 추론하는 딥러닝 네트워크 트레이닝단계; 및 상기 행동인식 결과들에 대한 가중치 투표를 실시하고, 상기 가중치 투표 결과 중에서 확률값이 가장 높은 행동인식 결과를 상기 최종행동인식결과로 선정하여 출력하는 행동인식 최종 결정단계;를 포함한다.In order to achieve the above object, the deep learning-based behavior recognition method according to the present invention includes a video input step of inputting a video; A frame classification step of extracting one frame from every K frames from the video and sequentially assigning it to K frame groups; A deep learning network training step in which K deep learning networks are provided, and each deep learning network among the K deep learning networks is trained with a corresponding frame group among the K frame groups to infer respective action recognition results. ; and a final action recognition decision step of conducting a weighted vote on the action recognition results and selecting and outputting the action recognition result with the highest probability value among the weighted voting results as the final action recognition result.
또한, 상기 가중치 투표는 상기 행동인식결과들에 대하여 신뢰도가 높은 행동인식결과에 표를 더 많이 부여하는 투표 방식을 의미하는 것을 특징으로 한다.In addition, the weighted voting is characterized in that it refers to a voting method in which more votes are given to behavior recognition results with higher reliability.
본 발명의 딥러닝 기반의 행동인식 장치 및 방법은 다음과 같은 효과가 있다.The deep learning-based behavior recognition device and method of the present invention has the following effects.
첫째, 대용량의 트레이닝 데이터를 처리할 때 데이터의 사이즈를 줄여서 트레이닝하게 되므로 행동인식을 신속하게 처리할 수 있는 효과가 있다.First, when processing large amounts of training data, training is performed by reducing the size of the data, which has the effect of processing action recognition quickly.
둘째, 행동인식을 위한 딥러닝 네트워크의 입력을 변화시킬 때 프레임수를 줄이는 방식으로 변화시킴으로써, 성능 저하를 유발시키지 않을 수 있는 효과가 있다. Second, by changing the input of the deep learning network for action recognition in a way that reduces the number of frames, there is an effect of preventing performance degradation.
셋째, 낮은 프레임레이트를 활용함으로써 동적 정보를 탐지하기에 좋은 구조를 갖는 행동인식 장치를 구현할 수 있는 효과가 있다.Third, by utilizing a low frame rate, it is possible to implement a behavior recognition device with a good structure for detecting dynamic information.
도 1은 본 발명에 따른 딥러닝 기반의 행동인식 장치의 블록도이다.Figure 1 is a block diagram of a deep learning-based behavior recognition device according to the present invention.
도 2는 본 발명에 따른 동영상의 구조를 예시적으로 나타낸 도면이다.Figure 2 is a diagram illustrating the structure of a video according to the present invention.
도 3은 본 발명에 따른 딥러닝 기반의 행동인식 원리에 대한 설명도이다.Figure 3 is an explanatory diagram of the deep learning-based action recognition principle according to the present invention.
도 4는 본 발명에 따른 딥러닝 기반의 행동인식 방법의 순서도이다.Figure 4 is a flow chart of the deep learning-based action recognition method according to the present invention.
이하 본 발명의 실시예를 첨부된 도면들을 참조하여 상세히 설명한다. 우선 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의한다. 또한 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 당업자에게 자명하거나 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings. First, it should be noted that when adding reference numerals to components in each drawing, the same components are given the same reference numerals as much as possible even if they are shown in different drawings. Additionally, in describing the present invention, if a detailed description of a related known configuration or function is determined to be obvious to those skilled in the art or may obscure the gist of the present invention, the detailed description will be omitted.
도 1은 본 발명의 실시예에 따른 딥러닝 기반의 행동인식 장치의 블록도이고, 도 2는 본 발명에 따른 동영상의 형태를 예시적으로 나타낸 도면이고, 도 3은 본 발명에 따른 딥러닝 기반의 행동인식 원리에 대한 설명도이다.Figure 1 is a block diagram of a deep learning-based behavior recognition device according to an embodiment of the present invention, Figure 2 is a diagram illustrating the form of a video according to the present invention, and Figure 3 is a deep learning-based behavior recognition device according to the present invention. This is an explanation of the action recognition principle.
도 1을 참조하면, 딥러닝 기반의 행동인식 장치(100)는 동영상 입력부(110), 프레임 분류부(120), 딥러닝 네트워크부(130) 및 행동인식 최종 결정부(140)를 포함한다. Referring to FIG. 1, the deep learning-based action recognition device 100 includes a video input unit 110, a frame classification unit 120, a deep learning network unit 130, and a final action recognition decision unit 140.
이와 같이 구성된 본 발명의 실시예에 따른 딥러닝 기반의 행동인식 장치의 작용을 도 2 및 도 3을 참조하여 설명하면 다음과 같다.The operation of the deep learning-based behavior recognition device according to an embodiment of the present invention configured as described above will be described with reference to FIGS. 2 and 3 as follows.
동영상 입력부(110)는 카메라를 통해 촬영된 동영상을 입력한다. 상기 동영상의 공급 경로는 특별하게 한정되지 않는다. 예를 들어, 상기 동영상은 카메라 또는 메모리 및 데이터베이스 등의 저장매체로부터 공급되는 동영상이나, 유무선의 네트워크를 통해 전송되는 동영상 등을 포함할 수 있다.The video input unit 110 inputs video captured through a camera. The supply path for the video is not particularly limited. For example, the video may include video supplied from a camera or storage media such as memory and database, or video transmitted through a wired or wireless network.
도 2는 상기 동영상의 형태를 예시적으로 나타낸 것이다. 도 2에서와 같이 상기 동영상(MP)은 여러 장의 영상프레임(이하,'프레임'이라 칭함)으로 이루어질 수 있다. 여기서, W는 프레임들로 이루어진 동영상(MP)의 폭을 의미하고, H는 프레임의 높이를 의미하며, N은 동영상(MP)의 입력시간을 의미하고,
Figure PCTKR2022019596-appb-img-000003
는 동영상(MP)의 프레임레이트(fps, frame per second)를 의미한다.
Figure 2 exemplarily shows the form of the video. As shown in FIG. 2, the video (MP) may be composed of several video frames (hereinafter referred to as 'frames'). Here, W means the width of the video (MP) made up of frames, H means the height of the frame, N means the input time of the video (MP),
Figure PCTKR2022019596-appb-img-000003
means the frame rate (fps, frame per second) of the video (MP).
동영상 입력부(110)에서 전처리 과정이 수행될 수 있는데, 이에 의해 동영상이 W×H×3×
Figure PCTKR2022019596-appb-img-000004
N의 형태로 변경될 수 있다. 여기서, 3은 입력 동영상이 적색(R), 녹색(G) 및 청색(B)의 삼원색 정보를 가지고 있기 때문에 부여된 것이다.
A pre-processing process may be performed in the video input unit 110, whereby the video is W×H×3×
Figure PCTKR2022019596-appb-img-000004
It can be changed to the form of N. Here, 3 is given because the input video has three primary color information: red (R), green (G), and blue (B).
프레임 분류부(120)는 입력되는 동영상을 대상으로 K 프레임마다 하나의 프레임을 추출하여 W×H×3×
Figure PCTKR2022019596-appb-img-000005
N/K의 크기를 갖는 K개의 프레임군들을 생성한다. 도 3에서는 프레임 분류부(120)에 의해 3개의 프레임군(FG1-FG3)들이 형성되는 것을 예시적으로 나타낸 것이다.
The frame classification unit 120 extracts one frame for every K frames from the input video and
Figure PCTKR2022019596-appb-img-000005
Create a group of K frames with a size of N/K. FIG. 3 exemplarily shows that three frame groups (FG1-FG3) are formed by the frame classification unit 120.
도 3을 참조하면, 입력되는 동영상(MP)이 15장의 프레임(F1-F15)으로 이루어진 경우 K 프레임(예, K=3)마다 한 장의 프레임을 선택하여 각각의 제1-3 프레임군(FG1-FG3)에 차례로 할당한다. 이에 따라, 제1프레임군(FG1)에 F1,F4,F7,F10,F13의 프레임들이 할당되고, 제2프레임군(FG2)에 F2,F5,F8,F11,F14의 프레임들이 할당되고, 제3프레임군(FG3)에 F3,F6,F9,F12,F15의 프레임들이 할당된다.Referring to FIG. 3, when the input video (MP) consists of 15 frames (F1-F15), one frame is selected for every K frames (e.g., K=3) and each 1st-3rd frame group (FG1) -Allocate to FG3) in order. Accordingly, the frames F1, F4, F7, F10, and F13 are allocated to the first frame group (FG1), and the frames F2, F5, F8, F11, and F14 are allocated to the second frame group (FG2). Frames F3, F6, F9, F12, and F15 are allocated to the 3-frame group (FG3).
결과적으로, 상기 제1-3 프레임군(FG1-FG3)들은 전체 동영상(MP)에 비하여 시간축으로 1/K 만큼 줄어든 사이즈를 갖게 된다. 즉, 상기 제1-3 프레임군(FG1- FG3)들의 사이즈는 전체 동영상(MP)의 사이즈(데이터 용량)에 비하여 1/K로 줄어들게 된다. As a result, the 1st-3rd frame groups (FG1-FG3) have a size reduced by 1/K on the time axis compared to the entire video (MP). That is, the size of the 1st-3rd frame groups (FG1-FG3) is reduced by 1/K compared to the size (data capacity) of the entire video (MP).
딥러닝 네트워크부(130)는 도 3에서와 같이 상기 제1-3 프레임군(FG1-FG3)으로 트레이닝을 수행하여 그에 따른 각각의 행동인식 결과들을 추론한다. 딥러닝 네트워크부(130)는 통상의 딥러닝 네트워크가 3D CNN(Convolutional Neural Network) 기반으로 구현되거나, 2D CNN에 LSTM(Long Short-Term Memory) 등을 적용하여 구현될 수 있다. 이와 같은 경우 선행적으로 트레이닝된 2D CNN의 네트워크를 기반으로 초기 구성을 하고, 그것을 시간축으로 늘려서 동적인 정보를 추정하도록 구현될 수 있다.The deep learning network unit 130 performs training with the 1-3 frame groups (FG1-FG3) as shown in FIG. 3 and infers each action recognition result accordingly. The deep learning network unit 130 may be implemented as a typical deep learning network based on a 3D CNN (Convolutional Neural Network), or may be implemented by applying LSTM (Long Short-Term Memory), etc. to a 2D CNN. In this case, it can be implemented to estimate dynamic information by initially configuring a network of a previously trained 2D CNN and extending it on the time axis.
딥러닝 네트워크부(130)는 제1-3 딥러닝 네트워크(131-133)를 포함할 수 있다. 이와 같은 경우, 제1 딥러닝 네트워크(131)는 제1 프레임군(FG1)으로 트레이닝을 수행하여 그에 따른 제1행동인식결과를 추론한다. 제2 딥러닝 네트워크(132)는 제2 프레임군(FG2)으로 트레이닝을 수행하여 그에 따른 제2행동인식결과를 추론한다. 제3 딥러닝 네트워크(133)는 제3 프레임군(FG3)으로 트레이닝을 수행하여 그에 따른 제3행동인식결과를 추론한다. 따라서, 제1-3 딥러닝 네트워크(131-133)의 복잡도는 통상의 딥러닝 네트워크에 비하여 K배 줄어든 구조로 구현될 수 있다.The deep learning network unit 130 may include first-third deep learning networks (131-133). In this case, the first deep learning network 131 performs training with the first frame group (FG1) and infers the first action recognition result accordingly. The second deep learning network 132 performs training with the second frame group (FG2) and infers the second action recognition result accordingly. The third deep learning network 133 performs training with the third frame group (FG3) and infers the third action recognition result accordingly. Accordingly, the complexity of the first to third deep learning networks 131 to 133 can be implemented in a structure that is K times reduced compared to a typical deep learning network.
행동인식 최종결정부(140)는 상기 제1-3 딥러닝 네트워크(131-133)에서 각기 출력되는 제1-3행동인식결과에 대한 가중치 투표(Weighted voting)를 실시하고, 그 가중치 투표 결과를 이용하여 최종행동인식결과를 출력한다. 여기서, 상기 가중치 투표란 상기 제1-3행동인식결과들에 대하여 신뢰도가 높은 행동인식결과에 표를 더 많이 부여하는 투표 방식을 의미한다.The action recognition final decision unit 140 performs weighted voting on the 1st to 3rd action recognition results output from the 1st to 3rd deep learning networks 131 to 133, and determines the weighted voting result. Use it to output the final action recognition result. Here, the weighted voting refers to a voting method that gives more votes to action recognition results with higher reliability relative to the first to third action recognition results.
즉, 행동인식 최종결정부(140)는 상기 제1-3행동인식결과에 대한 각각의 확률을 클래스(class)별로 계산하고, 이 계산 결과들을 클래스별로 모두 합산하여 가장 확률이 높은 클래스를 최종행동인식결과로 출력하게 된다.That is, the action recognition final decision unit 140 calculates each probability for the first to third action recognition results for each class, adds up these calculation results for each class, and selects the class with the highest probability as the final action. The recognition result is output.
예를 들어, 행동인식 분류 클래스로써 달리기, 수영 및 팔운동의 세가지가 있다고 가정할 경우 행동인식 최종결정부(140)가 상기 제1-3 딥러닝 네트워크(131-133)에서 출력되는 각각의 행동인식결과에 대하여 가중투표하는 것을 설명하면 다음과 같다,For example, assuming that there are three action recognition classification classes: running, swimming, and arm exercise, the action recognition final decision unit 140 determines each action output from the first to third deep learning networks (131 to 133). The explanation of weighted voting for recognition results is as follows:
행동인식 최종결정부(140)는 제1 딥러닝 네트워크(131)에서 출력되는 제1행동인식결과에 대한 가중치 투표를 실시하여 달리기, 수영 및 팔운동에 대한 확률값을 0.5, 0.2, 0.3으로 배정할 수 있다. 또한, 행동인식 최종결정부(140)는 제2 딥러닝 네트워크(132)에서 출력되는 제2행동인식결과에 대한 가중치 투표를 실시하여 달리기, 수영 및 팔운동에 대한 확률값을 0.4, 0.3, 0.2으로 배정할 수 있다. 또한, 행동인식 최종결정부(140)는 제3 딥러닝 네트워크(133)에서 출력되는 제3행동인식결과에 대한 가중치 투표를 실시하여 달리기, 수영 및 팔운동에 대한 확률값을 0.3, 0.2, 0.3으로 배정할 수 있다. The action recognition final decision unit 140 conducts a weighted vote on the first action recognition result output from the first deep learning network 131 to assign probability values to running, swimming, and arm exercise as 0.5, 0.2, and 0.3. You can. In addition, the action recognition final decision unit 140 conducts a weighted vote on the second action recognition result output from the second deep learning network 132 to set the probability values for running, swimming, and arm exercise to 0.4, 0.3, and 0.2. Can be assigned. In addition, the action recognition final decision unit 140 conducts a weighted vote on the third action recognition result output from the third deep learning network 133 to set the probability values for running, swimming, and arm exercise to 0.3, 0.2, and 0.3. Can be assigned.
그리고, 행동인식 최종결정부(140)는 상기 행동인식 분류 클래스에 대하여 해당 확률값을 클래스별로 합산한다. 따라서, 행동인식 최종결정부(140)는 달리기에 대한 확률값의 합산결과로 1.2(0.5 + 0.4 + 0.3)를 얻게 되고, 수영에 대한 확률값의 합산결과로 0.7(0.2 + 0.3 + 0.2)를 얻게 되고, 팔운동에 대한 확률값의 합산결과로 0.8(0.3 + 0.2 + 0.3)를 얻게 된다.Then, the action recognition final decision unit 140 sums the corresponding probability values for each class for the action recognition classification classes. Therefore, the action recognition final decision unit 140 obtains 1.2 (0.5 + 0.4 + 0.3) as a result of summing the probability values for running, and 0.7 (0.2 + 0.3 + 0.2) as a result of summing probability values for swimming. , the result of summing the probability values for arm movement is 0.8 (0.3 + 0.2 + 0.3).
따라서, 행동인식 최종결정부(140)는 확률값의 합산결과들 중에서 가장 큰 값을 갖는 달리기를 최종행동인식결과로 선정하여 출력하게 된다. Accordingly, the action recognition final decision unit 140 selects and outputs the run with the largest value among the summation results of probability values as the final action recognition result.
한편, 도 4는 본 발명에 따른 딥러닝 기반의 행동인식 방법의 순서도이다.Meanwhile, Figure 4 is a flowchart of the deep learning-based action recognition method according to the present invention.
도 4를 참조하면, 본 발명에 따른 딥러닝 기반의 행동인식 방법은 동영상 입력단계(S1), 프레임 분류단계(S2), 딥러닝 네트워크 트레이닝 단계(S3) 및 행동인식 최종 결정단계(S4)를 포함한다.Referring to Figure 4, the deep learning-based action recognition method according to the present invention includes a video input step (S1), a frame classification step (S2), a deep learning network training step (S3), and a final action recognition decision step (S4). Includes.
이와 같이 구성된 본 발명의 실시예에 따른 딥러닝 기반의 행동인식 방법을 도 1 내지 도 3을 참조하여 설명하면 다음과 같다.The deep learning-based action recognition method according to an embodiment of the present invention configured as described above will be described with reference to FIGS. 1 to 3 as follows.
동영상 입력단계(S1)에서, 동영상 입력부(110)는 카메라를 통해 촬영된 동영상을 입력한다. 여기서, 상기 동영상은 상기 설명에서와 같은 경로를 통해 공급될 수 있으며, 상기 도 2에서와 같은 형태를 갖는다.In the video input step (S1), the video input unit 110 inputs a video captured through a camera. Here, the video can be supplied through the same path as in the description above, and has the same form as in FIG. 2.
프레임 분류단계(S2)에서, 프레임 분류부(120)는 도 3에서와 같이 상기 동영상을 대상으로 K 프레임마다 하나의 프레임을 추출하여 W×H×3×
Figure PCTKR2022019596-appb-img-000006
N/K의 크기를 갖는 프레임군(FG1-FG3)들을 생성한다.
In the frame classification step (S2), the frame classification unit 120 extracts one frame for every K frames from the video as shown in FIG. 3 and divides it into W × H × 3 ×
Figure PCTKR2022019596-appb-img-000006
Create a group of frames (FG1-FG3) with a size of N/K.
딥러닝 네트워크 트레이닝 단계(S3)에서, 딥러닝 네트워크부(130)는 제1-3 프레임군(FG1-FG3)으로 제1-3 딥러닝 네트워크(131-133)를 트레이닝하여 그에 따른 각각의 제1 내지 3의 행동인식 결과들을 추론한다.In the deep learning network training step (S3), the deep learning network unit 130 trains the 1-3 deep learning networks 131-133 with the 1-3 frame groups (FG1-FG3) and each Infer the action recognition results from 1 to 3.
행동인식 최종 결정단계(S4)에서, 행동인식 최종결정부(140)는 제1-3 딥러닝 네트워크(131-133)에서 각기 출력되는 제1-3행동인식결과에 대한 가중치 투표(Weighted voting)를 실시한다. 그리고, 행동인식 최종결정부(140)는 상기 제1-3행동인식결과에 대한 각각의 확률을 클래스(class)별로 계산하고, 이 계산 결과들을 클래스별로 모두 합산하여 가장 확률이 높은 클래스를 최종행동인식결과로 출력한다.In the action recognition final decision step (S4), the action recognition final decision unit 140 performs weighted voting on the 1st and 3rd action recognition results output from the 1st and 3rd deep learning networks 131-133, respectively. carry out. Then, the action recognition final decision unit 140 calculates each probability for each of the first to third action recognition results for each class, adds up these calculation results for each class, and selects the class with the highest probability as the final action. Output as recognition result.
이상으로 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것은 아니며, 기술적 사상의 범주를 이탈함없이 본 발명에 대해 다수의 변경 및 수정이 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서 그러한 모든 적절한 변경 및 수정과 균등물들도 본 발명의 범위에 속하는 것으로 간주되어야 할 것이다. Although the preferred embodiments have been described and illustrated above to illustrate the technical idea of the present invention, the present invention is not limited to the configuration and operation as shown and described, and does not depart from the scope of the technical idea. Those skilled in the art will appreciate that many changes and modifications are possible to the present invention. Accordingly, all such appropriate changes, modifications and equivalents shall be considered to fall within the scope of the present invention.

Claims (8)

  1. 동영상을 입력하는 동영상 입력부;A video input unit for inputting a video;
    상기 동영상을 대상으로 K 프레임마다 하나의 프레임을 추출하여 K개의 프레임군들에 순차적으로 배정하는 프레임 분류부;a frame classification unit that extracts one frame from every K frames from the video and sequentially assigns it to K frame groups;
    K개의 딥러닝 네트워크를 구비하되, 상기 K개의 프레임군들 중에서 해당 프레임 군으로 상기 K개의 딥러닝 네트워크 중에서 해당 딥러닝 네트워크를 각기 트레이닝하여 그에 따른 각각의 행동인식 결과들을 추론하는 딥러닝 네트워크부; 및 A deep learning network unit comprising K deep learning networks, each training a corresponding deep learning network among the K deep learning networks using a corresponding frame group among the K frame groups, and inferring respective action recognition results accordingly; and
    상기 행동인식 결과들에 대한 가중치 투표를 실시하고, 상기 가중치 투표 결과 중에서 확률값이 가장 높은 행동인식 결과를 최종행동인식결과로 선정하여 출력하는 행동인식 최종 결정부;를 포함하는 딥러닝 기반의 행동인식 장치.Deep learning-based behavior recognition comprising a behavior recognition final decision unit that conducts a weighted vote on the behavior recognition results and selects and outputs the behavior recognition result with the highest probability value as the final behavior recognition result among the weighted voting results. Device.
  2. 제1항에 있어서,According to paragraph 1,
    상기 K개의 프레임군들은The K frame groups are
    W×H×3×
    Figure PCTKR2022019596-appb-img-000007
    N/K의 크기를 갖되,
    W×H×3×
    Figure PCTKR2022019596-appb-img-000007
    Have a size of N/K,
    여기서, W는 상기 동영상의 폭을 의미하고, H는 상기 프레임의 높이를 의미하며, N은 상기 동영상의 입력시간을 의미하고,
    Figure PCTKR2022019596-appb-img-000008
    는 상기 동영상의 프레임레이트(fps, frame per second)를 의미하고, 3은 상기 동영상이 R,G,B 3원색 정보를 갖는 것을 의미하는 것을 특징으로 하는 딥러닝 기반의 행동인식 장치.
    Here, W means the width of the video, H means the height of the frame, N means the input time of the video,
    Figure PCTKR2022019596-appb-img-000008
    means the frame rate (fps, frame per second) of the video, and 3 means that the video has three primary color information: R, G, and B. A deep learning-based behavior recognition device.
  3. 제1항에 있어서,According to paragraph 1,
    상기 딥러닝 네트워크는The deep learning network is
    3D CNN(Convolutional Neural Network) 기반으로 구현되거나, 2D CNN에 LSTM(Long Short-Term Memory)을 적용하여 구현된 것을 특징으로 하는 딥러닝 기반의 행동인식 장치.A deep learning-based action recognition device that is implemented based on a 3D CNN (Convolutional Neural Network) or by applying LSTM (Long Short-Term Memory) to a 2D CNN.
  4. 제1항에 있어서,According to paragraph 1,
    상기 딥러닝 네트워크부는The deep learning network unit
    상기 프레임 분류부로부터 공급되는 제1 프레임군으로 트레이닝을 수행하여 그에 따른 제1행동인식결과를 추론하는 제1 딥러닝 네트워크;a first deep learning network that performs training with the first group of frames supplied from the frame classification unit and infers a first action recognition result accordingly;
    상기 프레임 분류부로부터 공급되는 제2 프레임군으로 트레이닝을 수행하여 그에 따른 제2행동인식결과를 추론하는 제2 딥러닝 네트워크; 및 a second deep learning network that performs training with a second group of frames supplied from the frame classification unit and infers a second action recognition result accordingly; and
    상기 프레임 분류부로부터 공급되는 제3 프레임군으로 트레이닝을 수행하여 그에 따른 제3행동인식결과를 추론하는 제3 딥러닝 네트워크;를 포함하는 것을 특징으로 하는 것을 특징으로 하는 딥러닝 기반의 행동인식 장치.A deep learning-based action recognition device comprising a third deep learning network that performs training with the third frame group supplied from the frame classification unit and infers the third action recognition result accordingly. .
  5. 제4항에 있어서,According to paragraph 4,
    상기 행동인식 최종결정부는 The action recognition final decision unit
    상기 제1-3 딥러닝 네트워크에서 각기 출력되는 제1-3행동인식결과에 대한 가중치 투표를 실시하고, 상기 가중치 투표 결과를 이용하여 상기 최종행동인식결과를 출력하는 것을 특징으로 하는 딥러닝 기반의 행동인식 장치.A deep learning-based method characterized in that weighted voting is performed on the first and third action recognition results output from the first and third deep learning networks, and the final action recognition result is output using the weighted voting results. Behavior recognition device.
  6. 제4항에 있어서,According to paragraph 4,
    상기 행동인식 최종결정부는 The action recognition final decision unit
    상기 제1-3행동인식결과에 대한 각각의 확률을 클래스(class)별로 계산하고, 상기 계산 결과들을 클래스별로 모두 합산하여 가장 확률이 높은 클래스를 상기 최종행동인식결과로 출력하는 것을 특징으로 하는 딥러닝 기반의 행동인식 장치.Deep, characterized in that the probability of each of the first to third action recognition results is calculated for each class, the calculation results are summed for each class, and the class with the highest probability is output as the final action recognition result. Learning-based behavior recognition device.
  7. 동영상을 입력하는 동영상 입력단계;A video input step of inputting a video;
    상기 동영상을 대상으로 K 프레임마다 하나의 프레임을 추출하여 K개의 프레임군들에 순차적으로 배정하는 프레임 분류단계;A frame classification step of extracting one frame from every K frames from the video and sequentially assigning it to K frame groups;
    K개의 딥러닝 네트워크를 구비하되, 상기 K개의 프레임군들 중에서 해당 프레임 군으로 상기 K개의 딥러닝 네트워크 중에서 해당 딥러닝 네트워크를 각기 트레이닝하여 그에 따른 각각의 행동인식 결과들을 추론하는 딥러닝 네트워크 트레이닝단계; 및 A deep learning network training step in which K deep learning networks are provided, and each deep learning network among the K deep learning networks is trained with a corresponding frame group among the K frame groups to infer respective action recognition results. ; and
    상기 행동인식 결과들에 대한 가중치 투표를 실시하고, 상기 가중치 투표 결과 중에서 확률값이 가장 높은 행동인식 결과를 상기 최종행동인식결과로 선정하여 출력하는 행동인식 최종 결정단계;를 포함하는 딥러닝 기반의 행동인식 방법.Deep learning-based behavior including; conducting a weighted vote on the behavior recognition results, selecting and outputting the behavior recognition result with the highest probability value among the weighted voting results as the final behavior recognition result; How to recognize.
  8. 제7항에 있어서,In clause 7,
    상기 가중치 투표는 상기 행동인식결과들에 대하여 신뢰도가 높은 행동인식결과에 표를 더 많이 부여하는 투표 방식을 의미하는 것을 특징으로 하는 딥러닝 기반의 행동인식 방법.The weighted voting refers to a voting method in which more votes are given to behavior recognition results with high reliability relative to the behavior recognition results. A deep learning-based behavior recognition method.
PCT/KR2022/019596 2022-08-19 2022-12-05 Deep learning-based action recognition device and method WO2024038969A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020220104171A KR20240026400A (en) 2022-08-19 2022-08-19 Apparatus for Performing Recognition of Activity Based on Deep Learning and Driving Method Thereof
KR10-2022-0104171 2022-08-19

Publications (1)

Publication Number Publication Date
WO2024038969A1 true WO2024038969A1 (en) 2024-02-22

Family

ID=89941982

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/019596 WO2024038969A1 (en) 2022-08-19 2022-12-05 Deep learning-based action recognition device and method

Country Status (2)

Country Link
KR (1) KR20240026400A (en)
WO (1) WO2024038969A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160096460A (en) * 2015-02-05 2016-08-16 삼성전자주식회사 Recognition system based on deep learning including a plurality of classfier and control method thereof
US20200076842A1 (en) * 2018-09-05 2020-03-05 Oracle International Corporation Malicious activity detection by cross-trace analysis and deep learning
KR20210040604A (en) * 2019-10-04 2021-04-14 광주과학기술원 Action recognition method and device
KR20220067833A (en) * 2020-11-18 2022-05-25 한국전자통신연구원 Position estimation system and method based on 3D terrain data
KR20220097329A (en) * 2020-12-31 2022-07-07 서울대학교산학협력단 Method and algorithm of deep learning network quantization for variable precision

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160096460A (en) * 2015-02-05 2016-08-16 삼성전자주식회사 Recognition system based on deep learning including a plurality of classfier and control method thereof
US20200076842A1 (en) * 2018-09-05 2020-03-05 Oracle International Corporation Malicious activity detection by cross-trace analysis and deep learning
KR20210040604A (en) * 2019-10-04 2021-04-14 광주과학기술원 Action recognition method and device
KR20220067833A (en) * 2020-11-18 2022-05-25 한국전자통신연구원 Position estimation system and method based on 3D terrain data
KR20220097329A (en) * 2020-12-31 2022-07-07 서울대학교산학협력단 Method and algorithm of deep learning network quantization for variable precision

Also Published As

Publication number Publication date
KR20240026400A (en) 2024-02-28

Similar Documents

Publication Publication Date Title
WO2018174623A1 (en) Apparatus and method for image analysis using virtual three-dimensional deep neural network
WO2018217019A1 (en) Device for detecting variant malicious code on basis of neural network learning, method therefor, and computer-readable recording medium in which program for executing same method is recorded
WO2021096009A1 (en) Method and device for supplementing knowledge on basis of relation network
CN111523410A (en) Video saliency target detection method based on attention mechanism
CN108564052A (en) Multi-cam dynamic human face recognition system based on MTCNN and method
CN109558781A (en) A kind of multi-angle video recognition methods and device, equipment and storage medium
WO2020149601A1 (en) Method and device for high-speed image recognition using 3d cnn
WO2021095987A1 (en) Multi-type entity-based knowledge complementing method and apparatus
CN110348358B (en) Skin color detection system, method, medium and computing device
Zhang et al. Training efficient saliency prediction models with knowledge distillation
CN116052218B (en) Pedestrian re-identification method
CN110046568A (en) A kind of video actions recognition methods based on Time Perception structure
CN114782998A (en) Abnormal behavior recognition method, system, device and medium with enhanced skeleton joint points
CN114743027B (en) Weak supervision learning-guided cooperative significance detection method
WO2022146080A1 (en) Algorithm and method for dynamically changing quantization precision of deep-learning network
WO2024038969A1 (en) Deep learning-based action recognition device and method
WO2021101052A1 (en) Weakly supervised learning-based action frame detection method and device, using background frame suppression
WO2024019337A1 (en) Video enhancement method and apparatus
CN111626212B (en) Method and device for identifying object in picture, storage medium and electronic device
WO2019112084A1 (en) Method for removing compression distortion by using cnn
WO2024101466A1 (en) Attribute-based missing person tracking apparatus and method
WO2022108275A1 (en) Method and device for generating virtual face by using artificial intelligence
CN116091844A (en) Image data processing method and system based on edge calculation
WO2022131720A1 (en) Device and method for generating building image
WO2022107925A1 (en) Deep learning object detection processing device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22955841

Country of ref document: EP

Kind code of ref document: A1