WO2021101052A1 - Procédé et dispositif de détection de trame d'action fondée sur un apprentissage faiblement supervisé, à l'aide d'une suppression de trame d'arrière-plan - Google Patents

Procédé et dispositif de détection de trame d'action fondée sur un apprentissage faiblement supervisé, à l'aide d'une suppression de trame d'arrière-plan Download PDF

Info

Publication number
WO2021101052A1
WO2021101052A1 PCT/KR2020/012645 KR2020012645W WO2021101052A1 WO 2021101052 A1 WO2021101052 A1 WO 2021101052A1 KR 2020012645 W KR2020012645 W KR 2020012645W WO 2021101052 A1 WO2021101052 A1 WO 2021101052A1
Authority
WO
WIPO (PCT)
Prior art keywords
class
activation sequence
frame
generating
behavior
Prior art date
Application number
PCT/KR2020/012645
Other languages
English (en)
Korean (ko)
Inventor
변혜란
이필현
Original Assignee
연세대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 연세대학교 산학협력단 filed Critical 연세대학교 산학협력단
Publication of WO2021101052A1 publication Critical patent/WO2021101052A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/14Picture signal circuitry for video frequency region
    • H04N5/144Movement detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics

Definitions

  • This study is related to the research on the source technology for understanding the semantic context based on deep learning of the next generation information computing technology development project conducted with the support of the Ministry of Science and ICT (No. NRF-2017M3C4A7069370).
  • a technology for detecting only the action frame is required, and the video refined by detecting the action frame is very easy to use as training data in other deep learning models as well as users.
  • Patent Document 1 Korean Laid-Open Patent Publication No. 10-2015-0127684 (2015.11.17)
  • Patent Document 1 Korean Registered Patent Publication No. 10-0785076 (2017.12.05)
  • Embodiments of the present invention have a main purpose to accurately detect a behavior frame based on weak supervised learning by generating an adjustment class activation sequence through a behavior classification model in which a background frame is suppressed from a feature map extracted from a video.
  • a method for detecting a behavioral frame based on weak supervised learning by a computing device extracting a feature map from a video, and applying a foreground weight to the feature map to provide an adjustment class through a behavior classification model. It provides a method for detecting a weak supervised learning-based action frame, including the step of generating an activation sequence.
  • the extracting of the feature map may include converting the video into a plurality of color frames, extracting color features from the plurality of color frames, converting the video into an optical flow frame, and determining the optical flow characteristics from the optical flow frame.
  • the feature map may be generated by extracting and combining the color feature and the optical flow feature.
  • the foreground weight may be adjusted by filtering the background frame so that the background frame is not activated in the adjustment class activation sequence through a filtering model for the feature map.
  • positive learning for the behavior class and negative learning for the background class may be performed by calculating an adjustment class score for the adjustment class activation sequence.
  • a base class score may be calculated for the base class activation sequence, and positive learning may be performed on the behavior class and the background class.
  • the generating of the base class activation sequence and the generating of the adjustment class activation sequence may share weights of the behavioral classification model to be learned together.
  • an apparatus for detecting an action frame including at least one processor and a memory for storing at least one program executed by the at least one processor, wherein the processor extracts a feature map from the video, and the processor May provide an apparatus for detecting a behavioral frame, comprising generating an adjustment class activation sequence through a behavioral classification model by applying a foreground weight to the feature map.
  • the processor converts the video into a plurality of color frames, extracts color features from the plurality of color frames, converts the video into an optical flow frame, and extracts optical flow features from the optical flow frame, and the color
  • the feature map may be generated by combining the feature and the optical flow feature.
  • the processor may generate the adjustment class activation sequence by filtering the background frame so that the background frame is not activated in the adjustment class activation sequence through a filtering model for the feature map to adjust the foreground weight.
  • the processor may generate a basic class activation sequence for the feature map through the behavior classification model while generating the adjustment class activation sequence.
  • the processor calculates a base class score for the base class activation sequence to perform positive learning for the behavior class and the background class, and calculates an adjustment class score for the adjustment class activation sequence to positively learn the behavior class. And negative learning may be performed on the background class.
  • FIG. 1 is a block diagram illustrating an apparatus for detecting an action frame according to an embodiment of the present invention.
  • FIG. 2 is a flowchart illustrating a method of detecting an action frame according to another embodiment of the present invention.
  • FIG. 3 is a diagram illustrating a behavior classification model of a behavior frame detection apparatus according to embodiments of the present invention.
  • Supervised Learning is a learning strategy in which the correct answer is given, and has the premise that we can know what the correct output is for which input.
  • For supervised learning when learning about a data set, correct answers for each of the data constituting the data set are required.
  • the present invention generates an adjustment class activation sequence through a behavior classification model in which a background frame is suppressed from a feature map extracted from a video, and accurately detects a behavior frame based on weak supervised learning.
  • FIG. 1 is a block diagram illustrating an apparatus for detecting an action frame according to an embodiment of the present invention.
  • the action frame detection apparatus 110 includes at least one processor 120, a computer-readable storage medium 130, and a communication bus 170.
  • the processor 120 may be controlled to operate as the action frame detection device 110.
  • the processor 120 may execute one or more programs stored in the computer-readable storage medium 130.
  • One or more programs may include one or more computer-executable instructions, and when the computer-executable instructions are executed by the processor 120, the action frame detection apparatus 110 is configured to perform operations according to an exemplary embodiment. Can be.
  • Computer-readable storage medium 130 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information.
  • the program 140 stored in the computer-readable storage medium 130 includes a set of instructions executable by the processor 120.
  • the computer-readable storage medium 130 includes memory (volatile memory such as random access memory, nonvolatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, Flash memory devices, other types of storage media that can be accessed by the action frame detection apparatus 110 and store desired information, or a suitable combination thereof.
  • the communication bus 170 interconnects various other components of the action frame detection device 110 including the processor 120 and a computer-readable storage medium 140.
  • the action frame detection device 110 may also include one or more input/output interfaces 150 and one or more communication interfaces 160 that provide an interface for one or more input/output devices.
  • the input/output interface 150 and the communication interface 160 are connected to the communication bus 170.
  • the input/output device may be connected to other components of the action frame detection device 110 through the input/output interface 150.
  • the behavior frame detection apparatus generates an adjustment class activation sequence through a behavior classification model in which a background frame is suppressed from a feature map extracted from a video, and accurately detects the behavior frame based on weak supervised learning.
  • the behavioral frame detection apparatus filters the background frame so that the background frame is not activated in the adjustment class activation sequence through a filtering model for the feature map, and adjusts the foreground weight to generate the adjustment class activation sequence.
  • the behavior frame detection apparatus generates a basic class activation sequence through a behavior classification model for a feature map while generating an adjustment class activation sequence.
  • the behavior frame detection device calculates the base class score for the base class activation sequence, performs positive learning for the behavior class and the background class, and calculates the adjustment class score for the adjustment class activation sequence, and calculates the positive learning and background for the behavior class. Negative learning is performed on the class.
  • the action frame detection method may be performed by an action frame detection apparatus or a computing device.
  • step S210 the processor extracts a feature map from the video.
  • the video is converted into a plurality of color frames, and color features are extracted from the plurality of color frames.
  • the video is converted into an optical flow frame and optical flow features are extracted from the optical flow frame.
  • a feature map is generated by combining a color feature and an optical flow feature.
  • step S220 the processor generates a base class activation sequence for the feature map through the behavior classification model.
  • a base class score is calculated for the base class activation sequence, and positive learning is performed on the behavior class and the background class.
  • step S230 the processor applies the foreground weight to the feature map to generate an adjustment class activation sequence through the behavior classification model.
  • the foreground weight is adjusted by filtering the background frame so that the background frame is not activated in the adjustment class activation sequence through the filtering model for the feature map.
  • a class score is calculated for the adjustment class activation sequence, and positive learning for the behavior class and negative learning for the background class are performed.
  • the weights of the behavior classification model are shared and learned together.
  • FIG. 3 is a diagram illustrating a behavior classification model of a behavior frame detection apparatus according to embodiments of the present invention.
  • the behavior frame detection device includes a feature extraction model and a behavior classification model.
  • the feature extraction model is a network with connected layers and is a model that learns weights and biases.
  • the feature extraction model may be implemented as a neural network such as a convolutional neural network (CNN).
  • CNN convolutional neural network
  • the feature extraction model extracts an RGB frame and an optical flow frame from an input video. After dividing the extracted frames into segments in units of 16 frames, 1024-dimensional RGB feature information and optical flow feature information are obtained from each segment. By connecting the RGB feature information and the optical flow feature information, a 2048-dimensional feature map is created.
  • the behavioral classification model predicts a class of data and assigns a corresponding label.
  • various classification models implemented by neural networks can be applied.
  • a feature map is input to the convolutional network, and action and background class scores at each time are generated.
  • the generated class activation sequence is trained to classify into an action class and a background class.
  • the number of behavioral classes and background classes is expressed as C+1, where C is the number of behavioral classes. Induces the class activation sequence to be activated in the behavioral part.
  • the action frame detection apparatus generates a base class activation sequence (A n ) in order to predict the class score at the segment level.
  • is the learning parameter of the convolutional layer.
  • class scores at the segment level are aggregated.
  • the score of the video unit is compared with the ground truth.
  • the top K averages are applied to calculate the composite score.
  • the class score in the video unit is used to predict the probability (p n) of the positive sample of each class.
  • Equation 4 The loss function of the basic processing operation is expressed as Equation 4.
  • y n is the label of the video level in the n-th video.
  • a feature map is added to a filtering module to calculate a foreground weight, and then the foreground weight is multiplied by the feature map.
  • a feature map in which the background is suppressed can be obtained. Similar to the basic processing operation, it is input to the convolutional network to generate a class activation sequence and classify the classes. Since the suppression processing operation is designed for the purpose of suppressing the background frame, it learns negatively about the background class. As a result, the filtering module can accurately calculate the foreground weight. Positive learning about the background class learns that there is a background as the correct answer, and negative learning about the background class learns that there is no background as the correct answer.
  • Equation (5) The feature map (X' n ) multiplied by the foreground weight is expressed as Equation (5).
  • Adjusting class activation sequence (A 'n) is expressed as Equation (6).
  • Equation 7 The loss function of the suppression processing operation is expressed as in Equation 7.
  • the label at the video level is to be.
  • the label for the background class is set to 1.
  • the label at the video level is to be.
  • the label for the background class is set to 0.
  • ⁇ , ⁇ , ⁇ are optimization parameters, L1 normalization for attention weights Can be applied.
  • the action frame detection apparatus detects the action frame by using the result of suppressing the background frame in the suppression processing operation.
  • FIGS. 4 to 6 show simulation results performed according to embodiments of the present invention.
  • 4 is a spike scene of a volleyball video
  • FIG. 5 is a shot put video
  • FIG. 6 is a penalty kick scene of a soccer video.
  • GT ground truth
  • the action frame detection apparatus may be implemented in a logic circuit by hardware, firmware, software, or a combination thereof, or may be implemented using a general purpose or specific purpose computer.
  • the device may be implemented using a hardwired device, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or the like.
  • the device may be implemented as a System on Chip (SoC) including one or more processors and controllers.
  • SoC System on Chip
  • the action frame detection apparatus may be mounted in a form of software, hardware, or a combination thereof on a computing device or server provided with a hardware element.
  • Computing devices or servers are all or part of a communication device such as a communication modem for performing communication with various devices or wired/wireless communication networks, a memory storing data for executing a program, and a microprocessor for calculating and commanding by executing a program. It can mean a variety of devices including.
  • each process is described as sequentially executing, but this is only illustrative, and those skilled in the art may change the order shown in FIG. 2 without departing from the essential characteristics of the embodiment of the present invention. Or, by executing one or more processes in parallel, or adding other processes, various modifications and variations may be applied.
  • Computer-readable medium refers to any medium that has participated in providing instructions to a processor for execution.
  • the computer-readable medium may include program instructions, data files, data structures, or a combination thereof.
  • there may be a magnetic medium, an optical recording medium, a memory, and the like.
  • Computer programs may be distributed over networked computer systems to store and execute computer-readable codes in a distributed manner. Functional programs, codes, and code segments for implementing this embodiment may be easily inferred by programmers in the art to which this embodiment belongs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

Les présents modes de réalisation de l'invention concernent un procédé et un dispositif qui permettent de détecter avec précision une trame d'action en fonction d'un apprentissage faiblement supervisé par génération d'une séquence d'activation de classe ajustée par l'intermédiaire d'un modèle de classification d'action qui a supprimé la trame d'arrière-plan dans une carte de caractéristiques extraite d'une image animée.
PCT/KR2020/012645 2019-11-22 2020-09-18 Procédé et dispositif de détection de trame d'action fondée sur un apprentissage faiblement supervisé, à l'aide d'une suppression de trame d'arrière-plan WO2021101052A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020190151551A KR102201353B1 (ko) 2019-11-22 2019-11-22 배경 프레임 억제를 통한 약한 지도 학습 기반의 행동 프레임 검출 방법 및 장치
KR10-2019-0151551 2019-11-22

Publications (1)

Publication Number Publication Date
WO2021101052A1 true WO2021101052A1 (fr) 2021-05-27

Family

ID=74127672

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/012645 WO2021101052A1 (fr) 2019-11-22 2020-09-18 Procédé et dispositif de détection de trame d'action fondée sur un apprentissage faiblement supervisé, à l'aide d'une suppression de trame d'arrière-plan

Country Status (2)

Country Link
KR (1) KR102201353B1 (fr)
WO (1) WO2021101052A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818829B (zh) * 2021-01-27 2022-09-09 中国科学技术大学 基于结构网络的弱监督时域动作定位方法及系统
CN116612420B (zh) * 2023-07-20 2023-11-28 中国科学技术大学 弱监督视频时序动作检测方法、系统、设备及存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019099226A1 (fr) * 2017-11-14 2019-05-23 Google Llc Localisation d'action faiblement supervisée par réseau de regroupement temporel épars
KR20190127261A (ko) * 2018-05-04 2019-11-13 연세대학교 산학협력단 행동 인식을 위한 투 스트림 네트워크의 클래스 스코어 학습 방법 및 장치

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100785076B1 (ko) 2006-06-15 2007-12-12 삼성전자주식회사 스포츠 동영상에서의 실시간 이벤트 검출 방법 및 그 장치
US9098923B2 (en) 2013-03-15 2015-08-04 General Instrument Corporation Detection of long shots in sports video

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019099226A1 (fr) * 2017-11-14 2019-05-23 Google Llc Localisation d'action faiblement supervisée par réseau de regroupement temporel épars
KR20190127261A (ko) * 2018-05-04 2019-11-13 연세대학교 산학협력단 행동 인식을 위한 투 스트림 네트워크의 클래스 스코어 학습 방법 및 장치

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ANDREA K; CHRISTINE LEITNER; HERBERT LEITOLD; ALEXANDER PROSSER: "Advances in Databases and Information Systems", vol. 11304 Chap.37, 17 November 2018, SPRINGER INTERNATIONAL PUBLISHING, Cham, ISBN: 978-3-319-10403-4, article SU HAISHENG; ZHAO XU; LIN TIANWEI; FEI HAIPING: "Weakly Supervised Temporal Action Detection with Shot-Based Temporal Pooling Network", pages: 426 - 436, XP047496472, 032682, DOI: 10.1007/978-3-030-04212-7_37 *
HUMAM ALWASSEL; ALEJANDRO PARDO; FABIAN CABA HEILBRON; ALI THABET; BERNARD GHANEM: "RefineLoc: Iterative Refinement for Weakly-Supervised Action Localization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 30 March 2019 (2019-03-30), 201 Olin Library Cornell University Ithaca, NY 14853, XP081538029 *
MARTIN ATZMUELLER, ALVIN CHIN, FREDERIK JANSSEN, IMMANUEL SCHWEIZER, CHRISTOPH TRATTNER: "ICIAP: International Conference on Image Analysis and Processing, 17th International Conference, Naples, Italy, September 9-13, 2013. Proceedings", vol. 11220 Chap.10, 6 October 2018, SPRINGER, Berlin, Heidelberg, ISBN: 978-3-642-17318-9, article SHOU ZHENG; GAO HANG; ZHANG LEI; MIYAZAWA KAZUYUKI; CHANG SHIH-FU: "AutoLoc: Weakly-Supervised Temporal Action Localization in Untrimmed Videos", pages: 162 - 179, XP047488473, 032548, DOI: 10.1007/978-3-030-01270-0_10 *

Also Published As

Publication number Publication date
KR102201353B1 (ko) 2021-01-08

Similar Documents

Publication Publication Date Title
WO2018217019A1 (fr) Dispositif de détection d'un code malveillant variant sur la base d'un apprentissage de réseau neuronal, procédé associé, et support d'enregistrement lisible par ordinateur dans lequel un programme d'exécution dudit procédé est enregistré
WO2020246834A1 (fr) Procédé de reconnaissance d'objet dans une image
WO2021101052A1 (fr) Procédé et dispositif de détection de trame d'action fondée sur un apprentissage faiblement supervisé, à l'aide d'une suppression de trame d'arrière-plan
WO2021261696A1 (fr) Segmentation d'instances d'objets visuels à l'aide d'une imitation de modèle spécialisé de premier plan
WO2020071701A1 (fr) Procédé et dispositif de détection d'un objet en temps réel au moyen d'un modèle de réseau d'apprentissage profond
EP3568828A1 (fr) Appareil et procédé de traitement d'images utilisant une carte de caractéristiques multicanaux
WO2022158819A1 (fr) Procédé et dispositif électronique pour déterminer une saillance de mouvement et un style de lecture vidéo dans une vidéo
WO2021054706A1 (fr) Apprendre à des gan (réseaux antagonistes génératifs) à générer une annotation par pixel
WO2017138766A1 (fr) Procédé de regroupement d'image à base hybride et serveur de fonctionnement associé
CN107316035A (zh) 基于深度学习神经网络的对象识别方法及装置
WO2013048159A1 (fr) Procédé, appareil et support d'enregistrement lisible par ordinateur pour détecter un emplacement d'un point de caractéristique de visage à l'aide d'un algorithme d'apprentissage adaboost
WO2021246810A1 (fr) Procédé d'entraînement de réseau neuronal par auto-codeur et apprentissage multi-instance, et système informatique pour la mise en oeuvre de ce procédé
WO2020231005A1 (fr) Dispositif de traitement d'image et son procédé de fonctionnement
WO2020231226A1 (fr) Procédé de réalisation, par un dispositif électronique, d'une opération de convolution au niveau d'une couche donnée dans un réseau neuronal, et dispositif électronique associé
WO2021246811A1 (fr) Procédé et système d'entraînement de réseau neuronal pour déterminer la gravité
WO2020017829A1 (fr) Procédé de génération d'image de plaque d'immatriculation à l'aide d'un motif de bruit et appareil associé
CN108921023A (zh) 一种确定低质量人像数据的方法及装置
WO2023013809A1 (fr) Procédé de commande pour appareil d'apprentissage de classification d'activité sportive, et support d'enregistrement et appareil pour sa mise en œuvre
WO2021091096A1 (fr) Procédé et appareil de réponse à des questions visuelles utilisant un réseau de classification d'équités
CN110852209A (zh) 目标检测方法及装置、介质和设备
WO2019225799A1 (fr) Procédé et dispositif de suppression d'informations d'utilisateur à l'aide d'un modèle génératif d'apprentissage profond
WO2022191366A1 (fr) Dispositif électronique et son procédé de commande
WO2023277448A1 (fr) Procédé et système d'entraînement de modèle de réseau neuronal artificiel pour traitement d'image
WO2023033194A1 (fr) Procédé et système de distillation de connaissances spécialisés pour l'éclaircissement de réseau neuronal profond à base d'élagage
WO2021071258A1 (fr) Dispositif et procédé d'apprentissage d'image de sécurité mobile basés sur l'intelligence artificielle

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20890263

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20890263

Country of ref document: EP

Kind code of ref document: A1