WO2021107333A1 - Procédé de détection d'événement acoustique dans un environnement de détection basé sur un apprentissage profond - Google Patents

Procédé de détection d'événement acoustique dans un environnement de détection basé sur un apprentissage profond Download PDF

Info

Publication number
WO2021107333A1
WO2021107333A1 PCT/KR2020/010760 KR2020010760W WO2021107333A1 WO 2021107333 A1 WO2021107333 A1 WO 2021107333A1 KR 2020010760 W KR2020010760 W KR 2020010760W WO 2021107333 A1 WO2021107333 A1 WO 2021107333A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
event
cnn
lstm
detect
Prior art date
Application number
PCT/KR2020/010760
Other languages
English (en)
Korean (ko)
Inventor
김홍국
박인영
Original Assignee
광주과학기술원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020200035181A external-priority patent/KR102314824B1/ko
Application filed by 광주과학기술원 filed Critical 광주과학기술원
Publication of WO2021107333A1 publication Critical patent/WO2021107333A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B13/00Burglar, theft or intruder alarms
    • G08B13/16Actuation by interference with mechanical vibrations in air or other fluid
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments

Definitions

  • the present disclosure relates to a method and system for detecting an acoustic event in a deep learning-based sensing situation, and the present disclosure provides a method and system for detecting an acoustic signal with different characteristics by using a fast regional convolution-based network (FR-CNN) to detect the frequency domain of an acoustic signal with different characteristics. , it is possible to detect complex sounds in real time regardless of distance.
  • FR-CNN fast regional convolution-based network
  • a technology for detecting an acoustic event and classifying the type of acoustic event has been continuously studied in order to be applied to a user's judgment of the surrounding environment by being fused with a context-aware technology.
  • the conventional acoustic event detection system has a problem with high environmental dependence, that is, when noise greater than the sound source to be detected is detected, or when the sound source is located farther away from the sound source, the sound source is detected There was a problem of lowering the accuracy of
  • the present disclosure aims at learning an artificial intelligence model using a fast regional convolution-based network (F-R-CNN) and real-time acoustic event detection in a detection situation.
  • F-R-CNN fast regional convolution-based network
  • An object of the present disclosure is to detect a complex acoustic event that extracts acoustic features regardless of distance and detects a plurality of acoustic occurrences in an acoustic sensing situation.
  • the present disclosure provides a method for detecting an event using sound data, the method comprising: receiving sound data including at least one sound source; extracting a composite acoustic feature included in the acoustic data; classifying each of at least one sound source included in the composite sound feature using an artificial intelligence model; and detecting an event based on a combination of each of the classified sound sources.
  • the step of classifying the sound source included in the composite acoustic feature by using the artificial intelligence model includes obtaining a convolutional feature map of the static feature, and a preset region of interest from the convolutional feature map ( Classifying the sound source by extracting a feature vector of a region of interest), obtaining a feature map of the differential feature, extracting a feature vector from the feature map of the differential feature, and classifying the sound source included in the feature vector may include
  • the present disclosure may provide improved real-time detection based on sound information as well as situation detection using image information using a fast regional convolution based network (F-R-CNN).
  • F-R-CNN fast regional convolution based network
  • the present disclosure identifies the acoustic characteristics regardless of the distance, it is possible to compensate for the disadvantages of the existing acoustic-based sensing system and to detect the occurrence of a plurality of sounds to accurately detect the occurrence of acoustic events.
  • FIG. 1 shows a flowchart according to an embodiment of the present disclosure.
  • FIG 3 illustrates an algorithm progress according to an embodiment of the present disclosure.
  • CTC Connectionist Temporal Classification
  • AED Acoustic event detection
  • R-CNN regional-convolutional neural networks
  • LSTM long short-term memory
  • CTC connectionist temporal classification
  • the present disclosure proposes an acoustic event detection system that operates in a variety of noisy environments and distances.
  • the method uses a regional convolution-based network (R-CNN) to detect the frequency domain of time-series acoustic events with different characteristics in a noisy environment.
  • R-CNN regional convolution-based network
  • F-R-CNN fast regional convolution based network
  • the input value of the artificial intelligence model may be designed to use a multi-feature value in order to solve the problem that the event detection performance depends on the signal strength according to the distance.
  • connectionist temporal classification CTC
  • FIG. 1 will be described.
  • FIG. 1 shows a flowchart of the present disclosure.
  • the acoustic event detection system of the present disclosure may include receiving acoustic data including at least one sound source ( S10 ).
  • the sound source may include a source of a sound
  • the acoustic event detection system may include sound data including at least one sound source.
  • one sound data may include all sounds generated from a specific object or living things, such as a dialogue sound source, a sound source that an object collides with, and a gunshot. And the sound data may include sound characteristic data extracted from the sound source.
  • noise removal may be performed in the preprocessing process, and when the spectral frequency band of noise is different from the occurrence of the event or if each event has a unique frequency band, R-CNN (region-convolutional neural networks) can be used.
  • the acoustic event detection system of the present disclosure may include extracting a complex feature value included in the acoustic data (S20).
  • the composite feature value may include a static feature extracted from a spectrogram of the received sound data and a differential feature based on a difference between the sound data and sound data before a preset time.
  • the static feature may mean a log Mel-Band energy-based image including the sound data feature.
  • the Log Mel-Band energy is energy that can well represent the characteristics of the acoustic signal.
  • Acoustic data is generally normalized by the power of the data, but sometimes, the characteristic information may be damaged if the intensity is weakened according to distance or mixed with noise.
  • the acoustic data has a time series characteristic, it is possible to measure the change in log mel band energy through the difference from the previous data in the current acoustic data.
  • the differential feature generated through the above method is based on the amount of change between the current sound data and the spectrum in which the sound data before a preset time is expressed as an image, the characteristics of the sound data are determined regardless of the distance from the point of generation of the sound data. can reflect
  • the present disclosure may classify at least one sound source included in the complex feature value using an artificial intelligence model (S30). And it is possible to detect an event based on the combination of each of the classified sound sources (S40).
  • the present disclosure discloses a Fast R-CNN-LSTM based real-time sensing system using region information for various types of deep learning output layers.
  • a fast R-CNN-Attention LSTM (Fast R-CNN-Attention LSTM) is trained using training data in which acoustic data is expressed as an image to generate an FFast R-CNN-LSTM (31) model.
  • the present disclosure selects training data in which the decoded value of the generated FFast R-CNN-LSTM (31) model has a score greater than or equal to a predetermined reference value, and uses the selected training data to perform semi-supervised region labeling.
  • the SFast R-CNN-LSTM (32) model can be generated.
  • the SFast R-CNN-LSTM 32 model generated by the above process can minimize the hand labeling cost because the model can extract the region representing the characteristics of the sound source in the spectrum section by itself.
  • CNN Convolution Neural Network
  • multiple viewpoints can be collected to use acoustic data as an input of CNN, and multiple frames of acoustic data can be configured into one image for use in image processing.
  • the artificial intelligence model of the present disclosure may include an event detection algorithm according to the acoustic structure that connects CNN and RNN (LSTM).
  • the algorithm of the present disclosure may include an FFast R-CNN-LSTM 31 model, an SFast R-CNN-LSTM model 32 , and an artificial intelligence model 33 in which differential features are reflected.
  • the FFast R-CNN-LSTM 31 may be a model pre-trained using unlabeled training data in the full-band height of the existing Fast R-CNN-LSTM. have.
  • the fast regional convolutional neural network refers to a method proposed to improve the shortcomings of the bottleneck structure of R-CNN, in which all of the Bounding Box Proposals are generated at the locations where feature data exists from the image. have.
  • each Bbox-Proposal does not go through the CNN, but rather performs object detection at the output feature map stage after passing through the CNN once for the entire image.
  • the present disclosure may also include a SFast R-CNN-LSTM model 32 .
  • the SFast R-CNN-LSTM (32) model can extract a region representing the characteristics of the sound source in the spectrum section by itself.
  • the present disclosure may select appropriate training data obtained by obtaining a score greater than or equal to a preset value as a result of a test of the FFast R-CNN-LSTM in order to train the SFast R-CNN-LSTM model 32 .
  • the SFast R-CNN-LSTM model 32 of the present disclosure may include Fast R-CNN-LSTM using semi-supervised region labeling. Since the SFast R-CNN-LSTM model 32 of the present disclosure uses selected training data, it is possible to minimize the hand labeling cost.
  • the sensing system of the present disclosure may include an artificial intelligence model 33 in which differential features are reflected.
  • the artificial intelligence model 33 reflecting the differential feature may be trained in parallel with the SFast R-CNN-LSTM model 32 .
  • one event detection can be performed by sharing some layers for acoustic detection with the SFast R-CNN-LSTM model 32 .
  • some layers may include a CTC algorithm layer.
  • the artificial intelligence model 33 in which the differential feature is reflected includes a layer in which CNN and LSTM are connected, and when the differential feature is provided to an algorithm in which CNN and RNN (LSTM) are connected, a feature vector reflecting the differential feature is generated.
  • the differential feature includes the amount of change obtained by subtracting the image information of the sound data before a specific time (T-1 time) from the image information of the sound data at a specific time (T time) extracted from the sound source, and the artificial intelligence model ( 33) can be learned using the differential feature value.
  • FIG. 3 the process of the sound sensing system of the present disclosure is shown.
  • the sound sensing system of the present disclosure may receive sound data including at least one sound source and extract a complex feature value included in the sound data (S20). And a static feature among the extracted complex feature values is input to the FFast R-CNN-LSTM model 31, and the output result of the FFast R-CNN-LSTM model 31 is a preset interest set in step S20.
  • the region RoI may be reset.
  • the sensing system of the present disclosure provides a static feature of acoustic data to the SFast R-CNN-LSTM model 32, extracts a reconfigured feature vector of a region of interest from a convolutional feature map of static features of the acoustic data, and extracts the composite feature You can classify the sound source included in the value.
  • the artificial intelligence model 33 in which differential features are reflected in parallel with the SFast R-CNN-LSTM model 32 extracts a feature vector from the convolutional feature map using the differential feature of the acoustic data, , it is possible to classify the sound source included in the complex feature value.
  • the artificial intelligence model 33 reflecting the differential feature and the SFast R-CNN-LSTM model 32 reflecting the static feature share a specific layer to receive at least one sound source included in the complex feature value, respectively. It may include a step of classifying (S30) and a step of detecting an event based on a combination of each of the classified sound sources (S40).
  • a specific shared layer may include an Attention-LSTM algorithm and a Connectionist Temporal Classification (CTC) algorithm in which time-series and regional features are reflected.
  • CTC Connectionist Temporal Classification
  • the Attention-LSTM algorithm and the CTC algorithm are included in each of the models 31, 32, and 33 shown in FIG. 2 to perform sound sensing.
  • the acoustic data obtained in the event detection of the present disclosure should be able to detect events of various lengths.
  • the present disclosure extracts regional features of sound data from a feature map and applies an attention algorithm to assign specific weights to the regional features to increase sound detection accuracy.
  • the attention algorithm can be trained by unsupervised learning that does not require labeling.
  • the sensing system of the present disclosure may detect a final event by labeling each time-step or frame wisely of the acoustic data using CTC (Connectionist Temporal Classification).
  • the detection system detects the obtained sound data of 3 frames.
  • 'Traffic Accident Occurrence' can be output as the final event using
  • Connectionist Temporal Classification may be used as a method of determining the final event.
  • FIG. 6 shows a comparison result according to post-processing of the present disclosure. Referring to FIG. 6 , it can be seen that the F1 score is higher after the post-processing according to the presence or absence of the post-processing.
  • the F1 score is the highest when complex feature values are used and the Attention algorithm is used.
  • 7 to 8 show statistical values indicating that the performance of the algorithm of the present disclosure shows the highest performance when a composite feature value including a static feature and a differential feature is used.
  • the present disclosure described above can be implemented as computer-readable code on a medium in which a program is recorded.
  • the computer-readable medium includes all kinds of recording devices in which data readable by a computer system is stored. Examples of computer-readable media include Hard Disk Drive (HDD), Solid State Disk (SSD), Silicon Disk Drive (SDD), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. there is this
  • the computer may include a processor of the terminal.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Acoustics & Sound (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention concerne un procédé : qui extrait de multiples caractéristiques à partir de données acoustiques dignes d'intérêt à partir de données acoustiques contenant une ou plusieurs sources sonores ; qui classe la ou les sources sonores incluses dans les multiples caractéristiques à l'aide d'un modèle d'intelligence artificielle basé sur une mémoire à court et long terme de réseau neuronal à région rapide (Fast region-CNN-LSTM) ; et qui détecte un événement à l'aide des sources sonores classées.
PCT/KR2020/010760 2019-11-25 2020-08-19 Procédé de détection d'événement acoustique dans un environnement de détection basé sur un apprentissage profond WO2021107333A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201962940058P 2019-11-25 2019-11-25
US62/940,058 2019-11-25
KR10-2020-0035181 2020-03-23
KR1020200035181A KR102314824B1 (ko) 2019-11-25 2020-03-23 딥러닝 기반 감지상황에서의 음향 사건 탐지 방법

Publications (1)

Publication Number Publication Date
WO2021107333A1 true WO2021107333A1 (fr) 2021-06-03

Family

ID=76129728

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/010760 WO2021107333A1 (fr) 2019-11-25 2020-08-19 Procédé de détection d'événement acoustique dans un environnement de détection basé sur un apprentissage profond

Country Status (1)

Country Link
WO (1) WO2021107333A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007114413A (ja) * 2005-10-19 2007-05-10 Toshiba Corp 音声非音声判別装置、音声区間検出装置、音声非音声判別方法、音声区間検出方法、音声非音声判別プログラムおよび音声区間検出プログラム
US20160099010A1 (en) * 2014-10-03 2016-04-07 Google Inc. Convolutional, long short-term memory, fully connected deep neural networks
KR20170022445A (ko) * 2015-08-20 2017-03-02 삼성전자주식회사 통합 모델 기반의 음성 인식 장치 및 방법
KR101794543B1 (ko) * 2016-04-18 2017-11-08 주식회사 세화 소리분석을 통한 선로전환기의 고장 탐지 식별 시스템
KR102006206B1 (ko) * 2017-08-14 2019-08-01 오토시맨틱스 주식회사 딥러닝을 통한 음향기반 상수도 누수 진단 방법

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007114413A (ja) * 2005-10-19 2007-05-10 Toshiba Corp 音声非音声判別装置、音声区間検出装置、音声非音声判別方法、音声区間検出方法、音声非音声判別プログラムおよび音声区間検出プログラム
US20160099010A1 (en) * 2014-10-03 2016-04-07 Google Inc. Convolutional, long short-term memory, fully connected deep neural networks
KR20170022445A (ko) * 2015-08-20 2017-03-02 삼성전자주식회사 통합 모델 기반의 음성 인식 장치 및 방법
KR101794543B1 (ko) * 2016-04-18 2017-11-08 주식회사 세화 소리분석을 통한 선로전환기의 고장 탐지 식별 시스템
KR102006206B1 (ko) * 2017-08-14 2019-08-01 오토시맨틱스 주식회사 딥러닝을 통한 음향기반 상수도 누수 진단 방법

Similar Documents

Publication Publication Date Title
WO2013048159A1 (fr) Procédé, appareil et support d'enregistrement lisible par ordinateur pour détecter un emplacement d'un point de caractéristique de visage à l'aide d'un algorithme d'apprentissage adaboost
CN112735473B (zh) 基于声音识别无人机的方法及系统
WO2020153572A1 (fr) Procédé et appareil d'apprentissage de modèle de détection d'événement sonore
CN105895078A (zh) 动态选择语音模型的语音识别方法及装置
WO2015111771A1 (fr) Procédé de détermination d'une consommation d'alcool, support d'enregistrement et terminal associés
CN111724770B (zh) 一种基于深度卷积生成对抗网络的音频关键词识别方法
KR102314824B1 (ko) 딥러닝 기반 감지상황에서의 음향 사건 탐지 방법
WO2020032506A1 (fr) Système de détection de vision et procédé de détection de vision l'utilisant
WO2021107422A1 (fr) Procédé de surveillance de charge non intrusive utilisant des données de consommation d'énergie
WO2021153861A1 (fr) Procédé de détection de multiples objets et appareil associé
EP2907121A1 (fr) Détection de trafic en temps réel
CN114863221A (zh) 检测模型的训练方法、装置、系统、设备及存储介质
CN108615532A (zh) 一种应用于声场景的分类方法及装置
Liu et al. Slippage fault diagnosis of dampers for transmission lines based on faster R-CNN and distance constraint
Dong et al. At the speed of sound: Efficient audio scene classification
WO2021225296A1 (fr) Procédé d'apprentissage actif pouvant être expliqué, destiné à être utilisé pour un détecteur d'objet, à l'aide d'un codeur profond et dispositif d'apprentissage actif l'utilisant
WO2021107333A1 (fr) Procédé de détection d'événement acoustique dans un environnement de détection basé sur un apprentissage profond
WO2022139009A1 (fr) Procédé et appareil pour configurer un algorithme d'apprentissage profond pour une conduite autonome
CN116594057B (zh) 一种基于深度学习和边缘计算的地震预警方法与装置
JP4886460B2 (ja) 異常監視装置
WO2023096185A1 (fr) Procédé pour diagnostiquer une défaillance de machine sur la base de l'apprentissage profond en utilisant des sons et des vibrations, et dispositif de diagnostic l'utilisant
Hassan et al. Intelligent sign language recognition using enhanced fourier descriptor: a case of Hausa sign language
WO2022108057A1 (fr) Appareil multi-modal et procédé de classification d'émotions
WO2020231188A1 (fr) Procédé de vérification de résultat de classification et procédé d'apprentissage de résultat de classification qui utilisent un réseau neuronal de vérification, et dispositif informatique permettant de réaliser des procédés
WO2021153843A1 (fr) Procédé pour déterminer le stress d'un signal vocal en utilisant des poids, et dispositif associé

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20893944

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 05.09.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20893944

Country of ref document: EP

Kind code of ref document: A1