WO2021107333A1 - Procédé de détection d'événement acoustique dans un environnement de détection basé sur un apprentissage profond - Google Patents
Procédé de détection d'événement acoustique dans un environnement de détection basé sur un apprentissage profond Download PDFInfo
- Publication number
- WO2021107333A1 WO2021107333A1 PCT/KR2020/010760 KR2020010760W WO2021107333A1 WO 2021107333 A1 WO2021107333 A1 WO 2021107333A1 KR 2020010760 W KR2020010760 W KR 2020010760W WO 2021107333 A1 WO2021107333 A1 WO 2021107333A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- event
- cnn
- lstm
- detect
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims description 21
- 238000013135 deep learning Methods 0.000 title description 3
- 238000013473 artificial intelligence Methods 0.000 claims abstract description 23
- 238000000034 method Methods 0.000 claims abstract description 20
- 239000000284 extract Substances 0.000 claims abstract description 10
- 230000003068 static effect Effects 0.000 claims description 14
- 230000002123 temporal effect Effects 0.000 claims description 10
- 239000002131 composite material Substances 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 230000006403 short-term memory Effects 0.000 abstract description 2
- 230000001537 neural effect Effects 0.000 abstract 1
- 238000002372 labelling Methods 0.000 description 7
- 238000012805 post-processing Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 2
- 206010039203 Road traffic accident Diseases 0.000 description 1
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/24—Aligning, centring, orientation detection or correction of the image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
-
- G—PHYSICS
- G08—SIGNALLING
- G08B—SIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
- G08B13/00—Burglar, theft or intruder alarms
- G08B13/16—Actuation by interference with mechanical vibrations in air or other fluid
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
Definitions
- the present disclosure relates to a method and system for detecting an acoustic event in a deep learning-based sensing situation, and the present disclosure provides a method and system for detecting an acoustic signal with different characteristics by using a fast regional convolution-based network (FR-CNN) to detect the frequency domain of an acoustic signal with different characteristics. , it is possible to detect complex sounds in real time regardless of distance.
- FR-CNN fast regional convolution-based network
- a technology for detecting an acoustic event and classifying the type of acoustic event has been continuously studied in order to be applied to a user's judgment of the surrounding environment by being fused with a context-aware technology.
- the conventional acoustic event detection system has a problem with high environmental dependence, that is, when noise greater than the sound source to be detected is detected, or when the sound source is located farther away from the sound source, the sound source is detected There was a problem of lowering the accuracy of
- the present disclosure aims at learning an artificial intelligence model using a fast regional convolution-based network (F-R-CNN) and real-time acoustic event detection in a detection situation.
- F-R-CNN fast regional convolution-based network
- An object of the present disclosure is to detect a complex acoustic event that extracts acoustic features regardless of distance and detects a plurality of acoustic occurrences in an acoustic sensing situation.
- the present disclosure provides a method for detecting an event using sound data, the method comprising: receiving sound data including at least one sound source; extracting a composite acoustic feature included in the acoustic data; classifying each of at least one sound source included in the composite sound feature using an artificial intelligence model; and detecting an event based on a combination of each of the classified sound sources.
- the step of classifying the sound source included in the composite acoustic feature by using the artificial intelligence model includes obtaining a convolutional feature map of the static feature, and a preset region of interest from the convolutional feature map ( Classifying the sound source by extracting a feature vector of a region of interest), obtaining a feature map of the differential feature, extracting a feature vector from the feature map of the differential feature, and classifying the sound source included in the feature vector may include
- the present disclosure may provide improved real-time detection based on sound information as well as situation detection using image information using a fast regional convolution based network (F-R-CNN).
- F-R-CNN fast regional convolution based network
- the present disclosure identifies the acoustic characteristics regardless of the distance, it is possible to compensate for the disadvantages of the existing acoustic-based sensing system and to detect the occurrence of a plurality of sounds to accurately detect the occurrence of acoustic events.
- FIG. 1 shows a flowchart according to an embodiment of the present disclosure.
- FIG 3 illustrates an algorithm progress according to an embodiment of the present disclosure.
- CTC Connectionist Temporal Classification
- AED Acoustic event detection
- R-CNN regional-convolutional neural networks
- LSTM long short-term memory
- CTC connectionist temporal classification
- the present disclosure proposes an acoustic event detection system that operates in a variety of noisy environments and distances.
- the method uses a regional convolution-based network (R-CNN) to detect the frequency domain of time-series acoustic events with different characteristics in a noisy environment.
- R-CNN regional convolution-based network
- F-R-CNN fast regional convolution based network
- the input value of the artificial intelligence model may be designed to use a multi-feature value in order to solve the problem that the event detection performance depends on the signal strength according to the distance.
- connectionist temporal classification CTC
- FIG. 1 will be described.
- FIG. 1 shows a flowchart of the present disclosure.
- the acoustic event detection system of the present disclosure may include receiving acoustic data including at least one sound source ( S10 ).
- the sound source may include a source of a sound
- the acoustic event detection system may include sound data including at least one sound source.
- one sound data may include all sounds generated from a specific object or living things, such as a dialogue sound source, a sound source that an object collides with, and a gunshot. And the sound data may include sound characteristic data extracted from the sound source.
- noise removal may be performed in the preprocessing process, and when the spectral frequency band of noise is different from the occurrence of the event or if each event has a unique frequency band, R-CNN (region-convolutional neural networks) can be used.
- the acoustic event detection system of the present disclosure may include extracting a complex feature value included in the acoustic data (S20).
- the composite feature value may include a static feature extracted from a spectrogram of the received sound data and a differential feature based on a difference between the sound data and sound data before a preset time.
- the static feature may mean a log Mel-Band energy-based image including the sound data feature.
- the Log Mel-Band energy is energy that can well represent the characteristics of the acoustic signal.
- Acoustic data is generally normalized by the power of the data, but sometimes, the characteristic information may be damaged if the intensity is weakened according to distance or mixed with noise.
- the acoustic data has a time series characteristic, it is possible to measure the change in log mel band energy through the difference from the previous data in the current acoustic data.
- the differential feature generated through the above method is based on the amount of change between the current sound data and the spectrum in which the sound data before a preset time is expressed as an image, the characteristics of the sound data are determined regardless of the distance from the point of generation of the sound data. can reflect
- the present disclosure may classify at least one sound source included in the complex feature value using an artificial intelligence model (S30). And it is possible to detect an event based on the combination of each of the classified sound sources (S40).
- the present disclosure discloses a Fast R-CNN-LSTM based real-time sensing system using region information for various types of deep learning output layers.
- a fast R-CNN-Attention LSTM (Fast R-CNN-Attention LSTM) is trained using training data in which acoustic data is expressed as an image to generate an FFast R-CNN-LSTM (31) model.
- the present disclosure selects training data in which the decoded value of the generated FFast R-CNN-LSTM (31) model has a score greater than or equal to a predetermined reference value, and uses the selected training data to perform semi-supervised region labeling.
- the SFast R-CNN-LSTM (32) model can be generated.
- the SFast R-CNN-LSTM 32 model generated by the above process can minimize the hand labeling cost because the model can extract the region representing the characteristics of the sound source in the spectrum section by itself.
- CNN Convolution Neural Network
- multiple viewpoints can be collected to use acoustic data as an input of CNN, and multiple frames of acoustic data can be configured into one image for use in image processing.
- the artificial intelligence model of the present disclosure may include an event detection algorithm according to the acoustic structure that connects CNN and RNN (LSTM).
- the algorithm of the present disclosure may include an FFast R-CNN-LSTM 31 model, an SFast R-CNN-LSTM model 32 , and an artificial intelligence model 33 in which differential features are reflected.
- the FFast R-CNN-LSTM 31 may be a model pre-trained using unlabeled training data in the full-band height of the existing Fast R-CNN-LSTM. have.
- the fast regional convolutional neural network refers to a method proposed to improve the shortcomings of the bottleneck structure of R-CNN, in which all of the Bounding Box Proposals are generated at the locations where feature data exists from the image. have.
- each Bbox-Proposal does not go through the CNN, but rather performs object detection at the output feature map stage after passing through the CNN once for the entire image.
- the present disclosure may also include a SFast R-CNN-LSTM model 32 .
- the SFast R-CNN-LSTM (32) model can extract a region representing the characteristics of the sound source in the spectrum section by itself.
- the present disclosure may select appropriate training data obtained by obtaining a score greater than or equal to a preset value as a result of a test of the FFast R-CNN-LSTM in order to train the SFast R-CNN-LSTM model 32 .
- the SFast R-CNN-LSTM model 32 of the present disclosure may include Fast R-CNN-LSTM using semi-supervised region labeling. Since the SFast R-CNN-LSTM model 32 of the present disclosure uses selected training data, it is possible to minimize the hand labeling cost.
- the sensing system of the present disclosure may include an artificial intelligence model 33 in which differential features are reflected.
- the artificial intelligence model 33 reflecting the differential feature may be trained in parallel with the SFast R-CNN-LSTM model 32 .
- one event detection can be performed by sharing some layers for acoustic detection with the SFast R-CNN-LSTM model 32 .
- some layers may include a CTC algorithm layer.
- the artificial intelligence model 33 in which the differential feature is reflected includes a layer in which CNN and LSTM are connected, and when the differential feature is provided to an algorithm in which CNN and RNN (LSTM) are connected, a feature vector reflecting the differential feature is generated.
- the differential feature includes the amount of change obtained by subtracting the image information of the sound data before a specific time (T-1 time) from the image information of the sound data at a specific time (T time) extracted from the sound source, and the artificial intelligence model ( 33) can be learned using the differential feature value.
- FIG. 3 the process of the sound sensing system of the present disclosure is shown.
- the sound sensing system of the present disclosure may receive sound data including at least one sound source and extract a complex feature value included in the sound data (S20). And a static feature among the extracted complex feature values is input to the FFast R-CNN-LSTM model 31, and the output result of the FFast R-CNN-LSTM model 31 is a preset interest set in step S20.
- the region RoI may be reset.
- the sensing system of the present disclosure provides a static feature of acoustic data to the SFast R-CNN-LSTM model 32, extracts a reconfigured feature vector of a region of interest from a convolutional feature map of static features of the acoustic data, and extracts the composite feature You can classify the sound source included in the value.
- the artificial intelligence model 33 in which differential features are reflected in parallel with the SFast R-CNN-LSTM model 32 extracts a feature vector from the convolutional feature map using the differential feature of the acoustic data, , it is possible to classify the sound source included in the complex feature value.
- the artificial intelligence model 33 reflecting the differential feature and the SFast R-CNN-LSTM model 32 reflecting the static feature share a specific layer to receive at least one sound source included in the complex feature value, respectively. It may include a step of classifying (S30) and a step of detecting an event based on a combination of each of the classified sound sources (S40).
- a specific shared layer may include an Attention-LSTM algorithm and a Connectionist Temporal Classification (CTC) algorithm in which time-series and regional features are reflected.
- CTC Connectionist Temporal Classification
- the Attention-LSTM algorithm and the CTC algorithm are included in each of the models 31, 32, and 33 shown in FIG. 2 to perform sound sensing.
- the acoustic data obtained in the event detection of the present disclosure should be able to detect events of various lengths.
- the present disclosure extracts regional features of sound data from a feature map and applies an attention algorithm to assign specific weights to the regional features to increase sound detection accuracy.
- the attention algorithm can be trained by unsupervised learning that does not require labeling.
- the sensing system of the present disclosure may detect a final event by labeling each time-step or frame wisely of the acoustic data using CTC (Connectionist Temporal Classification).
- the detection system detects the obtained sound data of 3 frames.
- 'Traffic Accident Occurrence' can be output as the final event using
- Connectionist Temporal Classification may be used as a method of determining the final event.
- FIG. 6 shows a comparison result according to post-processing of the present disclosure. Referring to FIG. 6 , it can be seen that the F1 score is higher after the post-processing according to the presence or absence of the post-processing.
- the F1 score is the highest when complex feature values are used and the Attention algorithm is used.
- 7 to 8 show statistical values indicating that the performance of the algorithm of the present disclosure shows the highest performance when a composite feature value including a static feature and a differential feature is used.
- the present disclosure described above can be implemented as computer-readable code on a medium in which a program is recorded.
- the computer-readable medium includes all kinds of recording devices in which data readable by a computer system is stored. Examples of computer-readable media include Hard Disk Drive (HDD), Solid State Disk (SSD), Silicon Disk Drive (SDD), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. there is this
- the computer may include a processor of the terminal.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Acoustics & Sound (AREA)
- Image Analysis (AREA)
Abstract
La présente invention concerne un procédé : qui extrait de multiples caractéristiques à partir de données acoustiques dignes d'intérêt à partir de données acoustiques contenant une ou plusieurs sources sonores ; qui classe la ou les sources sonores incluses dans les multiples caractéristiques à l'aide d'un modèle d'intelligence artificielle basé sur une mémoire à court et long terme de réseau neuronal à région rapide (Fast region-CNN-LSTM) ; et qui détecte un événement à l'aide des sources sonores classées.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962940058P | 2019-11-25 | 2019-11-25 | |
US62/940,058 | 2019-11-25 | ||
KR10-2020-0035181 | 2020-03-23 | ||
KR1020200035181A KR102314824B1 (ko) | 2019-11-25 | 2020-03-23 | 딥러닝 기반 감지상황에서의 음향 사건 탐지 방법 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021107333A1 true WO2021107333A1 (fr) | 2021-06-03 |
Family
ID=76129728
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2020/010760 WO2021107333A1 (fr) | 2019-11-25 | 2020-08-19 | Procédé de détection d'événement acoustique dans un environnement de détection basé sur un apprentissage profond |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021107333A1 (fr) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007114413A (ja) * | 2005-10-19 | 2007-05-10 | Toshiba Corp | 音声非音声判別装置、音声区間検出装置、音声非音声判別方法、音声区間検出方法、音声非音声判別プログラムおよび音声区間検出プログラム |
US20160099010A1 (en) * | 2014-10-03 | 2016-04-07 | Google Inc. | Convolutional, long short-term memory, fully connected deep neural networks |
KR20170022445A (ko) * | 2015-08-20 | 2017-03-02 | 삼성전자주식회사 | 통합 모델 기반의 음성 인식 장치 및 방법 |
KR101794543B1 (ko) * | 2016-04-18 | 2017-11-08 | 주식회사 세화 | 소리분석을 통한 선로전환기의 고장 탐지 식별 시스템 |
KR102006206B1 (ko) * | 2017-08-14 | 2019-08-01 | 오토시맨틱스 주식회사 | 딥러닝을 통한 음향기반 상수도 누수 진단 방법 |
-
2020
- 2020-08-19 WO PCT/KR2020/010760 patent/WO2021107333A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007114413A (ja) * | 2005-10-19 | 2007-05-10 | Toshiba Corp | 音声非音声判別装置、音声区間検出装置、音声非音声判別方法、音声区間検出方法、音声非音声判別プログラムおよび音声区間検出プログラム |
US20160099010A1 (en) * | 2014-10-03 | 2016-04-07 | Google Inc. | Convolutional, long short-term memory, fully connected deep neural networks |
KR20170022445A (ko) * | 2015-08-20 | 2017-03-02 | 삼성전자주식회사 | 통합 모델 기반의 음성 인식 장치 및 방법 |
KR101794543B1 (ko) * | 2016-04-18 | 2017-11-08 | 주식회사 세화 | 소리분석을 통한 선로전환기의 고장 탐지 식별 시스템 |
KR102006206B1 (ko) * | 2017-08-14 | 2019-08-01 | 오토시맨틱스 주식회사 | 딥러닝을 통한 음향기반 상수도 누수 진단 방법 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2013048159A1 (fr) | Procédé, appareil et support d'enregistrement lisible par ordinateur pour détecter un emplacement d'un point de caractéristique de visage à l'aide d'un algorithme d'apprentissage adaboost | |
CN112735473B (zh) | 基于声音识别无人机的方法及系统 | |
WO2020153572A1 (fr) | Procédé et appareil d'apprentissage de modèle de détection d'événement sonore | |
CN105895078A (zh) | 动态选择语音模型的语音识别方法及装置 | |
WO2015111771A1 (fr) | Procédé de détermination d'une consommation d'alcool, support d'enregistrement et terminal associés | |
CN111724770B (zh) | 一种基于深度卷积生成对抗网络的音频关键词识别方法 | |
KR102314824B1 (ko) | 딥러닝 기반 감지상황에서의 음향 사건 탐지 방법 | |
WO2020032506A1 (fr) | Système de détection de vision et procédé de détection de vision l'utilisant | |
WO2021107422A1 (fr) | Procédé de surveillance de charge non intrusive utilisant des données de consommation d'énergie | |
WO2021153861A1 (fr) | Procédé de détection de multiples objets et appareil associé | |
EP2907121A1 (fr) | Détection de trafic en temps réel | |
CN114863221A (zh) | 检测模型的训练方法、装置、系统、设备及存储介质 | |
CN108615532A (zh) | 一种应用于声场景的分类方法及装置 | |
Liu et al. | Slippage fault diagnosis of dampers for transmission lines based on faster R-CNN and distance constraint | |
Dong et al. | At the speed of sound: Efficient audio scene classification | |
WO2021225296A1 (fr) | Procédé d'apprentissage actif pouvant être expliqué, destiné à être utilisé pour un détecteur d'objet, à l'aide d'un codeur profond et dispositif d'apprentissage actif l'utilisant | |
WO2021107333A1 (fr) | Procédé de détection d'événement acoustique dans un environnement de détection basé sur un apprentissage profond | |
WO2022139009A1 (fr) | Procédé et appareil pour configurer un algorithme d'apprentissage profond pour une conduite autonome | |
CN116594057B (zh) | 一种基于深度学习和边缘计算的地震预警方法与装置 | |
JP4886460B2 (ja) | 異常監視装置 | |
WO2023096185A1 (fr) | Procédé pour diagnostiquer une défaillance de machine sur la base de l'apprentissage profond en utilisant des sons et des vibrations, et dispositif de diagnostic l'utilisant | |
Hassan et al. | Intelligent sign language recognition using enhanced fourier descriptor: a case of Hausa sign language | |
WO2022108057A1 (fr) | Appareil multi-modal et procédé de classification d'émotions | |
WO2020231188A1 (fr) | Procédé de vérification de résultat de classification et procédé d'apprentissage de résultat de classification qui utilisent un réseau neuronal de vérification, et dispositif informatique permettant de réaliser des procédés | |
WO2021153843A1 (fr) | Procédé pour déterminer le stress d'un signal vocal en utilisant des poids, et dispositif associé |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20893944 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 05.09.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20893944 Country of ref document: EP Kind code of ref document: A1 |