KR102626550B1

KR102626550B1 - Deep learning-based environmental sound classification method and device

Info

Publication number: KR102626550B1
Application number: KR1020210038804A
Authority: KR
Inventors: 김기두; 셴 굽따 샨또누
Original assignee: 국민대학교산학협력단
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2024-01-18
Also published as: KR20220133552A

Abstract

본 발명은 딥러닝 기반 환경 사운드 분류 방법 및 장치에 관한 것으로, 상기 방법은 사운드에 관한 원시 신호(raw signal)를 수집하여 데이터 집합을 생성하는 데이터 수집 단계; 상기 원시 신호로부터 샘플링(sampling) 되어 소정의 시간 길이를 갖는 복수의 사운드 프레임(frame)들을 생성하는 데이터 전처리 단계; 상기 복수의 사운드 프레임들에 대해 해당 신호를 위상(phase)과 크기(magnitude) 성분으로 분리하여 데이터를 증강(augmentation)시키는 데이터 증강 단계; 증강된 데이터를 소정의 범위 내의 값들로 변환하는 피처 스케일링(feature scaling) 단계; 및 변환된 데이터를 학습하여 환경 사운드 분류(ESC, Environmental Sound Classification)를 위한 학습 모델을 구축하는 모델 구축 단계;를 포함한다.The present invention relates to a deep learning-based environmental sound classification method and device, the method comprising: a data collection step of collecting raw signals related to sound to generate a data set; A data preprocessing step of sampling from the raw signal to generate a plurality of sound frames with a predetermined time length; A data augmentation step of dividing the corresponding signal into phase and magnitude components for the plurality of sound frames and augmenting the data; A feature scaling step of converting the augmented data into values within a predetermined range; And a model building step of learning the converted data to build a learning model for environmental sound classification (ESC).

Description

Deep learning-based environmental sound classification method and device {DEEP LEARNING-BASED ENVIRONMENTAL SOUND CLASSIFICATION METHOD AND DEVICE}

본 발명은 환경 사운드 분류 기술에 관한 것으로, 보다 상세하게는 매우 제한된 환경 하에서 계산 복잡성이 낮은 학습 모델을 구축하여 주변환경 사운드를 효과적으로 분류할 수 있는 딥러닝 기반 환경 사운드 분류 방법 및 장치에 관한 것이다.The present invention relates to environmental sound classification technology, and more specifically, to a deep learning-based environmental sound classification method and device that can effectively classify environmental sounds by building a learning model with low computational complexity under a very limited environment.

최근 음악 및 음성 신호 처리를 대상으로 하는 많은 연구가 수행되고 있다. 이러한 연구에는 음악 태깅(tagging) 또는 장르 분류, 리듬의 음악 정보 검색 분석, 하모닉(harmonic) 분석 및 기타 저레벨 또는 고레벨 분석이 포함될 수 있다. 음성 신호의 경우 화자 식별, 음성-텍스트 변환 또는 그 반대의 경우, 자동 음성 인식 등에 집중되고 있다. 반면, 환경 사운드 분류(ESC)에 대한 연구는 거의 이루어지지 않고 있다. 다른 주변 소리의 분류 또는 태깅은 ESC에 해당할 수 있다.Recently, much research targeting music and voice signal processing has been conducted. These studies may include music tagging or genre classification, musical information retrieval analysis of rhythm, harmonic analysis, and other low-level or high-level analyses. In the case of voice signals, the focus is on speaker identification, voice-to-text conversion and vice versa, and automatic voice recognition. On the other hand, little research has been conducted on environmental sound classification (ESC). Classification or tagging of other ambient sounds may correspond to ESC.

실제로 ESC는 환경 사운드 처리의 일부에 해당한다. ESC 이외의 환경 사운드의 다른 두가지 분야는 음향 장면 분류(acoustic scene classification)와 음향 이벤트 감지(acoustic event detection)에 해당한다. 음향 장면 분류는 사운드 녹음을 '실내', '실외', '홈' 등과 같은 단일 장면 태그(single scene tag)로 분류하는데 집중하고 있다. 음향 이벤트 감지는 전체 오디오에서 단일 사운드 레이블의 시작 및 종료 지점을 예측하는 것을 목표로 하고 있다.In fact, ESC is part of environmental sound processing. Other than ESC, two other areas of environmental sound are acoustic scene classification and acoustic event detection. Acoustic scene classification focuses on classifying sound recordings into single scene tags such as 'indoor', 'outdoor', 'home', etc. Acoustic event detection aims to predict the start and end points of a single sound label across the entire audio.

ESC 또는 환경 사운드 태깅은 입력 오디오에서 단일 음원 레이블을 감지하기 위해 동작할 수 있다. 음성 및 음악 사운드 외에도 환경 사운드는 특히 주변 환경에 대한 많은 정보를 전달할 수 있다. 이러한 종류의 사운드 신호는 진화하는 도시 음향 모니터링 장치, 지능형 오디오 기반 감시 시스템, 환경 상황 인식 처리, 자동 범죄 현장 조사 등에 큰 기여를 할 수 있다. 감시 및 보안 시스템 개발 외에도 ESC는 대규모 멀티미디어 카탈로그 검색 및 인덱싱, 상황 인식 휴대용 장치, 환경과 능숙하게 상호 작용하는 로봇 개발, 시청각 안전 장비 개발 등 다양한 일상 생활 용도로 사용할 수 있다. 무엇보다도, ESC는 사물 인터넷(IOT)의 에지(edge)에서 중요한 역할을 할 수 있다.ESC or environmental sound tagging can operate to detect single sound source labels in input audio. In addition to voice and musical sounds, environmental sounds can convey a lot of information, especially about the surrounding environment. This kind of sound signal can make a significant contribution to evolving urban acoustic monitoring devices, intelligent audio-based surveillance systems, environmental situational awareness processing, automatic crime scene investigation, etc. In addition to the development of surveillance and security systems, ESCs can be used for a variety of everyday purposes, including searching and indexing large multimedia catalogs, portable devices with situational awareness, developing robots that skillfully interact with the environment, and developing audiovisual safety equipment. Above all, ESCs can play an important role at the edge of the Internet of Things (IOT).

한국공개특허 제10-2013-0117844호 (2013.10.28)Korean Patent Publication No. 10-2013-0117844 (2013.10.28)

본 발명의 일 실시예는 매우 제한된 환경 하에서 계산 복잡성이 낮은 학습 모델을 구축하여 주변환경 사운드를 효과적으로 분류할 수 있는 딥러닝 기반 환경 사운드 분류 방법 및 장치를 제공하고자 한다.An embodiment of the present invention seeks to provide a deep learning-based environmental sound classification method and device that can effectively classify environmental sound by building a learning model with low computational complexity under a very limited environment.

본 발명의 일 실시예는 유사한 아키텍쳐를 가진 독립된 모델들을 구축하고 운영 환경에 따라 각 모델을 선택적으로 적용함으로써 환경 사운드 분류의 성능을 개선할 수 있는 딥러닝 기반 환경 사운드 분류 방법 및 장치를 제공하고자 한다.An embodiment of the present invention seeks to provide a deep learning-based environmental sound classification method and device that can improve the performance of environmental sound classification by building independent models with similar architecture and selectively applying each model according to the operating environment. .

실시예들 중에서, 딥러닝 기반 환경 사운드 분류 방법은 사운드에 관한 원시 신호(raw signal)를 수집하여 데이터 집합을 생성하는 데이터 수집 단계; 상기 원시 신호로부터 샘플링(sampling) 되어 소정의 시간 길이를 갖는 복수의 사운드 프레임(frame)들을 생성하는 데이터 전처리 단계; 상기 복수의 사운드 프레임들에 대해 해당 신호를 위상(phase)과 크기(magnitude) 성분으로 분리하여 데이터를 증강(augmentation)시키는 데이터 증강 단계; 증강된 데이터를 소정의 범위 내의 값들로 변환하는 피처 스케일링(feature scaling) 단계; 및 변환된 데이터를 학습하여 환경 사운드 분류(ESC, Environmental Sound Classification)를 위한 학습 모델을 구축하는 모델 구축 단계;를 포함한다.Among embodiments, a deep learning-based environmental sound classification method includes a data collection step of collecting raw signals about sounds to generate a data set; A data preprocessing step of sampling from the raw signal to generate a plurality of sound frames with a predetermined time length; A data augmentation step of dividing the corresponding signal into phase and magnitude components for the plurality of sound frames and augmenting the data; A feature scaling step of converting the augmented data into values within a predetermined range; And a model building step of learning the converted data to build a learning model for environmental sound classification (ESC).

상기 데이터 수집 단계는 상기 데이터 집합으로서 ESC-10 또는 US-8K를 결정하는 단계를 포함할 수 있다.The data collection step may include determining ESC-10 or US-8K as the data set.

상기 데이터 전처리 단계는 상기 원시 신호를 소정의 주파수에 따라 샘플링 하여 샘플링 데이터를 생성하는 단계; 및 상기 샘플링 데이터에 대해 소정의 홉 길이(hop length)를 유지하면서 1초(s)의 시간 길이를 갖는 1초 프레임들을 연속하여 생성하는 단계;를 포함할 수 있다.The data preprocessing step includes generating sampling data by sampling the raw signal at a predetermined frequency; and continuously generating 1 second frames with a time length of 1 second (s) while maintaining a predetermined hop length for the sampling data.

상기 데이터 전처리 단계는 상기 1초 프레임들을 기초로 상기 원시 신호에 대응하는 감마톤 스펙트로그램(GTS, Gammatone Spectrogram)을 생성하는 단계를 포함할 수 있다.The data preprocessing step may include generating a gammatone spectrogram (GTS) corresponding to the raw signal based on the 1-second frames.

상기 데이터 증강 단계는 입력 신호를 복수의 대역 통과 필터들에 대한 출력 신호들의 합(summation)으로 표현하고 상기 출력 신호들 각각에 단시간 푸리에 변환(STFT, Short Time Fourier Transform)을 수행하여 상기 위상과 크기 성분으로 분리하는 분석(analysis) 단계; 상기 출력 신호들 각각의 상기 위상과 크기 성분을 기초로 타임 스트레칭(time stretching) 및 피치 쉬프팅(pitch shifting)을 수행하는 처리(processing) 단계; 및 상기 출력 신호들에 대해 상기 처리 단계의 결과로 생성된 신호들을 합성하여 상기 입력 신호에 대응하는 적어도 하나의 증강 신호를 생성하는 합성(synthesis) 단계;를 포함할 수 있다.The data enhancement step expresses the input signal as a summation of output signals for a plurality of band-pass filters and performs a Short Time Fourier Transform (STFT) on each of the output signals to determine the phase and magnitude. Analysis step of separating into components; A processing step of performing time stretching and pitch shifting based on the phase and magnitude components of each of the output signals; and a synthesis step of synthesizing signals generated as a result of the processing step with respect to the output signals to generate at least one augmented signal corresponding to the input signal.

상기 피처 스케일링 단계는 z-점수 정규화, 최소-최대 스케일링 및 평균 정규화를 포함하는 최소화(minimization) 기법을 적용하여 상기 증강된 데이터의 피처들이 소정의 평균(mean)과 표준편차(standard deviation)에 따라 분포하도록 재분배하는 단계를 포함할 수 있다.The feature scaling step applies a minimization technique including z-score normalization, min-max scaling, and mean normalization so that the features of the augmented data are adjusted according to a predetermined mean and standard deviation. It may include a redistribution step.

상기 모델 구축 단계는 상기 학습 모델로서 상기 원시 신호를 입력으로 수신하는 1D CNN 모델과 상기 감마톤 스펙트로그램을 입력으로 수신하는 2D CNN 모델을 각각 구축하는 단계를 포함할 수 있다.The model building step may include constructing a 1D CNN model that receives the raw signal as an input and a 2D CNN model that receives the gammatone spectrogram as an input as the learning model, respectively.

상기 모델 구축 단계는 상기 1D CNN 모델의 컨볼루션(convolution) 단계를 시간 축에 따라 수행하고 상기 2D CNN 모델의 컨볼루션 단계를 시간 및 주파수 축에 따라 수행하는 단계를 포함할 수 있다.The model building step may include performing a convolution step of the 1D CNN model along the time axis and performing a convolution step of the 2D CNN model along the time and frequency axes.

상기 모델 구축 단계는 상기 1D CNN 모델 및 상기 2D CNN 모델 각각이 5개의 컨볼루션 블록(block)들을 포함하도록 구성하는 단계를 포함하고, 상기 컨볼루션 블록은 컨볼루션, 활성화(activation) 및 배치 정규화(batch normalization) 레이어들을 포함할 수 있다.The model building step includes configuring each of the 1D CNN model and the 2D CNN model to include five convolution blocks, and the convolution block includes convolution, activation, and batch normalization ( batch normalization) layers.

상기 모델 구축 단계는 상기 1D CNN 모델 및 상기 2D CNN 모델 각각에 컨볼루션 레이어 구조에 관한 공통 규칙을 적용하는 단계를 포함하고, 상기 공통 규칙은 '[수용 영역(receptive field) / 스트라이드(stride), 필터 개수] × 반복 횟수'를 포함할 수 있다.The model building step includes applying a common rule regarding the convolution layer structure to each of the 1D CNN model and the 2D CNN model, and the common rule is '[receptive field / stride, Number of filters] × number of repetitions’ may be included.

상기 모델 구축 단계는 상기 1D CNN 모델의 경우 각 컨볼루션 블록의 스트라이드를 동일하게 적용하고 상기 2D CNN 모델의 경우 각 컨볼루션 블록의 스트라이드를 서로 다른 크기를 교대로 적용하는 단계를 포함할 수 있다.The model building step may include applying the same stride to each convolutional block in the case of the 1D CNN model, and alternately applying strides of different sizes to each convolutional block in the case of the 2D CNN model.

상기 모델 구축 단계는 상기 1D CNN 모델 및 상기 2D CNN 모델 각각이 4개의 덴스(dense) 블록들을 포함하도록 구성하는 단계를 포함하고, 상기 덴스 블록은 플래튼(flatten) 및 활성화(activation) 레이어들을 포함할 수 있다.The model building step includes configuring each of the 1D CNN model and the 2D CNN model to include four dense blocks, and the dense block includes flatten and activation layers. can do.

상기 방법은 상기 학습 모델을 기초로 주어진 환경 사운드에 대한 분류 동작을 수행하는 환경 사운드 분류 단계를 더 포함할 수 있다.The method may further include an environmental sound classification step of performing a classification operation for a given environmental sound based on the learning model.

상기 환경 사운드 분류 단계는 상기 학습 모델로서 1D CNN 모델 및 2D CNN 모델이 각각 구축된 경우 상기 분류 동작의 수행 환경에 따라 상기 1D CNN 모델 및 2D CNN 모델 중 어느 하나를 선택적으로 적용하는 단계를 포함할 수 있다.The environmental sound classification step may include selectively applying one of the 1D CNN model and the 2D CNN model according to the environment in which the classification operation is performed when a 1D CNN model and a 2D CNN model are respectively constructed as the learning models. You can.

실시예들 중에서, 딥러닝 기반 환경 사운드 분류 장치는 사운드에 관한 원시 신호(raw signal)를 수집하여 데이터 집합을 생성하는 데이터 수집부; 상기 원시 신호로부터 샘플링(sampling) 되어 소정의 시간 길이를 갖는 복수의 사운드 프레임(frame)들을 생성하는 데이터 전처리부; 상기 복수의 사운드 프레임들에 대해 해당 신호를 위상(phase)과 크기(magnitude) 성분으로 분리하여 데이터를 증강(augmentation)시키는 데이터 증강부; 증강된 데이터를 소정의 범위 내의 값들로 변환하는 피처 스케일링(feature scaling)부; 및 변환된 데이터를 학습하여 환경 사운드 분류(ESC, Environmental Sound Classification)를 위한 학습 모델을 구축하는 모델 구축부;를 포함한다.Among embodiments, a deep learning-based environmental sound classification device includes a data collection unit that collects raw signals related to sound and generates a data set; a data preprocessor that samples the raw signal and generates a plurality of sound frames with a predetermined time length; a data augmentation unit that augments data by dividing the corresponding signal into phase and magnitude components for the plurality of sound frames; A feature scaling unit that converts the augmented data into values within a predetermined range; and a model construction unit that learns the converted data to build a learning model for environmental sound classification (ESC).

개시된 기술은 다음의 효과를 가질 수 있다. 다만, 특정 실시예가 다음의 효과를 전부 포함하여야 한다거나 다음의 효과만을 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.The disclosed technology can have the following effects. However, since it does not mean that a specific embodiment must include all of the following effects or only the following effects, the scope of rights of the disclosed technology should not be understood as being limited thereby.

본 발명의 일 실시예에 따른 딥러닝 기반 환경 사운드 분류 방법 및 장치는 매우 제한된 환경 하에서 계산 복잡성이 낮은 학습 모델을 구축하여 주변환경 사운드를 효과적으로 분류할 수 있다.The deep learning-based environmental sound classification method and device according to an embodiment of the present invention can effectively classify environmental sounds by building a learning model with low computational complexity under a very limited environment.

본 발명의 일 실시예에 따른 딥러닝 기반 환경 사운드 분류 방법 및 장치는 유사한 아키텍쳐를 가진 독립된 모델들을 구축하고 운영 환경에 따라 각 모델을 선택적으로 적용함으로써 환경 사운드 분류의 성능을 개선할 수 있다.The deep learning-based environmental sound classification method and device according to an embodiment of the present invention can improve the performance of environmental sound classification by building independent models with similar architectures and selectively applying each model depending on the operating environment.

도 1은 본 발명에 따른 환경 사운드 분류 시스템을 설명하는 도면이다.
도 2는 도 1의 환경 사운드 분류 장치의 시스템 구성을 설명하는 도면이다.
도 3은 도 1의 환경 사운드 분류 장치의 기능적 구성을 설명하는 도면이다.
도 4는 본 발명에 따른 딥러닝 기반의 환경 사운드 분류 방법을 설명하는 순서도이다.
도 5는 본 발명에 따른 입력 신호의 리샘플링 및 프레이밍 동작을 설명하는 도면이다.
도 6은 본 발명에 감마톤 스펙트로그램을 설명하는 도면이다.
도 7은 본 발명에 따른 데이터 증강 과정을 설명하는 도면이다.
도 8은 본 발명에 따른 1D CNN 모델의 구조를 설명하는 도면이다.
도 9는 본 발명에 따른 2D CNN 모델의 구조를 설명하는 도면이다.
도 10은 본 발명에 따른 CNN 모델을 이용한 환경 사운드 분류 시스템을 설명하는 도면이다.1 is a diagram illustrating an environmental sound classification system according to the present invention.
FIG. 2 is a diagram explaining the system configuration of the environmental sound classification device of FIG. 1.
FIG. 3 is a diagram explaining the functional configuration of the environmental sound classification device of FIG. 1.
Figure 4 is a flowchart explaining the deep learning-based environmental sound classification method according to the present invention.
Figure 5 is a diagram explaining resampling and framing operations of an input signal according to the present invention.
Figure 6 is a diagram illustrating a gammatone spectrogram according to the present invention.
Figure 7 is a diagram explaining the data augmentation process according to the present invention.
Figure 8 is a diagram explaining the structure of a 1D CNN model according to the present invention.
Figure 9 is a diagram explaining the structure of a 2D CNN model according to the present invention.
Figure 10 is a diagram explaining an environmental sound classification system using a CNN model according to the present invention.

본 발명에 관한 설명은 구조적 내지 기능적 설명을 위한 실시예에 불과하므로, 본 발명의 권리범위는 본문에 설명된 실시예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 본 발명의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다. 또한, 본 발명에서 제시된 목적 또는 효과는 특정 실시예가 이를 전부 포함하여야 한다거나 그러한 효과만을 포함하여야 한다는 의미는 아니므로, 본 발명의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.Since the description of the present invention is only an example for structural or functional explanation, the scope of the present invention should not be construed as limited by the examples described in the text. In other words, since the embodiments can be modified in various ways and can have various forms, the scope of rights of the present invention should be understood to include equivalents that can realize the technical idea. In addition, the purpose or effect presented in the present invention does not mean that a specific embodiment must include all or only such effects, so the scope of the present invention should not be understood as limited thereby.

한편, 본 출원에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.Meanwhile, the meaning of the terms described in this application should be understood as follows.

"제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as “first” and “second” are used to distinguish one component from another component, and the scope of rights should not be limited by these terms. For example, a first component may be named a second component, and similarly, the second component may also be named a first component.

어떤 구성요소가 다른 구성요소에 "연결되어"있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결될 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어"있다고 언급된 때에는 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being “connected” to another component, it should be understood that it may be directly connected to the other component, but that other components may exist in between. On the other hand, when a component is referred to as being “directly connected” to another component, it should be understood that there are no other components in between. Meanwhile, other expressions that describe the relationship between components, such as "between" and "immediately between" or "neighboring" and "directly neighboring" should be interpreted similarly.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함하다"또는 "가지다" 등의 용어는 실시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions should be understood to include plural expressions unless the context clearly indicates otherwise, and terms such as “comprise” or “have” refer to implemented features, numbers, steps, operations, components, parts, or them. It is intended to specify the existence of a combination, and should be understood as not excluding in advance the possibility of the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

각 단계들에 있어 식별부호(예를 들어, a, b, c 등)는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.For each step, identification codes (e.g., a, b, c, etc.) are used for convenience of explanation. The identification codes do not explain the order of each step, and each step clearly follows a specific order in context. Unless specified, events may occur differently from the specified order. That is, each step may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the opposite order.

본 발명은 컴퓨터가 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있고, 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can be implemented as computer-readable code on a computer-readable recording medium, and the computer-readable recording medium includes all types of recording devices that store data that can be read by a computer system. . Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage devices. Additionally, the computer-readable recording medium can be distributed across computer systems connected to a network, so that computer-readable code can be stored and executed in a distributed manner.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 본 발명이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.All terms used herein, unless otherwise defined, have the same meaning as commonly understood by a person of ordinary skill in the field to which the present invention pertains. Terms defined in commonly used dictionaries should be interpreted as consistent with the meaning they have in the context of the related technology, and cannot be interpreted as having an ideal or excessively formal meaning unless clearly defined in the present application.

도 1은 본 발명에 따른 환경 사운드 분류 시스템을 설명하는 도면이다.1 is a diagram illustrating an environmental sound classification system according to the present invention.

도 1을 참조하면, 환경 사운드 분류 시스템(100)은 사용자 단말(110), 환경 사운드 분류 장치(130) 및 데이터베이스(150)를 포함할 수 있다.Referring to FIG. 1 , the environmental sound classification system 100 may include a user terminal 110, an environmental sound classification device 130, and a database 150.

사용자 단말(110)은 환경 사운드에 관한 데이터를 제공하고 환경 사운드에 관한 분류 결과를 확인할 수 있는 컴퓨팅 장치에 해당할 수 있다. 예를 들어, 사용자는 사용자 단말(110)을 통해 환경 사운드 분류를 위한 학습 모델의 입력 데이터로서 사운드 신호를 입력할 수 있고, 학습 모델에 의해 출력된 환경 사운드 분류 결과를 확인할 수 있다. 일 실시예에서, 사용자 단말(110)은 사운드 신호를 직접 생성할 수 있다. 예를 들어, 사용자 단말(110)은 주변의 사운드를 녹음하는 마이크 모듈을 포함할 수 있고, 이를 통해 주변 사운드를 환경 사운드로서 수집할 수 있다.The user terminal 110 may correspond to a computing device capable of providing data about environmental sounds and confirming classification results about environmental sounds. For example, the user may input a sound signal as input data of a learning model for environmental sound classification through the user terminal 110 and check the environmental sound classification result output by the learning model. In one embodiment, the user terminal 110 may directly generate a sound signal. For example, the user terminal 110 may include a microphone module that records surrounding sounds, through which the surrounding sounds may be collected as environmental sounds.

또한, 사용자 단말(110)은 환경 사운드 분류 장치(130)와 연결되어 동작 가능한 스마트폰, 노트북 또는 컴퓨터로 구현될 수 있으며, 반드시 이에 한정되지 않고, 태블릿 PC 등 다양한 디바이스로도 구현될 수 있다. 사용자 단말(110)은 환경 사운드 분류 장치(130)와 유선 또는 무선 네트워크를 통해 연결될 수 있으며, 복수의 사용자 단말(110)들은 환경 사운드 분류 장치(130)와 동시에 연결될 수 있다.Additionally, the user terminal 110 may be implemented as a smartphone, laptop, or computer that can be operated by being connected to the environmental sound classification device 130, but is not necessarily limited thereto, and may also be implemented as various devices such as a tablet PC. The user terminal 110 may be connected to the environmental sound classification device 130 through a wired or wireless network, and a plurality of user terminals 110 may be connected to the environmental sound classification device 130 at the same time.

환경 사운드 분류 장치(130)는 환경 사운드 분류를 위한 학습 모델을 구축하고 이를 기반으로 환경 사운드에 대한 분류를 수행할 수 있는 컴퓨터 또는 프로그램에 해당하는 서버로 구현될 수 있다. 환경 사운드 분류 장치(130)는 사용자 단말(110)과 블루투스, WiFi 등과 같은 무선 네트워크로 연결될 수 있고, 네트워크를 통해 사용자 단말(110)과 데이터를 송·수신할 수 있다. 또한, 환경 사운드 분류 장치(130)는 데이터의 수집 또는 추가 기능의 제공을 위하여 별도의 외부 시스템(도 1에 미도시함)과 연동하여 동작하도록 구현될 수도 있다.The environmental sound classification device 130 may be implemented as a server corresponding to a computer or program that can build a learning model for environmental sound classification and perform classification of environmental sounds based on this. The environmental sound classification device 130 may be connected to the user terminal 110 through a wireless network such as Bluetooth or WiFi, and may transmit and receive data with the user terminal 110 through the network. Additionally, the environmental sound classification device 130 may be implemented to operate in conjunction with a separate external system (not shown in FIG. 1) to collect data or provide additional functions.

데이터베이스(150)는 환경 사운드 분류 장치(130)의 동작 과정에서 필요한 다양한 정보들을 저장하는 저장장치에 해당할 수 있다. 예를 들어, 데이터베이스(150)는 다양한 출처로부터 수집된 학습 데이터를 저장할 수 있고, 학습을 통해 구축된 학습 모델들에 관한 정보를 저장할 수 있으며, 반드시 이에 한정되지 않고, 환경 사운드 분류 장치(130)가 딥러닝 기반의 환경 사운드 분류 방법을 수행하는 과정에서 다양한 형태로 수집 또는 가공된 정보들을 저장할 수 있다.The database 150 may correspond to a storage device that stores various information required during the operation of the environmental sound classification device 130. For example, the database 150 may store training data collected from various sources, and may store information about learning models built through learning, but is not necessarily limited thereto, and the environmental sound classification device 130 In the process of performing a deep learning-based environmental sound classification method, information collected or processed can be stored in various forms.

한편, 도 1에서, 데이터베이스(150)는 환경 사운드 분류 장치(130)와 독립적인 장치로서 도시되어 있으나, 반드시 이에 한정되지 않고, 환경 사운드 분류 장치(130)의 논리적인 저장장치로서 환경 사운드 분류 장치(130)에 포함되어 구현될 수 있음은 물론이다.Meanwhile, in FIG. 1, the database 150 is shown as a device independent of the environmental sound classification device 130, but is not necessarily limited thereto, and is a logical storage device of the environmental sound classification device 130. Of course, it can be implemented by being included in (130).

도 2는 도 1의 환경 사운드 분류 장치의 시스템 구성을 설명하는 도면이다.FIG. 2 is a diagram explaining the system configuration of the environmental sound classification device of FIG. 1.

도 2를 참조하면, 환경 사운드 분류 장치(130)는 프로세서(210), 메모리(230), 사용자 입출력부(250) 및 네트워크 입출력부(270)를 포함하여 구현될 수 있다.Referring to FIG. 2, the environmental sound classification device 130 may be implemented including a processor 210, a memory 230, a user input/output unit 250, and a network input/output unit 270.

프로세서(210)는 환경 사운드 분류 장치(130)가 동작하는 과정에서의 각 단계들을 처리하는 프로시저를 실행할 수 있고, 그 과정 전반에서 읽혀지거나 작성되는 메모리(230)를 관리할 수 있으며, 메모리(230)에 있는 휘발성 메모리와 비휘발성 메모리 간의 동기화 시간을 스케줄할 수 있다. 프로세서(210)는 환경 사운드 분류 장치(130)의 동작 전반을 제어할 수 있고, 메모리(230), 사용자 입출력부(250) 및 네트워크 입출력부(270)와 전기적으로 연결되어 이들 간의 데이터 흐름을 제어할 수 있다. 프로세서(210)는 환경 사운드 분류 장치(130)의 CPU(Central Processing Unit)로 구현될 수 있다.The processor 210 can execute a procedure that processes each step in the process of operating the environmental sound classification device 130, and can manage the memory 230 that is read or written throughout the process, and the memory ( 230), the synchronization time between the volatile memory and the non-volatile memory can be scheduled. The processor 210 can control the overall operation of the environmental sound classification device 130 and is electrically connected to the memory 230, the user input/output unit 250, and the network input/output unit 270 to control data flow between them. can do. The processor 210 may be implemented as a central processing unit (CPU) of the environmental sound classification device 130.

메모리(230)는 SSD(Solid State Drive) 또는 HDD(Hard Disk Drive)와 같은 비휘발성 메모리로 구현되어 환경 사운드 분류 장치(130)에 필요한 데이터 전반을 저장하는데 사용되는 보조기억장치를 포함할 수 있고, RAM(Random Access Memory)과 같은 휘발성 메모리로 구현된 주기억장치를 포함할 수 있다.The memory 230 may be implemented as a non-volatile memory such as a solid state drive (SSD) or a hard disk drive (HDD) and may include an auxiliary memory used to store overall data required for the environmental sound classification device 130. , may include a main memory implemented as volatile memory such as RAM (Random Access Memory).

사용자 입출력부(250)는 사용자 입력을 수신하기 위한 환경 및 사용자에게 특정 정보를 출력하기 위한 환경을 포함할 수 있다. 예를 들어, 사용자 입출력부(250)는 터치 패드, 터치 스크린, 화상 키보드 또는 포인팅 장치와 같은 어댑터를 포함하는 입력장치 및 모니터 또는 터치스크린과 같은 어댑터를 포함하는 출력장치를 포함할 수 있다. 일 실시예에서, 사용자 입출력부(250)는 원격 접속을 통해 접속되는 컴퓨팅 장치에 해당할 수 있고, 그러한 경우, 환경 사운드 분류 장치(130)는 독립적인 서버로서 수행될 수 있다.The user input/output unit 250 may include an environment for receiving user input and an environment for outputting specific information to the user. For example, the user input/output unit 250 may include an input device including an adapter such as a touch pad, touch screen, on-screen keyboard, or pointing device, and an output device including an adapter such as a monitor or touch screen. In one embodiment, the user input/output unit 250 may correspond to a computing device connected through a remote connection, in which case the environmental sound classification device 130 may perform as an independent server.

네트워크 입출력부(270)은 네트워크를 통해 외부 장치 또는 시스템과 연결하기 위한 환경을 포함하고, 예를 들어, LAN(Local Area Network), MAN(Metropolitan Area Network), WAN(Wide Area Network) 및 VAN(Value Added Network) 등의 통신을 위한 어댑터를 포함할 수 있다.The network input/output unit 270 includes an environment for connecting with external devices or systems through a network, for example, Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), and VAN ( It may include an adapter for communication such as a Value Added Network).

도 3은 도 1의 환경 사운드 분류 장치의 기능적 구성을 설명하는 도면이다.FIG. 3 is a diagram explaining the functional configuration of the environmental sound classification device of FIG. 1.

도 3을 참조하면, 환경 사운드 분류 장치(130)는 데이터 수집부(310), 데이터 전처리부(320), 데이터 증강부(330), 피처 스케일링부(340), 모델 구축부(350), 환경 사운드 분류부(360) 및 제어부(도 3에 미도시함)를 포함할 수 있다.Referring to Figure 3, the environmental sound classification device 130 includes a data collection unit 310, a data preprocessing unit 320, a data augmentation unit 330, a feature scaling unit 340, a model building unit 350, and an environment It may include a sound classification unit 360 and a control unit (not shown in FIG. 3).

데이터 수집부(310)는 사운드에 관한 원시 신호(raw signal)를 수집하여 데이터 집합을 생성할 수 있다. 즉, 데이터 집합은 환경 사운드에 관한 원시 신호들의 집합에 해당할 수 있다. 일 실시예에서, 데이터 수집부(310)는 데이터 집합으로서 ESC-10 또는 US-8K를 결정할 수 있다. ESC-10 데이터 집합은 음향 데이터에 총 10개의 클래스로 분류되는 레이블을 붙인 사운드 데이터 집합에 해당할 수 있다. US-8K(UrbanSound-8K) 데이터 집합은 10가지 종류의 소리를 4초 동안 녹음한 사운드 데이터 집합에 해당할 수 있다. 또한, ESC-10 및 US-8K는 모두 44.1 kHz로 샘플링된 사운드 데이터로 구성될 수 있다. 데이터 수집부(310)는 별도의 사운드 수집 장치를 통해 주변 사운드를 직접 수집할 수도 있으며, 필요에 따라 기 구축된 데이터 집합을 이용할 수도 있다.The data collection unit 310 may collect raw signals related to sound and generate a data set. That is, the data set may correspond to a set of raw signals related to environmental sounds. In one embodiment, data collection unit 310 may determine ESC-10 or US-8K as the data set. The ESC-10 data set may correspond to a labeled sound data set in which sound data is classified into a total of 10 classes. The US-8K (UrbanSound-8K) data set may correspond to a sound data set in which 10 types of sounds were recorded for 4 seconds. Additionally, both the ESC-10 and US-8K can be configured with sound data sampled at 44.1 kHz. The data collection unit 310 may directly collect ambient sounds through a separate sound collection device, or may use a pre-built data set as needed.

데이터 전처리부(320)는 원시 신호로부터 샘플링(sampling) 되어 소정의 시간 길이를 갖는 복수의 사운드 프레임(frame)들을 생성할 수 있다. 데이터 전처리부(320)는 다양한 샘플링 레이트(sampling rate)를 적용하여 샘플링 동작을 수행할 수 있으며, 샘플링된 사운드 데이터는 학습 모델의 입력으로 사용될 수 있다. 예를 들어, 샘플링 레이트는 8 kHz, 16 kHz, 22.5 kHz 등을 포함할 수 있으며, 여기에서는 22.5 kHz 샘플링 레이트를 적용하여 사운드 샘플링을 수행하는 것으로 가정하여 설명한다. 따라서, 1D CNN 모델의 경우 입력의 크기(shape)는 22,500(samples)에 해당할 수 있다.The data preprocessor 320 may generate a plurality of sound frames having a predetermined time length by sampling from the raw signal. The data preprocessor 320 can perform a sampling operation by applying various sampling rates, and the sampled sound data can be used as input to a learning model. For example, the sampling rate may include 8 kHz, 16 kHz, 22.5 kHz, etc., and herein, it is assumed that sound sampling is performed by applying the 22.5 kHz sampling rate. Therefore, in the case of a 1D CNN model, the size (shape) of the input may correspond to 22,500 (samples).

일 실시예에서, 데이터 전처리부(320)는 원시 신호를 소정의 주파수에 따라 샘플링 하여 샘플링 데이터를 생성하고, 샘플링 데이터에 대해 소정의 홉 길이(hop length)를 유지하면서 1초(s)의 시간 길이를 갖는 1초 프레임들을 연속하여 생성할 수 있다. 이에 대해서는 도 5에서 보다 자세히 설명한다.In one embodiment, the data preprocessor 320 generates sampling data by sampling the raw signal according to a predetermined frequency, and maintains a predetermined hop length for the sampling data for a time of 1 second (s). Frames with a length of 1 second can be generated continuously. This is explained in more detail in Figure 5.

일 실시예에서, 데이터 전처리부(320)는 1초 프레임들을 기초로 원시 신호에 대응하는 감마톤 스펙트로그램(GTS, Gammatone Spectrogram)을 생성할 수 있다. 이에 대해서는 도 6에서 보다 자세히 설명한다.In one embodiment, the data preprocessor 320 may generate a gammatone spectrogram (GTS) corresponding to the raw signal based on 1-second frames. This is explained in more detail in Figure 6.

데이터 증강부(330)는 복수의 사운드 프레임들에 대해 해당 신호를 위상(phase)과 크기(magnitude) 성분으로 분리하여 데이터를 증강(augmentation)시킬 수 있다. 데이터 증강은 딥러닝 기반 이미지 처리에서 매우 일반적인 기법에 해당할 수 있다. 이미지 처리에서 데이터 증강의 주요 이점은 사람의 눈이 다양한 유형의 이미지 패턴을 쉽게 감지할 수 있다는 것이다. 그러나, 사람의 귀에서 주파수 추정, SNR 계산 등과 같은 사운드 신호 패턴을 정확하게 측정하는 것은 어려울 수 있다.The data augmentation unit 330 may augment data by separating the corresponding signal into phase and magnitude components for a plurality of sound frames. Data augmentation can be a very common technique in deep learning-based image processing. The main advantage of data augmentation in image processing is that the human eye can easily detect various types of image patterns. However, it can be difficult to accurately measure sound signal patterns, such as frequency estimation, SNR calculation, etc., in the human ear.

또한, 데이터 증강은 실제와 유사한 사례들을 생성하는 것 외에도 클래스 당 학습 데이터의 개수를 늘리는 목적으로도 사용될 수 있으며, 이를 통해 더 나은 정확성을 제공하는 머신러닝 모델을 구축할 수 있다.Additionally, in addition to generating realistic examples, data augmentation can also be used to increase the number of learning data per class, which allows building machine learning models that provide better accuracy.

일 실시예에서, 데이터 증강부(330)는 크게 분석(analysis), 처리(processing) 및 합성(synthesis)의 단계들을 통해 데이터 증강을 수행할 수 있다. 보다 구체적으로, 데이터 증강부(330)는 분석(analysis) 단계에서 입력 신호를 복수의 대역 통과 필터들에 대한 출력 신호들의 합(summation)으로 표현하고 출력 신호들 각각에 단시간 푸리에 변환(STFT, Short Time Fourier Transform)을 수행하여 위상과 크기 성분으로 분리할 수 있다. 이에 대해서는 도 7에서 보다 자세히 설명한다.In one embodiment, the data augmentation unit 330 may largely perform data augmentation through analysis, processing, and synthesis steps. More specifically, in the analysis step, the data enhancer 330 expresses the input signal as a summation of output signals for a plurality of band-pass filters and performs a short-time Fourier transform (STFT) on each of the output signals. It can be separated into phase and magnitude components by performing Time Fourier Transform. This is explained in more detail in Figure 7.

또한, 데이터 증강부(330)는 처리(processing) 단계에서 출력 신호들 각각의 위상과 크기 성분을 기초로 타임 스트레칭(time stretching) 및 피치 쉬프팅(pitch shifting)을 수행할 수 있다. 타임 스트레칭은 위상 프로파일(profile)을 동일하게 유지하면서 신호의 시간 프로파일을 변경하는 것을 목적으로 하며, 피치 쉬프팅은 시간 길이를 동일하게 유지하면서 피치 프로파일을 변경하는 것을 목적으로 할 수 있다. 또한, 데이터 증강부(330)는 합성(synthesis) 단계에서 출력 신호들에 대해 처리 단계의 결과로 생성된 신호들을 합성하여 입력 신호에 대응하는 적어도 하나의 증강 신호를 생성할 수 있다. 즉, 데이터 증강부(330)는 증강된 데이터들을 하나로 합성하는 과정에서 다양한 방법을 적용함으로써 동일한 증강 데이터들에 대해 복수의 증간 신호를 생성할 수 있다. 이에 대해서는 도 7에서 보다 자세히 설명한다.Additionally, the data augmentation unit 330 may perform time stretching and pitch shifting based on the phase and magnitude components of each output signal in the processing step. Time stretching may aim to change the time profile of a signal while keeping the phase profile the same, and pitch shifting may aim to change the pitch profile while keeping the time length the same. Additionally, the data enhancer 330 may generate at least one augmented signal corresponding to the input signal by synthesizing signals generated as a result of the processing step for the output signals in the synthesis step. That is, the data augmentation unit 330 can generate a plurality of augmented signals for the same augmented data by applying various methods in the process of combining the augmented data into one. This is explained in more detail in Figure 7.

피처 스케일링부(340)는 증강된 데이터를 소정의 범위 내의 값들로 변환할 수 있다. 즉, 피처 스케일링부(340)는 데이터를 CNN 모델의 입력으로 제공하기 전 단계에서 피처 집합(feature set)을 스케일함으로써 더 나은 성능을 달성하는데 기여할 수 있다.The feature scaling unit 340 may convert the augmented data into values within a predetermined range. In other words, the feature scaling unit 340 can contribute to achieving better performance by scaling the feature set at a stage before providing data as input to the CNN model.

일 실시예에서, 피처 스케일링부(340)는 z-점수 정규화, 최소-최대 스케일링 및 평균 정규화를 포함하는 최소화(minimization) 기법을 적용하여 증강된 데이터의 피처들이 소정의 평균(mean)과 표준편차(standard deviation)에 따라 분포하도록 재분배할 수 있다. 예를 들어, 피처 스케일링부(340)는 다음의 수학식 1에 따라 z-점수 정규화를 수행할 수 있다.In one embodiment, the feature scaling unit 340 applies a minimization technique including z-score normalization, min-max scaling, and mean normalization so that the features of the augmented data have a predetermined mean and standard deviation. It can be redistributed to be distributed according to (standard deviation). For example, the feature scaling unit 340 may perform z-score normalization according to Equation 1 below.

[수학식 1][Equation 1]

여기에서, y 및 y'은 각각 입력 데이터와 스케일 데이터, 는 y의 평균, σ_y는 y의 표준편차에 해당할 수 있다. 피처 스케일링부(340)는 z-점수 정규화를 통해 평균 μ=0, 표준편차 σ=1이 되도록 피처들을 재분배할 수 있다.Here, y and y' are input data and scale data, respectively, may correspond to the average of y, and σ _y may correspond to the standard deviation of y. The feature scaling unit 340 may redistribute the features so that the average μ = 0 and the standard deviation σ = 1 through z-score normalization.

모델 구축부(350)는 전처리 단계와 스케일링 단계를 통해 변환된 데이터들을 학습하여 환경 사운드 분류(ESC, Environmental Sound Classification)를 위한 학습 모델을 구축할 수 있다. 모델 구축부(350)는 더 나은 분류 성능을 달성하기 위하여 운용 환경이나 데이터 특성에 따라 다양한 학습 모델들 중에서 적어도 하나를 채택하여 학습 모델을 구축할 수 있다.The model building unit 350 can build a learning model for environmental sound classification (ESC) by learning data converted through a preprocessing step and a scaling step. The model building unit 350 may build a learning model by adopting at least one of various learning models depending on the operating environment or data characteristics to achieve better classification performance.

일 실시예에서, 모델 구축부(350)는 학습 모델로서 원시 신호를 입력으로 수신하는 1D CNN 모델과 감마톤 스펙트로그램을 입력으로 수신하는 2D CNN 모델을 각각 구축할 수 있다. 모델 구축부(350)는 서로 다른 유형의 입력 데이터를 활용하는 독립된 학습 모델을 각각 구축할 수 있으며, 환경 사운드 분류의 처리 환경에 적응적으로 학습 모델을 선택하여 적용할 수 있다. 또한, 모델 구축부(350)는 학습 모델 간의 성능 비교를 위해 상호 유사한 네트워크 구조를 갖도록 학습 모델들을 구축할 수 있다. 이에 대해서는 도 8 및 9에서 보다 자세히 설명한다.In one embodiment, the model building unit 350 may construct a 1D CNN model that receives a raw signal as an input and a 2D CNN model that receives a gammatone spectrogram as an input as a learning model, respectively. The model building unit 350 can build independent learning models that utilize different types of input data, and can adaptively select and apply the learning model to the processing environment of environmental sound classification. Additionally, the model building unit 350 may build learning models to have similar network structures to compare performance between learning models. This is explained in more detail in Figures 8 and 9.

환경 사운드 분류부(360)는 학습 모델을 기초로 주어진 환경 사운드에 대한 분류 동작을 수행할 수 있다. 일 실시예에서, 환경 사운드 분류부(360)는 학습 모델로서 1D CNN 모델 및 2D CNN 모델이 각각 구축된 경우 분류 동작의 수행 환경에 따라 1D CNN 모델 및 2D CNN 모델 중 어느 하나를 선택적으로 적용할 수 있다. 즉, 1D CNN 모델의 경우 복잡도 낮은 입력 신호를 기초로 보다 빠르게 분류 연산을 처리할 수 있으며, 2D CNN 모델의 경우 전통적인 방식으로 환경 사운드 분류를 수행하여 보다 높은 정확도의 분류 결과를 생성할 수 있다. 환경 사운드 분류부(360)는 분류 목적이나 조건 등에 따라 독립적으로 구축된 학습 모델들 중 어느 하나를 선택하여 환경 사운드 분류의 효율성을 높일 수 있다.The environmental sound classification unit 360 may perform a classification operation for a given environmental sound based on a learning model. In one embodiment, the environmental sound classification unit 360 may selectively apply either the 1D CNN model or the 2D CNN model according to the performance environment of the classification operation when a 1D CNN model and a 2D CNN model are respectively constructed as learning models. You can. In other words, in the case of a 1D CNN model, classification calculations can be processed more quickly based on low-complexity input signals, and in the case of a 2D CNN model, classification results of higher accuracy can be generated by performing environmental sound classification in a traditional manner. The environmental sound classification unit 360 can increase the efficiency of environmental sound classification by selecting one of independently built learning models according to classification purposes or conditions.

일 실시예에서, 환경 사운드 분류부(360)는 주어진 환경 사운드의 개수, 용량, 네트워크 속도 및 연산 모듈의 성능을 기초로 부하 상태를 산출할 수 있고, 기 설정된 임계값을 초과하는 경우 구축된 학습 모델들 중 1D CNN 모델을 선택하여 환경 사운드 분류를 수행할 수 있다. 이와 반대로, 환경 사운드 분류부(360)는 부하 상태가 기 설정된 임계값 미만인 경우에는 2D CNN 모델을 선택하여 환경 사운드 분류를 수행할 수 있다.In one embodiment, the environmental sound classification unit 360 may calculate the load state based on the number of given environmental sounds, capacity, network speed, and performance of the operation module, and if it exceeds a preset threshold, the built learning Among the models, the 1D CNN model can be selected to perform environmental sound classification. Conversely, when the load state is less than a preset threshold, the environmental sound classifier 360 may select a 2D CNN model to perform environmental sound classification.

제어부(도 3에 미도시함)는 환경 사운드 분류 장치(130)의 전체적인 동작을 제어하고, 데이터 수집부(310), 데이터 전처리부(320), 데이터 증강부(330), 피처 스케일링부(340), 모델 구축부(350) 및 환경 사운드 분류부(360) 간의 제어 흐름 또는 데이터 흐름을 관리할 수 있다.The control unit (not shown in FIG. 3) controls the overall operation of the environmental sound classification device 130, and includes a data collection unit 310, a data preprocessing unit 320, a data augmentation unit 330, and a feature scaling unit 340. ), the control flow or data flow between the model building unit 350 and the environmental sound classification unit 360 can be managed.

도 4는 본 발명에 따른 딥러닝 기반의 환경 사운드 분류 방법을 설명하는 순서도이다.Figure 4 is a flowchart explaining the deep learning-based environmental sound classification method according to the present invention.

도 4를 참조하면, 환경 사운드 분류 장치(130)는 데이터 수집부(310)를 통해 사운드에 관한 원시 신호(raw signal)를 수집하여 데이터 집합을 생성할 수 있다(단계 S410). 환경 사운드 분류 장치(130)는 데이터 전처리부(320)를 통해 원시 신호로부터 샘플링(sampling) 되어 소정의 시간 길이를 갖는 복수의 사운드 프레임(frame)들을 생성할 수 있다(단계 S430).Referring to FIG. 4, the environmental sound classification device 130 may collect raw signals related to sound through the data collection unit 310 and generate a data set (step S410). The environmental sound classification device 130 may generate a plurality of sound frames with a predetermined time length by sampling from the raw signal through the data preprocessor 320 (step S430).

또한, 환경 사운드 분류 장치(130)는 데이터 증강부(330)를 통해 복수의 사운드 프레임들에 대해 해당 신호를 위상(phase)과 크기(magnitude) 성분으로 분리하여 데이터를 증강(augmentation)시킬 수 있다(단계 S450). 환경 사운드 분류 장치(130)는 피처 스케일링부(340)를 통해 증강된 데이터를 소정의 범위 내의 값들로 변환할 수 있다(단계 S470). In addition, the environmental sound classification device 130 can augment the data by dividing the signal into phase and magnitude components for a plurality of sound frames through the data augmentation unit 330. (Step S450). The environmental sound classification device 130 may convert the augmented data into values within a predetermined range through the feature scaling unit 340 (step S470).

환경 사운드 분류 장치(130)는 모델 구축부(350)를 통해 변환된 데이터를 학습하여 환경 사운드 분류(ESC, Environmental Sound Classification)를 위한 학습 모델을 구축할 수 있다(단계 S490).The environmental sound classification device 130 can build a learning model for environmental sound classification (ESC) by learning the data converted through the model building unit 350 (step S490).

본 발명에 따른 환경 사운드 분류 장치(130)는 상기의 과정을 통해 구축된 학습 모델을 이용하여 환경 사운드 분류에 관한 동작을 처리할 수 있다. 특히, 환경 사운드 분류 장치(130)는 복잡도에 따른 학습 모델을 독립적으로 구축할 수 있고, 분류 환경에 따라 학습 모델을 선택적으로 적용하여 환경 사운드 분류를 적응적으로 처리할 수 있다.The environmental sound classification device 130 according to the present invention can process operations related to environmental sound classification using the learning model built through the above process. In particular, the environmental sound classification device 130 can independently build a learning model according to complexity and can adaptively process environmental sound classification by selectively applying the learning model according to the classification environment.

도 5는 본 발명에 따른 입력 신호의 리샘플링 및 프레이밍 동작을 설명하는 도면이다.Figure 5 is a diagram explaining resampling and framing operations of an input signal according to the present invention.

도 5를 참조하면, 환경 사운드 분류 장치(130)는 데이터 전처리부(320)를 통해 원시 신호를 소정의 주파수(frequency)에 따라 샘플링 하여 샘플링 데이터를 생성할 수 있다. 예를 들어, 데이터 전처리부(320)는 원시 신호에 대해 22,500 kHz 주파수로 샘플링을 수행하여 학습 모델에 입력되는 데이터 크기를 22500(samples)로 설정할 수 있다.Referring to FIG. 5, the environmental sound classification device 130 may generate sampling data by sampling the raw signal according to a predetermined frequency through the data preprocessor 320. For example, the data preprocessor 320 may perform sampling on the raw signal at a frequency of 22,500 kHz and set the data size input to the learning model to 22500 (samples).

도 5에서, 데이터 전처리부(320)는 샘플링 데이터에 대해 8000의 홉 길이(hop length)를 유지하면서 1초(s)의 시간 길이(도 5의 화살표 길이에 대응됨)를 갖는 1초 프레임들(1^st frame, 2^nd frame, 3^rd frame)을 연속하여 생성할 수 있다. 이때, 홉 길이는 프레임 간의 중첩 영역의 크기에 해당할 수 있으며, 데이터 크기의 약 35%(samples)의 크기로 설정될 수 있으나, 반드시 이에 한정되지 않는다. 데이터 전처리부(320)에 의해 생성된 1초 프레임들은 이후 2D 스펙트로그램을 생성하는데 사용될 수 있다.In FIG. 5, the data preprocessor 320 maintains a hop length of 8000 for the sampling data and generates 1 second frames with a time length of 1 second (s) (corresponding to the length of the arrow in FIG. 5). (1 ^st frame, 2 ^nd frame, 3 ^rd frame) can be generated consecutively. At this time, the hop length may correspond to the size of the overlap area between frames and may be set to a size of approximately 35% (samples) of the data size, but is not necessarily limited to this. The 1-second frames generated by the data preprocessor 320 can then be used to generate a 2D spectrogram.

도 6은 본 발명에 감마톤 스펙트로그램을 설명하는 도면이다.Figure 6 is a diagram illustrating a gammatone spectrogram according to the present invention.

도 6을 참조하면, 환경 사운드 분류 장치(130)는 데이터 전처리부(320)를 통해 1초 프레임들을 기초로 원시 신호에 대응하는 감마톤 스펙트로그램(GTS, Gammatone Spectrogram)을 생성할 수 있다. 한편, 데이터 전처리부(320)는 감마톤 필터 표현(Gammatone filter representation)을 대신하여 멜 스케일 표현(Mel scale representation)을 사용할 수도 있으나, 여기에서는 구체적인 설명을 생략한다.Referring to FIG. 6, the environmental sound classification device 130 may generate a gammatone spectrogram (GTS) corresponding to the raw signal based on 1-second frames through the data preprocessor 320. Meanwhile, the data preprocessor 320 may use Mel scale representation instead of Gammatone filter representation, but detailed description is omitted here.

데이터 전처리부(320)에 의해 생성된 감마톤 스펙트로그램은 2D CNN 모델의 학습 데이터로 사용될 수 있다. 여기에서, 감마톤(Gammatone)이라는 용어는 사인곡선 톤(sinusoidal tone)과 감마 분포(gamma distribution)의 곱(product)에서 유래되었으며, 수학적으로 다음의 수학식 2와 같이 표현될 수 있다.The gammatone spectrogram generated by the data preprocessor 320 can be used as training data for a 2D CNN model. Here, the term gammatone is derived from the product of a sinusoidal tone and a gamma distribution, and can be mathematically expressed as Equation 2 below.

[수학식 2][Equation 2]

여기에서, a는 신호 진폭(amplitude), f₀는 Hz 단위의 중심 주파수(center frequency), 는 반송파 위상(라디안 단위), b는 필터의 대역폭(Hz), t는 시간(초)에 해당할 수 있다.Here, a is the signal amplitude, f ₀ is the center frequency in Hz, may correspond to the carrier phase (in radians), b may correspond to the bandwidth of the filter (Hz), and t may correspond to the time (seconds).

또한, 중심 주파수 f₀(kHz)에서 대역폭(Hz) 계산은 다음의 수학식 3을 통해 수행될 수 있다.Additionally, bandwidth (Hz) calculation at the center frequency f ₀ (kHz) can be performed through Equation 3 below.

[수학식 3][Equation 3]

여기에서, ERB는 Equivalent Rectangular Bandwidth이다.Here, ERB is Equivalent Rectangular Bandwidth.

그런 다음, 상기의 수학식 2 및 3을 통해 획득한 대역통과 필터뱅크(bandpass filterbank)에 FFT 기반의 스펙트로그램을 곱하여 감마톤 스펙트로그램을 생성할 수 있다. 도 6의 그림 (a)는 ERB 스케일을 사용한 필터뱅크를 나타내고, 그림 (b)와 그림 (c)는 사용된 데이터 집합에서 두 가지 대표적인 유형(class)의 GTS를 나타낸다. 즉, 그림 (b)는 '어린이 연주(children playing)' 유형의 사운드에 대한 GTS를 나타내고, 그림 (c)는 '거리 음악(street music)' 유형의 사운드에 대한 GTS를 나타낸다.Then, a gammatone spectrogram can be generated by multiplying the bandpass filterbank obtained through Equations 2 and 3 above by the FFT-based spectrogram. Figure 6 (a) shows a filter bank using the ERB scale, and Figures (b) and (c) show two representative types of GTS in the used data set. That is, figure (b) represents the GTS for the 'children playing' type sound, and figure (c) represents the GTS for the 'street music' type sound.

도 7은 본 발명에 따른 데이터 증강 과정을 설명하는 도면이다.Figure 7 is a diagram explaining the data augmentation process according to the present invention.

도 7을 참조하면, 환경 사운드 분류 장치(130)는 데이터 증강부(330)를 통해 3개의 단계들로 구성된 데이터 증강(augmentation) 동작을 수행할 수 있다. Referring to FIG. 7 , the environmental sound classification device 130 may perform a data augmentation operation consisting of three steps through the data augmentation unit 330.

보다 구체적으로, 분석(Analysis) 단계에서, 신호 f(t)은 일련의 대역 통과 필터들 BP1, ..., BPn의 합계로 표현될 수 있다. fn(t)이 N 대역 통과 필터 중 n번째 필터의 출력이면 입력 신호는 다음의 수학식 4와 같이 표현될 수 있다. 또한, 단시간 푸리에 변환(STFT)을 수행하면 다음의 수학식 5 내지 7을 통해 위상 및 크기 부분을 분리할 수 있다.More specifically, in the analysis step, the signal f(t) can be expressed as the sum of a series of band-pass filters BP1, ..., BPn. If fn(t) is the output of the nth filter among N band pass filters, the input signal can be expressed as Equation 4 below. Additionally, by performing short-time Fourier transform (STFT), the phase and magnitude parts can be separated through the following equations 5 to 7.

[수학식 4][Equation 4]

[수학식 5][Equation 5]

[수학식 6][Equation 6]

[수학식 7][Equation 7]

여기에서, a 및 b은 복소 스펙트럼(complex spectrum)의 실수(real) 및 허수(imaginary) 부분에 해당한다. F(ω_n, t) 및 h(t)는 엔벨로프 함수(envelope function)에 해당한다.Here, a and b correspond to the real and imaginary parts of the complex spectrum. F(ω _n , t) and h(t) correspond to the envelope function.

또한, 처리(Processing) 단계에서, 타임 스트레칭(time streching)과 피치 쉬프팅(fitch shifting)이 수행될 수 있다. 타임 스트레칭의 목표는 위상 프로파일을 동일하게 유지하면서 신호의 시간 프로파일을 변경하는 것이고, 마찬가지로 피치 쉬프팅의 경우 시간 길이를 동일하게 유지하면서 피치 프로파일을 변경하는 것일 수 있다. 예를 들어, 타임 스트레칭을 통해 각 신호를 시간 길이에서 두 배로 확장할 수 있고, 피치 쉬프팅을 통해 각 피치를 1.5 단계만큼 확장하여 각 신호를 증강시킬 수 있다.Additionally, in the processing stage, time stretching and pitch shifting may be performed. The goal of time stretching is to change the time profile of the signal while keeping the phase profile the same, and similarly, in the case of pitch shifting, it may be to change the pitch profile while keeping the time length the same. For example, through time stretching, each signal can be doubled in time length, and through pitch shifting, each signal can be augmented by expanding each pitch by 1.5 steps.

그런 다음, 마지막 합성(synthesis) 단계에서, 신호는 다음의 수학식 8을 통해 으로 재구성될 수 있고, 상기의 수학식 4는 다시 n 채널들의 출력을 합산하는데 사용될 수 있다.Then, in the final synthesis step, the signal is expressed through Equation 8: It can be reconstructed as, and Equation 4 above can be used again to sum the outputs of n channels.

[수학식 8][Equation 8]

도 8은 본 발명에 따른 1D CNN 모델의 구조를 설명하는 도면이고, 도 9는 본 발명에 따른 2D CNN 모델의 구조를 설명하는 도면이며, 도 10은 본 발명에 따른 CNN 모델을 이용한 환경 사운드 분류 시스템을 설명하는 도면이다.FIG. 8 is a diagram explaining the structure of a 1D CNN model according to the present invention, FIG. 9 is a diagram explaining the structure of a 2D CNN model according to the present invention, and FIG. 10 is a diagram explaining environmental sound classification using a CNN model according to the present invention. This is a drawing explaining the system.

도 8 내지 10을 참조하면, 환경 사운드 분류 장치(130)는 모델 구축부(350)를 통해 전처리 후 변환된 데이터를 학습하여 환경 사운드 분류(ESC)를 위한 학습 모델을 구축할 수 있다. 특히, 모델 구축부(350)는 입력 데이터의 유형(type)에 따라 1D CNN 모델과 2D CNN 모델로 구분되는 학습 모델들을 독립적으로 구축할 수 있다. 즉, 1D CNN 모델은 원시 신호를 입력으로 수신하고, 2D CNN 모델은 감마톤 스펙트로그램을 입력으로 수신할 수 있다.Referring to FIGS. 8 to 10, the environmental sound classification device 130 may learn the converted data after pre-processing through the model building unit 350 to build a learning model for environmental sound classification (ESC). In particular, the model building unit 350 can independently build learning models divided into 1D CNN models and 2D CNN models depending on the type of input data. That is, the 1D CNN model can receive a raw signal as input, and the 2D CNN model can receive a gammatone spectrogram as input.

한편, 독립적으로 구축되는 1D 및 2D CNN 모델들은 유사한 네트워크 구조를 형성하도록 설계될 수 있다. 도 8 및 9는 본 발명에 따라 구축되는 1D 및 2D CNN 모델들의 아키텍쳐(architecture)를 설명하고 있다. 1D CNN 모델의 경우 컨볼루션(convolution) 단계는 시간 축(time axis)에 따라 수행될 수 있고, 2D CNN 모델의 경우 컨볼루션 단계는 시간 및 주파수 축(time and frequency axis)에 따라 수행될 수 있다.Meanwhile, independently built 1D and 2D CNN models can be designed to form similar network structures. Figures 8 and 9 illustrate the architecture of 1D and 2D CNN models built according to the present invention. For 1D CNN models, the convolution step can be performed along the time axis, and for 2D CNN models, the convolution step can be performed along the time and frequency axis. .

또한, 모델 구축부(350)는 1D CNN 모델 및 2D CNN 모델 각각이 5개의 컨볼루션 블록(block)(ConvBlock-n으로 도시됨)들을 포함하도록 구성할 수 있다. 이때, 각 컨볼루션 블록은 컨볼루션(convolution), 활성화(activation) 및 배치 정규화(batch normalization) 레이어들을 포함할 수 있다.Additionally, the model building unit 350 may configure each of the 1D CNN model and the 2D CNN model to include five convolution blocks (shown as ConvBlock-n). At this time, each convolution block may include convolution, activation, and batch normalization layers.

또한, 모델 구축부(350)는 1D CNN 모델 및 2D CNN 모델 각각에 컨볼루션 레이어 구조에 관한 공통 규칙을 적용할 수 있다. 이때, 공통 규칙은 '[수용 영역(receptive field) / 스트라이드(stride), 필터 개수] × 반복 횟수'를 포함할 수 있다. 수용 영역(receptive field)은 1D 및 2D CNN 모델에서 동일하게 적용될 수 있다. 첫번째 레이어(first layer)에서, 큰 수용 필드는 차별적인 피처들(discriminative features)을 학습하는데 도움이 될 수 있다.Additionally, the model building unit 350 may apply common rules regarding the convolution layer structure to each of the 1D CNN model and the 2D CNN model. At this time, the common rule may include '[receptive field / stride, number of filters] × number of repetitions'. The receptive field can be applied equally to 1D and 2D CNN models. In the first layer, a large receptive field can help learn discriminative features.

또한, 모델 구축부(350)는 1D CNN 모델의 경우 각 컨볼루션 블록의 스트라이드를 동일하게 적용하는 반면, 2D CNN 모델의 경우 각 컨볼루션 블록의 스트라이드를 서로 다른 크기를 교대로 적용할 수 있다. 예를 들어, 2D CNN 모델에서 스트라이드는 각 블록에서 3 과 1이 교대로 적용될 수 있다.In addition, the model building unit 350 may apply the same stride to each convolutional block in the case of a 1D CNN model, while the stride of each convolutional block may alternately have different sizes in the case of a 2D CNN model. For example, in a 2D CNN model, strides of 3 and 1 may be applied alternately in each block.

또한, 모델 구축부(350)는 각 컨볼루션 블록 내의 다음 레이어에서, 활성화 함수를 사용하여 비선형성(non-linearity)을 제공할 수 있다. 모델 구축부(350)는 'ReLu' 활성화 함수 대신 다음의 수학식 9로 표현되는 'Leaky ReLu' 활성화 함수를 사용할 수 있다. Additionally, the model building unit 350 may provide non-linearity using an activation function in the next layer within each convolution block. The model building unit 350 may use the 'Leaky ReLu' activation function expressed by Equation 9 below instead of the 'ReLu' activation function.

[수학식 9][Equation 9]

여기에서, α는 특정 상수(constant)이다.Here, α is a specific constant.

해당 활성화 함수를 사용하는 강력한 이유 중 하나는 ReLu의 편차(derivation)가 음수 입력에 대해 항상 0이라는 것이다. 그러나, 음수 입력에는 특정 상수 α와의 곱연산이 적용될 수 있다. 따라서, 특히 원시 오디오 신호에 있어서 dying relu 문제를 극복할 수 있다. 일 실시예에서, 모델 구축부(350)는 특정 상수 α값으로 0.03을 적용할 수 있다. One compelling reason to use that activation function is that the derivation of ReLu is always 0 for negative inputs. However, a multiplication operation with a specific constant α can be applied to negative input. Therefore, the dying relu problem can be overcome, especially for raw audio signals. In one embodiment, the model building unit 350 may apply 0.03 as the specific constant α value.

또한, 모델 구축부(350)는 과적합(overfitting)과 빠른 수렴(quick convergence)을 방지하기 위해 각 컨볼루션 블록 이후에 배치 정규화 레이어를 사용할 수 있다. 따라서, 각 컨볼루션 블록의 수학적 표현은 다음의 수학식 10과 같이 표현될 수 있다.Additionally, the model building unit 350 may use a batch normalization layer after each convolution block to prevent overfitting and quick convergence. Therefore, the mathematical expression of each convolution block can be expressed as Equation 10 below.

[수학식 10][Equation 10]

여기에서, n(.) 및 a(.)는 각각 배치 정규화 레이어 및 활성화 레이어에 해당하고, 는 컨볼루션 연산에 해당한다. 덴스 레이어(dense layer)로 진행하기 전 단계에서 GlobalAveragePooling 레이어를 통해 글로벌 피처들(global features)이 학습될 수 있으며, 이후 이전 레이어의 출력은 덴스 블록(Dense block)에서 병합(flatten)될 수 있다.Here, n(.) and a(.) correspond to the batch normalization layer and activation layer, respectively; Corresponds to the convolution operation. In the step before proceeding to the dense layer, global features can be learned through the GlobalAveragePooling layer, and then the output of the previous layer can be flattened in the dense block.

또한, 모델 구축부(350)는 1D CNN 모델 및 2D CNN 모델 각각이 서로 다른 개수의 뉴런들(neurons)에 대해 4개의 덴스(dense) 블록들을 포함하도록 구성할 수 있다. 이때, 덴스 블록은 플래튼(flatten) 및 활성화(activation) 레이어들을 포함할 수 있으며, 활성화 함수로서 'Leaky ReLu'를 사용할 수 있다. 다만, 마지막 레이어의 경우 출력 클래스들의 개수에 따라 예측 확률을 제공하기 위해 소프트맥스(softmax) 활성화 함수가 사용될 수 있다.Additionally, the model building unit 350 may configure the 1D CNN model and the 2D CNN model to include four dense blocks for different numbers of neurons. At this time, the dense block may include flatten and activation layers, and 'Leaky ReLu' may be used as an activation function. However, in the case of the last layer, a softmax activation function can be used to provide a prediction probability according to the number of output classes.

따라서, 컨볼루션 레이어들의 출력은 다음의 수학식 11과 같이 표현될 수 있다.Therefore, the output of the convolutional layers can be expressed as Equation 11 below.

[수학식 11][Equation 11]

여기에서, P(.)는 마지막 레이어의 확률 출력, O(.)는 각 레이어의 출력, 는 가중치(W)와 바이어스(b)을 나타낸다.Here, P(.) is the probability output of the last layer, O(.) is the output of each layer, represents the weight (W) and bias (b).

본 발명에 따른 환경 사운드 분류(ESC) 시스템(100)을 위한 1D 및 2D CNN 모델들은 도 10과 같이 시각적으로 표현될 수 있다. 즉, 환경 사운드 분류 장치(130)는 사운드에 관한 원시 신호(Raw signal) 또는 감마톤 스펙트로그램(GTS)을 입력 데이터로서 수신할 수 있다. 이때, 원시 신호는 1D CNN 모델의 입력으로 사용되고, GTS는 2D CNN 모델의 입력으로 사용될 수 있다. 환경 사운드 분류 장치(130)는 어떤 유형의 입력을 수신하더라도 전처리 동작을 통해 다른 유형의 입력으로 변환할 수 있으며, 이를 통해 필요에 따라 학습 모델을 선택적으로 사용하여 환경 사운드 분류를 처리할 수 있다.1D and 2D CNN models for the environmental sound classification (ESC) system 100 according to the present invention can be visually expressed as shown in FIG. 10. That is, the environmental sound classification device 130 may receive a raw signal or a gammatone spectrogram (GTS) related to sound as input data. At this time, the raw signal can be used as the input of the 1D CNN model, and the GTS can be used as the input of the 2D CNN model. The environmental sound classification device 130 can convert any type of input into another type of input through a preprocessing operation, and through this, it can process environmental sound classification by selectively using a learning model as needed.

각 CNN 모델은 유사한 아키텍쳐로 구현될 수 있으며, 입력 데이터는 크게 컨볼루션 블록, 풀링 블록 및 덴스 블록을 차례대로 통과하면서 출력 데이터로 변경될 수 있다. 도 10의 경우, 각 CNN 모델은 출력 결과로서 총 10개의 사운드 클래스들에 대한 확률 정보를 제공하도록 구현될 수 있으며, 환경 사운드 분류 장치(130)는 해당 결과를 활용하여 입력 데이터를 10개의 사운드 클래스들 중 어느 하나로 분류할 수 있다.Each CNN model can be implemented with a similar architecture, and the input data can be changed to output data by passing through the convolution block, pooling block, and dense block in sequence. In the case of Figure 10, each CNN model can be implemented to provide probability information for a total of 10 sound classes as an output result, and the environmental sound classification device 130 uses the results to classify the input data into 10 sound classes. It can be classified into any of the following.

본 발명에 따른 환경 사운드 분류 장치(130)는 전처리가 필요없는 환경 사운드 분류를 위한 1D CNN 모델을 구축할 수 있다. 또한, 환경 사운드 분류 장치(130)는 1D CNN 모델과 유사한 아키텍쳐를 가진 2D CNN 모델을 구축하여 1D CNN 모델과 함께 환경 사운드 분류에 활용할 수 있다.The environmental sound classification device 130 according to the present invention can build a 1D CNN model for environmental sound classification without preprocessing. Additionally, the environmental sound classification device 130 can build a 2D CNN model with a similar architecture to the 1D CNN model and use it for environmental sound classification together with the 1D CNN model.

본 발명에 따른 1D CNN 모델과 2D CNN 모델을 동일한 데이터 집합의 서로 다른 입력 표현에 대해 비교한 결과 1D CNN 모델이 원시 신호 입력에서 차별적인 피처들을 매우 잘 학습한다는 것이 정확성을 통해 검증되었다. 이에 따라, 환경 사운드 분류 장치(130)는 정확도와 하드웨어 부하 등의 제한된 조건 하에서 1D CNN 모델과 2D CNN 모델을 선택적으로 적용하여 환경 사운드 분류의 높은 성능 수준을 제공할 수 있다.As a result of comparing the 1D CNN model and the 2D CNN model according to the present invention on different input representations of the same data set, it was verified through accuracy that the 1D CNN model learns discriminative features very well in the raw signal input. Accordingly, the environmental sound classification device 130 can provide a high performance level of environmental sound classification by selectively applying the 1D CNN model and the 2D CNN model under limited conditions such as accuracy and hardware load.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the present invention has been described above with reference to preferred embodiments, those skilled in the art may make various modifications and changes to the present invention without departing from the spirit and scope of the present invention as set forth in the claims below. You will understand that you can do it.

100: 환경 사운드 분류 시스템
110: 사용자 단말 130: 환경 사운드 분류 장치
150: 데이터베이스
210: 프로세서 230: 메모리
250: 사용자 입출력부 270: 네트워크 입출력부
310: 데이터 수집부 320: 데이터 전처리부
330: 데이터 증강부 340: 피처 스케일링부
350: 모델 구축부 360: 환경 사운드 분류부100: Environmental sound classification system
110: User terminal 130: Environmental sound classification device
150: database
210: Processor 230: Memory
250: user input/output unit 270: network input/output unit
310: data collection unit 320: data preprocessing unit
330: data augmentation unit 340: feature scaling unit
350: Model building unit 360: Environment sound classification unit

Claims

A data collection step of collecting raw signals about sound to generate a data set;
A data preprocessing step of sampling from the raw signal to generate a plurality of sound frames with a predetermined time length;
A data augmentation step of dividing the corresponding signal into phase and magnitude components for the plurality of sound frames and augmenting the data;
A feature scaling step of converting the augmented data into values within a predetermined range;
A model building step to build a learning model for environmental sound classification (ESC) by learning the converted data; and
An environmental sound classification step of performing a classification operation for a given environmental sound based on the learning model,
The model building step constructs a 1D CNN model that receives the raw signal as an input as the learning model and a 2D CNN model that receives a gammatone spectrogram corresponding to the raw signal as an input, and the 1D CNN model and the A common rule regarding the convolution layer structure is applied to each 2D CNN model, and the common rule includes '[receptive field / stride, number of filters] Х repetition number', and the 1D CNN model For the 2D CNN model, applying the same stride to each convolutional block includes alternately applying different sizes to the strides of each convolutional block,
The environmental sound classification step is calculated based on the number of environmental sounds, capacity, network speed, and performance of the calculation module given according to the performance environment of the classification operation when a 1D CNN model and a 2D CNN model are each constructed as the learning model. A deep learning-based environmental sound classification method comprising the step of selectively applying one of the 1D CNN model and the 2D CNN model based on load status.

The method of claim 1, wherein the data collection step is
A deep learning-based environmental sound classification method comprising the step of determining ESC-10 or US-8K as the data set.

The method of claim 1, wherein the data preprocessing step is
generating sampling data by sampling the raw signal at a predetermined frequency; and
A deep learning-based environmental sound classification method comprising: continuously generating 1-second frames with a time length of 1 second (s) while maintaining a predetermined hop length for the sampling data. .

The method of claim 3, wherein the data preprocessing step is
A deep learning-based environmental sound classification method comprising generating a gammatone spectrogram (GTS) corresponding to the raw signal based on the 1-second frames.

The method of claim 1, wherein the data augmentation step is
Analysis that expresses the input signal as the sum of output signals for a plurality of band-pass filters and performs Short Time Fourier Transform (STFT) on each of the output signals to separate them into the phase and magnitude components. (analysis) stage;
A processing step of performing time stretching and pitch shifting based on the phase and magnitude components of each of the output signals; and
Deep learning-based environmental sound comprising a synthesis step of synthesizing signals generated as a result of the processing step for the output signals to generate at least one augmented signal corresponding to the input signal. Classification method.

The method of claim 1, wherein the feature scaling step
Applying a minimization technique including z-score normalization, min-max scaling, and mean normalization to redistribute the features of the augmented data so that they are distributed according to a predetermined mean and standard deviation. A deep learning-based environmental sound classification method comprising:

delete

The method of claim 1, wherein the model building step is
A deep learning-based environmental sound classification method comprising the step of performing a convolution step of the 1D CNN model along the time axis and performing a convolution step of the 2D CNN model along the time and frequency axes. .

The method of claim 1, wherein the model building step is
Constructing each of the 1D CNN model and the 2D CNN model to include 5 convolution blocks,
A deep learning-based environmental sound classification method, wherein the convolution block includes convolution, activation, and batch normalization layers.

delete

The method of claim 9, wherein the model building step is
Constructing each of the 1D CNN model and the 2D CNN model to include four dense blocks,
A deep learning-based environmental sound classification method, wherein the dense block includes flatten and activation layers.

delete

a data collection unit that collects raw signals related to sound and generates a data set;
a data preprocessor that samples the raw signal and generates a plurality of sound frames with a predetermined time length;
a data augmentation unit that augments data by dividing the corresponding signal into phase and magnitude components for the plurality of sound frames;
A feature scaling unit that converts the augmented data into values within a predetermined range;
A model building unit that learns the converted data to build a learning model for Environmental Sound Classification (ESC); and
Includes an environmental sound classification unit that performs a classification operation for a given environmental sound based on the learning model,
The model building unit constructs a 1D CNN model that receives the raw signal as an input and a 2D CNN model that receives a gammatone spectrogram corresponding to the raw signal as an input, respectively, as the learning model, and the 1D CNN model and the 2D A common rule regarding the convolutional layer structure is applied to each CNN model, and the common rule includes '[receptive field / stride, number of filters] Х repetition number', and the 1D CNN model's In this case, the stride of each convolution block is applied equally, and in the case of the 2D CNN model, the stride of each convolution block is alternately applied with different sizes,
When a 1D CNN model and a 2D CNN model are each constructed as the learning model, the environmental sound classification unit calculates a load based on the number of environmental sounds, capacity, network speed, and performance of the calculation module given according to the performance environment of the classification operation. A deep learning-based environmental sound classification device characterized in that one of the 1D CNN model and the 2D CNN model is selectively applied based on the state.