KR102276964B1

KR102276964B1 - Apparatus and Method for Classifying Animal Species Noise Robust

Info

Publication number: KR102276964B1
Application number: KR1020190126635A
Authority: KR
Inventors: 고한석; 이영로; 김동현; 박충호; 김정민; 고경득
Original assignee: 고려대학교 산학협력단
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2021-07-14
Also published as: WO2021075709A1; KR20210043833A

Abstract

잡음 환경에 강인한 동물 종 식별 장치 및 방법은 동물 종들의 울음소리에 기초하여 동물 울음소리 구간을 검출하고, 동물 울음소리가 존재하지 않는 구간을 바탕으로 잡음을 추정하여 수신된 소리에서 잡음을 제거한 후, 다양한 잡음에도 강화된 음향 특징을 학습하여 동물 종을 식별할 수 있다.An apparatus and method for identifying animal species that are robust to a noisy environment detects an animal cry section based on the cry of the animal species, estimates noise based on a section in which no animal cry sounds, and removes noise from the received sound. , it is possible to identify animal species by learning the enhanced acoustic features in spite of various noises.

Description

Apparatus and Method for Classifying Animal Species Noise Robust

본 발명은 동물 종 식별 장치에 관한 것으로서, 더욱 상세하게는 다양한 잡음에도 동물 종들의 울음소리에 기초하여 동물 울음소리 구간을 검출하고, 강화된 음향 특징을 학습하여 동물 종을 식별할 수 있는 잡음 환경에 강인한 동물 종 식별 장치 및 방법에 관한 것이다.The present invention relates to an apparatus for identifying animal species, and more particularly, a noise environment capable of identifying animal species by detecting an animal cry section based on the cry of animal species despite various noises and learning enhanced acoustic characteristics. It relates to an apparatus and method for robust animal species identification.

현재까지 환경 음원의 자동 분류에 관한 연구는 여러 해 동안 꾸준히 진행되어 왔다. 음원에 대한 자동 분류는 직접적 또는 간접적으로 음성 인식, 패턴 인식, 상황 감지 또는 상황 인식 등의 다양한 분야에 적용될 수 있기 때문에 그 중요성이 더 커져가고 있다.Until now, research on automatic classification of environmental sound sources has been steadily progressing for many years. Since automatic classification of sound sources can be applied to various fields, such as voice recognition, pattern recognition, situation detection, or situation recognition, directly or indirectly, its importance is increasing.

공공 사업과 같은 철도, 전기, 가스, 수도 사업 등은 대규모 사업으로 자연 환경에 미치는 영향이 크기 때문에 환경 영향 평가를 시행하여 지역 환경에 미칠 수 있는 부정적인 요소를 미리 파악 및 분석하여 최소한의 피해를 줄 수 있는 수단을 취해야 한다.Railway, electricity, gas, and water projects such as public works are large-scale projects and have a large impact on the natural environment. Therefore, we conduct an environmental impact assessment to identify and analyze negative factors that may affect the local environment in advance to minimize damage. possible means should be taken.

환경 영향 평가에서 생물 종을 조사하는 활동은 해당 지역에 서식하는 생물들의 다양성을 보전하고 생태계를 유지하는데 필수적으로 필요한 활동이다.Investigating species in environmental impact assessment is an essential activity to conserve the diversity of living organisms in the area and maintain the ecosystem.

생물 종을 식별하기 위해서는 실제로 생물을 채집하거나 생물의 울음소리를 듣고 분별할 수 있지만, 사람이 특정 지역 내의 모든 생물 종을 직접 분별하는 것이 불가능하다.In order to identify biological species, it is possible to actually collect living things or to distinguish them by hearing the cry of living things, but it is impossible for humans to directly distinguish all living species within a specific area.

실외 환경은 실내 환경에서 생물의 울음소리 정보를 수집하는 것과 다르게 잡음이 많은 환경이기 때문에 생물의 울음소리만을 습득하는 것이 어려운 일이다.It is difficult to acquire only the cry of living things because the outdoor environment is a noisy environment unlike collecting the cry information of living things in the indoor environment.

한국 공개특허번호 제10-2014-0122881호Korean Patent Publication No. 10-2014-0122881

이와 같은 문제점을 해결하기 위하여, 본 발명은 동물 종들의 울음소리에 기초하여 동물 울음소리 구간을 검출하고, 동물 울음소리가 존재하지 않는 구간을 바탕으로 잡음을 추정하여 수신된 소리에서 잡음을 제거한 후, 다양한 잡음에도 강화된 음향 특징을 학습하여 동물 종을 식별할 수 있는 잡음 환경에 강인한 동물 종 식별 장치 및 방법을 제공하는 것을 목적으로 한다.In order to solve this problem, the present invention detects an animal cry section based on the cry of animal species, estimates noise based on a section where no animal cry sounds, and removes the noise from the received sound. An object of the present invention is to provide an apparatus and method for identifying an animal species that is robust to a noisy environment that can identify an animal species by learning the acoustic characteristics reinforced in spite of various noises.

또한, 본 발명은 다양한 잡음이 많이 발생하는 환경에서 획득된 동물 종 식별 관련 음향 신호이더라도 좋은 분류 성능을 낼 수 있는 잡음에 강인한 인공지능 기반 알고리즘과 시스템에 개발을 제공하는 것을 목적으로 한다.Another object of the present invention is to provide development of an artificial intelligence-based algorithm and system that is robust against noise that can produce good classification performance even for sound signals related to animal species identification obtained in an environment in which a lot of various noises are generated.

상기 목적을 달성하기 위한 본 발명의 특징에 따른 잡음 환경에 강인한 동물 종 식별 장치는,An apparatus for identifying an animal species robust to a noise environment according to a feature of the present invention for achieving the above object,

동물 울음소리의 음향 신호인 1차원 음향 신호를 입력받고, 상기 1차원 음향 신호를 주파수 축과 시간 축을 가지는 2차원 음향 특징인 log-mel 스펙트로그램으로 변환하며, 상기 변환된 2차원 음향 특징을 소리 구간 검출을 통해 상기 동물 울음소리가 존재하는 구간을 검출하는 동물 울음소리 구간 검출부;Receives a one-dimensional sound signal that is an acoustic signal of an animal cry, converts the one-dimensional sound signal into a log-mel spectrogram that is a two-dimensional sound feature having a frequency axis and a time axis, and converts the converted two-dimensional sound feature into sound an animal cry section detection unit for detecting a section in which the animal cries exist through section detection;

상기 검출한 동물 울음소리가 존재하는 구간을 입력 데이터로 수신하고, 상기 수신한 입력 데이터를 신경망 기반 특징 추출 방법을 이용하여 잡음 신호를 추정하며, 상기 입력 데이터에서 상기 추정한 잡음 신호를 뺄셈 연산으로 불필요한 신호 성분을 제거하여 음향 신호의 특징을 강화하는 특징 강화부; 및A section in which the detected animal cries are present is received as input data, a noise signal is estimated using the received input data using a neural network-based feature extraction method, and the estimated noise signal is subtracted from the input data. a feature strengthening unit for enhancing the characteristics of the acoustic signal by removing unnecessary signal components; and

상기 강화된 음향 신호의 특징을 컨볼루션 레이어(Convolutional Layer)와 완전 연결 레이어(Fully Connected Layer, FCL)로 이루어진 분류 알고리즘을 이용하여 출력 레이어에서 레이블 수만큼의 결과를 산출하고, 상기 산출한 결과에서 가장 높은 점수를 가진 레이블을 최종 결과로 도출하는 동물 종 식별부를 포함하는 것을 특징으로 한다.Using a classification algorithm consisting of a convolutional layer and a fully connected layer (FCL) for the characteristics of the reinforced sound signal, as many results as the number of labels are calculated in the output layer, and from the calculated results It is characterized in that it includes an animal species identification unit that derives the label with the highest score as a final result.

상기 동물 울음소리 구간 검출부는 상기 1차원 음향 신호를 STFT(Short Time Fourier Transform) 변환을 통해 특정 주파수로 샘플링된 음향 신호의 특정 구간을 계산하고, 상기 계산한 음향 신호의 특정 구간을 멜 필터 뱅크(Mel Filter Bank)를 통과시킨 후, 로그 연산을 수행하여 상기 2차원 음향 특징인 log-mel 스펙트로그램으로 변환하며, 이 중 이웃한 음향 특징 벡터들을 선택하는 음향 특징 추출부를 더 포함하는 것을 특징으로 한다.The animal cry section detection unit calculates a specific section of the sound signal sampled at a specific frequency through Short Time Fourier Transform (STFT) transformation of the one-dimensional sound signal, and stores the specific section of the calculated sound signal in a Mel filter bank ( Mel Filter Bank), the logarithmic operation is performed to convert the log-mel spectrogram, which is the two-dimensional acoustic feature, and further comprises an acoustic feature extraction unit that selects neighboring acoustic feature vectors. .

상기 동물 울음소리 구간 검출부는 상기 음향 특징 추출부로부터 상기 음향 특징 벡터들을 입력받아 주파수 축 방향으로 32개의 1×1 크기의 필터인 컨볼루션 레이어(Convolutional Layer)로 컨볼루션(Convolution) 연산을 수행하고, 64개의 5×1 크기의 필터인 컨볼루션 레이어로 diated 컨볼루션 연산을 수행하며, residual connection을 위해 다시 32개의 1×1 크기의 필터인 컨볼루션 레이어로 컨볼루션 연산을 수행하여 특징을 추출하는 컨볼루션 연산부를 더 포함하는 것을 특징으로 한다.The animal cry section detector receives the acoustic feature vectors from the acoustic feature extractor and performs a convolution operation with 32 1×1 filters, which are convolutional layers in the frequency axis direction, and , perform diated convolution operation with convolution layers that are 64 5×1 filters, and perform convolution operations with convolution layers that are 32 1×1 filters for residual connection to extract features. It is characterized in that it further comprises a convolution operation unit.

상기 동물 울음소리 구간 검출부는 상기 컨볼루션 연산부로부터 수신한 컨볼루션 연산 결과에 0에서 1 사이의 값을 갖는 시그모이드(Sigmoid) 함수를 적용하여 동물 울음소리 주파수 대역이 개선된 음향 특징이 추출하는 음향 특징 개선부를 더 포함하는 것을 특징으로 한다.The animal cry section detection unit applies a sigmoid function having a value between 0 and 1 to the convolution operation result received from the convolution operation unit to extract the sound feature with improved animal cry frequency band. It is characterized in that it further comprises an acoustic characteristic improvement unit.

본 발명의 특징에 따른 잡음 환경에 강인한 동물 종 식별 방법은,A method for identifying an animal species robust to a noise environment according to a feature of the present invention comprises:

동물 울음소리의 음향 신호인 1차원 음향 신호를 입력받고, 상기 1차원 음향 신호를 주파수 축과 시간 축을 가지는 2차원 음향 특징인 log-mel 스펙트로그램으로 변환하며, 상기 변환된 2차원 음향 특징을 소리 구간 검출을 통해 상기 동물 울음소리가 존재하는 구간을 검출하는 단계;Receives a one-dimensional sound signal that is an acoustic signal of an animal cry, converts the one-dimensional sound signal into a log-mel spectrogram that is a two-dimensional sound feature having a frequency axis and a time axis, and converts the converted two-dimensional sound feature into sound detecting a section in which the animal cry is present through section detection;

상기 검출한 동물 울음소리가 존재하는 구간을 입력 데이터로 수신하고, 상기 수신한 입력 데이터를 신경망 기반 특징 추출 방법을 이용하여 잡음 신호를 추정하며, 상기 입력 데이터에서 상기 추정한 잡음 신호를 뺄셈 연산으로 불필요한 신호 성분을 제거하여 음향 신호의 특징을 강화하는 단계; 및A section in which the detected animal cries are present is received as input data, a noise signal is estimated using the received input data using a neural network-based feature extraction method, and the estimated noise signal is subtracted from the input data. enhancing the characteristics of the acoustic signal by removing unnecessary signal components; and

상기 강화된 음향 신호의 특징을 컨볼루션 레이어(Convolutional Layer)와 완전 연결 레이어(Fully Connected Layer, FCL)로 이루어진 분류 알고리즘을 이용하여 출력 레이어에서 레이블 수만큼의 결과를 산출하고, 상기 산출한 결과에서 가장 높은 점수를 가진 레이블을 최종 결과로 도출하는 단계를 포함하는 것을 특징으로 한다.Using a classification algorithm consisting of a convolutional layer and a fully connected layer (FCL) for the characteristics of the reinforced sound signal, as many results as the number of labels are calculated in the output layer, and from the calculated results and deriving the label with the highest score as the final result.

전술한 구성에 의하여, 본 발명은 잡음 환경에 강인한 종 식별 알고리즘 및 시스템을 적용하면 잡음이 많은 환경에서 얻은 신호이더라도 생물 종 식별 성능을 통해 정확한 생태계 환경을 파악할 수 있는 효과가 있다.According to the above-described configuration, the present invention has an effect that, when a robust species identification algorithm and system are applied to a noisy environment, accurate ecosystem environment can be identified through biological species identification performance even if a signal is obtained in a noisy environment.

본 발명은 생물 종 식별 뿐만 아니라 음성 데이터를 활용한 알고리즘 및 시스템에 적용하여 어떠한 잡음 환경에서도 개선된 성능을 낼 수 있는 효과가 있다.The present invention has the effect of being able to achieve improved performance in any noise environment by applying to an algorithm and system using voice data as well as biological species identification.

본 발명은 수집된 데이터들로부터 정확한 생태계 환경 정보를 바탕으로 해당 지역의 생물들의 고유 특성을 파악할 수 있으며, 환경 개선 및 유지를 통한 생물들을 보전할 수 있는 효과가 있다.The present invention has the effect of being able to grasp the unique characteristics of living things in a given area based on accurate ecosystem environment information from the collected data, and conserving living things through environmental improvement and maintenance.

도 1은 본 발명의 실시예에 따른 잡음 환경에 강인한 동물 종 식별 장치의 구성을 나타낸 블록도이다.
도 2는 본 발명의 실시예에 따른 동물 울음소리 구간 검출부의 내부 구성을 간략하게 나타낸 블록도이다.
도 3은 본 발명의 실시예에 따른 특징 강화부에서 이루어지는 특징 강화 과정을 나타낸 도면이다.
도 4는 본 발명의 실시예에 따른 잡음 성분 검출 과정을 나타낸 도면이다.
도 5는 본 발명의 실시예에 따른 도메인 적응 과정을 나타낸 도면이다.
도 6은 본 발명의 실시예에 따른 강화된 음향 신호 특징을 인식하기 위한 CNN 구조도를 나타낸 도면이다.1 is a block diagram showing the configuration of an animal species identification device robust to a noise environment according to an embodiment of the present invention.
2 is a block diagram schematically illustrating an internal configuration of an animal cry section detection unit according to an embodiment of the present invention.
3 is a view showing a feature strengthening process performed in the feature strengthening unit according to an embodiment of the present invention.
4 is a diagram illustrating a noise component detection process according to an embodiment of the present invention.
5 is a diagram illustrating a domain adaptation process according to an embodiment of the present invention.
6 is a diagram illustrating a CNN structural diagram for recognizing enhanced acoustic signal features according to an embodiment of the present invention.

이하 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써 본 발명을 상술한다.Hereinafter, the present invention will be described in detail by describing preferred embodiments of the present invention with reference to the drawings.

본 발명의 명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification of the present invention, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated.

도 1은 본 발명의 실시예에 따른 잡음 환경에 강인한 동물 종 식별 장치의 구성을 나타낸 블록도이고, 도 2는 본 발명의 실시예에 따른 동물 울음소리 구간 검출부의 내부 구성을 간략하게 나타낸 블록도이고, 도 3은 본 발명의 실시예에 따른 특징 강화부에서 이루어지는 특징 강화 과정을 나타낸 도면이고, 도 4는 본 발명의 실시예에 따른 잡음 성분 검출 과정을 나타낸 도면이고, 도 5는 본 발명의 실시예에 따른 도메인 적응 과정을 나타낸 도면이고, 도 6은 본 발명의 실시예에 따른 강화된 음향 신호 특징을 인식하기 위한 CNN 구조도를 나타낸 도면이다.1 is a block diagram showing the configuration of an animal species identification device robust to a noise environment according to an embodiment of the present invention, and FIG. 2 is a block diagram schematically showing the internal configuration of an animal cry section detection unit according to an embodiment of the present invention 3 is a diagram illustrating a feature enhancement process performed by the feature enhancement unit according to an embodiment of the present invention, FIG. 4 is a diagram illustrating a noise component detection process according to an embodiment of the present invention, and FIG. 5 is a diagram of the present invention It is a diagram illustrating a domain adaptation process according to an embodiment, and FIG. 6 is a diagram illustrating a CNN structure for recognizing an enhanced acoustic signal feature according to an embodiment of the present invention.

본 발명의 실시예에 따른 잡음 환경에 강인한 동물 종 식별 장치(100)는 동물 울음소리 구간을 검출하는 전처리부(110) 및 잡음 성분 검출과 도메인 적응을 수행하는 분류기(120)를 포함한다.The apparatus 100 for identifying an animal species robust to a noise environment according to an embodiment of the present invention includes a preprocessor 110 for detecting an animal cry section, and a classifier 120 for detecting a noise component and performing domain adaptation.

동물 종 식별 장치(100)는 동물 종들의 울음소리에 기초하여 수신된 소리에서 동물 울음소리 구간을 검출하고, 동물 울음소리가 존재하지 않는 구간을 바탕으로 잡음을 추정하여 수신된 소리에서 잡음을 제거한 후, 다양한 잡음에도 강화된 음향 특징을 학습하여 동물 종을 식별할 수 있다.The animal species identification device 100 detects an animal cry section from the received sound based on the cry of the animal species, and removes the noise from the received sound by estimating the noise based on the section in which the animal cry does not exist. Then, it is possible to identify the animal species by learning the acoustic features reinforced in the presence of various noises.

전처리부(110)는 소리를 분류하기 전에 수행되는 작업으로 음향 특징 추출, 소리 구간 검출과 잡음 제거로 구성된다.The preprocessor 110 is a task performed before classifying sounds, and consists of extraction of acoustic features, detection of a sound section, and noise removal.

전처리부(110)는 동물 울음소리의 구간 검출 기능을 수행하는 것으로 음향 신호 입력부(111), 동물 울음소리 구간 검출부(112) 및 음향 품질 개선부(113)를 포함한다.The preprocessor 110 performs a function of detecting a section of an animal cry and includes an acoustic signal input unit 111 , an animal cry section detecting unit 112 , and a sound quality improvement unit 113 .

동물 울음소리 구간 검출부(112)는 입력된 음향 신호에서 동물 울음소리가 존재하는 구간을 검출하기 위해 NAVAD(Neural Attentive Voice Activity Detection) 알고리즘을 적용한다.The animal cry section detecting unit 112 applies a Neural Attentive Voice Activity Detection (NAVAD) algorithm to detect a section in which an animal cry is present in the input sound signal.

동물 울음소리 구간 검출부(112)는 음향 특징 추출부(112a), 컨볼루션 연산부(112b), 음향 특징 개선부(112c), 어텐션 모듈부(112d) 및 최종 확률 계산부(112e)를 포함하며, 이러한 구성 모듈은 NAVAD 알고리즘의 구조도라고 볼 수 있다.The animal cry section detection unit 112 includes an acoustic feature extraction unit 112a, a convolution operation unit 112b, an acoustic characteristic improvement unit 112c, an attention module unit 112d, and a final probability calculation unit 112e, This configuration module can be seen as a structural diagram of the NAVAD algorithm.

음향 신호 입력부(111)는 탐지 대상이 되는 동물 울음소리의 음향 신호를 입력받는다.The acoustic signal input unit 111 receives an acoustic signal of a cry of an animal to be detected.

음향 특징 추출부(112a)는 1차원 음향 신호를 주파수 축과 시간 축을 가지는 2차원 음향 특징인 log-mel 스펙트로그램으로 변환하는 과정이다.The acoustic feature extraction unit 112a is a process of converting a one-dimensional acoustic signal into a log-mel spectrogram that is a two-dimensional acoustic feature having a frequency axis and a time axis.

음향 특징 추출부(112a)는 음향 신호 입력부(111)로부터 1차원 음향 신호를 입력받고, 1차원 음향 신호를 STFT(Short Time Fourier Transform) 변환과, 멜 필터 뱅크부(Mel Filter Bank)를 통과시킨 후, 로그 연산을 수행하여 2차원 음향 특징인 log-mel 스펙트로그램으로 변환한다.The acoustic feature extraction unit 112a receives a one-dimensional sound signal from the sound signal input unit 111, converts the one-dimensional sound signal to STFT (Short Time Fourier Transform), and passes the Mel filter bank After that, a logarithmic operation is performed and converted into a log-mel spectrogram, which is a two-dimensional acoustic characteristic.

음향 특징 추출부(112a)는 STFT 변환 모듈 및 멜 필터 뱅크부를 포함한다.The acoustic feature extraction unit 112a includes an STFT transform module and a Mel filter bank unit.

STFT 변환 모듈은 음향 신호를 일정한 길이의 윈도우(Window)을 이용하여 세분화한 후, 음향 신호의 주파수를 분석할 수 있다.The STFT conversion module may analyze the frequency of the sound signal after segmenting the sound signal using a window of a certain length.

멜 필터 뱅크부는 STFT 변환 모듈에서 분석한 주파수를 이용하여 MFCC(Mel Frequency Cepstral Coefficient) 방식에 의하여 입력된 음향 신호의 특징값을 추출할 수 있다.The Mel filter bank unit may extract a feature value of an input sound signal by using a frequency analyzed by the STFT transform module by a Mel Frequency Cepstral Coefficient (MFCC) method.

음향 특징 추출부(112a)는 2차원 음향 특징인 log-mel 스펙트로그램으로 변환한 후, 이 중 이웃한 7개의 음향 특징 벡터들을 선택하여 컨볼루션 연산부(112b)로 전송한다.The acoustic feature extraction unit 112a converts the log-mel spectrogram, which is a two-dimensional acoustic characteristic, into a log-mel spectrogram, and then selects seven adjacent acoustic characteristic vectors among them, and transmits the selection to the convolution operation unit 112b.

다시 말해, 음향 특징 추출부(112a)는 STFT 변환을 통해 특정 주파수로 샘플링된 음향 신호의 특정 구간을 계산하고, 계산한 음향 신호의 특정 구간을 멜 필터 뱅크(Mel Filter Bank)를 통과시켜 특정 주파수 밴드의 멜 필터 뱅크값(Mel Filterbank)을 구하며, 구해진 멜 필터 뱅크값에 로그 연산을 수행하여 2차원 음향 특징인 log-mel 스펙트로그램으로 변환된다.In other words, the acoustic feature extraction unit 112a calculates a specific section of the acoustic signal sampled at a specific frequency through STFT transformation, and passes the calculated specific section of the acoustic signal through a Mel filter bank to a specific frequency. The Mel filterbank value of the band is obtained, and a logarithmic operation is performed on the obtained Mel filterbank value, which is converted into a log-mel spectrogram, which is a two-dimensional acoustic characteristic.

컨볼루션 연산부(112b)는 입력 데이터와 가중치들의 집합체인 필터(Filter)와의 컨볼루션 연산을 통해 입력 데이터의 특징(Feature)을 추출하는 기능을 수행한다.The convolution operation unit 112b performs a function of extracting a feature of the input data through a convolution operation between the input data and a filter that is a set of weights.

컨볼루션 연산부(112b)는 7개의 음향 특징 벡터들을 입력받는데, 즉 하기의 수학식 1과 같이 입력된 음향 특징에 주파수 축 방향으로 32개의 1×1 크기의 필터인 컨볼루션 레이어(Convolutional Layer)로 컨볼루션(Convolution) 연산을 수행한 후, 비선형 활성화 함수인 ReLU(Recified Linear Unit)를 적용한다.The convolution operation unit 112b receives seven acoustic feature vectors, that is, as a convolutional layer, which is 32 1×1 filters in the frequency axis direction, to the input acoustic feature as shown in Equation 1 below. After performing a convolution operation, ReLU (Recified Linear Unit), a non-linear activation function, is applied.

여기서,

는 i번째 음향 특징 벡터,

는 1×1 크기를 갖는 j번째 필터이며,

는 i번째 입력

과 j번째 필터

의 컨볼루션(Convolution) 연산을 통해 나온 결과값이다.here,

is the ith acoustic feature vector,

is the j-th filter with a size of 1×1,

is the i-th input

and the j-th filter

It is the result obtained through the convolution operation of .

그런 다음, 컨볼루션 연산부(112b)는 64개의 5×1 크기의 필터인 컨볼루션 레이어로 diated 컨볼루션 연산을 수행하며, 앞서 추출한 특징과의 residual connection을 위해 다시 32개의 1×1 크기의 필터인 컨볼루션 레이어로 컨볼루션 연산을 수행한다. 각각의 컨볼루션 연산 뒤에는 비선형 활성화 함수 ReLU를 적용한다.Then, the convolution operation unit 112b performs a diated convolution operation with the convolution layer, which is a convolution layer, which is a filter of 64 5 × 1 size filters, and again 32 1 × 1 size filters for residual connection with the previously extracted features. Convolution operation is performed with a convolution layer. After each convolution operation, a nonlinear activation function ReLU is applied.

해당 컨볼루션 연산은 하기의 수학식 2와 같다.The corresponding convolution operation is as shown in Equation 2 below.

여기서,

는 5×1 크기를 갖는 j번째 입력 필터에 상응하는 k번째 필터이며,

는 5×1 크기를 갖는 k번째 입력 필터에 상응하는

번째 필터이다.here,

is the k-th filter corresponding to the j-th input filter having a size of 5×1,

is corresponding to the k-th input filter of size 5×1

is the second filter.

네트워크에 사용되는 입력과 출력의 차원을 맞추기 위해 1×1 크기의 필터로 컨볼루션 연산이 하기의 수학식 3과 같이 최종적으로 수행된다.In order to match the dimensions of the input and output used in the network, a convolution operation is finally performed as shown in Equation 3 below with a filter having a size of 1×1.

여기서,

는 1×1 크기를 갖는

번째 필터이고,

는 입력

와 크기와 차원이 같다.here,

has a size of 1×1

is the second filter,

is input

is the same size and dimension as

음향 특징 개선부(112c)는 컨볼루션 연산부(112b)로부터 수신한 컨볼루션 연산 결과에 0에서 1 사이의 값을 갖는 시그모이드(Sigmoid) 함수를 적용하여 동물 울음소리 주파수 대역이 개선된 음향 특징

이 추출된다(수학식 4). 여기서, 시그모이드 함수는 입력을 0과 1 사이의 출력값으로 정규화시키는 활성화 함수이다.The acoustic characteristic improvement unit 112c applies a sigmoid function having a value between 0 and 1 to the convolution operation result received from the convolution operation unit 112b, so that the frequency band of the animal cry is improved.

is extracted (Equation 4). Here, the sigmoid function is an activation function that normalizes an input to an output value between 0 and 1.

음향 특징 개선부(112c)는 추출된 개선된 음향 특징

을 어텐션 모듈부(112d)에 입력한다.The acoustic characteristic improvement unit 112c is the extracted improved acoustic characteristic.

is input to the attention module unit 112d.

어텐션 모듈부(112d)는 음향 신호를 패턴 인식 기법을 이용하여 다양한 음향 이벤들을 찾아내는 기법으로 재발 신경망(Recurrent Neural Network, RNN) 모델의 LSTM(Long Short-Term Memory)과 Attention Layer로 구성된다.The attention module unit 112d is a technique for finding various acoustic events by using a pattern recognition technique for acoustic signals, and is composed of a Long Short-Term Memory (LSTM) and an Attention Layer of a Recurrent Neural Network (RNN) model.

패턴 인식 기법은 인공 신경망을 이용한 예측 방법으로 입력층으로부터 출력층의 결과값을 예측한 경우, 학습 과정에서 결과값들로부터 입력값을 예측할 수 있다. 인공 신경망은 입력값과 출력값이 일대일 대응 관계에 있지 아니하므로, 출력층으로서 입력층을 그대로 복구하는 것은 불가능하나, 예측 알고리즘을 고려하여 역전파(Backpropagation, Backpropa) 알고리즘에 의해 결과값으로부터 산출된 출력 데이터가 최초의 입력 데이터와 상이하다면, 인공 신경망의 예측이 부정확하다고 볼 수 있으므로, 제약 조건 하에서 산출된 출력 데이터가 최초의 입력 데이터와 유사해지도록 예측 계수를 변경하여 학습을 훈련하게 된다.The pattern recognition technique is a prediction method using an artificial neural network, and when the result value of the output layer is predicted from the input layer, the input value can be predicted from the result values in the learning process. Since the artificial neural network does not have a one-to-one correspondence between the input value and the output value, it is impossible to restore the input layer as an output layer as it is, but the output data calculated from the result value by the Backpropagation (Backpropa) algorithm in consideration of the prediction algorithm If is different from the original input data, it can be considered that the prediction of the artificial neural network is inaccurate. Therefore, learning is trained by changing the prediction coefficients so that the output data calculated under the constraint is similar to the original input data.

심층 신경망이란 신경망 알고리즘 중에서 여러 개의 층으로 이루어진 신경망을 의미한다. 한 층은 여러 개의 노드로 이루어져 있고, 노드에서 실제 연산이 이루어지는데, 이러한 연산 과정은 인간의 신경망을 구성하는 뉴런에서 일어나는 과정을 모사하도록 설계되어 있다. 통상적인 인공 신경망은 입력층(Input Layer), 은닉층(Hidden Layer), 출력층(Output Layer)로 나뉘며, 입력 데이터는 입력층의 입력이 되며, 입력층의 출력은 은닉층의 입력이 되고, 은닉층의 출력은 출력층의 입력이 되고, 출력층의 출력이 최종 출력이 된다.A deep neural network refers to a neural network composed of several layers among neural network algorithms. A layer consists of several nodes, and actual calculations are performed at the nodes, and this calculation process is designed to mimic the processes occurring in the neurons constituting the human neural network. A typical artificial neural network is divided into an input layer, a hidden layer, and an output layer. The input data becomes the input of the input layer, and the output of the input layer becomes the input of the hidden layer and the output of the hidden layer. becomes the input of the output layer, and the output of the output layer becomes the final output.

어텐션 모듈부(112d)는 입력층으로부터 입력 데이터를 입력받아 예측값을 출력층의 버퍼에 출력하는 예측 심층 신경망을 사용하며, 예측 심층 신경망의 구조나 형태는 제한되지 않고, 대표적인 방법으로 DNN(Deep Neural Network), CNN(Convolutional Neural Network), RNN(Recurrent Neural Network) 등이 있으며, 각각의 신경망의 조합으로 예측 심층 신경망을 구성하여 다양한 구조의 심층 신경망을 구성할 수 있다. 본 발명의 어텐션 모듈부(112d)는 예측 심층 신경망의 구조로 RNN을 사용한다.The attention module unit 112d uses a prediction deep neural network that receives input data from an input layer and outputs a predicted value to a buffer of an output layer. The structure or shape of the prediction deep neural network is not limited, and a representative method is a Deep Neural Network (DNN). ), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), etc., and by combining each neural network, a predictive deep neural network can be constructed to construct a deep neural network of various structures. The attention module unit 112d of the present invention uses an RNN as a structure of a prediction deep neural network.

RNN은 입력으로 T 프레임의 40차의 (T ×40) 로그 멜 필터 뱅크값이 사용되고, 그 전체적인 구조가 256개의 유닛을 가진 3개의 GRU(Gated Recurrent Unit)층과 4개층의 FNN, 그리고 마지막에 Sigmoid 함수를 가진 출력층으로 구성되어 있다. The RNN uses the 40th order (T × 40) log Mel filter bank value of the T frame as an input, and the overall structure is 3 GRU (Gated Recurrent Unit) layers with 256 units, 4 FNN layers, and finally It consists of an output layer with a sigmoid function.

여기서, FNN(Feedforward Neural Network)은 40차의 로그 멜 필터 뱅크값의 특징은 연속적인 5개의 프레임들이 합쳐져서 200차원의 특징 벡터를 형성하고 이들은 FNN의 입력으로 사용된다. 2개의 은닉층은 각각 Relu 활성화 함수를 가지는 1600개의 유닛들로 구성된다.Here, in the FNN (Feedforward Neural Network), the features of the 40th order log Mel filter bank value are combined with 5 consecutive frames to form a 200-dimensional feature vector, which is used as an input to the FNN. The two hidden layers consist of 1600 units each having a Relu activation function.

출력층은 분류하고자 하는 음향 이벤트 클래스의 수만큼의 유닛을 가지며, 각 유닛은 Sigmoid 활성화 함수를 이용한다. 여기서, Sigmoid 활성화 함수의 출력값은 각 클래스에 대한 사후 확률로 간주되며, 0.5의 기준치와 비교하여 이진화된 후, Ground Truth Table과 비교하여 FNN의 정확도 산출을 위하여 사용된다.The output layer has as many units as the number of acoustic event classes to be classified, and each unit uses a sigmoid activation function. Here, the output value of the sigmoid activation function is regarded as the posterior probability for each class, and after being binarized by comparing it with a reference value of 0.5, it is used for calculating the accuracy of the FNN by comparing it with the Ground Truth Table.

RNN 모델은 음성, 문장 등의 순차적인 데이터를 처리하는 신경망 구조이다. RNN의 메모리라 할 수 있는 state는 입력 데이터가 들어올 때마다 반복적으로 갱신이 되어 처음부터 현재 시간 t까지의 입력 데이터를 요약한 정보를 가지고 있다.The RNN model is a neural network structure that processes sequential data such as speech and sentences. The state, which can be said to be the memory of the RNN, is repeatedly updated whenever input data comes in, and has information summarizing the input data from the beginning to the current time t.

결과적으로 입력을 모두 처리한 후, RNN의 state는 입력 데이터의 전체를 요약하는 정보가 된다. 실제 모델 훈련 시 t에서의 state에서는 t에서의 입력 데이터와 t-1의 state 값으로 t의 결과값과 state값을 계산한다.As a result, after processing all the inputs, the state of the RNN becomes information that summarizes the entire input data. In the actual model training, in the state at t, the result value and state value of t are calculated with the input data at t and the state value of t-1.

RNN은 Back-Propagation 기반의 Gradient Descent 방법을 이용하여 비용 함수를 최소화하며, 최적의 변수값을 찾는다.RNN minimizes the cost function by using the gradient descent method based on back-propagation and finds the optimal variable value.

LSTM은 RNN의 가중치 대신 망각 게이트를 이용하여 결과값을 예측하는 RNN 방법의 일종을 의미한다. 시계열적인 입력 데이터에 대한 예측에 있어서 순차적으로 데이터를 처리할 때, RNN 방식으로 지난 데이터를 처리하는 경우, 오래된 데이터의 경우는 가중치에 따라 감소되어 일정 단계를 넘게 되면 그 값이 0이 되어 가중치와 관계없이 더 이상 반영하지 않는 문제가 있다.LSTM refers to a kind of RNN method that predicts the result using the forgetting gate instead of the weight of the RNN. When processing data sequentially in predicting time-series input data, when processing old data in the RNN method, old data is reduced according to the weight, and when it exceeds a certain stage, the value becomes 0, Regardless, there are issues that are no longer reflected.

LSTM의 경우 곱셈 대신 덧셈을 사용하므로, Recurrent 입력값이 0이 되지 않는 장점이 있게 된다.In the case of LSTM, since addition is used instead of multiplication, there is an advantage that the recurrent input value does not become 0.

어텐션 모듈부(112d)는 음향 특징 개선부(112c)로부터 추출된 음향 특징이 입력되면, 서로 이웃한 7개의 음향 특징 중 어느 음향 특징에 동물 울음소리가 존재하는지 하기의 수학식 5에 따라 어텐션(Attention) 정보를 확률로 계산한다.When the acoustic feature extracted from the acoustic feature improving part 112c is input, the attention module unit 112d determines which acoustic characteristic among the 7 adjacent acoustic characteristics contains an animal cry according to Equation 5 below. Attention) information is calculated as a probability.

여기서, a,

와

는 Attention Layer의 학습 파라미터이며,

는 LSTM Layer의 내부 state,

는 개선된 i번째 음향 특징이다.where a,

Wow

is the learning parameter of the attention layer,

is the internal state of the LSTM layer,

is the improved i-th acoustic feature.

어텐션 모듈부(112d)는 수학식 5를 통해 계산된 어텐션 정보와 개선된 음향 특징

을 가중 평균 계산을 수행하여 하나의 벡터를 계산한다(수학식 6).The attention module unit 112d provides attention information calculated through Equation 5 and improved acoustic characteristics.

One vector is calculated by performing a weighted average calculation (Equation 6).

어텐션 모듈부(112d)는 계산된 하나의 벡터 c를 최종 확률 계산부(112e)로 전송한다.The attention module unit 112d transmits one calculated vector c to the final probability calculation unit 112e.

RNN과 같이 순차적인 데이터를 처리하는 신경망 구조는 시간이 길어질수록 초기에 저장된 정보가 사라지는 단점이 있는데, 이를 보완하기 위해서 Attention 알고리즘을 사용한다. Attention 알고리즘은 순차적인 입력 데이터를 RNN으로 처리하여 모든 시간의 state 벡터 h를 출력한다. 이때, 모든 시간의 벡터를 같은 비율로 참고하는 것이 아니라 특정 시간에 Attention을 주어 최종 결과를 도출한다.The neural network structure that processes sequential data, such as RNN, has a disadvantage in that the initially stored information disappears as time increases. To compensate for this, an attention algorithm is used. Attention algorithm outputs the state vector h at all times by processing sequential input data with RNN. At this time, the final result is derived by giving attention to a specific time, rather than referring to the vectors at all times at the same rate.

최종 확률 계산부(112e)는 어텐션 모듈부(112d)에서 계산된 어텐션 정보를 하나의 출력 노드를 갖는 2개의 완전 연결 계층부(Fully Connected Layer, FCL)와 Sigmoid 활성화 함수에 입력하여 최종적으로 해당 구간에 동물 울음소리가 존재할 확률을 도출한다(수학식 7).The final probability calculation unit 112e inputs the attention information calculated by the attention module unit 112d into two Fully Connected Layers (FCLs) having one output node and a sigmoid activation function, and finally the corresponding section The probability of the presence of an animal cry is derived (Equation 7).

입력된 음향 신호의 모든 구간에 대하여 수학식 1 내지 수학식 7와 같은 연산을 수행한다.Equations 1 to 7 are performed for all sections of the input sound signal.

음향 신호에서 동물 울음소리가 존재하지 않는 부분은 잡음 제거 단계를 적용하며, 동물 울음소리가 존재하는 부분은 특징 강화 단계를 적용한다.In the acoustic signal, the noise removal step is applied to the part where the animal cry is not present, and the feature enhancement step is applied to the part where the animal cry is present.

음향 품질 개선부(113)는 동물 울음소리 구간 검출부(112)로부터 동물 울음소리 구간이 검출된 음향 신호를 수신하고, 수신한 음향 신호에서 잡음 신호를 제거하여 음향 신호의 품질을 개선한다.The sound quality improvement unit 113 receives the sound signal in which the animal cry section is detected from the animal cry section detector 112 , and removes the noise signal from the received sound signal to improve the quality of the sound signal.

잡음 신호는 시간에 따라 변하기 때문에 동물 울음소리가 존재하지 않을 때마다 잡음의 분산을 갱신하여 사후 신호 대 잡음비(Posterior Signal to Noise Ratio, SNR)와 사전 신호 대 잡음비(Prior SNR)를 추정한다.Since the noise signal changes with time, the variance of the noise is updated whenever there is no animal cry, and the posterior signal to noise ratio (SNR) and the prior signal to noise ratio (SNR) are estimated.

잡음 제거 기술은 신호 대 잡음비에 따른 스펙트럼 이득으로 표현할 수 있다. 이에 따라 음향 품질 개선부(113)는 동물 울음소리 구간 검출부(112)로부터 동물 울음소리 구간이 검출된 음향 신호를 수신하고, 수신한 음향 신호에서 하기의 수학식 8과 수학식 9에 의해 사후 SNR과 사전 SNR을 계산한다.The noise cancellation technique can be expressed as a spectral gain according to the signal-to-noise ratio. Accordingly, the sound quality improvement unit 113 receives the sound signal in which the animal cry section is detected from the animal cry section detection unit 112, and the post-SNR of the received sound signal is obtained by the following Equations (8) and (9). and calculate the prior SNR.

사후 SNR

은 잡음이 섞인 신호 대비 잡음비이고, 사전 SNR

은 잡음이 섞이지 않은 신호 대비 잡음비이다.post-mortem SNR

is the noise-to-noise ratio, and the pre-SNR

is the noise-free signal-to-noise ratio.

주어진 환경에서는 잡음이 섞이지 않은 신호를 알 수 없기 때문에 사전 SNR을 추정하는 과정이 필요하며, 해당 사후 SNR과 사전 SNR의 추정은 하기의 수학식 8과 수학식 9와 같이 산출된다.In a given environment, since it is impossible to know a signal without noise, a process of estimating the prior SNR is required, and the posterior SNR and the prior SNR are estimated as shown in Equations 8 and 9 below.

여기서,

는 k번째 주파수 성분에서 계산된 사후 SNR,

는 음향 신호의 k번째 주파수 파워 스펙트럼,

는 잡음 신호의 k번째 주파수 파워 스펙트럼,

는 k번째 주파수 성분에서 추정된 사전 SNR,

는 이전 시간에 추정된 사전 SNR을 나타낸다. 최초의 사전 SNR 값은 0으로 초기화한다. here,

is the posterior SNR calculated from the k-th frequency component,

is the k-th frequency power spectrum of the acoustic signal,

is the kth frequency power spectrum of the noise signal,

is the pre-SNR estimated at the k-th frequency component,

denotes the prior SNR estimated at the previous time. The first prior SNR value is initialized to 0.

음향 품질 개선부(113)는 수학식 8과 수학식 9를 이용하여 추정된 사전 SNR을 하기의 수학식 10의 이득 함수에 대입하여 스펙트럼 이득을 계산하고, 0과 1 사이의 값을 가지며, 입력 음향 신호의 각 주파수 성분에 직접 곱해진다.The sound quality improving unit 113 calculates a spectral gain by substituting the pre-SNR estimated using Equations 8 and 9 into the gain function of Equation 10 below, has a value between 0 and 1, and input Each frequency component of the acoustic signal is multiplied directly.

여기서,

는 스펙트럼 이득,

는 k번째 주파수 성분에서 추정된 사전 SNR이다.here,

is the spectral gain,

is the pre-SNR estimated at the k-th frequency component.

음향 품질 개선부(113)는 이득 함수 Gaink를 잡음이 있는 음향 신호의 k번째 주파수 파워 스펙트럼

에 직접 곱하여 잡음이 제거된 음향 신호의 k번째 주파수 파워 스펙트럼

를 계산하게 된다.The sound quality improvement unit 113 calculates the gain function Gaink as the k-th frequency power spectrum of the noisy sound signal.

The k-th frequency power spectrum of the denoised acoustic signal by directly multiplying

will be calculated

음향 품질 개선부(113)는 동물 울음소리 구간 검출부(112)로부터 동물 울음소리 구간이 검출된 음향 신호를 수신하고, 수신한 음향 신호의 각 주파수 성분에 스펙트럼 이득을 곱셈 연산을 수행하면, 잡음 신호를 제거할 수 있다.When the sound quality improvement unit 113 receives the sound signal in which the animal cry section is detected from the animal cry section detection unit 112 and performs a multiplication operation on each frequency component of the received sound signal by a spectral gain, the noise signal can be removed.

음향 품질 개선부(113)는 잡음 신호를 제거한 동물 울음소리가 존재하는 구간을 잘라서 특징 강화부(121)로 전송한다.The sound quality improving unit 113 cuts a section in which an animal cry is present from which the noise signal is removed and transmits it to the feature enhancing unit 121 .

분류기(120)는 특징 강화부(121) 및 동물 종 식별부(122)를 포함한다.The classifier 120 includes a feature enhancement unit 121 and an animal species identification unit 122 .

특징 강화부(121)는 음향 품질 개선부(113)로부터 동물 울음소리가 존재하는 입력 스펙트로그램을 수신하고, 상기 수신한 입력 스펙트로그램을 신경망 기반 특징 추출 방법을 통해 잡음 신호를 추정하고, 입력 데이터에서 추정한 잡음 신호를 뺄셈 연산으로 불필요한 성분을 제거하여 음향 신호의 특징을 강화한다. 스펙트로그램은 소리나 파동을 시각화하여 파악하기 위한 것으로 파형과 스펙트럼의 특징이 조합되어 있다. 파형에서는 시간 축의 변화에 따른 진폭 축의 변화를 볼 수 있고, 스펙트럼에서는 주파수 축의 변화에 따른 진폭 축의 변화를 볼 수 있다. 스펙트로그램은 시간 축과 주파수 축의 변화에 따라 진폭의 차이를 인쇄 농도와 표시 색상의 차이로 나타낸다.The feature reinforcement unit 121 receives an input spectrogram in which an animal cry is present from the sound quality improvement unit 113 , estimates a noise signal using the received input spectrogram through a neural network-based feature extraction method, and input data By subtracting the noise signal estimated in , unnecessary components are removed to enhance the characteristics of the acoustic signal. A spectrogram is a combination of waveform and spectrum characteristics to visualize and understand sound or waves. The change of the amplitude axis according to the change of the time axis can be seen in the waveform, and the change of the amplitude axis according to the change of the frequency axis can be seen in the spectrum. The spectrogram shows the difference in amplitude according to the change of the time axis and the frequency axis as the difference in print density and display color.

특징 강화부(121)는 뺄셈 연산 이후에 비선형 활성화 함수 ReLU를 통해 음수 부분을 0으로 처리해주고, 정규화(Normalization) 과정을 통해 특징의 범위를 0 내지 1로 조절한다. 이는 Positive Definition을 유지하고, 뺄셈 연산으로 인한 특징의 손실을 막기 위함이다.The feature enhancement unit 121 processes the negative part as 0 through the nonlinear activation function ReLU after the subtraction operation, and adjusts the feature range to 0 to 1 through a normalization process. This is to maintain the positive definition and to prevent loss of features due to the subtraction operation.

특징 강화부(121)는 RNN-LSTM과 Auto Encoder 알고리즘을 이용하여 신경망이 잡음 및 녹음 장소 등과 같은 도메인 변화에 강인하도록 학습하는 과정이다.The feature reinforcement unit 121 is a process of learning the neural network to be robust to domain changes such as noise and recording locations using RNN-LSTM and Auto Encoder algorithm.

다시 말해, 특징 강화 네트워크의 학습(잡음 성분 검출 및 도메인 적응(Adaptation))은 분류기(120)와 동시에 End to End 방식으로 학습된다.In other words, learning (noise component detection and domain adaptation) of the feature reinforcement network is learned in an end-to-end manner simultaneously with the classifier 120 .

특징 강화부(121)는 잡음 성분 검출(도 4)을 에너지 기반으로 이루어진다.The feature enhancement unit 121 performs energy-based noise component detection (FIG. 4).

특징 강화부(121)는

차원을 가지는 입력 스펙트로그램에서 특징 요소 간 에너지 크기를 학습하고, 에너지가 상대적으로 작으면서 학습에 불필요한 성분을 제거한다.The feature strengthening unit 121 is

In an input spectrogram having a dimension, the amount of energy between the feature elements is learned, and components unnecessary for learning are removed while the energy is relatively small.

특징 강화부(121)는 차원 축소 및 확장을 통해 이루어며, RNN-LSTM을 이용하여 입력 스펙트로그램의 각 축 별로 저차원 특징을 추출한다.The feature enhancement unit 121 performs dimensionality reduction and expansion, and extracts low-dimensional features for each axis of the input spectrogram using RNN-LSTM.

시간 축의 경우,

차원을 가지며 N > K의 조건을 만족한다. 주파수 축은

차원을 가지며 M > K의 조건을 만족한다.For the time axis,

It has dimension and satisfies the condition of N > K. the frequency axis

It has a dimension and satisfies the condition M > K.

특징 강화부(121)는 하이퍼볼릭 탄젠트(Hypoblic Tangent) 함수(Tanh)인 수학식 12를 이용하여 각각의 특징 요소들을 -1에서 1의 범위로 표현하고, 비선형 활성화 함수(Rectifued Linear Unit, RELU)인 수학식 13을 통해 음의 값들을 0으로 만든다. 이러한 과정을 통해 각 축 별 LSTM 출력 벡터는 항상 0에서 1 범위의 확률값을 가지게 된다.The feature reinforcement unit 121 expresses each feature element in a range of -1 to 1 using Equation 12, which is a hyperbolic tangent function Tanh, and uses a nonlinear activation function (Rectifued Linear Unit, RELU) Negative values are made 0 through Equation (13). Through this process, the LSTM output vector for each axis always has a probability value in the range of 0 to 1.

도 4에 도시된 바와 같이, 특징 강화부(121)는 RNN-LSTM을 통해 얻은 2개의 특징을 하기의 수학식 14를 이용하여 잡음 성분과 잡음이 아닌 성분을 구분한 강화된 스펙트로그램을 계산한다.As shown in FIG. 4 , the feature enhancement unit 121 calculates an enhanced spectrogram in which two features obtained through RNN-LSTM are divided into a noise component and a non-noise component using Equation 14 below. .

V₁, V₂, X는 각각 주파수축 LSTM 출력 벡터, 시간 축 LSTM 출력 벡터 및 입력 데이터를 나타낸다.V ₁ , V ₂ , and X represent a frequency-axis LSTM output vector, a time-axis LSTM output vector, and input data, respectively.

는 각각 벡터 외적과 성분별 곱셈을 나타낸다.

는 수학식 12, 수학식 13을 연속적으로 적용한 함수이다.

represents the vector cross product and component multiplication, respectively.

is a function to which Equations 12 and 13 are successively applied.

각각의 LSTM 출력 벡터의 외적을 통해 얻은

차원의 행렬은 입력 스펙트로그램의 구성 요소에 0 내지 1 범위를 가지는 가중치를 주기 때문에 학습에 불필요한 성분을 검출하는 역할을 수행한다.obtained through the cross product of each LSTM output vector

Since the dimensional matrix gives weights having a range of 0 to 1 to the components of the input spectrogram, it plays a role of detecting components unnecessary for learning.

특징 강화부(121)는 잡음 성분 검출을 수행한 후, 도메인 적응(Adaptation)을 수행하게 되는데, 수학식 14를 이용하여 계산된 스펙트로그램에 도 5와 같은 CNN(Convolutional Neural Network) 기반 Bottleneck 구조의 오토 엔코더(Auto Encoder) 알고리즘을 적용하여 도메인 적응을 수행한다.After detecting the noise component, the feature enhancement unit 121 performs domain adaptation. In the spectrogram calculated using Equation 14, the convolutional neural network (CNN)-based bottleneck structure as shown in FIG. Domain adaptation is performed by applying an auto encoder algorithm.

일반적인 CNN 기반 오토 엔코더 알고리즘의 경우, 입력 데이터를 빼기 연산에 적합한 도메인으로 변환하는 역할을 수행한다. 여기서, 오토 엔코더(Auto Encoder) 알고리즘은 고차원의 입력 데이터를 저차원으로 압축한 후, 다시 원래 데이터로 복원하는 알고리즘이다. 인코딩(Encoding) 단계에서는 입력 데이터를 압축하는데, 압축하는 과정에서 신경망을 통해 입력 데이터의 중요한 특징을 추출한다.In the case of a typical CNN-based autoencoder algorithm, it serves to transform the input data into a domain suitable for subtraction operation. Here, the auto encoder algorithm is an algorithm that compresses high-dimensional input data into a low-dimensional one and then restores it back to original data. In the encoding stage, the input data is compressed, and in the process of compression, important features of the input data are extracted through a neural network.

디코딩(Decoding) 단계는 인코딩 단계에서 압축된 특징을 입력받아 초기의 입력 데이터로 복원하고, 이러한 과정에서 잡음을 제거해 원하는 입력 데이터의 정보만 볼 수 있다.In the decoding step, the features compressed in the encoding step are received and restored to the initial input data, and in this process, noise is removed so that only the information of the desired input data can be viewed.

CNN은 입력 데이터에 컨볼루션 연산과 폴링 연산을 반복적으로 수행하고, 입력 데이터의 특징을 보존하는 동시에 데이터의 크기를 줄이면서 학습해야 할 변수값의 개수를 줄인다. 각각의 연산을 수행하는 레이어를 컨볼루션 레이어(Convolutional Layer)와 폴링 레이어(Pooing Layer)로 나타낸다.CNN repeatedly performs convolution and polling operations on input data, and reduces the number of variable values to learn while preserving the characteristics of the input data while reducing the size of the data. A layer performing each operation is represented by a convolutional layer and a polling layer.

컨볼루션 레이어는 2차원 입력에 고정된 크기의 2차원 필터로 컨볼루션 연상을 수행한다. 폴링 레이어는 컨볼루션 레이어의 결과를 입력으로 사용하여 이웃한 영역 내부에서 최대값을 선정함으로써 차원을 축소한다.The convolutional layer performs convolutional association with a two-dimensional filter of a fixed size on a two-dimensional input. The polling layer uses the result of the convolutional layer as input and reduces the dimension by selecting the maximum value within the neighboring region.

특징 강화부(121)는 도메인 적응을 수행하기 위하여 4개의 인코더 레이어와 2개의 디코더 레이어를 구성한다.The feature enhancement unit 121 configures four encoder layers and two decoder layers to perform domain adaptation.

인코더 레이어에서는 일정한 크기의 컨볼루션 레이어(Convolutional Layer)를 통해 출력 크기와 차원을 줄이면서 특징을 추출하고, 디코더 레이어에서는 일정한 크기의 컨볼루션 레이어를 통해 출력 크기의 차원을 늘리면서 출력 크기를 입력 스펙트로그램과 동일하게 한다.In the encoder layer, features are extracted while reducing the output size and dimension through a convolutional layer of a certain size, and in the decoder layer, the output size is converted into an input spectrogram while increasing the dimension of the output size through a convolutional layer of a constant size. do the same with

특징 강화부(121)는 컨볼루션 연산을 수행한 인코더 레이어가 끌날 때마다 최대 폴링 레이어(Max Pooling Layer), 정규화 연산인 배치 정규화(Batch Normalization), 활성화 함수인 RELU 함수를 수행한다. 최대 폴링 레이어는 컨볼루션 레이어에서 컨볼루션 연산을 수행한 후, 사이즈를 리사이즈하여 만드는 과정을 의미한다.Each time the encoder layer that has performed the convolution operation is turned off, the feature enhancement unit 121 performs a maximum polling layer, a batch normalization that is a normalization operation, and a RELU function that is an activation function. The maximum polling layer refers to a process of performing a convolution operation on the convolution layer and then resizing the size to make it.

배치 정규화는 활성화 함수의 앞쪽에 배치가 되며, 이는 역전파(Back Propagation)를 통해서 학습이 가능하다.Batch normalization is placed in front of the activation function, which can be learned through back propagation.

특징 강화부(121)는 컨볼루션 연산을 수행한 디코더 레이어가 끌날 때마다 정규화 연산인 배치 정규화(Batch Normalization), 활성화 함수인 RELU 함수를 수행한다.The feature enhancement unit 121 performs batch normalization, which is a normalization operation, and a RELU function, which is an activation function, whenever the decoder layer that has performed the convolution operation is turned off.

특징 강화부(121)는 입력 스펙트로그램에서 도메인 적응을 수행한 결과인 잡음 특징을 뺄셈 연산으로 불필요한 성분을 제거한 후, 비선형 활성화 함수인 ReLU(Recified Linear Unit)을 적용하고, 0에서 1 범위의 min-max 정규화(Normalization)를 통해 특징의 크기를 조절하여 음향 신호의 특징을 강화한다.The feature enhancement unit 121 removes unnecessary components by subtracting the noise feature, which is a result of domain adaptation in the input spectrogram, and then applies a nonlinear activation function ReLU (Recified Linear Unit), min in the range of 0 to 1 -max Reinforces the characteristics of the acoustic signal by adjusting the size of the feature through normalization.

도 6에 도시된 바와 같이, 동물 종 식별부(122)는 CNN 기반 분류 알고리즘과 완전 연결 네트워크(Fully Connected Network, FCN)으로 구성되어 있으며, 마지막에 Softmax 함수를 통해 최종 결과를 도출한다. 다시 말해, 동물 종 식별부(122)는 5개의 컨볼루션 레이어(Convolutional Layer)와 2개의 완전 연결 레이어(Fully Connected Layer, FCL)로 구성된다.As shown in FIG. 6 , the animal species identification unit 122 is composed of a CNN-based classification algorithm and a Fully Connected Network (FCN), and finally derives a final result through a Softmax function. In other words, the animal species identification unit 122 includes five convolutional layers and two fully connected layers (FCL).

동물 종 식별부(122)는 첫 번째, 두 번째, 마지막의 컨볼루션 레이어의 컨볼루션 연산을 수행한 후, 최대 폴링(Max Pooling)을 수행한다.The animal species identification unit 122 performs a convolution operation on the first, second, and last convolution layers, and then performs Max Pooling.

동물 종 식별부(122)는 특징 강화부(121)로부터 강화된 음향 신호의 특징이 입력되면, 컨볼루션 레이어에서 컨볼루션 연산을 수행하고, 마지막 컨볼루션 레이어에서 컨볼루션 연산을 수행한 후, FCN과 연결되어 출력 레이어에서 레이블 수만큼의 결과를 산출한다.The animal species identification unit 122 performs a convolution operation in a convolution layer when a feature of an acoustic signal reinforced from the feature reinforcement unit 121 is input, and after performing a convolution operation in the last convolution layer, FCN is connected to and yields the number of labels in the output layer

동물 종 식별부(122)는 산출된 결과를 하기의 수학식 15의 소프트맥스(Softmax) 함수를 통하여 점수(Score)로 나타나게 되고, 최종적으로 가장 높은 점수를 가진 레이블(Label)을 최종 결과(S(yi))로 분류한다.The animal species identification unit 122 displays the calculated result as a score through the Softmax function of Equation 15 below, and finally sets the label with the highest score to the final result (S). (yi)).

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improved forms of the present invention are also provided by those skilled in the art using the basic concept of the present invention as defined in the following claims. is within the scope of the right.

100: 동물 종 식별 장치 110: 전처리부
111: 음향 신호 입력부 112: 동물 울음소리 구간 검출부
112a: 음향 특징 추출부 112b: 컨볼루션 연산부
112c: 음향 특징 개선부 112d: 어텐션 모듈부
112e: 최종 확률 계산부 113: 음향 품질 개선부
120: 분류기 121: 특징 강화부
122: 동물 종 식별부100: animal species identification device 110: pre-processing unit
111: sound signal input unit 112: animal cry section detection unit
112a: acoustic feature extraction unit 112b: convolution operation unit
112c: acoustic characteristic improvement unit 112d: attention module unit
112e: final probability calculation unit 113: sound quality improvement unit
120: classifier 121: feature enhancement unit
122: animal species identification unit

Claims

Receives a one-dimensional sound signal that is an acoustic signal of an animal cry, converts the one-dimensional sound signal into a log-mel spectrogram that is a two-dimensional sound feature having a frequency axis and a time axis, and converts the converted two-dimensional sound feature into sound an animal cry section detection unit for detecting a section in which the animal cries exist through section detection;
A section in which the detected animal cries are present is received as input data, a noise signal is estimated using the received input data using a neural network-based feature extraction method, and the estimated noise signal is subtracted from the input data. a feature strengthening unit for enhancing the characteristics of the acoustic signal by removing unnecessary signal components; and
Using a classification algorithm consisting of a convolutional layer and a fully connected layer (FCL) for the characteristics of the reinforced sound signal, as many results as the number of labels are calculated in the output layer, and from the calculated results an animal species identifier that results in a label with the highest score as a final result;
The animal cry section detection unit,
A specific section of the sound signal sampled at a specific frequency is calculated through Short Time Fourier Transform (STFT) transformation of the one-dimensional sound signal, and a specific section of the calculated sound signal is passed through a Mel filter bank. thereafter, an acoustic feature extraction unit that performs a logarithmic operation to convert the log-mel spectrogram which is the two-dimensional acoustic feature, and selects neighboring acoustic feature vectors from among them; and
The acoustic feature vectors are received from the acoustic feature extraction unit, and a convolution operation is performed with 32 1X1-size filters in a convolutional layer in the frequency axis direction, and 64 5X1-size filters are convolutional. Animal species, characterized in that it further comprises a convolution operation unit that performs a diated convolution operation with a convolution layer, and extracts features by performing a convolution operation with a convolution layer that is again 32 1X1 size filters for residual connection identification device.

delete

According to claim 1,
The animal cry section detection unit applies a sigmoid function having a value between 0 and 1 to the convolution operation result received from the convolution operation unit, so that the animal cry frequency band is improved acoustic features are extracted Animal species identification device, characterized in that it further comprises an acoustic characteristic improvement unit.

5. The method of claim 4,
The animal cry section detection unit is composed of a Long Short-Term Memory (LSTM) and an attention layer of a Recurrent Neural Network (RNN) model, and when an acoustic characteristic extracted from the acoustic characteristic improvement unit is input, the extracted sound Animal species identification device, characterized in that it further comprises an attention module for calculating attention (Attention) information according to the following Equation (1) in which acoustic feature of the features the animal cry is present.
[Equation 1]

where a,

Wow

is the learning parameter of the attention layer,

is the internal state of the LSTM layer,

is an improved i-th acoustic feature.

6. The method of claim 5,
The animal cry section detection unit inputs the attention information calculated by the attention module unit to two fully connected layer units (FCL) having one output node and a sigmoid activation function, and finally the animal cries in the corresponding section. Animal species identification device, characterized in that it further comprises a final probability calculator for deriving the probability that the sound exists.

7. The method of claim 6,
An acoustic signal in which an animal cry section is detected is received from the final probability calculator, and a posterior signal to noise ratio (SNR) is obtained from the received acoustic signal by Equations 2 and 3 below. The prior SNR and the prior SNR are calculated, the spectral gain is calculated by substituting the calculated prior SNR into the gain function of Equation 4 below, and the spectral gain is applied to each frequency component of the received sound signal. The apparatus for identifying animal species according to claim 1, further comprising a sound quality improvement unit that removes a noise signal when a multiplication operation is performed.
[Equation 2]

[Equation 3]

here,

is the posterior SNR calculated from the k-th frequency component,

is the k-th frequency power spectrum of the acoustic signal,

is the kth frequency power spectrum of the noise signal,

is the pre-SNR estimated at the k-th frequency component,

is the prior SNR estimated at the previous time.
[Equation 4]

here,

is the spectral gain,

is the pre-SNR estimated at the k-th frequency component.

8. The method of claim 7,
The feature reinforcement unit receives an input spectrogram in which an animal cry is present from the sound quality improvement unit, and uses the received input spectrogram as a Long Short-Term Memory (LSTM) of a Recurrent Neural Network (RNN) model. An apparatus for identifying an animal species, characterized in that a feature is extracted for each axis of the input spectrogram, and the extracted feature is calculated as a spectrogram in which a noise component and a non-noise component are separated.

9. The method of claim 8,
The feature enhancement unit configures one or more Convolutional Neural Network (CNN)-based encoder layers and one or more decoder layers in the calculated spectrogram to compress high-dimensional input data to a low dimension, and then restores it back to original data. Auto encoder (Auto Encoder) An apparatus for identifying an animal species, characterized in that it performs domain adaptation by applying an algorithm.

10. The method of claim 9,
The feature enhancement unit removes unnecessary components from the received input spectrogram, which is a result of domain adaptation, by subtraction operation, and then applies a non-linear activation function ReLU (Recified Linear Unit), and ranges from 0 to 1. An apparatus for identifying an animal species, characterized in that the characteristic of the acoustic signal is reinforced by adjusting the size of the characteristic through min-max normalization.

11. The method of claim 10,
The animal species identification unit uses a CNN-based classification algorithm consisting of five convolutional layers and two fully connected layers (FCL) when the characteristics of the acoustic signal reinforced from the feature reinforcement unit are input. to calculate the number of labels in the output layer, and the calculated result is expressed as a score through the Softmax function of Equation 5 below, and the label with the highest score is Animal species identification device, characterized in that the classification by the final result (S(yi)).
[Equation 5]

An animal species identification method in which an animal species identification device identifies an animal species, the method comprising:
Receives a one-dimensional sound signal that is an acoustic signal of an animal cry, converts the one-dimensional sound signal into a log-mel spectrogram that is a two-dimensional sound feature having a frequency axis and a time axis, and converts the converted two-dimensional sound feature into sound detecting a section in which the animal cry is present through section detection;
A section in which the detected animal cries are present is received as input data, a noise signal is estimated using the received input data using a neural network-based feature extraction method, and the estimated noise signal is subtracted from the input data. enhancing the characteristics of the acoustic signal by removing unnecessary signal components; and
Using a classification algorithm consisting of a convolutional layer and a fully connected layer (FCL) for the characteristics of the reinforced sound signal, as many results as the number of labels are calculated in the output layer, and from the calculated results deriving the label with the highest score as the final result;
The step of detecting the section in which the cry of the animal is present,
A specific section of the sound signal sampled at a specific frequency is calculated through Short Time Fourier Transform (STFT) transformation of the one-dimensional sound signal, and a specific section of the calculated sound signal is passed through a Mel filter bank. Then, performing a log operation to convert the log-mel spectrogram, which is the two-dimensional acoustic feature, and selecting neighboring acoustic feature vectors from among them;
The acoustic feature vectors are received as input, and a convolution operation is performed with 32 1X1 size filters, convolutional layers, in the frequency axis direction, and diated convolution with 64 5X1 size filters, convolutional layers. Animal species identification method, characterized in that it further comprises the step of extracting features by performing an operation and performing a convolution operation with a convolution layer that is a convolution layer that is a filter of 32 1X1 size for residual connection.

13. The method of claim 12,
The step of detecting the section in which the cry of the animal is present,
By applying a sigmoid function having a value between 0 and 1 to the result of the convolution operation received from the step of extracting the feature by performing the convolution operation, an acoustic feature with an improved animal cry frequency band is extracted Animal species identification method, characterized in that it further comprises the step of.

14. The method of claim 13,
The step of detecting the section in which the cry of the animal is present,
It is composed of an LSTM (Long Short-Term Memory) and an attention layer of a Recurrent Neural Network (RNN) model, and when an acoustic feature extracted in the step of extracting the improved acoustic feature is input, any of the extracted acoustic features Animal species identification method, characterized in that it further comprises the step of calculating attention (Attention) information according to the following Equation 1 whether there is an animal cry in the acoustic characteristic.
[Equation 1]

where a,

Wow

is the learning parameter of the attention layer,

is the internal state of the LSTM layer,

is an improved i-th acoustic feature.

15. The method of claim 14,
Inputting the calculated attention information to two Fully Connected Layers (FCL) having one output node and a sigmoid activation function to finally derive the probability that an animal cry exists in the corresponding section. Animal species identification method, characterized in that.

16. The method of claim 15,
Reinforcing the characteristics of the sound signal comprises:
Receive the input spectrogram in which the cry of the animal exists, and use the received input spectrogram for each axis of the input spectrogram using Long Short-Term Memory (LSTM) of a Recurrent Neural Network (RNN) model. and calculating the extracted feature as a spectrogram in which a noise component and a non-noise component are separated.

17. The method of claim 16,
An auto encoder that configures one or more encoder layers and one or more decoder layers based on a CNN (Convolutional Neural Network) based on the calculated spectrogram to compress high-dimensional input data into a low-dimensional one, and then restores it back to the original data. performing domain adaptation by applying an algorithm; and
After removing unnecessary components from the received input spectrogram by subtracting the noise feature that is the result of performing the domain adaptation, a nonlinear activation function ReLU (Recified Linear Unit) is applied, and min-max normalization in the range of 0 to 1 (Normalization) The method of identifying an animal species, characterized in that it further comprises the step of enhancing the characteristic of the acoustic signal by adjusting the size of the characteristic.

18. The method of claim 17,
The step of deriving the final result is,
When the characteristics of the reinforced sound signal are input, the number of labels in the output layer is the result using a CNN-based classification algorithm consisting of five convolutional layers and two fully connected layers (FCL). , and the calculated result is expressed as a score through the Softmax function of Equation 2 below, and the label with the highest score is the final result (S(yi)). Animal species identification method, characterized in that it further comprises the step of classifying.
[Equation 2]