KR20230144822A

KR20230144822A - Neural network learning unit and improved system and method for providing sound source maps inclulding the same

Info

Publication number: KR20230144822A
Application number: KR1020220043976A
Authority: KR
Inventors: 장지호; 조완호; 이승철; 이수영
Original assignee: 한국표준과학연구원; 포항공과대학교 산학협력단
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2023-10-17

Abstract

본 발명은 음원지도 제공 시스템을 개시한다. 보다 상세하게는, 본 발명은 다수의 마이크로폰으로 측정한 음향신호를 이용하여 평면 또는 곡면상에 존재하는 하나 이상의 음원들의 세기 및 위치를 이미지 형태로 표시하는 음원지도를 제공하는 딥러닝 기반의 음원지도 제공 시스템 및 방법에 관한 것이다.
본 발명의 실시예에 따르면, 초음파 마이크로폰 어레이를 이용하여 음향지도를 제공하는 시스템에서, 분석을 위해 도입된 딥러닝 모델에 서로 다른 주파수가 설정된 적어도 두 개 인코더를 탑재함으로써, 저역과 고역을 달리하여 공간 앨리어싱 현상을 최소화할 수 있다.The present invention discloses a sound source map providing system. More specifically, the present invention is a deep learning-based sound source map that provides a sound source map that displays the intensity and location of one or more sound sources existing on a flat or curved surface in the form of an image using sound signals measured with a plurality of microphones. It relates to a provision system and method.
According to an embodiment of the present invention, in a system that provides an acoustic map using an ultrasonic microphone array, at least two encoders with different frequencies are installed in the deep learning model introduced for analysis, thereby differentiating the low and high frequencies. Spatial aliasing phenomenon can be minimized.

Description

Neural network learning unit and improved sound source map providing system including the same {NEURAL NETWORK LEARNING UNIT AND IMPROVED SYSTEM AND METHOD FOR PROVIDING SOUND SOURCE MAPS INCLULDING THE SAME}

본 발명은 음원지도 제공 시스템에 관한 것으로, 특히 다수의 마이크로폰으로 측정한 음향신호를 이용하여 평면 또는 곡면상에 존재하는 하나 이상의 음원들의 세기 및 위치를 이미지 형태로 표시하는 음원지도를 제공하는 딥러닝 기반의 음원지도 제공 시스템에 관한 것이다.The present invention relates to a sound source map providing system, and in particular, a deep learning method that provides a sound source map that displays the intensity and location of one or more sound sources existing on a flat or curved surface in the form of an image using sound signals measured with a plurality of microphones. It is about a sound source map providing system.

음원지도란, 위치 및 세기가 알려져 있지 않은 다양한 복수의 음원이 존재할 것으로 여겨지는 평면 또는 곡면 위에, 그 음원들의 위치 및 세기를 계산한 후 그것을 표시하는 이미지를 가리킨다.A sound source map refers to an image that calculates the locations and intensities of multiple sound sources and displays them on a flat or curved surface where multiple sound sources whose locations and intensities are unknown are believed to exist.

도 1은 종래의 음원지도를 제공하는 빔 포밍 방법을 모식도로 나타낸 도면으로서, 종래의 음원지도를 생성하는 기술과 관련하여, 빔형성(beamforming) 방법에 의하면, 측정장치의 측정면(Measurement plane)에 다수의 마이크로폰(microphone)을 배열하여 마이크로폰 어레이를 형성하고, 마이크로폰 어레이가 x,y,z 방향에서 수신한 측정신호로부터 음원면상의 각 음원들의 위치 및 세기를 예측함으로써 빔형성 지도(beamforming map)를 생성할 수 있다. Figure 1 is a schematic diagram showing a beam forming method for providing a conventional sound source map. In relation to the technology for generating a conventional sound source map, according to the beamforming method, the measurement plane of the measurement device A microphone array is formed by arranging a number of microphones, and the microphone array predicts the location and intensity of each sound source on the sound source plane from the measurement signals received in the x, y, and z directions to create a beamforming map. can be created.

특히, 이러한 빔형성 방법에서는 음원들의 위치 및 세기는 마이크로폰 어레이를 이용하여 취득한 음압값으로부터 계산하고, 이를 평면상에 매칭하여 지도를 완성하게 된다.In particular, in this beam forming method, the location and intensity of sound sources are calculated from sound pressure values acquired using a microphone array, and the map is completed by matching them on a plane.

그런데, 전술한 빔형성 방법에 의하면, 음원을 검출하는 마이크로폰의 간격이 반파장보다 커지는 주파수 구간에서는 공간 앨리어싱(spatial aliasing)이 발생하는 문제점이 있다.However, according to the above-described beam forming method, there is a problem in that spatial aliasing occurs in a frequency section where the spacing between microphones that detect a sound source is larger than a half wavelength.

전술한 공간 앨리어싱은 음원지도 상의 실제 음원의 위치뿐만 아니라 다른 위치에서도 피크값이 나타남에 따라 실제로 없는 음원이 있는 것처럼 표현되는 현상을 가리킨다. 이러한 공간 앨리어싱은 다른 용어로 유령음원(ghost source) 라고도 한다.The above-described spatial aliasing refers to a phenomenon in which peak values appear not only at the location of the actual sound source on the sound source map but also at other locations, making it appear as if a sound source is not actually present. This spatial aliasing is also called a ghost source.

이러한 문제를 해결하기 위해, 인공지능 분야의 딥-러닝(Deep learning) 기술이 적용되었고, 이에 전술한 유령음원 현상을 극복할 수 있는 다양한 솔루션이 제안되고 있다. To solve this problem, deep learning technology in the field of artificial intelligence has been applied, and various solutions have been proposed to overcome the ghost sound source phenomenon described above.

한편, 기존의 음원지도 제작방법 중 하나로서, 시간 지연-합 빔형성(delay and sum beamforming) 등에 빔형성 방법에 의한 음원지도의 분해능을 향상시키는 디컨볼루션(deconvolution) 방법들이 있었으나, 이러한 기존 방법을 따른다 하더라도 높은 주파수에서는 여전히 공간 앨리어싱 문제가 발생한다고 알려져 있다. 즉, 일반적인 빔형성 방법뿐 아니라 최신의 딥-러닝 기술을 적용한다 하더라도 공간 앨리어싱의 발생에 의한 유령음원들이 등장 다수 나타나는 문제를 해결하는 데는 한계가 있다.Meanwhile, as one of the existing sound source map production methods, there were deconvolution methods to improve the resolution of the sound source map by beamforming methods, such as time delay and sum beamforming, but these existing methods It is known that spatial aliasing problems still occur at high frequencies even if . In other words, even if the latest deep-learning technology is applied as well as the general beam forming method, there is a limit to solving the problem of the appearance of many ghost sound sources due to the occurrence of spatial aliasing.

공개특허공보 제10-2022-0032382호(공개일자: 2022.03.15.)Public Patent Publication No. 10-2022-0032382 (Publication date: 2022.03.15.)

본 발명은 전술한 문제점을 해결하기 위해 안출된 것으로, 본 발명은 마이크로폰 어레이를 이용하여 하나 이상의 음원을 검출하고 그 세기 및 위치를 평면상에 매핑한 음원지도를 제공하는 시스템에서 공간 앨리어싱에 의한 음원지도의 정확도 저하를 개선하는데 과제가 있다.The present invention was created to solve the above-mentioned problems. The present invention is a system that detects one or more sound sources using a microphone array and provides a sound source map that maps the intensity and location on a plane, and provides a sound source map by spatial aliasing. There is a challenge in improving the decline in map accuracy.

이를 위해, 본 발명은 공간 나이키스트 주파수를 기준으로 그보다 낮은 주파수 대역 및 높은 주파수 대역에 대해 딥러닝 모델의 일부를 다르게 구성한 시스템을 제공하는 데 과제가 있다.To this end, the present invention has the task of providing a system that configures parts of the deep learning model differently for lower and higher frequency bands based on the spatial Nyquist frequency.

전술한 과제를 해결하기 위해, 본 발명의 바람직한 실시예에 따른 신경망 학습부는, 복수의 계층으로 이루어지고, 컨볼루션 연산을 통해 입력 데이터로부터 제1 주파수 대역에서 추출되는 하나 이상의 주요 특징을 포함하는 특징 맵을 생성하는 제1 인코더, 복수의 계층으로 이루어지고, 컨볼루션 연산을 통해 입력 데이터로부터 제2 주파수 대역에서 추출되는 하나 이상의 주요 특징을 포함하는 특징 맵을 생성하는 제2 인코더, 상기 입력 데이터의 형태에 따라 상기 제1 및 제2 인코더 중 어느 하나의 출력을 선택하는 스위치 및, 상기 스위치의 선택에 따라, 상기 제1 또는 제2 인코더로부터 출력되는 특징 맵을 입력받아 복원하여 출력 이미지를 생성하는 디코더를 포함할 수 있다.In order to solve the above-described problem, the neural network learning unit according to a preferred embodiment of the present invention is composed of a plurality of layers and includes one or more main features extracted from the first frequency band from input data through a convolution operation. A first encoder for generating a map, a second encoder for generating a feature map composed of a plurality of layers and including one or more key features extracted from the input data in a second frequency band through a convolution operation, the input data A switch that selects the output of one of the first and second encoders according to the shape, and, depending on the selection of the switch, a feature map output from the first or second encoder is input and restored to generate an output image. May include a decoder.

상기 제1 및 제2 주파수 대역은, 상기 입력 데이터에 대한 공간 나이퀴스트(Nyquist) 주파수 파장을 기준으로 하여 각각 고대역 및 저대역으로 구분될 수 있다.The first and second frequency bands may be divided into high and low bands, respectively, based on the spatial Nyquist frequency wavelength for the input data.

상기 복수의 계층은, 각각 하나 이상의 컨볼루셔널 계층, 풀링 계층, 배치 정규화 계층 및 비선형화 계층을 포함할 수 있다.The plurality of layers may each include one or more convolutional layers, a pooling layer, a batch normalization layer, and a non-linearization layer.

제1 및 제2 컨볼루셔널 인코더는, 상기 컨볼루셔널 계층의 커널 크기가 서로 상이하게 설정될 수 있다.The first and second convolutional encoders may be set to have different kernel sizes of the convolutional layers.

상기 스위치는, 학습시, 상기 입력 데이터에 공간 앨리어싱이 존재하지 않는 경우 상기 제1 인코더를 선택하고, 상기 입력 데이터에 공간 앨리어싱이 존재하는 경우 상기 제2 인코더를 선택할 수 있다.During learning, the switch may select the first encoder when spatial aliasing does not exist in the input data, and select the second encoder when spatial aliasing exists in the input data.

또한, 전술한 과제를 해결하기 위해, 본 발명의 바람직한 실시예예 따른 신경망 학습부를 포함하는 개선된 음원지도 제공 시스템은, 복수의 마이크로폰이 일정 간격으로 배치되는 마이크로폰 어레이로부터 수신된 음향신호로부터 음원지도를 생성하는 음원지도 제공 시스템으로서, 상기 마이크로폰 어레이로부터 수신한 음향신호로부터 빔형성 과정을 통해 빔포밍 이미지를 생성하는 이미지 생성부, 상기 빔포밍 이미지를 이용하여 목표지도를 생성하는 목표지도 생성부 및, 상기 목표지도를 입력받아 학습을 수행하고, 설정된 딥러닝 모델에 기반하여 이미지 내 포함되는 하나 이상의 주요 특징에 따라 분석 이미지에 포함되는 음원의 위치 및 세기를 이미지화한 음원지도를 생성하는 신경망 학습부를 포함할 수 있다.In addition, in order to solve the above-described problem, an improved sound source map providing system including a neural network learning unit according to a preferred embodiment of the present invention provides a sound source map from sound signals received from a microphone array in which a plurality of microphones are arranged at regular intervals. A system for providing a sound source map, comprising: an image generator that generates a beamforming image through a beamforming process from an acoustic signal received from the microphone array; a target map generator that generates a target map using the beamforming image; It includes a neural network learning unit that receives the target map, performs learning, and generates a sound source map that images the location and intensity of the sound source included in the analysis image according to one or more key features included in the image based on the set deep learning model. can do.

상기 신경망 학습부는, 복수의 계층으로 이루어지고, 컨볼루션 연산을 통해 상기 빔포밍 이미지로부터 제1 주파수 대역에서 추출되는 하나 이상의 주요 특징을 포함하는 특징 맵을 생성하는 제1 인코더, 복수의 계층으로 이루어지고, 컨볼루션 연산을 통해 상기 빔포밍 이미지로부터 제2 주파수 대역에서 추출되는 하나 이상의 주요 특징을 포함하는 특징 맵을 생성하는 제2 인코더, 상기 빔포밍 이미지의 형태에 따라 상기 제1 및 제2 인코더 중 어느 하나의 출력을 선택하는 스위치 및, 상기 스위치의 선택에 따라, 상기 제1 또는 제2 인코더로부터 출력되는 특징 맵을 복원하여 상기 음원지도를 생성하는 디코더를 포함할 수 있다.The neural network learning unit is composed of a plurality of layers, and is composed of a first encoder that generates a feature map including one or more main features extracted in a first frequency band from the beamforming image through a convolution operation, and a plurality of layers. a second encoder that generates a feature map including one or more key features extracted in a second frequency band from the beamforming image through a convolution operation, the first and second encoders according to the shape of the beamforming image; It may include a switch for selecting one of the outputs, and a decoder for generating the sound source map by restoring the feature map output from the first or second encoder according to the selection of the switch.

상기 제1 및 제2 주파수 대역은, 각각 상기 빔포밍 이미지에 대한 공간 나이퀴스트(Spatial Nyquist) 주파수 파장을 기준으로 하여 고대역 및 저대역으로 구분될 수 있다.The first and second frequency bands may be divided into high and low bands, respectively, based on the spatial Nyquist frequency wavelength for the beamforming image.

상기 스위치는, 상기 빔포밍 이미지에 공간 앨리어싱이 존재하지 않는 경우 상기 제1 인코더를 선택하고, 상기 빔포밍 이미지에 공간 앨리어싱이 존재하는 경우 상기 제2 인코더를 선택할 수 있다.The switch may select the first encoder when spatial aliasing does not exist in the beamforming image, and select the second encoder when spatial aliasing exists in the beamforming image.

본 발명의 실시예에 따르면, 초음파 마이크로폰 어레이를 이용하여 음향지도를 제공하는 시스템에서, 분석을 위해 도입된 딥러닝 모델에 서로 다른 주파수가 설정된 적어도 두 개 인코더를 탑재함으로써 저역과 고역을 달리하여 공간 앨리어싱 현상을 최소화할 수 있다.According to an embodiment of the present invention, in a system that provides an acoustic map using an ultrasonic microphone array, at least two encoders with different frequencies are installed in the deep learning model introduced for analysis, thereby differentiating the low and high frequencies to determine the space Aliasing phenomenon can be minimized.

이에 따라, 학습시 딥러닝 모델의 컨볼루션 연산에서 더 넓은 영역을 참조할 수 있게 되어 마이크로폰 어레이의 설계시 센서간 간격을 보다 좁게 설정할 수 있고, 이에 마이크로폰 어레이에 필요한 마이크로폰의 개수를 줄임으로써 시스템 구축비용을 절감할 수 있는 효과가 있다.Accordingly, a wider area can be referenced in the convolution operation of the deep learning model during learning, allowing the spacing between sensors to be set narrower when designing the microphone array, and thus building a system by reducing the number of microphones required for the microphone array. It has the effect of reducing costs.

도 1은 종래의 음원지도를 제공하는 빔 포밍 방법을 모식도로 나타낸 도면이다.
도 2는 본 발명의 실시예에 따른 신경망 학습부를 포함하는 개선된 음원지도 제공 시스템의 구조를 나타낸 도면이다.
도 3은 본 발명의 실시예에 따른 신경망 학습부를 포함하는 개선된 음원지도 제공 시스템에 의한 목표지도 생성방법을 모식화한 도면이다.
도 4는 본 발명의 실시예에 따른 개선된 음원지도 제공 시스템에 포함되는 신경망 학습부의 네트워크 구조를 나타낸 도면이다.
도 5는 본 발명의 실시예에 따른 신경망 학습부를 포함하는 개선된 음원지도 제공 시스템의 기계학습 방법을 나타낸 도면이다.
도 6은 본 발명의 실시예에 따른 신경망 학습부를 포함하는 개선된 음원지도 제공 시스템의 음원지도 제공 방법을 나타낸 도면이다.Figure 1 is a schematic diagram showing a conventional beam forming method for providing a sound source map.
Figure 2 is a diagram showing the structure of an improved sound source map providing system including a neural network learning unit according to an embodiment of the present invention.
Figure 3 is a diagram illustrating a method of generating a target map by an improved sound source map providing system including a neural network learning unit according to an embodiment of the present invention.
Figure 4 is a diagram showing the network structure of the neural network learning unit included in the improved sound source map providing system according to an embodiment of the present invention.
Figure 5 is a diagram showing a machine learning method of an improved sound source map providing system including a neural network learning unit according to an embodiment of the present invention.
Figure 6 is a diagram showing a method of providing a sound source map in an improved sound source map providing system including a neural network learning unit according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. The present embodiments are merely provided to ensure that the disclosure of the present invention is complete and to provide a general understanding of the technical field to which the present invention pertains. It is provided to fully inform the skilled person of the scope of the present invention, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for describing embodiments and is not intended to limit the invention. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used in the specification, “comprises” or “comprising” does not exclude the presence or addition of one or more other elements in addition to the mentioned elements. Like reference numerals refer to like elements throughout the specification and include each and every combination of one or more of the referenced elements. Although “first”, “second”, etc. are used to describe various components, these components are of course not limited by these terms. These terms are merely used to distinguish one component from another. Therefore, it goes without saying that the first component mentioned below may also be a second component within the technical spirit of the present invention.

또한, 다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Additionally, unless otherwise defined, all terms used in this specification may be used with meanings commonly understood by those skilled in the art to which the present invention pertains. Additionally, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless clearly specifically defined.

이하의 설명에서, 본 발명의 "신경망 학습부" 및 "개선된 음원 제공 시스템" 및 이를 이루는 각 구성부는, 공지의 마이크로 프로세서에 의해 실행 가능하고, 읽고 쓰기가 가능한 기록매체에 저장되어 컴퓨팅 장치에 탑재될 수 있다.In the following description, the "neural network learning unit" and the "improved sound source providing system" of the present invention and each component thereof are executable by a known microprocessor and are stored in a readable and writable recording medium and stored on a computing device. It can be mounted.

이하의 설명에서, 본 발명의 "신경망 학습부를 포함하는 개선된 음원지도 제공 시스템"을 가리키는 용어는, 기재의 편의상 "음원지도 제공 시스템" 또는 "시스템"으로 약식 기재될 수 있다.In the following description, the term referring to the “improved sound source map providing system including a neural network learning unit” of the present invention may be abbreviated as “sound source map providing system” or “system” for convenience of description.

이하, 도면을 참조하여 본 발명의 실시예에 따른 신경망 학습부 및 이를 포함하는 개선된 음원지도 제공 시스템을 설명한다.Hereinafter, a neural network learning unit and an improved sound source map providing system including the same according to an embodiment of the present invention will be described with reference to the drawings.

도 2는 본 발명의 실시예에 따른 신경망 학습부를 포함하는 개선된 음원지도 제공 시스템의 구조를 나타낸 도면이고, 도 3은 본 발명의 실시예에 따른 신경망 학습부를 포함하는 개선된 음원지도 제공 시스템에 의한 목표지도 생성방법을 모식화한 도면이다.Figure 2 is a diagram showing the structure of an improved sound source map providing system including a neural network learning unit according to an embodiment of the present invention, and Figure 3 is a diagram showing the structure of an improved sound source map providing system including a neural network learning unit according to an embodiment of the present invention. This is a diagram illustrating the goal map creation method.

도 2를 참조하면, 본 발명의 실시예에 따른 음원지도 제공 시스템(100), 마이크로폰 어레이(10)로부터 수신한 음향신호로부터 빔형성 과정을 통해 빔포밍 이미지를 생성하는 이미지 생성부(110), 빔포밍 이미지 이용하여 목표지도를 생성하는 목표지도 생성부(120) 및, 목표지도를 입력받아 학습을 수행하고, 설정된 딥러닝 모델에 기반하여 이미지 내 포함되는 하나 이상의 주요 특징에 따라 분석 이미지에 포함되는 음원의 위치 및 세기를 이미지화한 음원지도를 생성하는 신경망 학습부(130)를 포함할 수 있다.Referring to Figure 2, a sound source map providing system 100 according to an embodiment of the present invention, an image generator 110 that generates a beamforming image through a beam forming process from the sound signal received from the microphone array 10, A target map generator 120 that generates a target map using a beamforming image, receives the target map, performs learning, and includes it in the analysis image according to one or more key features included in the image based on a set deep learning model. It may include a neural network learning unit 130 that generates a sound source map that images the location and intensity of the sound source.

상세하게는, 이미지 생성부(110)는 마이크로폰 어레이(10)가 음원으로부터 대한 음향신호를 입력받아 시간지연 합 빔형성(delay-and-sum beamforming) 등의 빔 형성 기술을 통해 빔포밍 이미지를 생성할 수 있다.In detail, the image generator 110 receives an acoustic signal from a sound source and generates a beamforming image through beam forming technology such as delay-and-sum beamforming. can do.

여기서, 마이크로폰 어레이(10)는 복수개가 좌우로 배열된 마이크로폰으로 이루어지며, 마이크로폰들이 배열된 전체 너비는 저대역 주파수의 음향신호에 영향을 주고, 각 마이크로폰의 간격은 고대역 주파수의 음향신호에 영향을 준다.Here, the microphone array 10 is composed of a plurality of microphones arranged left and right. The overall width of the microphones arranged affects the sound signals of low-band frequencies, and the spacing of each microphone affects the sound signals of high-band frequencies. gives.

이에, 종래 마이크로폰 어레이(10)의 설계자는 음향신호에서 발생할 수 있는 공간 앨리어싱 현상 등을 고려하여 마이크로폰 어레이(10)의 마이크로폰간 간격을 좁게 배치하는 등, 어레이 구성을 다르게 설계하여야 하나, 본 발명의 개선된 음지도 제공 시스템(100)에 의하면, 음향신호에 따라 적절한 인코더를 통해 음원지도를 생성함으로써, 마이크로폰 어레이의 설계를 변경하지 않고도 저대역 및 고대역 주파주 모두를 포괄할 수 있는 것을 특징으로 한다.Accordingly, the designer of the conventional microphone array 10 must design the array configuration differently, such as arranging the spacing between microphones of the microphone array 10 to be narrow, taking into account the spatial aliasing phenomenon that may occur in the acoustic signal, etc., but the present invention According to the improved sound map providing system 100, by generating a sound source map through an appropriate encoder according to the sound signal, it is possible to cover both low-band and high-band frequencies without changing the design of the microphone array. do.

목표지도 생성부(120)는 음향신호에 기초하여 학습 데이터로 이용되는 목표지도를 생성하고 신경망 학습부(130)에 입력하여 딥 러닝이 수행되도록 한다. 이를 위해, 목표지도 생성부(120)는 그 기능에 따라 복수의 모듈로 구성될 수 있다.The target map generator 120 generates a target map used as learning data based on the acoustic signal and inputs it to the neural network learning unit 130 to perform deep learning. To this end, the target map generator 120 may be composed of a plurality of modules according to their functions.

예를 들면, 목표지도 생성부(120)는 음원의 주변에 소정 범위의 간격을 갖는 격자를 생성하는 격자 생성모듈(121), 음원의 위치에서 결과값이 최대가 되고, 음원과의 거리에 따라 결과값이 감소하도록 격자의 각 좌표별 결과값을 산출하는 결과값 산출모듈(123), 결과값을 격자의 각 좌표와 대응되는 행렬 상의 위치에 배열하고, 그 행렬에 배열된 결과값을 이용하여 이미지 형태의 목표지도를 생성하는 매핑모듈(125)을 포함할 수 있다.For example, the target map generator 120 includes a grid generation module 121 that generates a grid with a predetermined range of intervals around the sound source. The result value is maximum at the location of the sound source, and depending on the distance to the sound source, A result value calculation module 123 that calculates the result value for each coordinate of the grid so that the result value decreases, arranges the result value at a position on the matrix corresponding to each coordinate of the grid, and uses the result value arranged in the matrix. It may include a mapping module 125 that generates a target map in the form of an image.

도 3을 함께 참조하면, 격자 생성모듈(121)은 하나 이상의 음원이 존재하는 특정 영역에 대하여 일정한 간격을 갖는 격자점을 포함하는 격자를 생성할 수 있고, 각 음원의 위치를 격자상에 표시할 수 있다.Referring to FIG. 3 together, the grid generation module 121 can generate a grid including grid points at regular intervals for a specific area where one or more sound sources exist, and display the location of each sound source on the grid. You can.

여기서, 격자를 이루는 격자점간 간격(K)이 좁을수록 더욱 정확한 음원지도의 취득이 가능하나 연산시간은 길어지고, 간격(K)이 넓을수록 음원지도를 취득하는데 걸리는 연산시간은 짧아지며, 음원지도의 정확도는 다소 낮아지게 된다. 이에 따라, 본 발명의 실시예에 따르면, 격자 생성모듈(121)이 제공하는 격자의 바람직한 격자점간 간격(K)은 반복적인 실험을 통해 적절한 크기로 설정될 수 있다.Here, the narrower the spacing (K) between grid points forming the grid is, the more accurate the acquisition of the sound source map is, but the longer the computation time is, and the wider the spacing (K) is, the shorter the computation time required to acquire the sound source map becomes. The accuracy becomes somewhat lower. Accordingly, according to an embodiment of the present invention, the desired spacing (K) between grid points of the grid provided by the grid generation module 121 can be set to an appropriate size through repeated experiments.

결과값 산출모듈(123)은 격자 생성모듈(121)이 제공하는 격자의 격자점들{(1,1),(1,2) ~ (N,M; N,M은 자연수)}과 격자상에 위치하는 하나 이상의 음원(A,B,C)과의 거리(R)로부터 결과값을 산출할 수 있다.The result calculation module 123 uses the grid points {(1,1), (1,2) ~ (N,M; N,M are natural numbers)} of the grid provided by the grid generation module 121 and the grid The result can be calculated from the distance (R) to one or more sound sources (A, B, C) located in .

여기서, 결과값은 취득한 음향신호를 이미지화 할 때 각 픽셀(Pixel)이 갖는 특정 값을 의미하며, 음원(A,B,C)과 격자의 각 좌표 사이의 거리(R)를 이용하여 산출될 수 있다.Here, the result value refers to the specific value each pixel has when imaging the acquired sound signal, and can be calculated using the distance (R) between the sound source (A, B, C) and each coordinate of the grid. there is.

또한, 결과값 산출에 있어서, R의 승수를 통해 그 조절함으로써 거리(R)에 따른 결과값의 변화율을 조절할 수 있다. 즉, R의 제곱근의 승수를 상대적으로 큰 값으로 설정함으로써 거리(R)에 따른 결과값의 변화율을 크게 하여, 음원지도 내 음원의 위치나 세기의 차이를 더욱 명확하게 나타낼 수 있다.Additionally, in calculating the result value, the rate of change of the result value according to the distance (R) can be adjusted by adjusting the multiplier of R. In other words, by setting the multiplier of the square root of R to a relatively large value, the rate of change of the resulting value according to the distance (R) can be increased, making it possible to more clearly indicate the difference in the location or intensity of the sound source in the sound source map.

매핑모듈(125)은 전술한 거리(R)를 이용하여 산출해낸 결과값을 격자에 대응되는 행렬상의 위치에 배열할 수 있다. 특히, 매핑모듈(125)은 모든 좌표에 대한 결과값을 산출할 때까지 작업을 반복하여 격자 생성모듈(121)이 생성한 격자와 동일한 크기의 행렬을 생성할 수 있고, 그 생성한 행렬에 결과값을 배열하고, 이를 이용하여 이미지 형태의 목표지도를 생성할 수 있다.The mapping module 125 can arrange the result values calculated using the distance (R) described above in positions on a matrix corresponding to the grid. In particular, the mapping module 125 can generate a matrix of the same size as the grid generated by the grid generation module 121 by repeating the operation until the result values for all coordinates are calculated, and the resulting matrix can be added to the generated matrix. You can arrange the values and use them to create a goal map in the form of an image.

이에 따라, 목표지도 생성부(120)를 통해 생성한 목표지도는 후술하는 신경망 학습부(130)에 입력되어 해당 빔포밍 이미지에 대한 학습 데이터로서 이용될 수 있다.Accordingly, the target map generated through the target map generator 120 can be input to the neural network learning unit 130, which will be described later, and used as learning data for the corresponding beamforming image.

신경망 학습부(130)는 목표지도 생성부(120)로부터 목표지도를 입력받아 기계학습을 반복 수행하여 다양한 음원 즉, 빔포밍 이미지에 대한 특징 맵을 구축함과 더불어, 다양한 주파수 대역의 음향신호에 대응하는 빔포밍 이미지 형태의 분석 데이터를 입력받아 특징을 추출함으로써 공간 앨리어싱 현상이 제거된 데이터를 추출하고 음원지도를 생성 및 제공할 수 있다.The neural network learning unit 130 receives the target map from the target map generator 120 and repeatedly performs machine learning to construct feature maps for various sound sources, that is, beamforming images, and responds to sound signals in various frequency bands. By receiving analysis data in the form of a beamforming image and extracting features, data with spatial aliasing removed can be extracted and a sound source map can be generated and provided.

공간 앨리어싱 현상은 음원지도의 전반에 퍼져서 나타남에 따라 딥러닝 모델의 구조가 공간 나이퀴스트 주파수보다 낮은 주파수에서의 구조가 서로 달라져야 할 필요가 있다. 여기서, 공간 나이퀴스트 주파수는 마이크로폰 어레이 상의 이웃한 마이크로폰간 간격이 반파장과 같아지는 주파수를 가리킨다.As the spatial aliasing phenomenon appears spread throughout the sound source map, the structure of the deep learning model needs to be different at frequencies lower than the spatial Nyquist frequency. Here, the spatial Nyquist frequency refers to the frequency at which the spacing between neighboring microphones on the microphone array is equal to half a wavelength.

이를 위해, 본 발명의 실시예에 따르면, 시스템에 적용된 딥러닝 모델은 컨볼루셔널 인코더-디코더(convolutional encoder-decoder) 네트워크를 기반으로 한 2인코더-1디코더 구조일 수 있다.To this end, according to an embodiment of the present invention, the deep learning model applied to the system may be a 2 encoder-1 decoder structure based on a convolutional encoder-decoder network.

상세하게는, 본 발명의 실시예에 따른 신경망 학습부(130)는, 복수의 계층으로 이루어지고, 컨볼루션 연산을 통해 입력 데이터로부터 제1 주파수 대역에서 추출되는 하나 이상의 주요 특징을 포함하는 특징 맵을 생성하는 제1 인코더(131), 복수의 계층으로 이루어지고, 컨볼루션 연산을 통해 입력 데이터로부터 제2 주파수 대역에서 추출되는 하나 이상의 주요 특징을 포함하는 특징 맵을 생성하는 제2 인코더(132), 입력 데이터의 형태에 따라 상기 제1 및 제2 인코더(131, 132) 중 어느 하나의 출력을 선택하는 스위치(135) 및, 스위치(135)의 선택에 따라, 제1 또는 제2 인코더(131, 132)로부터 출력되는 특징 맵을 입력받아 복원하여 출력 이미지를 생성하는 디코더(137)를 포함할 수 있다.In detail, the neural network learning unit 130 according to an embodiment of the present invention is composed of a plurality of layers and includes a feature map including one or more main features extracted from the input data in the first frequency band through a convolution operation. A first encoder 131 that generates, a second encoder 132 that is composed of a plurality of layers and generates a feature map including one or more main features extracted from the input data in the second frequency band through a convolution operation. , a switch 135 that selects the output of one of the first and second encoders 131 and 132 according to the type of input data, and, depending on the selection of the switch 135, the first or second encoder 131 , 132) may include a decoder 137 that receives and restores the feature map output from 132) to generate an output image.

제1 인코더(131) 및 제2 인코더(132)는 복수의 컨볼루션 계층으로 이루어지는 인코더로서, 입력 데이터에 대하여 서로 다른 주파수 기준에 따라 특징 맵을 추출할 수 있다. 일례로서, 제1 인코더(131)는 공간 앨리어싱이 발생하지 않은 일반적인 주파수 파장을 갖는 입력 데이터에 적합하도록 설정되어 있고, 제2 인코더(132)는 공간 앨리어싱이 발생하는 주파수 파장을 갖는 입력 데이터에 적합하도록 설정될 수 있다.The first encoder 131 and the second encoder 132 are encoders composed of a plurality of convolutional layers, and can extract feature maps based on different frequency standards for input data. As an example, the first encoder 131 is set to be suitable for input data having a general frequency wavelength in which spatial aliasing does not occur, and the second encoder 132 is set to be suitable for input data having a frequency wavelength in which spatial aliasing occurs. It can be set to do so.

여기서, 전술한 주파수 기준은 공간 나이퀴스트 주파수를 기준으로 설정되되, 마이크로폰 어레이의 크기 및 마이크로폰간 거리에 의해 다르게 설정될 수 있다.Here, the above-described frequency standard is set based on the spatial Nyquist frequency, but may be set differently depending on the size of the microphone array and the distance between microphones.

이에 따라, 제1 인코더(131)는 일반적인 주파수 파장을 갖는 입력 데이터로부터 특징 맵을 생성하도록 설계될 수 있고, 제2 인코더(132)는 이보다 높은 주파수 파장을 갖는 입력 데이터로부터 특징맵을 생성하도록 설계될 수 있다.Accordingly, the first encoder 131 may be designed to generate a feature map from input data having a general frequency wavelength, and the second encoder 132 may be designed to generate a feature map from input data having a higher frequency wavelength. It can be.

이러한 제1 및 제2 인코더(131, 132)를 포함하여 디코더(137)에 관한 구체적 네트워크 구조 및 그 특징에 대한 상세한 설명은 후술한다.A detailed description of the specific network structure and characteristics of the decoder 137, including the first and second encoders 131 and 132, will be described later.

스위치(135)는 제1 및 제2 인코더(131, 132) 중, 선택되는 어느 하나와 후술하는 디코더(137)를 서로 연결하는 역할을 한다. 특히, 본 발명의 실시예에 따르면, 분석 데이터의 분석시 주파수 파장에 따라 서로 다른 설정값을 갖는 하나의 인코더를 선택적으로 운영하는 것으로, 스위치(135)는 분석 데이터에 대한 파장을 통해 어떤 주파수 파장을 갖는지 판단하고, 어느 하나의 인코더에서 생성한 특징 맵을 디코더(137)에 전달할 수 있다.The switch 135 serves to connect one selected among the first and second encoders 131 and 132 and a decoder 137, which will be described later. In particular, according to an embodiment of the present invention, when analyzing analysis data, one encoder with different setting values is selectively operated according to the frequency wavelength, and the switch 135 operates at a certain frequency wavelength through the wavelength for the analysis data. , and the feature map generated by one encoder can be transmitted to the decoder 137.

디코더(137)는 인코더에 의해 추출된 특징 맵을 디코딩을 통해 복원하여 출력 데이터를 생성할 수 있다. 디코더(137)는 인코더와 대칭되는 구조로 구성될 수 있고, 특징 맵에 대한 디컨볼루션 과정을 통해 특정 주파수 성분에 대한 특징이 극대화된 이미지를 도출할 수 있다. 이에 따라 분역된 분석 데이터가 하나 이상의 음원에 대한 빔포밍 이미지인 경우, 그 음원의 위치 및 세기가 극대화된 음원지도를 제공하게 된다.The decoder 137 may generate output data by restoring the feature map extracted by the encoder through decoding. The decoder 137 may be configured in a structure that is symmetrical to the encoder, and may derive an image with maximized features for a specific frequency component through a deconvolution process for the feature map. Accordingly, when the segmented analysis data is a beamforming image for one or more sound sources, a sound source map that maximizes the location and intensity of the sound source is provided.

전술한 구조에 따라, 본 발명의 실시예에 따른 개선된 음원지도 시스템(100)은 딥 러닝 기술을 통해 빔포밍 이미지로부터 음원의 위치 및 세기를 직관적으로 확인할 수 있는 음원지도를 제공할 수 있으며, 특히 신경망 학습부를 서로 다른 주파수 파장에 따라 다른 특징 맵을 추출하도록 설정된 적어도 두 개의 인코더로 구축함으로써 공간 앨리어싱 현상이 최소화된 음원지도를 제공할 수 있다.According to the above-described structure, the improved sound source map system 100 according to an embodiment of the present invention can provide a sound source map that can intuitively check the location and intensity of the sound source from the beamforming image through deep learning technology, In particular, by constructing the neural network learning unit with at least two encoders set to extract different feature maps according to different frequency wavelengths, a sound source map with minimized spatial aliasing phenomenon can be provided.

이하, 도면을 참조하여 본 발명의 실시예에 따른 개선된 음원지도 제공 시스템에 탑재되는 신경망 학습부의 네트워크 구조를 통해 본 발명의 기술적 사상을 상세히 설명한다.Hereinafter, with reference to the drawings, the technical idea of the present invention will be described in detail through the network structure of the neural network learning unit mounted on the improved sound source map providing system according to the embodiment of the present invention.

도 4는 본 발명의 실시예에 따른 개선된 음원지도 제공 시스템에 포함되는 신경망 학습부의 네트워크 구조를 나타낸 도면이다.Figure 4 is a diagram showing the network structure of the neural network learning unit included in the improved sound source map providing system according to an embodiment of the present invention.

도 4를 참조하면, 본 발명의 실시예에 따른 신경망 학습부는, 두 개의 컨볼루셔널 인코더 및 한 개의 컨볼루셔널 디코더로 구성될 수 있다. 각 인코더와 디코더는 상호 대칭되는 구조로서, 컨볼루셔널 인코더-디코더는 입력 이미지에서 추출되는 특징 맵(feature map)의 크기를 줄였다가 다시 입력 이미지의 크기만큼 크게 만들어서 입력 이미지의 각 픽셀에 대해 분류 결과로 클래스를 분류할 수 있다. Referring to FIG. 4, the neural network learning unit according to an embodiment of the present invention may be composed of two convolutional encoders and one convolutional decoder. Each encoder and decoder have a mutually symmetrical structure, and the convolutional encoder-decoder reduces the size of the feature map extracted from the input image and then makes it larger to the size of the input image to classify each pixel of the input image. As a result, classes can be classified.

두 개의 컨볼루셔널 인코더는 각각 복수의 계층(layer)으로 구성될 수 있다. 복수의 계층은 적어도 하나의 컨볼루셔널 계층(convolutional layer)과, 풀링 계층(pooling layer)과, 배치 정규화 계층(batch normalization layer)과, 비선형화 계층(non linear activation layer)을 포함할 수 있다. 그리고 컨볼루셔널 인코더에서는 전술한 각 계층이 순차적으로, 컨볼루셔널 계층, 배치 표준화 계층 및 비선형화 계층으로 배열될 수 있다. Each of the two convolutional encoders may be composed of multiple layers. The plurality of layers may include at least one convolutional layer, a pooling layer, a batch normalization layer, and a non-linear activation layer. And in the convolutional encoder, each of the above-described layers can be sequentially arranged into a convolutional layer, a batch normalization layer, and a non-linearization layer.

또한, 하나의 계층은 채널 수(ch) 만큼 반복될 수 있다. 즉 하나의 계층은 '(컨볼루셔널 계층, 배치 표준화 계층 및 비선형화 계층) × 채널 수(ch) + 풀링 계층' 구조를 가질 수 있다.Additionally, one layer may be repeated as many times as the number of channels (ch). That is, one layer may have the structure of ‘(convolutional layer, batch normalization layer, and non-linearization layer) × number of channels (ch) + pooling layer’.

이러한 컨볼루셔널 계층은 입력 이미지에 대한 컨볼루셔널 연산을 통해 특징맵(feature map)을 출력할 수 있다. 이때, 컨볼루셔널 연산을 수행하는 필터(filter)를 커널(kernel)이라고 한다. 또한, 커널을 구성하는 연산 파라미터(parameter)를 커널 파라미터(kernel parameter) 또는 가중치(weight)라고 한다. 컨볼루셔널 계층에서는 하나의 입력에 서로 다른 종류의 커널을 사용할 수 있다.This convolutional layer can output a feature map through convolutional operations on the input image. At this time, the filter that performs the convolutional operation is called the kernel. Additionally, the operation parameters that make up the kernel are called kernel parameters or weights. In a convolutional layer, different types of kernels can be used for one input.

그리고, 컨볼루셔널 계층은 입력 이미지의 특정 영역을 대상으로 컨볼루션 연산을 수행하며, 그 연산 영역을 윈도우(window)라고 한다. 이러한 윈도우는 영상의 좌측 상단에서 우측 하단까지 한 칸씩 이동할 수 있고, 한 번에 이동하는 이동 크기를 조절할 수 있다. 이때, 이동 크기를 스트라이드(stride)라고 한다.Additionally, the convolutional layer performs a convolution operation targeting a specific area of the input image, and the operation area is called a window. These windows can be moved one space from the top left to the bottom right of the image, and the size of the movement at one time can be adjusted. At this time, the movement size is called stride.

컨볼루셔널 계층은 입력 이미지에서 윈도우를 이동하면서 입력 이미지의 모든 영역에 대하여 컨볼루션 연산을 수행하게 된다.The convolutional layer moves the window in the input image and performs a convolution operation on all areas of the input image.

일례로서, 입력 값에 컨볼루션 연산을 수행한다고 할 때, 입력 크기(N,C _in ,H,W) 및 출력(N, C _out ,H, W)에 대한 컨볼루션 계층의 출력값은 이하의 수학식 1로 표현할 수 있다(단, 'N'은 배치 크기, 'C'는 채널의 수, 'H'는 커널의 높이,'W'는 커널의 폭).As an example, when performing a convolution operation on an input value, the output value of the convolution layer for the input size ( N, C _in , H, W ) and output (N, C _out , H, W) is the following math. It can be expressed as Equation 1 (where ' N ' is the batch size, ' C ' is the number of channels, ' H ' is the height of the kernel, and ' W ' is the width of the kernel).

여기서, 'out'은 출력 데이터, 'bias'는 편향값, 'weight'는 가중치, '*'는 컨볼루션, 'input'은 입력 데이터를 나타낸다.Here, 'out' represents output data, 'bias' represents bias value, 'weight' represents weight, '*' represents convolution, and 'input' represents input data.

이에, 본 발명의 실시예에 따르면, 제1 및 제2 컨볼루셔널 인코더는 각각 서로 다른 커널 크기{(H,W)=k} 즉, 'k'값에 따라 컨볼루션 과정을 수행하게 된다.Accordingly, according to an embodiment of the present invention, the first and second convolutional encoders each perform a convolution process according to different kernel sizes { (H, W) = k }, that is, the value of 'k'.

특히, 본 발명의 실시예에 따르면, 두 개의 컨볼루셔널 인코더는 커널 크기(k)가 각각 입력 데이터의 주파수 대역에 대응하여 서로 다르게 설정됨에 따라 서로 다른 특징 맵을 생성하는 것을 특징으로 한다.In particular, according to an embodiment of the present invention, the two convolutional encoders are characterized in that they generate different feature maps as the kernel size (k) is set differently corresponding to the frequency band of the input data.

컨볼루셔널 인코더는 컨볼루션 과정에서 커널 크기(k)에 따라 추출되는 특징 맵이 결정한다. 이러한 컨볼루션 과정에서 상대적으로 작은 크기의 커널은 저대역 주파수의 이미지, 즉 공간 앨리어싱이 발생하지 않는 이미지에 적합하고, 상대적으로 큰 크기의 커널은 고대역 주파수 이미지, 즉 공간 앨리어싱이 발생함에 따라 이를 최소화해야 하는 이미지에 적합하다.The convolutional encoder is determined by the feature map extracted according to the kernel size (k) during the convolution process. In this convolution process, a relatively small-sized kernel is suitable for low-band frequency images, that is, images in which spatial aliasing does not occur, and a relatively large-sized kernel is suitable for high-band frequency images, i.e., images in which spatial aliasing occurs. Suitable for images that need to be minimized.

따라서, 본 발명의 제1 컨볼루션 인코더는 공간 나이퀴스트 주파수를 기준으로 하여 상대적으로 커널 크기(k)가 작은 값(ex. k=5)으로 설정되고, 제2 컨볼루션 인코더는 상대적으로 커널 크기(k)가 큰 값(ex. k=7)을 설정될 수 있다.Therefore, the first convolution encoder of the present invention has a relatively small kernel size (k) based on the spatial Nyquist frequency (ex. k = 5), and the second convolution encoder has a relatively small kernel size (k). The size (k) can be set to a large value (ex. k=7).

풀링 계층은 컨볼루셔널 계층의 연산 결과로 얻은 특징 맵을 서브 샘플링(sub sampling)한다. 풀링 연산에는 최대 풀링(max pooling)과 평균 풀링(average pooling) 등이 있다. 최대 풀링은 윈도우 내에서 가장 큰 샘플 값을 택한다. 평균 풀링은 윈도우에 포함된 값의 평균 값으로 샘플링한다. 일반적으로 풀링은 스트라이드와 윈도우의 크기가 갖도록 하는 것이 일반적이다. The pooling layer subsamples the feature map obtained as a result of the convolutional layer. Pooling operations include max pooling and average pooling. Max pooling takes the largest sample value within the window. Average pooling samples the average value of the values contained in the window. In general, pooling is used to ensure that the stride and window sizes are the same.

비선형화 계층은 뉴런에서 출력값을 결정하는 계층으로서, 소정의 전달 함수(transfer function)를 사용할 수 있다. 알려진 전달 함수로는 Relu, sigmoid 등이 있고, 본 발명의 실시예에 따른 시스템에서는 Relu 함수를 사용할 수 있다.The non-linearization layer is a layer that determines the output value from the neuron, and can use a predetermined transfer function. Known transfer functions include Relu, sigmoid, etc., and the Relu function can be used in the system according to the embodiment of the present invention.

전술한 계층 구조에 따라, 두 개의 컨볼루셔널 인코더는 입력 이미지에 대한 특징 맵을 생성할 수 있다.According to the above-described hierarchical structure, two convolutional encoders can generate a feature map for the input image.

스위치는 두 컨볼루셔널 인코더 중, 어느 하나의 출력을 후술하는 컨볼루셔널 디코더에 전달할 수 있다. 이러한 스위치는 입력 데이터의 주파수 대역에 따라 제어되며, 어느 하나의 컨볼루셔널 인코더의 출력을 컨볼루셔널 디코더에 전달하게 된다.The switch can transfer the output of either of the two convolutional encoders to the convolutional decoder described later. These switches are controlled according to the frequency band of the input data, and transmit the output of one convolutional encoder to the convolutional decoder.

컨볼루셔널 디코더는 두 컨볼루셔널 인코더가 생성한 특징 맵을 이용하여 출력 이미지를 생성할 수 있다. 두 컨볼루셔널 인코더에 대비하여 컨볼루셔널 디코더는 하나가 탑재되며, 복수의 계층으로 구성될 수 있다.The convolutional decoder can generate an output image using the feature maps generated by the two convolutional encoders. In contrast to the two convolutional encoders, one convolutional decoder is installed and can be composed of multiple layers.

하나의 계층은 업샘플링 계층(upsampling layer) 및 디컨볼루셔널 계층(deconvolutional layer)을 포함할 수 있다. 디컨볼루셔널 계층은 컨볼루셔널 인코더의 컨볼루셔널 계층의 역동작을 수행할 수 있다. 또한, 디컨볼루셔널 계층은 컨볼루셔널 계층, 배치 표준화 계층, 비 선형화 계층의 구조가 반복될 수 있다.One layer may include an upsampling layer and a deconvolutional layer. The deconvolutional layer can perform the reverse operation of the convolutional layer of the convolutional encoder. Additionally, the deconvolutional layer may repeat the structures of the convolutional layer, batch normalization layer, and non-linearization layer.

여기서, 컨볼루셔널 디코더는 컨볼루셔널 인코더와 대칭적인 구조를 가질 수 있음에 따라, 컨볼루셔널 인코더와 유사한 개수의 컨볼루셔널 계층, 동일한 컨볼루셔널 커널 크기(k) 및 개수를 갖는다.Here, the convolutional decoder may have a symmetrical structure with the convolutional encoder and thus has a similar number of convolutional layers and the same convolutional kernel size (k) and number as the convolutional encoder.

업샘플링 계층은 풀링 계층의 역동작을 수행할 수 있다. 업샘플링 계층은 업샘플링(upsampling)을 진행할 수 있으며, 풀링 계층은 역으로 차원을 확대하는 역할을 한다.The upsampling layer can perform the reverse operation of the pooling layer. The upsampling layer can perform upsampling, and the pooling layer conversely serves to expand the dimension.

디컨볼루셔널 계층은 컨볼루셔널 계층의 역동작을 수행할 수 있다. 디컨볼루셔널 계층은 컨볼루셔널 계층과 반대 방향으로 컨볼루션 연산을 수행하는 것으로, 디컨볼루셔널 계층은 입력으로 특징 맵을 받아 커널을 이용한 컨볼루션 연산으로 시각화를 위한 출력 이미지를 생성할 수 있다.The deconvolutional layer can perform the reverse operation of the convolutional layer. The deconvolutional layer performs a convolution operation in the opposite direction to the convolutional layer. The deconvolutional layer receives a feature map as input and can generate an output image for visualization through a convolution operation using a kernel.

여기서, 스트라이드를 1로 하면 디컨볼루셔널 계층은 특징맵의 가로, 세로 크기가 출력의 가로, 세로와 동일한 이미지를 출력할 수 있고, 스트라이드를 2로 하면 디컨볼루셔널 계층은 특징 맵의 가로, 세로 크기 대비 절반 크기의 영상을 출력할 수 있다. Here, if the stride is set to 1, the deconvolutional layer can output an image in which the horizontal and vertical sizes of the feature map are the same as the horizontal and vertical sizes of the output, and if the stride is set to 2, the deconvolutional layer can output an image with the horizontal and vertical sizes of the feature map. It is possible to output images half the size.

이에 따라, 디컨볼루셔널 절차를 모두 완료한 출력 이미지는 선택된 인코더로부터 공간 앨리어싱이 최소화된 특징 맵에 의해 복원됨에 따라 일정 수준 이상의 높은 정확도를 갖게 된다.Accordingly, the output image that has completed all of the deconvolutional procedures has high accuracy above a certain level as it is restored by a feature map with minimized spatial aliasing from the selected encoder.

이하, 도면을 참조하여 본 발명의 실시예에 따른 시스템에 의한 음원지도 생성방법을 상세히 설명한다.Hereinafter, a method for generating a sound source map by a system according to an embodiment of the present invention will be described in detail with reference to the drawings.

도 5는 본 발명의 실시예에 따른 신경망 학습부를 포함하는 개선된 음원지도 제공 시스템의 기계학습 방법을 나타낸 도면이고, 도 6은 본 발명의 실시예에 따른 신경망 학습부를 포함하는 개선된 음원지도 제공 시스템의 음원지도 제공 방법을 나타낸 도면이다.Figure 5 is a diagram showing a machine learning method of an improved sound source map providing system including a neural network learning unit according to an embodiment of the present invention, and Figure 6 is a diagram showing an improved sound source map providing system including a neural network learning unit according to an embodiment of the present invention. This diagram shows how the system provides sound source maps.

이하의 설명에서, 별도의 기재가 없는 경우 각 단계별 실행주체는 본 발명의 실시예에 따른 신경망 학습부, 시스템 및 이의 각 구성부가 된다.In the following description, unless otherwise stated, the executor of each step is the neural network learning unit, the system, and each component thereof according to an embodiment of the present invention.

도 5를 참조하면, 본 발명의 실시예예 따른 개선된 음원지도 제공 시스템을 통한 기계학습 방법에 따르면, 데이터 수집을 위해 하나 이상의 마이크로폰이 배열된 마이크로폰 어레이가 음원으로부터 음향신호를 수신하면(S100), 시스템은 마이크로폰 어레이로부터 음향신호를 전기신호 형태로 입력받게 된다.Referring to FIG. 5, according to the machine learning method using the improved sound source map providing system according to an embodiment of the present invention, when a microphone array in which one or more microphones are arranged for data collection receives an acoustic signal from a sound source (S100), The system receives acoustic signals from the microphone array in the form of electrical signals.

또한, 전술한 S100 단계는 음원지도의 생성을 위해 실시간으로 음향을 수집하는 방식 뿐만 아니라, 별도의 마이크로폰 어레이를 통해 미리 수신하여 녹음된 음향을 수신하는 방식도 적용될 수 있다.In addition, the above-described step S100 can be applied not only by collecting sound in real time to create a sound source map, but also by receiving sound that has been previously received and recorded through a separate microphone array.

다음으로, 시스템은 입력된 음향신호를 이를 이용하여 빔포밍 기술을 통해 빔포밍 이미지를 생성한다(S110). S110 단계에는 시간지연 합 빔형성(delay-and-sum beamforming) 등의 빔 형성 기술들이 이용될 수 있다.Next, the system uses the input sound signal to create a beamforming image through beamforming technology (S110). Beam forming techniques such as delay-and-sum beamforming may be used in step S110.

그리고, 시스템은 빔포밍학습을 위한 학습 데이터를 확보하기 위해 목표지도를 생성하는 절차를 수행할 수 있다.Additionally, the system can perform a procedure to generate a target map to secure learning data for beamforming learning.

다음으로, 시스템은 음향신호로부터 하나 이상의 음원이 존재하는 특정 영역에 대하여 일정한 간격을 갖는 격자점을 포함하는 격자를 생성한다(S120). 이때, 각 음원의 위치를 격자상에 표시되며, 격자의 바람직한 격자점간의 간격은 반복적인 실험을 통해 적절한 크기로 설정된다.Next, the system generates a grid including grid points at regular intervals for a specific area where one or more sound sources exist from the sound signal (S120). At this time, the location of each sound source is displayed on a grid, and the desired spacing between grid points is set to an appropriate size through repeated experiments.

이어서, 시스템은 격자점들과 격자상에 위치하는 하나 이상의 음원의 위치 및 거리에 따라 음향신호의 이미지화시, 픽셀이 갖는 특정값인 결과값을 산출한다(S130).Next, the system calculates a result value, which is a specific value of the pixel, when imaging the sound signal according to the positions and distances of the grid points and one or more sound sources located on the grid (S130).

이어서, 시스템은 좌표상에서 음원과의 거리를 이용하여 산출해낸 결과값을 격자에 대응되는 행렬상의 위치에 배열하고 이를 이용하여 이미지 형태의 목표지도를 생성한다(S140).Next, the system arranges the results calculated using the distance from the sound source on the coordinates to positions on the matrix corresponding to the grid and uses them to generate a target map in the form of an image (S140).

이후, 시스템은 탑재된 신경망 학습부에 목표지도를 입력하여 해당 빔포밍 이미지에 대한 학습 데이터로서 학습을 수행한다(S150).Afterwards, the system inputs the target map into the mounted neural network learning unit and performs learning as learning data for the corresponding beamforming image (S150).

다음으로, 도 6을 참조하면 본 발명의 실시예에 따른 시스템을 이용하여 음원지도를 제공하는 방법에 의하면, 마이크로폰 어레이로부터 획득한 음향신호를 빔포밍 이미지 형태로 변환하고, 이를 학습된 신경망 학습부에 입력하여 컨볼루션을 통한 특징 추출을 통해 음원의 위치 및 세기가 이미지상에 명확하게 표시되는 음원지도를 생성할 수 있다.Next, referring to FIG. 6, according to the method of providing a sound source map using a system according to an embodiment of the present invention, the acoustic signal obtained from the microphone array is converted into a beamforming image form, and the learned neural network learning unit By inputting and extracting features through convolution, a sound source map can be created where the location and intensity of the sound source are clearly displayed on the image.

이를 위해, 신경망 학습부는 분석 데이터를 입력받으면, 기준 이하 또는 미만의 크기의 커널이 설정된 제1 인코더를 통해 분석 데이터로부터 저대역 주파수의 주요특징을 추출하고 이를 포함하는 특징 맵을 생성한다(S200). 또한, 신경망 학습부는 분석 데이터에 대하여 기준 초과 또는 이상의 크기의 커널이 설정된 제2 인코더를 통해 분석 데이터로부터 고대역 주파수의 주요특징을 추출하고 이를 포함하는 특징 맵을 생성한다(S210).To this end, when the neural network learning unit receives analysis data, it extracts key features of low-band frequencies from the analysis data through a first encoder with a kernel of a size below or below the standard and generates a feature map including these (S200) . In addition, the neural network learning unit extracts key features of high-band frequencies from the analysis data through a second encoder in which a kernel of a size exceeding or larger than the standard is set for the analysis data and generates a feature map including these (S210).

다음으로, 신경망 학습부는 분석 데이터의 형태, 즉 분석 데이터의 주파수 파장이 전술한 마이크로폰간 간격이 반파장 보다 커지는 주파수인 경우 제2 인코더를 선택하고, 그렇지 않은 경우 제1 인코더를 선택한다(S220).Next, the neural network learning unit selects the second encoder if the type of analysis data, that is, the frequency wavelength of the analysis data is a frequency at which the above-mentioned inter-microphone interval is larger than a half-wavelength, and if not, selects the first encoder (S220) .

다음으로, S220 단계에서 선택된 인코더로부터 출력되는 특징 맵은 디코더에 달되고, 디코더는 특징 맵을 디컨볼루션 절차를 통해 복원하여 음원의 세기 및 위치가 정확하게 표시되며, 고대역 주파수일 경우 발생하는 공간 앨리어싱이 최소화된 음원지도를 생성한다(S230).Next, the feature map output from the encoder selected in step S220 is sent to the decoder, and the decoder restores the feature map through a deconvolution procedure so that the intensity and location of the sound source are accurately displayed, and the space that occurs in the case of high-band frequency is displayed. Generate a sound source map with minimal aliasing (S230).

상기한 설명에 많은 사항이 구체적으로 기재되어 있으나 이것은 발명의 범위를 한정하는 것이라기보다 바람직한 실시예의 예시로서 해석되어야 한다. 따라서, 발명은 설명된 실시예에 의하여 정할 것이 아니고 특허청구범위와 특허청구범위에 균등한 것에 의하여 정하여져야 한다.Although many details are described in detail in the above description, this should be interpreted as an example of a preferred embodiment rather than limiting the scope of the invention. Therefore, the invention should not be determined by the described embodiments, but by the scope of the patent claims and their equivalents.

10 : 마이크로폰 어레이 100 : 개선된 음원지도 시스템
110 : 이미지 생성부 120 : 목표지도 생성부
121 : 격자 생성모듈 123 : 결과값 산출모듈
125 : 매핑모듈 130 : 신경망 학습부
131 : 제1 인코더 132 : 제2 인코더
135 : 스위치 137 : 디코더10: Microphone array 100: Improved sound source guidance system
110: image generation unit 120: target map generation unit
121: grid generation module 123: result calculation module
125: Mapping module 130: Neural network learning unit
131: first encoder 132: second encoder
135: switch 137: decoder

Claims

a first encoder that is composed of a plurality of layers and generates a feature map including one or more main features extracted from input data in a first frequency band through a convolution operation;
a second encoder composed of a plurality of layers and generating a feature map including one or more main features extracted from input data in a second frequency band through a convolution operation;
a switch for selecting an output of one of the first and second encoders according to the type of the input data; and
A decoder that receives the feature map output from the first or second encoder and restores it to generate an output image according to the selection of the switch.
Neural network learning unit including.

According to claim 1,
The first and second frequency bands are,
A neural network learning unit that is divided into high and low bands, respectively, based on the spatial Nyquist frequency wavelength for the input data.

According to claim 2,
The plurality of layers are,
One or more convolutional layers, a pooling layer, a batch normalization layer, and a non-linearization layer each.
Neural network learning unit including.

According to claim 3,
The first and second convolutional encoders are:
A neural network learning unit in which the kernel sizes of the convolutional layers are set differently.

According to claim 2,
The switch is
During learning, the neural network learning unit selects the first encoder when spatial aliasing does not exist in the input data, and selects the second encoder when spatial aliasing exists in the input data.

A sound source map providing system that generates a sound source map from sound signals received from a microphone array in which a plurality of microphones are arranged at regular intervals, comprising:
an image generator that generates a beamforming image through a beamforming process from the acoustic signal received from the microphone array;
a target map generator that generates a target map using the beamforming image; and
A neural network learning unit that receives the target map, performs learning, and generates a sound source map that images the location and intensity of the sound source included in the analysis image according to one or more key features included in the image based on the set deep learning model.
An improved sound source map providing system including.

According to claim 6,
The neural network learning unit,
a first encoder that is composed of a plurality of layers and generates a feature map including one or more main features extracted in a first frequency band from the beamforming image through a convolution operation;
a second encoder that is composed of a plurality of layers and generates a feature map including one or more main features extracted in a second frequency band from the beamforming image through a convolution operation;
a switch for selecting an output of one of the first and second encoders according to the shape of the beamforming image; and
A decoder that generates the sound source map by restoring the feature map output from the first or second encoder according to selection of the switch.
An improved sound source map providing system including.

According to claim 7,
The first and second frequency bands are,
An improved sound source map providing system that is divided into high and low bands based on the spatial Nyquist frequency wavelength for the beamforming image, respectively.

According to claim 7,
The switch is
An improved sound source map providing system that selects the first encoder when spatial aliasing does not exist in the beamforming image, and selects the second encoder when spatial aliasing exists in the beamforming image.