KR102504043B1

KR102504043B1 - Face-to-face Recording Apparatus and Method with Robust Dialogue Voice Separation in Noise Environments

Info

Publication number: KR102504043B1
Application number: KR1020210040098A
Authority: KR
Inventors: 김선만; 이광훈; 김회민; 전성국
Original assignee: 한국광기술원
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2023-02-28
Also published as: KR20220134877A

Abstract

잡음환경에 강건한 대화 음성 분리 기능을 갖춘 대면 녹취장치 및 방법을 개시한다.
본 발명의 일 실시예에 따르면, 기 설정된 장소에 배치되어 장소 내에서 생성되는 소리 신호를 녹취하는 마이크로폰과 상기 마이크로폰에서 녹취된 소리신호를 인가받아, 소리 신호 내 잡음 신호만을 필터링하고, 잡음 신호가 필터링된 소리 신호 내에서 음성 신호들 각각을 분리하는 대면 녹취장치 및 상기 대면 녹취장치가 분리한 음성 신호들을 수신하여 저장하는 음원 저장장치를 포함하는 것을 특징으로 하는 대면 녹취 시스템을 제공한다.Disclosed is a face-to-face recording device and method with a conversational voice separation function that is robust against noise environments.
According to an embodiment of the present invention, a microphone disposed in a predetermined place to record a sound signal generated in the place and a sound signal recorded by the microphone are received, and only noise signals in the sound signal are filtered, and the noise signal is removed. A face-to-face recording system characterized by comprising a face-to-face recording device that separates each of the voice signals within the filtered sound signal and a sound source storage device that receives and stores the voice signals separated by the face-to-face recording device.

Description

Face-to-face Recording Apparatus and Method with Robust Dialogue Voice Separation in Noise Environments}

본 발명은 잡음환경에 강건하게 대화 음성 분리 기능을 갖추며, 분리된 음성에 손쉽게 태깅이 가능한 대면 녹취장치 및 방법에 관한 것이다.The present invention relates to a face-to-face recording apparatus and method that has a conversational voice separation function robustly in a noisy environment and can easily tag separated voices.

이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The contents described in this part merely provide background information on the present embodiment and do not constitute prior art.

최근 금융 상품 판매사들은 고객에게 투자상품들의 설명의무를 지는데, 고객으로부터 온전히 설명듣지 못했다는 등의 클레임에 의한 금전적·정신적 피해사례가 증가하고 있다. 금전적인 피해사례의 예로서, 은행 또는 증권사 등 금융 상품 판매사들이 설명 의무를 어기거나 불공정행위를 하면, 위반행위 관련 수입의 최대 50%까지 징벌적 과징금으로 배상해야 한다.Recently, financial product sellers are obligated to explain investment products to customers, but cases of financial and psychological damage due to claims such as not being fully explained by customers are increasing. As an example of financial damage, if financial product sellers, such as banks or securities companies, violate their duty of explanation or commit unfair acts, they must compensate up to 50% of the revenue related to the violation as a punitive penalty.

따라서, (금융상품 판매사들이 설명의무를 수행하였음에도) 고객들이 상품에 대해 제대로 듣지 못했다며 설명의무 위반으로 손해배상을 청구했을 때, 금융상품 판매사들은 이러한 피해를 방지하기 위해, 금융사가 금융상품을 설명하였음을 증명하여야 한다. 이에 대한 일환으로, 금융상품 판매사들은 대면 녹취시스템을 도입하고 있으며, 관련 시장이 점점 커지고 있다.Therefore, when customers claim damages for violating the duty of explanation (even though financial product sellers fulfilled their duty of explanation), claiming that they did not properly hear about the product, financial product sellers explained the financial product to prevent such damage. should prove As part of this, financial product sellers are introducing a face-to-face recording system, and the related market is growing.

고객과 (금융상품) 판매사가 상담하는 공간 내에 음성을 녹취하는 녹취기가 있으며, 녹취된 음성을 저장하여 전술한 상황을 대비한다. 이때, 녹취된 음성 내에서 주변 잡음을 분리하고, 고객과 판매사의 음성을 분리하여 각각 저장하는 것이 효율적인데, 종래의 녹취 시스템 상에서는 이를 모두 수행하는 것에 기술적 어려움이 존재하여 녹취 음성의 질이 떨어지거나 녹취 음성의 용량이 커지는 등의 불편이 있었다.There is a voice recorder in the space where the customer and the (financial product) seller consult, and the recorded voice is stored to prepare for the above situation. At this time, it is efficient to separate ambient noise from the recorded voice and separate and store the voices of the customer and the vendor, respectively. However, in the conventional recording system, there are technical difficulties in performing all of these, resulting in a decrease in the quality of the recorded voice or There were inconveniences such as an increase in the volume of the recorded voice.

본 발명의 일 실시예는, 음성을 녹취하며, 녹취된 파일 내 잡음을 제거하고 음원마다 분리하고 이를 태깅하여 저장하는 대면 녹취장치 및 방법을 제공하는 데 일 목적이 있다.An object of an embodiment of the present invention is to provide a face-to-face recording device and method for recording voice, removing noise in a recorded file, separating each sound source, and tagging and storing it.

본 발명의 일 측면에 의하면, 기 설정된 장소에 배치되어 장소 내에서 생성되는 소리 신호를 녹취하는 마이크로폰과 상기 마이크로폰에서 녹취된 소리신호를 인가받아, 소리 신호 내 잡음 신호만을 필터링하고, 잡음 신호가 필터링된 소리 신호 내에서 음성 신호들 각각을 분리하는 대면 녹취장치 및 상기 대면 녹취장치가 분리한 음성 신호들을 수신하여 저장하는 음원 저장장치를 포함하는 것을 특징으로 하는 대면 녹취 시스템을 제공한다.According to one aspect of the present invention, a microphone disposed in a predetermined place to record a sound signal generated in the place and a sound signal recorded by the microphone are applied, and only the noise signal in the sound signal is filtered, and the noise signal is filtered. A face-to-face recording system characterized in that it includes a face-to-face recording device that separates each of the voice signals from the face-to-face recording device and a sound source storage device that receives and stores the voice signals separated by the face-to-face recording device.

본 발명의 일 측면에 의하면, 상기 마이크로폰은 하나 이상의 채널을 구비할 수 있는 것을 특징으로 한다.According to one aspect of the present invention, the microphone is characterized in that it can have one or more channels.

본 발명의 일 측면에 의하면, 상기 대면 녹취장치는 인가받은 소리신호를 주파수 도메인으로 변환하는 것을 특징으로 한다.According to one aspect of the present invention, the face-to-face recording device is characterized in that the received sound signal is converted into a frequency domain.

본 발명의 일 측면에 의하면, 상기 대면 녹취장치는 잡음 신호와 음성 신호 각각을 저장하는 데이터베이스를 포함하며, 각 데이터 베이스 내 잡음 신호와 음성 신호를 학습하는 것을 특징으로 한다.According to one aspect of the present invention, the face-to-face recording device includes a database for storing a noise signal and a voice signal, respectively, and learns the noise signal and the voice signal in each database.

본 발명의 일 측면에 의하면, 상기 대면 녹취장치는 학습 결과와 주파수 변환된 소리 신호를 토대로 잡음 신호를 필터링할 잡음제거 주파수 필터를 추론하는 것을 특징으로 한다.According to one aspect of the present invention, the face-to-face recording device is characterized in inferring a noise canceling frequency filter to filter a noise signal based on a learning result and a frequency-converted sound signal.

본 발명의 일 측면에 의하면, 대면 녹취 시스템이 소리 신호를 녹취하고 음성 신호만을 분리하여 저장하는 방법에 있어서, 기 설정된 장소에 배치되어 생성되는 소리 신호를 녹취하는 녹취과정과 상기 녹취과정에서 녹취된 소리 신호를 인가받아 소리 신호 내 잡음 신호만을 필터링하는 필터링과정과 상기 필터링과정에 의해 잡음 신호가 필터링된 소리 신호 내에서 음성 신호들 각각을 분리하는 분리과정과 상기 분리과정에 의해 분리된 음성 신호들을 수신하여 저장하는 저장과정을 포함하는 것을 특징으로 하는 음성 신호 분리방법을 제공한다.According to one aspect of the present invention, in a method for a face-to-face recording system to record a sound signal and separate and store only a voice signal, a recording process of recording a sound signal generated by being placed in a predetermined place, and a recording process recorded in the recording process The filtering process of receiving the sound signal and filtering only the noise signal within the sound signal, the separation process of separating each of the audio signals from the sound signal from which the noise signal is filtered by the filtering process, and the audio signals separated by the separation process It provides a voice signal separation method comprising a storage process of receiving and storing.

본 발명의 일 측면에 의하면, 기 설정된 장소에서의 소리 신호를 수신하여 소리 신호의 도메인을 주파수 도메인으로 변환하는 도메인 변환부와 음성 신호 및 잡음 신호를 학습하고, 상기 도메인 변환부에서 변환된 신호를 수신하여 학습된 결과로부터 잡음제거 주파수 필터를 추론하는 필터 추론부와 추론된 필터를 이용하여 상기 도메인 변환부에서 변환된 신호 내 잡음 신호를 필터링하는 필터링부와 상기 필터링부에 의해 잡음 신호가 제거된 소리 신호 내에서 각 음성 신호들을 분리하는 음성 분리부 및 분리된 각 음성 신호들의 발원지의 위치나 방향을 태깅하는 태깅부를 포함하는 것을 특징으로 하는 대면 녹취장치를 제공한다.According to an aspect of the present invention, a domain conversion unit for receiving a sound signal at a preset location and converting a domain of the sound signal into a frequency domain, learning a voice signal and a noise signal, and converting the signal converted by the domain conversion unit into a frequency domain. A filter inference unit for inferring a noise canceling frequency filter from the received and learned result, and a filtering unit for filtering the noise signal within the signal converted by the domain conversion unit using the inferred filter, and the noise signal is removed by the filtering unit. Provided is a face-to-face recording device comprising a voice separation unit separating each voice signal from a sound signal and a tagging unit tagging the location or direction of the source of each separated voice signal.

본 발명의 일 측면에 의하면, 상기 도메인 변환부는 FFT(Fast Fourier Transform)를 수행하여 소리 신호의 도메인을 주파수 도메인으로 변환하는 것을 특징으로 한다.According to one aspect of the present invention, the domain converter converts the domain of the sound signal into a frequency domain by performing Fast Fourier Transform (FFT).

본 발명의 일 측면에 의하면, 상기 도메인 변환부가 수신하는 소리 신호는 하나 이상의 채널로부터 녹취된 소리 신호인 것을 특징으로 한다.According to an aspect of the present invention, the sound signal received by the domain conversion unit is a sound signal recorded from one or more channels.

본 발명의 일 측면에 의하면, 상기 필터 추론부는 음성 신호를 저장하는 음성 데이터 베이스 및 잡음 신호를 저장하는 잡음 데이터베이스를 포함하는 것을 특징으로 한다.According to one aspect of the present invention, the filter reasoning unit may include a voice database for storing a voice signal and a noise database for storing a noise signal.

본 발명의 일 측면에 의하면, 상기 필터 추론부는 음성 신호를 학습함에 있어, 상기 소리 신호가 녹취된 채널의 개수 이하의 개수만큼 음성 신호가 존재하는 음성신호를 학습하는 것을 특징으로 한다.According to one aspect of the present invention, the filter reasoning unit is characterized in that, when learning a voice signal, it learns a voice signal in which a voice signal exists as many as a number equal to or less than the number of channels in which the voice signal is recorded.

본 발명의 일 측면에 의하면, 상기 음성 분리부는 암묵 음원분리 기술을 사용하는 것을 특징으로 한다.According to one aspect of the present invention, the voice separation unit is characterized in that an implicit sound source separation technology is used.

본 발명의 일 측면에 의하면, 대면 녹취장치가 소리 신호 내 음성 신호만을 분리하여 태깅하는 방법에 있어서, 기 설정된 장소에서의 소리 신호를 수신하여 소리 신호의 도메인을 주파수 도메인으로 변환하는 변환과정과 음성 신호 및 잡음 신호를 학습하고, 상기 변환과정에서 변환된 신호를 수신하여 학습된 결과로부터 잡음제거 주파수 필터를 추론하는 추론과정과 상기 추론과정에서 추론된 필터를 이용하여 상기 변환과정에서 변환된 신호 내 잡음 신호를 필터링하는 필터링과정과 상기 필터링과정을 거쳐 잡음 신호가 제거된 소리 신호 내에서 각 음성 신호들을 분리하는 분리과정 및 상기 분리과정에서 분리된 각 음성 신호들의 발원지의 위치나 방향을 태깅하는 태깅과정을 포함하는 것을 특징으로 하는 대면 녹취방법을 제공한다.According to one aspect of the present invention, in a method of separating and tagging only a voice signal from a sound signal by a face-to-face recording device, a conversion process of receiving a sound signal at a predetermined place and converting the domain of the sound signal into a frequency domain and the voice An inference process of learning a signal and a noise signal, receiving a signal converted in the conversion process, and inferring a noise canceling frequency filter from the learned result, and using the filter inferred in the inference process, in the converted signal in the conversion process. A filtering process of filtering noise signals, a separation process of separating each audio signal from a sound signal from which noise signals have been removed through the filtering process, and a tagging process of tagging the location or direction of the origin of each audio signal separated in the separation process. It provides a face-to-face recording method comprising a process.

본 발명의 일 측면에 의하면, 상기 변환과정은 FFT(Fast Fourier Transform)를 수행하여 소리 신호의 도메인을 주파수 도메인으로 변환하는 것을 특징으로 한다.According to one aspect of the present invention, the conversion process is characterized in that the domain of the sound signal is converted into a frequency domain by performing FFT (Fast Fourier Transform).

이상에서 설명한 바와 같이, 본 발명의 일 측면에 따르면, 음성을 녹취하며, 녹취된 파일 내 잡음을 제거하고 음원마다 분리하고 이를 태깅함으로써, 각 음원들을 구분하여 저장할 수 있으며 저장된 음원의 음질이 향상될 수 있는 장점이 있다.As described above, according to one aspect of the present invention, by recording voice, removing noise in the recorded file, separating and tagging each sound source, each sound source can be distinguished and stored, and the sound quality of the stored sound source can be improved. There are advantages to being able to

도 1은 본 발명의 일 실시예에 따른 대면 녹취 시스템의 구성을 도시한 도면이다.
도 2는 본 발명의 일 실시예에 따른 대면 녹취장치의 구성을 도시한 도면이다.
도 3은 본 발명의 일 실시예에 따른 필터 추론부의 구성을 도시한 도면이다.
도 4는 본 발명의 일 실시예에 따른 태깅부가 분석한 히스토그램을 도시한 도면이다.
도 5는 본 발명의 일 실시예에 따른 대면 녹취장치가 녹취하여 각 음원들을 분리하여 태깅하는 방법을 도시한 순서도이다.
도 6은 본 발명의 일 실시예에 따른 대면 녹취 시스템이 소리 신호를 녹취하여 음성 신호만을 분리 후 저장하는 방법을 도시한 도면이다.1 is a diagram showing the configuration of a face-to-face recording system according to an embodiment of the present invention.
2 is a diagram showing the configuration of a face-to-face recording device according to an embodiment of the present invention.
3 is a diagram showing the configuration of a filter reasoning unit according to an embodiment of the present invention.
4 is a diagram showing a histogram analyzed by a tagging unit according to an embodiment of the present invention.
5 is a flowchart illustrating a method of separating and tagging each sound source by recording by a face-to-face recording device according to an embodiment of the present invention.
6 is a diagram illustrating a method of recording a sound signal by a face-to-face recording system according to an embodiment of the present invention, separating and storing only a voice signal.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.Since the present invention can make various changes and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, or substitutes included in the spirit and technical scope of the present invention. Like reference numerals have been used for like elements throughout the description of each figure.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various components, but the components should not be limited by the terms. These terms are only used for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element, without departing from the scope of the present invention. The term and/or includes a combination of a plurality of related recited items or any one of a plurality of related recited items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에서, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.It is understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, but other elements may exist in the middle. It should be. On the other hand, when an element is referred to as “directly connected” or “directly connected” to another element, it should be understood that no intervening element exists.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서 "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. It should be understood that terms such as "include" or "having" in this application do not exclude in advance the possibility of existence or addition of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification. .

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해서 일반적으로 이해되는 것과 동일한 의미를 가지고 있다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs.

일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present application, they should not be interpreted in an ideal or excessively formal meaning. don't

또한, 본 발명의 각 실시예에 포함된 각 구성, 과정, 공정 또는 방법 등은 기술적으로 상호간 모순되지 않는 범위 내에서 공유될 수 있다.In addition, each configuration, process, process or method included in each embodiment of the present invention may be shared within a range that does not contradict each other technically.

도 1은 본 발명의 일 실시예에 따른 대면 녹취 시스템의 구성을 도시한 도면이다.1 is a diagram showing the configuration of a face-to-face recording system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 대면 녹취 시스템(100)은 마이크로폰(110), 대면 녹취장치(120) 및 음원 저장장치(130)를 포함한다.Referring to FIG. 1 , a face-to-face recording system 100 according to an embodiment of the present invention includes a microphone 110, a face-to-face recording device 120, and a sound source storage device 130.

대면 녹취 시스템(100)은 1인 또는 2 이상의 인원이 존재하는 공간에서, 해당 인원 또는 각 인원들의 음성을 녹취한다. 이때, 공간의 특성 상 인원 또는 각 인원들의 음성 외에 잡음이 존재할 확률이 높은데, 대면 녹취 시스템(100)은 녹취된 파일을 토대로 잡음을 필터링하여 고품질의 음성만을 분리하여 저장한다. 또한, 공간 내 2인 이상의 인원이 존재하는 경우, 각 인원들의 음성을 분리하고 음성의 발원지의 위치(방향)를 구분하여, 구분된 음성이 어떠한 인원으로부터 나온 것인지 태깅한다. 이에 따라, 음원 저장장치(130)는 태그를 이용해 원하는 대상의 음성들을 분류하여 저장해둘 수 있으며, 필요치 않은 음성들은 삭제나 별도로 관리할 수 있다. 또한, 저장된 음성들도 잡음이 필터링된 상태이기 때문에, 음성의 품질이 높아질 수 있다.The face-to-face recording system 100 records the voice of a corresponding person or each person in a space where one person or two or more people exist. At this time, there is a high probability that there will be noise other than the voice of the person or each person due to the nature of the space, and the face-to-face recording system 100 filters noise based on the recorded file to separate and store only high-quality voice. In addition, when there are two or more people in the space, the voice of each person is separated, the location (direction) of the origin of the voice is classified, and the person from whom the separated voice is from is tagged. Accordingly, the sound source storage device 130 may classify and store desired target voices using tags, and may delete or separately manage unnecessary voices. In addition, since the stored voices are also in a state in which noise is filtered, the quality of voices can be improved.

마이크로폰(110)은 인원 또는 각 인원이 존재하는 공간 내에서 발생하는 소리 신호를 녹취한다. 도 1에 도시된 바와 같이, 마이크로폰(110)은 공간 내에 존재하는 인원의 수만큼 채널을 구비하며, 각 채널들은 공간에서 발생하는 소리 신호를 녹취한다. 마이크로폰(110)은 녹취된 녹음파일을 대면 녹취장치(120)로 전달한다.The microphone 110 records a sound signal generated within a person or a space where each person exists. As shown in FIG. 1, the microphone 110 has as many channels as the number of people present in the space, and each channel records sound signals generated in the space. The microphone 110 transmits the recorded recording file to the face-to-face recording device 120.

대면 녹취장치(120)는 마이크로폰(110)에서 녹취된 소리 신호를 수신하여, 잡음 신호를 필터링하고 각 인원이 발생시킨 음성 신호를 구분하여 태깅한다. 대면 녹취장치(120)는 1차적으로 빅데이터의 학습을 토대로 잡음 신호와 인간의 음성 신호를 분리(필터링)할 필터를 추론하여 잡음 신호와 인간의 음성 신호를 분리한다. 대면 녹취장치(120)는 복수의 인원이 해당 장소에서 음성을 발생시켰을 경우 2차적으로 분리된 인간의 음성 신호를 음원 분리를 이용해 각 음성 신호를 분리한다. 이때, 대면 녹취장치(120)는 음성 신호의 분리 후 음성 신호의 발원지의 위치(방향)를 분리하여 해당 음성 신호가 어느 위치에 있던 누구로부터 나온 것인지 구분한다. 대면 녹취 시스템(100)이 배경이 되는 기술에서 언급한 바와 같이 금융상품을 설명하는 공간 등에서 사용될 경우, 통상 고객과 판매사 및 기타 인원의 위치는 정해져 있다. 이에, 음성 신호의 발원지의 위치를 알 수 있다면, 해당 음성 신호가 누구의 음성인지 구분하여 태깅할 수 있다. 대면 녹취장치(120)는 빅데이터의 학습(딥러닝)을 토대로 잡음 신호와 인간의 음성 신호를 분리하기 때문에, 간편하면서도 정확히 잡음 신호와 인간의 음성 신호를 분리할 수 있다. 또한, 대면 녹취장치(120)는 잡음 신호가 분리된 음성 신호에 대해서는 암묵 음원분리 기술을 사용하기 때문에, 일일이 각 인원의 음성 신호들을 학습할 필요없이 잡음 신호가 분리된 상태의 음성 신호를 정확히 분리해낼 수 있다. 대면 녹취장치(120)에 대한 구체적인 구조는 도 2 및 3을 참조하여 후술한다.The face-to-face recording device 120 receives the sound signal recorded by the microphone 110, filters the noise signal, and classifies and tags the audio signal generated by each person. The face-to-face recording device 120 primarily infers a filter to separate (filter) the noise signal and the human voice signal based on big data learning, and separates the noise signal and the human voice signal. The face-to-face recording device 120 separates each voice signal from a secondly separated human voice signal using sound source separation when a plurality of people generate voice at a corresponding place. At this time, the face-to-face recording device 120 separates the location (direction) of the origin of the audio signal after separating the audio signal, and distinguishes where the audio signal came from or from whom. As mentioned in the background technology, when the face-to-face recording system 100 is used in a space for explaining financial products, the locations of customers, vendors, and other personnel are usually fixed. Therefore, if the location of the origin of the voice signal can be known, it is possible to classify and tag whose voice the corresponding voice signal belongs to. Since the face-to-face recording device 120 separates the noise signal and the human voice signal based on big data learning (deep learning), it can simply and accurately separate the noise signal and the human voice signal. In addition, since the face-to-face recording device 120 uses the implicit sound source separation technology for the audio signal from which the noise signal is separated, the audio signal in the state where the noise signal is separated is accurately separated without the need to learn the voice signals of each person individually. You can do it. A detailed structure of the face-to-face recording device 120 will be described later with reference to FIGS. 2 and 3 .

음원 저장장치(130)는 대면 녹취장치(120)에서 분리되어 태깅된 음성 신호들을 저장한다. 음원 저장장치(130)로는 각 인원이 분리된 채 태깅되어 전달되기 때문에, 음원 저장장치(130)는 각 인원들의 음성 신호들을 각각 분리하여 저장할 수 있다. 이에, 보다 관리 및 탐색이 용이해질 수 있다. 또한, 마이크로폰(110)으로부터 녹취된 소리 신호 내에서 잡음 신호가 분리되기 때문에, 상대적으로 음성 신호의 크기가 줄어들게 된다. 이에, 음원 저장장치(130)의 저장 용량도 보다 여유가 생길 수 있다.The sound source storage device 130 stores voice signals that are separated from the face-to-face recording device 120 and tagged. Since each person is tagged and transmitted to the sound source storage device 130 while being separated, the sound source storage device 130 can separately store voice signals of each person. As a result, management and search may be more easily performed. In addition, since the noise signal is separated from the sound signal recorded from the microphone 110, the volume of the audio signal is relatively reduced. Accordingly, the storage capacity of the sound source storage device 130 may be more marginal.

도 2는 본 발명의 일 실시예에 따른 대면 녹취장치의 구성을 도시한 도면이다.2 is a diagram showing the configuration of a face-to-face recording device according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시예에 따른 대면 녹취장치(120)는 도메인 변환부(210), 필터 추론부(220), 필터링부(230), 음성 분리부(240) 및 태깅부(250)를 포함한다.Referring to FIG. 2 , the face-to-face recording device 120 according to an embodiment of the present invention includes a domain conversion unit 210, a filter reasoning unit 220, a filtering unit 230, a voice separation unit 240, and a tagging unit. (250).

도메인 변환부(210)는 마이크로폰(110)이 녹취한 소리 신호를 마이크로폰(110)으로부터 입력받아 주파수 도메인으로 변환한다. 도메인 변환부(210)는 마이크로폰(110)으로부터 소리 신호를 입력받음에 있어, 마이크로폰(110)이 다 채널로 각각 소리 신호를 녹취한 경우, 이들을 구분하여 입력받는다. 도메인 변환부(210)가 입력받은 소리 신호는 시간 도메인의 데이터인 점에서, 도메인 변환부(210)는 입력받은 소리 신호를 주파수 도메인으로 변환한다. 도메인 변환부(210)는 전술한 동작을 수행함에 있어 FFT(Fast Fourier Transform)를 수행할 수 있다.The domain converter 210 receives the sound signal recorded by the microphone 110 from the microphone 110 and converts it into a frequency domain. When receiving sound signals from the microphone 110, the domain conversion unit 210 receives sound signals separately when the microphone 110 records sound signals in multiple channels. Since the sound signal received by the domain converter 210 is time domain data, the domain converter 210 converts the input sound signal into a frequency domain. The domain conversion unit 210 may perform FFT (Fast Fourier Transform) in performing the above-described operation.

필터 추론부(220)는 저장된 음성 신호와 잡음 신호를 학습하고, 도메인 변환부(210)에서 변환된 소리 신호를 이용하여 입력된 소리 신호 내 잡음 신호를 분리할 최적의 필터를 추론한다. 필터 추론부(220)는 다양한 공간에서 발생할 수 있는 다양한 잡음 신호들과 다양한 인간들의 음성 신호를 학습한다. 필터 추론부(220)는 도메인 변환부(210)에서 변환된 소리 신호를 수신하는데, 수신한 데이터를 이용하여 학습한 결과를 토대로 수신한 데이터에 최적의 필터를 추론한다. 필터 추론부(220)의 구체적인 구조는 도 3에 도시되어 있다.The filter reasoning unit 220 learns the stored voice signal and noise signal, and uses the sound signal converted by the domain conversion unit 210 to infer an optimal filter to separate the noise signal from the input sound signal. The filter reasoning unit 220 learns various noise signals that may occur in various spaces and various human voice signals. The filter reasoning unit 220 receives the sound signal converted by the domain conversion unit 210, and infers an optimal filter for the received data based on a result of learning using the received data. A detailed structure of the filter reasoning unit 220 is shown in FIG. 3 .

도 3은 본 발명의 일 실시예에 따른 필터 추론부의 구성을 도시한 도면이다.3 is a diagram showing the configuration of a filter reasoning unit according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시예에 따른 필터 추론부(120)는 음성 데이터베이스(310), 잡음 데이터베이스(315), 다중음성 생성부(320), 딥러닝부(330) 및 추론부(340)를 포함한다. Referring to FIG. 3 , the filter inference unit 120 according to an embodiment of the present invention includes a voice database 310, a noise database 315, a multi-speech generator 320, a deep learning unit 330, and an inference unit. (340).

음성 데이터베이스(310)는 대면 녹취장치(120)가 잡음 신호와 분리하여 획득하고자 하는, 다양한 종류의 음성 신호들을 저장한다. 전술한 예와 같이, 마이크로폰(110)이 배치된 (또는 배치될) 장소에 위치한 인원 또는 각 인원들의 음성 신호를 분리하고자 한다면, 음성 데이터베이스(310)는 딥러닝부(330)가 학습할, 다양한 인간들의 수많은 음성 신호들을 저장한다. The voice database 310 stores various types of voice signals that the face-to-face recording device 120 wants to acquire separately from noise signals. As in the foregoing example, if it is desired to separate voice signals of persons or persons located in a place where the microphone 110 is (or will be placed), the voice database 310 is a deep learning unit 330 to learn, various It stores numerous voice signals of humans.

잡음 데이터베이스(315)는 대면 녹취장치(120)가 음성 신호와 분리하고자 하는, 다양한 종류의 잡음 신호를 저장한다. 전술한 예와 같이, 잡음 데이터베이스(315)는 딥러닝부(330)가 학습할, 마이크로폰(110)이 배치된 또는 배치될 장소에서 발생할 수 있는 다양한 종류의 잡음 신호를 저장한다. The noise database 315 stores various types of noise signals that the face-to-face recording device 120 wants to separate from the voice signal. As in the foregoing example, the noise database 315 stores various types of noise signals that may occur at a place where the microphone 110 is or will be placed, for the deep learning unit 330 to learn.

다중음성 생성부(320)는 음성 데이터베이스(310) 내 저장된 음성 신호들 중 딥러닝부(330)가 학습할 다중 음성 신호들을 생성한다. 다중음성 생성부(320)는 녹취장치에 설치되어 있는 마이크로폰 채널 수 이하의 개수만큼 음성 신호들이 존재하는 다중 음성을 생성한다. 예를 들어, 마이크로폰(110)이 2인이 위치한 장소에서 2개의 채널로 소리 신호를 녹취하는 경우를 고려하면, 다중음성 생성부(320)는 1개의 단일 음성 신호 혹은 동시에 2개의 음성 신호가 존재하는 다중 음성 신호들을 랜덤으로 번갈아가며 딥러닝부(330)가 학습할 정도의 양만큼 생성한다. 딥러닝부(330)가 이처럼 실제 해당 장소에 존재하는 인원수에 맞는 음성 신호를 학습하여야, 정확히 마이크로폰(110)이 녹취한 소리 신호에서 잡음 신호만을 분리할 필터를 학습하여 추론할 수 있다.The multi-speech generation unit 320 generates multi-speech signals to be learned by the deep learning unit 330 among the voice signals stored in the voice database 310 . The multi-voice generating unit 320 generates multi-voice in which voice signals exist as many as the number of microphone channels installed in the recording device or less. For example, considering the case where the microphone 110 records sound signals through two channels in a place where two people are located, the multi-voice generator 320 has one single voice signal or two voice signals at the same time. The deep learning unit 330 generates an amount sufficient to learn by alternating multiple voice signals at random. When the deep learning unit 330 learns a voice signal suitable for the number of people actually present in the place, it can learn and infer a filter that will separate only the noise signal from the sound signal accurately recorded by the microphone 110.

딥러닝부(330)는 잡음 데이터베이스(315)로부터 잡음 신호들을, 음성 데이터베이스(310) 또는 다중음성 생성부(320)로부터 음성 신호들을 입력받아 잡음제거 주파수 필터의 생성을 위한 학습을 수행한다. 딥러닝부(330)는 데이터베이스(310, 315)들 내 저장된 데이터들을 학습한다. 딥러닝부(330)는 잡음 데이터베이스(315) 내 저장된 데이터들을 학습하여 잡음 신호에 관한 정보를 인식하며, 음성 데이터베이스(310)로부터 각 음성 신호 및 마이크로폰 채널 수 이하의 음성이 섞인 다중음성 생성부(320)를 거치며 생성된 다중 음성 신호들을 학습하여 음성 신호에 관한 정보를 인식한다. 딥러닝부(330)는 이러한 학습에 의해, 양자가 혼합된 소리 신호 내에서 잡음 신호만을 필터링할 수 있는 잡음제거 주파수 필터를 생성할 수 있도록 학습한다.The deep learning unit 330 receives noise signals from the noise database 315 and speech signals from the speech database 310 or the multivoice generator 320 and performs learning to generate a noise canceling frequency filter. The deep learning unit 330 learns data stored in the databases 310 and 315. The deep learning unit 330 learns the data stored in the noise database 315 to recognize information about the noise signal, and the multiple voice generation unit ( 320) to recognize information about the voice signals by learning the multiple voice signals generated. The deep learning unit 330 learns to generate a noise canceling frequency filter capable of filtering only the noise signal in the sound signal mixed with both through this learning.

추론부(340)는 딥러닝부(330)의 학습결과를 토대로, 도메인 변환부(210)가 변환한 주파수값을 수신하여 잡음제거 주파수 필터를 추론한다. 추론부(340)는 딥러닝부(330)의 학습결과를, 도메인 변환부(210)가 변환한 주파수값을 수신한다. 이때, 도메인 변환부가 복수의 채널로부터 녹취된 소리 신호를 각각 주파수 도메인으로 변환한 경우, 추론부(340)로 수신되는 주파수 값은 각 채널로 입력된 소리 신호로부터 변환된 값을 각각을 입력받는 것이 아니라, 각 채널로 입력된 소리 신호로부터 변환된 값들을 평균한 하나의 값을 수신한다. 추론부(340)가 평균값을 수신하는 것이 아닌 각 채널로 입력된 소리 신호로부터 변환된 값을 각각 입력받을 경우, 공간 특성이 달라져 잡음 신호 제거 후 음성 신호를 분리할 때 음성의 발원지 위치(방향)을 추론하기 곤란하다. 추론부(340)가 도메인 변환부(210)에서 변환된 값의 평균값을 수신하여야 주파수 필터를 공통으로 사용할 수 있다. 이에, 잡음이 제거된 각 음성 신호들의 공간 특성이 동일해진다. 이에, 추론부(340)로는 각 채널로 입력된 소리 신호로부터 변환된 값들의 평균값이 수신되며, 추론부(340)는 수신값을 토대로 잡음제거 주파수 필터를 추론한다.The reasoning unit 340 receives the frequency value converted by the domain conversion unit 210 based on the learning result of the deep learning unit 330 and infers a noise canceling frequency filter. The inference unit 340 receives a frequency value converted by the domain conversion unit 210 from the learning result of the deep learning unit 330 . At this time, when the domain conversion unit converts each of the sound signals recorded from a plurality of channels into the frequency domain, the frequency value received by the reasoning unit 340 is converted from the sound signal input to each channel. Instead, one value obtained by averaging values converted from sound signals input to each channel is received. When the reasoning unit 340 receives the values converted from the sound signals input to each channel instead of receiving the average value, the spatial characteristics are different, so when separating the voice signal after removing the noise signal, the location (direction) of the origin of the voice It is difficult to infer The frequency filter can be commonly used only when the inference unit 340 receives the average value of the values converted by the domain conversion unit 210 . Accordingly, the spatial characteristics of each voice signal from which noise has been removed become the same. Accordingly, the inference unit 340 receives an average value of values converted from sound signals input to each channel, and the inference unit 340 infers a noise canceling frequency filter based on the received values.

다시 도 2를 참조하면, 필터링부(230)는 필터 추론부(220), 특히, 추론부(340)가 추론한 필터를 이용하여 도메인 변환부(210)로 입력된 소리 신호 내에서 잡음 신호를 필터링한다. 필터링부(230)는 추론한 (잡음제거 주파수) 필터와 도메인 변환부(210)에서 변환된 주파수 도메인 값을 곱하여, 녹취된 소리 신호 내에서 잡음 신호만을 제거한다. 이때, 도메인 변환부(210)로 복수의 채널에서 녹취된 각각의 소리 신호가 입력되는 경우, 필터링부(230)는 각 소리 신호가 주파수 변환된 값 각각을 필터링한다. 이에 따라, 각 채널에서 녹취된 소리 신호 중 잡음 신호가 제거된 음성 (주파수 도메인 값)이 도출된다.Referring back to FIG. 2 , the filtering unit 230 uses the filter reasoning unit 220, in particular, the filter inferred by the reasoning unit 340, to convert the noise signal within the sound signal input to the domain conversion unit 210. filter The filtering unit 230 multiplies the inferred (noise removal frequency) filter by the frequency domain value converted by the domain conversion unit 210 to remove only the noise signal from the recorded sound signal. At this time, when each sound signal recorded from a plurality of channels is input to the domain converter 210, the filtering unit 230 filters each sound signal frequency-converted value. Accordingly, among the sound signals recorded in each channel, a voice (frequency domain value) from which noise signals are removed is derived.

음성 분리부(240)는 암묵 음원분리기술을 이용하여 음성 신호들을 분리한다. 음성 신호가 복수 개가 존재할 경우, 음성 분리부(240)는 각 음성 신호들을 분리한다. 예를 들어, 주파수 k인덱스, 시간 t인 경우에서 임의의 공간 상에 음성 신호 S₁(t, k), S₂(t, k)가 마이크로폰 채널 1, 2에 녹취되는 상황이 존재할 수 있다. 마이크로폰 채널 1, 2에 녹취된 음성 신호를 각각 X₁(t, k), X₂(t, k)라 하면, X₁(t, k), X₂(t, k)는 다음을 만족한다.The voice separation unit 240 separates voice signals using an implicit sound source separation technique. When there are a plurality of audio signals, the audio separation unit 240 separates each audio signal. For example, there may be a situation in which voice signals S ₁ (t, k) and S ₂ (t, k) are recorded on microphone channels 1 and 2 in an arbitrary space in the case of frequency k index and time t. If X ₁ (t, k) and X ₂ (t, k) are the audio signals recorded on microphone channels 1 and 2, respectively, X ₁ (t, k) and X ₂ (t, k) satisfy the following .

X₁(t, k)=H₁₁(t, k)S₁(t, k)+H₁₂(t, k)S₂(t, k)X ₁ (t, k)=H ₁₁ (t, k)S ₁ (t, k)+H ₁₂ (t, k)S ₂ (t, k)

X₂(t, k)=H₂₁(t, k)S₁(t, k)+H₂₂(t, k)S₂(t, k)X ₂ (t, k)=H ₂₁ (t, k)S ₁ (t, k)+H ₂₂ (t, k)S ₂ (t, k)

여기서, H₁₁(t, k)은 마이크로폰 1번 채널과 제1 음성 신호 간의 전달변환 복소 가중치를, H₁₂(t, k)은 마이크로폰 1번 채널과 제2 음성 신호 간의 전달변환 복소 가중치를, H₂₁(t, k)은 마이크로폰 2번 채널과 제1 음성 신호 간의 전달변환 복소 가중치를, H₂₂(t, k)은 마이크로폰 2번 채널과 제2 음성 신호 간의 전달변환 복소 가중치를 의미한다.Here, H ₁₁ (t, k) is the transfer conversion complex weight between microphone channel 1 and the first voice signal, H ₁₂ (t, k) is the transfer conversion complex weight between microphone channel 1 and the second voice signal, H ₂₁ (t, k) denotes a transfer conversion complex weight between microphone channel 2 and the first voice signal, and H ₂₂ (t, k) denotes a transfer conversion complex weight between microphone channel 2 and the second voice signal.

음성 분리부(240)는 잡음이 제거된 녹취 음성 신호 X₁(t, k), X₂(t, k)로부터 S₁(t, k), S₂(t, k)를 추정하여 분리한다. 전술한 수식에 따라,The audio separator 240 estimates and separates S ₁ (t, k) and S ₂ (t, k) from the noise-removed recorded audio signals X ₁ (t, k) and X ₂ (t, k). . According to the above formula,

가 성립한다. 전술한 수식에 의해,is achieved By the above formula,

가 만족한다.is satisfied

이때,

를 만족하는 음원 분리행렬W(k)가 존재한다고 가정할 경우, 음성 분리부(240)는 S₁(t, k)와 S₂(t, k)가 서로 독립적(상관관계가 존재하지 않음)인 값을 갖는 W(k)를 연산(X₁(t, k)와 X₂(t, k)는 알고 있는 값이기 때문에)하여 추정한다.At this time,

Assuming that there exists a sound source separation matrix W(k) _that _satisfies W(k) having a value of is calculated (Since X ₁ (t, k) and X ₂ (t, k) are known values), it is estimated.

음성 분리부(240)는 암묵 음원분리를 이용하여 음원 분리행렬인 W(k)를 추정함으로써, 음원 S₁(t, k)와 S₂(t, k)를 분리할 수 있다.The voice separator 240 may separate sound sources S ₁ (t, k) and S ₂ (t, k) by estimating the sound source separation matrix W(k) using implicit sound source separation.

태깅부(250)는 음성 분리부(240)에서 추정된 음원 분리행렬을 이용하여, 음성의 방향을 분석하여 음성 신호의 발원지의 위치(방향)를 태깅한다. 태깅부(250)는 음원 분리행렬의 위상각을 다음과 같이 연산한다. The tagging unit 250 analyzes the direction of the voice using the sound source separation matrix estimated by the voice separation unit 240 and tags the location (direction) of the origin of the voice signal. The tagging unit 250 calculates the phase angle of the sound source separation matrix as follows.

전술한 수식과 같이, 태깅부(250)는 음원 분리행렬 중 동일한 음원의 복소 가중치들의 위상각을 연산한다.As described above, the tagging unit 250 calculates phase angles of complex weights of the same sound source in the sound source separation matrix.

태깅부(250)는 연산된 위상각 결과를 토대로 히스토그램을 분석한다. 태깅부(250)가 분석한 히스토그램의 일 예는 도 4에 도시되어 있다.The tagging unit 250 analyzes the histogram based on the calculated phase angle result. An example of the histogram analyzed by the tagging unit 250 is shown in FIG. 4 .

도 4는 본 발명의 일 실시예에 따른 태깅부가 분석한 히스토그램을 도시한 도면이다.4 is a diagram showing a histogram analyzed by a tagging unit according to an embodiment of the present invention.

도 4는 마이크로폰(110)이 2인이 위치한 장소에서 2개의 채널로 소리 신호를 녹취한 경우에서의 히스토그램을 도시하고 있다. 도 4의 그래프 내에서 x축은 각도를, y축은 빈도수를 의미한다.4 shows a histogram in the case where the microphone 110 records sound signals through two channels at a place where two people are located. In the graph of FIG. 4, the x-axis means the angle and the y-axis means the frequency.

기준값(Threshold, 예를 들어, 마이크로폰(110)의 중심축)을 기준으로 좌측에서 음성 신호가 발원되었다면, 도 4a와 같이 기준값보다 작은 각도에서 월등히 많은 빈도수를 보이고 있는 것을 확인할 수 있다.If the voice signal originates from the left side of the reference value (for example, the central axis of the microphone 110), it can be seen that the frequency is significantly higher at an angle smaller than the reference value, as shown in FIG. 4A.

반대로, 기준값(Threshold, 예를 들어, 마이크로폰(110)의 중심축)을 기준으로 우측에서 음성 신호가 발원되었다면, 도 4b와 같이 기준값보다 큰 각도에서 월등히 많은 빈도수를 보이고 있는 것을 확인할 수 있다.Conversely, if the voice signal originates from the right side with respect to the reference value (for example, the central axis of the microphone 110), it shows a significantly higher frequency at an angle greater than the reference value as shown in FIG. 4B. It can be seen.

통상, 고객과 판매사 및 기타 인원의 위치는 정해져 있기 때문에, 태깅부(250)는 음성 신호의 발원지의 위치(방향)를 토대로 각 음성 신호가 누구의 음성 인지를 태깅한다.Since the locations of customers, sales companies, and other personnel are usually determined, the tagging unit 250 tags whose voice each voice signal belongs to based on the location (direction) of the source of the voice signal.

추론부(340)가 각 채널로 입력된 소리 신호의 주파수 도메인의 평균값을 수신하며 잡음제거 주파수 필터를 추론하기 때문에, 공간 특성이 달라지지 않는다. 이에 따라, 음성 분리부(240)에서 음원의 분리나 태깅부(250)에서의 태깅이 어려움없이 수행된다.Since the reasoning unit 340 receives the average value of the frequency domain of the sound signal input to each channel and infers the noise canceling frequency filter, the spatial characteristics do not change. Accordingly, separation of the sound source in the voice separation unit 240 or tagging in the tagging unit 250 is performed without difficulty.

태깅부(250)에서의 태깅이 완료된 음성파일은 음원 저장장치(130)에 저장된다.The voice file tagged by the tagging unit 250 is stored in the sound source storage device 130 .

도 5는 본 발명의 일 실시예에 따른 대면 녹취장치가 녹취하여 각 음원들을 분리하여 태깅하는 방법을 도시한 순서도이다.5 is a flowchart illustrating a method of separating and tagging each sound source by recording by a face-to-face recording device according to an embodiment of the present invention.

도메인 변환부(210)는 마이크로폰의 각 채널에 녹음된 소리 신호들을 인가받아 각각을 주파수 도메인으로 변환한다(S510).The domain conversion unit 210 receives sound signals recorded in each channel of the microphone and converts each into a frequency domain (S510).

필터 추론부(220)는 잡음 신호와 채널의 개수 이하의 개수의 음성 신호들을 학습한다(S520). The filter reasoning unit 220 learns noise signals and voice signals of a number less than or equal to the number of channels (S520).

필터 추론부(220)는 주파수 도메인으로 변환된 각 소리 신호의 평균값을 수신하여 학습된 결과로부터 잡음제거 주파수 필터를 추론한다(S530),The filter reasoning unit 220 receives the average value of each sound signal converted to the frequency domain and infers a noise canceling frequency filter from the learned result (S530).

필터링부(230)는 추론된 필터를 토대로 주파수 도메인으로 변환된 각 소리 신호 내 잡음신호를 필터링한다(S540),The filtering unit 230 filters the noise signal within each sound signal converted to the frequency domain based on the inferred filter (S540).

음성 분리부(240)는 잡음이 제거된 소리 신호 내에서 각 음성 신호들을 분리한다(S550).The voice separation unit 240 separates each voice signal within the noise-removed sound signal (S550).

태깅부(250)는 분리된 각 음성 신호들의 발원지의 위치 또는 방향을 태깅한다(S560).The tagging unit 250 tags the location or direction of the source of each separated voice signal (S560).

도 6은 본 발명의 일 실시예에 따른 대면 녹취 시스템이 소리 신호를 녹취하여 음성 신호만을 분리 후 저장하는 방법을 도시한 도면이다.6 is a diagram illustrating a method of recording a sound signal by a face-to-face recording system according to an embodiment of the present invention, separating and storing only a voice signal.

마이크로폰(110)은 기 설정된 장소에 배치되어 생성되는 소리 신호를 녹취한다(S610). The microphone 110 is placed in a preset location and records the generated sound signal (S610).

대면 녹취장치(120)는 마이크로폰의 각 채널에 녹음된 소리 신호들을 인가받아 각각을 주파수 도메인으로 변환한다(S620).The face-to-face recording device 120 receives sound signals recorded in each channel of the microphone and converts each into a frequency domain (S620).

대면 녹취장치(120)는 잡음 신호와 채널의 개수 이하의 개수의 음성 신호들을 학습한다(S630). The face-to-face recording device 120 learns noise signals and voice signals equal to or smaller than the number of channels (S630).

대면 녹취장치(120)는 주파수 도메인으로 변환된 각 소리 신호의 평균값을 수신하여 학습된 결과로부터 잡음제거 주파수 필터를 추론한다(S640),The face-to-face recording device 120 receives the average value of each sound signal converted to the frequency domain and infers a noise canceling frequency filter from the learned result (S640).

대면 녹취장치(120)는 추론된 필터를 토대로 주파수 도메인으로 변환된 각 소리 신호 내 잡음신호를 필터링한다(S650),The face-to-face recording device 120 filters the noise signal within each sound signal converted to the frequency domain based on the inferred filter (S650).

대면 녹취장치(120)는 잡음이 제거된 소리 신호 내에서 각 음성 신호들을 분리한다(S660).The face-to-face recording device 120 separates each voice signal from the noise-removed sound signal (S660).

대면 녹취장치(120)는 분리된 각 음성 신호들의 발원지의 위치 또는 방향을 태깅한다(S670).The face-to-face recording device 120 tags the location or direction of the source of each separated audio signal (S670).

음원 저장장치(130)는 대면 녹취장치(120)에 의해 태깅된 음성 신호들을 수신하여 저장한다(S680).The sound source storage device 130 receives and stores the audio signals tagged by the face-to-face recording device 120 (S680).

도 5 및 6에서는 각 과정을 순차적으로 실행하는 것으로 기재하고 있으나, 이는 본 발명의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것이다. 다시 말해, 본 발명의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 각 도면에 기재된 순서를 변경하여 실행하거나 각 과정 중 하나 이상의 과정을 병렬적으로 실행하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이므로, 도 5 및 6은 시계열적인 순서로 한정되는 것은 아니다.5 and 6 describe that each process is sequentially executed, but this is merely an example of the technical idea of one embodiment of the present invention. In other words, those skilled in the art to which an embodiment of the present invention pertains may change and execute the order described in each drawing or perform one or more processes of each process without departing from the essential characteristics of an embodiment of the present invention. 5 and 6 are not limited to a time-series order, since various modifications and variations can be applied by executing them in parallel.

한편, 도 5 및 6에 도시된 과정들은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽힐 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 즉, 컴퓨터가 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드디스크 등) 및 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Meanwhile, the processes shown in FIGS. 5 and 6 can be implemented as computer readable codes on a computer readable recording medium. A computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored. That is, computer-readable recording media include storage media such as magnetic storage media (eg, ROM, floppy disk, hard disk, etc.) and optical reading media (eg, CD-ROM, DVD, etc.). In addition, the computer-readable recording medium may be distributed to computer systems connected through a network to store and execute computer-readable codes in a distributed manner.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely an example of the technical idea of the present embodiment, and various modifications and variations can be made to those skilled in the art without departing from the essential characteristics of the present embodiment. Therefore, the present embodiments are not intended to limit the technical idea of the present embodiment, but to explain, and the scope of the technical idea of the present embodiment is not limited by these embodiments. The scope of protection of this embodiment should be construed according to the claims below, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of rights of this embodiment.

100: 대면 녹취 시스템
110: 마이크로폰
120: 대면 녹취장치
130: 음원 저장장치
210: 도메인 변환부
220: 필터 추론부
230: 필터링부
240: 음성 분리부
250: 태깅부
310: 음성 데이터베이스
315: 잡음 데이터베이스
320: 다중음성 생성부
330: 딥러닝부
340: 추론부100: face-to-face recording system
110: microphone
120: face-to-face recording device
130: sound source storage device
210: domain conversion unit
220: filter reasoning unit
230: filtering unit
240: voice separator
250: tagging unit
310 Voice database
315 noise database
320: multiple voice generation unit
330: deep learning unit
340: reasoning unit

Claims

a microphone disposed in a predetermined place, having as many channels as the number of people present in the place, and recording a sound signal generated in the place using each channel;
a face-to-face recording device receiving the sound signal recorded by the microphone, filtering only the noise signal within the sound signal, and separating each of the audio signals from the sound signal from which the noise signal is filtered; and
A sound source storage device for receiving and storing voice signals separated by the face-to-face recording device,
The face-to-face recording device,
a domain conversion unit that receives the sound signal recorded by the microphone and converts it into a frequency domain;
a noise database for storing a plurality of types of noise signals;
a voice database for storing a plurality of types of voice signals;
a multi-speech generating unit generating a multi-speech signal to be learned from among the voice signals stored in the voice database;
a deep learning unit receiving noise signals from the noise database and voice signals from the voice database or the multi-speech generation unit and learning to generate a noise canceling frequency filter; and
Based on the learning result of the deep learning unit, an inference unit for inferring a noise canceling frequency filter by receiving one value obtained by averaging values converted from sound signals input to each channel by the domain conversion unit. recording system.

According to claim 1,
the microphone,
A face-to-face recording system, characterized in that it can have one or more channels.

According to claim 1,
The face-to-face recording device,
A face-to-face recording system characterized by converting an authorized sound signal into a frequency domain.

According to claim 3,
The face-to-face recording device,
A face-to-face recording system comprising a database storing each of the noise signal and the voice signal, and learning the noise signal and the voice signal in each database.

According to claim 4,
The face-to-face recording device,
A face-to-face recording system characterized by inferring a noise canceling frequency filter to filter a noise signal based on a learning result and a frequency-converted sound signal.

delete

a domain converter for receiving a sound signal at a preset location and converting a domain of the sound signal into a frequency domain;
a filter inference unit that learns the voice signal and the noise signal, receives the signal converted by the domain converter, and infers a noise canceling frequency filter from the learned result;
a filtering unit filtering a noise signal within the signal converted by the domain conversion unit using the inferred filter;
a voice separator separating each voice signal from the sound signal from which the noise signal has been removed by the filtering unit; and
A tagging unit for tagging the location or direction of the source of each of the separated voice signals,
The filter reasoning unit,
a noise database for storing a plurality of types of noise signals;
a voice database for storing a plurality of types of voice signals;
a multi-speech generating unit generating a multi-speech signal to be learned from among the voice signals stored in the voice database;
a deep learning unit receiving noise signals from the noise database and voice signals from the voice database or the multi-speech generation unit and learning to generate a noise canceling frequency filter; and
Based on the learning result of the deep learning unit, an inference unit for inferring a noise canceling frequency filter by receiving one value obtained by averaging values converted from sound signals input to each channel by the domain conversion unit. recording device.

According to claim 7,
The domain conversion unit,
A face-to-face recording device characterized by converting the domain of a sound signal into a frequency domain by performing FFT (Fast Fourier Transform).

According to claim 7,
The sound signal received by the domain conversion unit,
A face-to-face recording device, characterized in that the sound signal recorded from one or more channels.

delete

According to claim 9,
The filter reasoning unit,
In learning the voice signal, the face-to-face recording device, characterized in that for learning the voice signal in which the voice signal exists as many as the number of channels in which the sound signal is recorded.

According to claim 7,
The voice separator,
A face-to-face recording device characterized by using implicit sound source separation technology.

delete