KR102343811B1

KR102343811B1 - Method for detecting voice

Info

Publication number: KR102343811B1
Application number: KR1020200025491A
Authority: KR
Inventors: 신종원; 황수중; 진유광
Original assignee: 광주과학기술원
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2021-12-28
Also published as: KR20210110054A

Abstract

음성 검출 방법이 개시된다. 본 발명의 실시 예에 따른 음성 검출 방법은, 제1 마이크 및 제2 마이크를 통하여 오디오 신호를 수신하는 단계, 기 설정된 조건에 기반하여, 상기 오디오 신호에 대응하는 복수의 주파수 빈들 중 신뢰 주파수 빈들을 선별하는 단계, 및, 상기 선별된 신뢰 주파수 빈들을 이용하여 상기 오디오 신호 내 음성의 존재 여부를 검출하는 단계를 포함한다A voice detection method is disclosed. A voice detection method according to an embodiment of the present invention includes receiving an audio signal through a first microphone and a second microphone, and selecting reliable frequency bins from among a plurality of frequency bins corresponding to the audio signal based on a preset condition. selecting, and detecting the presence or absence of speech in the audio signal using the selected confidence frequency bins.

Description

Voice detection method {METHOD FOR DETECTING VOICE}

본 발명은, 기 설정된 조건에 기반하여 신뢰 주파수들을 선별하고, 신뢰 주파수들을 이용하여 오디오 신호 내 음성의 존재 여부를 검출하는, 음성 검출 방법에 관한 것이다.The present invention relates to a voice detection method for selecting reliable frequencies based on a preset condition and detecting the presence or absence of voice in an audio signal using the trusted frequencies.

음성 활동 감지(Voice activity detection, VAD)는, 사람의 음성의 유/무를 검출하는 것이다.Voice activity detection (VAD) is to detect the presence/absence of a human voice.

음성은 시간의 흐름에 따라 변화가 많은 신호로써, 오디오 신호를 20ms?30ms 구간의 프레임들로 나눠서 프레임 별로 음성을 검출하는 것이 일반적이다. 그리고 일반적으로, 음성 활동 감지(Voice activity detection, VAD)는, 오디오 신호 중 해당 프레임에 사람의 음성 또는 목표하는 사람의 음성이 존재하는지를 판단하는 것을 목표로 한다.A voice is a signal that changes with the passage of time, and it is common to divide an audio signal into frames of 20 ms to 30 ms and detect the voice for each frame. And in general, voice activity detection (VAD) aims to determine whether a human voice or a target person's voice is present in a corresponding frame of an audio signal.

이러한 음성 활동 감지 기술은, 음성향상(잡음제거), 잡음추정, pitch extraction, 가변 음성 통신 codecs, 음성 인식 등의 전처리(pre-processing)에 널리 이용될 수 있다.This voice activity detection technology can be widely used in pre-processing of voice enhancement (noise removal), noise estimation, pitch extraction, variable voice communication codecs, voice recognition, and the like.

기존의 음성 활동 감지 기술은, 오디오 신호를 주파수 영역으로 변환 후, 주파수 영역에서의 모든 주파수 빈(frequency bin)에 대응하는 값들을 평균한 검출 통계치(test statistics)와 문턱치(threshold)를 비교하여 음성의 존재 여부를 결정하였다.In the conventional voice activity detection technology, after converting an audio signal into a frequency domain, a threshold value is compared with test statistics obtained by averaging values corresponding to all frequency bins in the frequency domain. was determined to exist.

한편 음성의 구성 요소는 주파수 빈(frequency bin)들에 드문 드문 분포되어 있다. 그리고 모든 주파수 빈(frequency bin)에 대응하는 값들을 사용하는 방식은 신뢰할 수 있는 주파수 빈(frequency bin)들의 값들뿐만 아니라 신뢰 할 수 없는 주파수 빈(frequency bin)들의 값들까지 사용하는 방식이기 때문에, 음성의 존재 여부 검출의 정확도가 낮아지는 문제가 있었다.On the other hand, components of speech are sparsely distributed in frequency bins. And since the method of using values corresponding to all frequency bins is a method of using not only reliable frequency bin values but also unreliable frequency bin values, the voice There was a problem in that the accuracy of detecting the presence or absence of

본 발명은 상술한 문제점을 해결하기 위한 것으로, 본 발명의 목적은, 기 설정된 조건에 기반하여 신뢰 주파수들을 선별하고, 신뢰 주파수들을 이용하여 오디오 신호 내 음성의 존재 여부를 검출하는, 음성 검출 방법에 관한 것이다.SUMMARY OF THE INVENTION The present invention is to solve the above problems, and an object of the present invention is to provide a voice detection method for selecting reliable frequencies based on a preset condition and detecting the presence or absence of a voice in an audio signal using the trusted frequencies. it's about

본 발명의 실시 예에 따른 음성 검출 방법은, 제1 마이크 및 제2 마이크를 통하여 오디오 신호를 수신하는 단계, 기 설정된 조건에 기반하여, 상기 오디오 신호에 대응하는 복수의 주파수 빈들 중 신뢰 주파수 빈들을 선별하는 단계, 및, 상기 선별된 신뢰 주파수 빈들을 이용하여 상기 오디오 신호 내 음성의 존재 여부를 검출하는 단계를 포함한다.A voice detection method according to an embodiment of the present invention includes receiving an audio signal through a first microphone and a second microphone, and selecting reliable frequency bins from among a plurality of frequency bins corresponding to the audio signal based on a preset condition. selecting, and detecting the presence or absence of a voice in the audio signal using the selected confidence frequency bins.

이 경우 상기 신뢰 주파수 빈들을 선별하는 단계는, 상기 기 설정된 조건을 만족하는 주파수 빈을 상기 신뢰 주파수 빈으로 결정하고, 상기 기 설정된 조건은, 상기 오디오 신호의 크기, 상기 오디오 신호의 채널 간 레벨 차, 상기 오디오 신호의 채널 간 시간 차 중 적어도 하나에 기반할 수 있다.In this case, the selecting of the confidence frequency bins includes determining a frequency bin that satisfies the preset condition as the confidence frequency bin, and the preset condition includes the size of the audio signal and the level difference between channels of the audio signal. , may be based on at least one of a time difference between channels of the audio signal.

이 경우 상기 기 설정된 조건은, 해당하는 주파수 빈에서, 음원의 예상 위치와 더 가까운 상기 제1 마이크를 통하여 수신되는 제1 오디오 신호의 크기가 기 설정된 값보다 큰 조건일 수 있다.In this case, the preset condition may be a condition in which the magnitude of the first audio signal received through the first microphone closer to the expected position of the sound source in a corresponding frequency bin is greater than a preset value.

한편 상기 기 설정된 조건은, 해당하는 주파수 빈에서, 상기 제1 마이크에 의해 수신된 제1 오디오 신호와 상기 제2 마이크에 의해 수신된 제2 오디오 신호의 채널간 레벨 차가 기 설정된 값보다 큰 조건일 수 있다.Meanwhile, the preset condition is a condition in which a level difference between channels of a first audio signal received by the first microphone and a second audio signal received by the second microphone in a corresponding frequency bin is greater than a preset value can

한편 상기 기 설정된 조건은, 해당하는 주파수 빈에서, 상기 오디오 신호의 채널 간 시간 차가 기 설정된 범위 이내이거나, 상기 해당하는 주파수 빈이 제1 기 설정된 값보다 작거나, 상기 해당하는 주파수 빈이 제2 기 설정된 값보다 큰 조건일 수 있다.Meanwhile, the preset condition is that, in a corresponding frequency bin, a time difference between channels of the audio signal is within a preset range, the corresponding frequency bin is smaller than a first preset value, or the corresponding frequency bin is a second preset value It may be a condition greater than the value.

한편 상기 기 설정된 조건은, 둘 이상의 세부 조건을 포함하고, 상기 신뢰 주파수 빈들을 선별하는 단계는, 해당하는 주파수 빈이 상기 둘 이상의 세부 조건을 모두 만족하는 경우, 상기 해당하는 주파수 빈을 상기 신뢰 주파수 빈으로 결정할 수 있다.Meanwhile, the preset condition includes two or more detailed conditions, and the selecting of the confidence frequency bins may include selecting the corresponding frequency bin as the confidence frequency bin when the corresponding frequency bin satisfies all of the two or more detailed conditions. can be determined as

한편 상기 선별된 신뢰 주파수 빈들을 이용하여 상기 오디오 신호 내 음성의 존재 여부를 검출하는 단계는, NDPSD, LTIPD 및 상기 선별된 신뢰 주파수 빈들에 대응하는 값들을 사용한 검출 통계치에 기반해 음성 활동을 감지하는 알고리즘들 중 적어도 하나에 기반한 음성 활동 감지 알고리즘을 이용하여, 상기 오디오 신호 내 음성의 존재 여부를 검출할 수 있다.On the other hand, the step of detecting the presence of voice in the audio signal using the selected confidence frequency bins may include detecting voice activity based on detection statistics using NDPSD, LTIPD, and values corresponding to the selected confidence frequency bins. The presence or absence of a voice in the audio signal may be detected using a voice activity detection algorithm based on at least one of the algorithms.

한편 상기 선별된 신뢰 주파수 빈들을 이용하여 상기 오디오 신호 내 음성의 존재 여부를 검출하는 단계는, 상기 선별된 신뢰 주파수 빈들에 대응하는 채널 간 시간 차 정보와 채널 간 레벨 차 정보를 결합하여 상기 오디오 신호 내 음성의 존재 여부를 검출하는 단계를 포함하고, 상기 선별된 신뢰 주파수 빈들에 대응하는 채널 간 시간 차 정보와 채널 간 레벨 차 정보를 결합하여 상기 오디오 신호 내 음성의 존재 여부를 검출하는 단계는, AND, SVM 및 상기 선별된 신뢰 주파수 빈들에 대응하는 값들을 입력으로 하는 분류기에 기반해 음성 활동을 감지하는 알고리즘들 중 적어도 하나에 기반한 음성 활동 감지 알고리즘을 이용하여, 상기 오디오 신호 내 음성의 존재 여부를 검출한다.Meanwhile, the detecting of the presence or absence of a voice in the audio signal using the selected confidence frequency bins may include combining inter-channel time difference information corresponding to the selected reliable frequency bins and inter-channel level difference information to obtain the audio signal. Detecting the presence of a voice in the audio signal, wherein detecting the presence of a voice in the audio signal by combining the inter-channel time difference information corresponding to the selected confidence frequency bins and the inter-channel level difference information includes: Whether a voice is present in the audio signal by using a voice activity detection algorithm based on at least one of algorithms for detecting voice activity based on AND, SVM, and a classifier inputting values corresponding to the selected confidence frequency bins to detect

이 경우 상기 선별된 신뢰 주파수 빈들을 이용하여 상기 오디오 신호 내 음성의 존재 여부를 검출하는 단계는, 상기 선별된 신뢰 주파수 빈들의 개수가 기 설정된 개수보다 작은 경우, 상기 오디오 신호 내 상기 음성이 존재하지 않는 것으로 결정할 수 있다In this case, the step of detecting the presence of a voice in the audio signal using the selected reliable frequency bins may include, when the number of the selected reliable frequency bins is smaller than a preset number, the voice does not exist in the audio signal. can decide not to

한편 본 발명의 실시 예에 따른 이동 단말기는, 제1 마이크 및 제2 마이크를 포함하는 수신부, 및, 상기 제1 마이크 및 상기 제2 마이크를 통하여 오디오 신호를 수신하고, 기 설정된 조건에 기반하여 상기 오디오 신호에 대응하는 복수의 주파수 빈들 중 신뢰 주파수 빈들을 선별하고, 상기 선별된 신뢰 주파수 빈들을 이용하여 상기 오디오 신호 내 음성의 존재 여부를 검출하는 제어부를 포함할 수 있다.Meanwhile, the mobile terminal according to an embodiment of the present invention receives an audio signal through a receiver including a first microphone and a second microphone, and the first microphone and the second microphone, and based on a preset condition, the and a control unit that selects reliable frequency bins from among a plurality of frequency bins corresponding to the audio signal and detects whether a voice is present in the audio signal using the selected reliable frequency bins.

이 경우 상기 제1 마이크는, 상기 이동 단말기를 이용하여 통화 중인 화자의 입에 더 가깝게 배치되고, 상기 제2 마이크는, 상기 제1 마이크에 비하여 상기 통화 중인 화자의 입에 더 멀게 배치될 수 있다.In this case, the first microphone may be disposed closer to the mouth of the speaker who is talking using the mobile terminal, and the second microphone may be disposed farther from the mouth of the speaker who is talking with the first microphone than the first microphone. .

본 발명에 따르면, 조건을 부과하여 신뢰성 있는 주파수를 선택하고, 신뢰성 있는 주파수들에 대응하는 값들을 이용하여 음성의 존재 유무를 검출함으로써, 음성 활동 검출의 정확도를 향상시킬 수 있는 장점이 있다.According to the present invention, there is an advantage in that the accuracy of voice activity detection can be improved by selecting a reliable frequency by applying a condition and detecting the presence or absence of a voice using values corresponding to the reliable frequencies.

도 1은 음성 검출 장치를 설명하기 위한 블록도이다.
도 2는, 음성 검출 장치의 일례인 이동 단말기를 설명하기 위한 도면이다.
도 3은 음성 검출 방법을 설명하기 위한 순서도이다.
도 4는 테스트에서의 조건을 설명한 도면이다.
도 5는 종합적인 테스트 결과를 도시한 도면이다.
도 6은 SNR의 레벨 별 테스트 결과를 도시한 도면이다.
도 7 및 도 8은 또 다른 테스트에 따른 실험 결과를 도시한 도면이다.1 is a block diagram illustrating a voice detection apparatus.
2 is a diagram for explaining a mobile terminal that is an example of a voice detection device.
3 is a flowchart illustrating a voice detection method.
4 is a diagram for explaining conditions in a test.
5 is a view showing a comprehensive test result.
6 is a diagram illustrating test results for each level of SNR.
7 and 8 are diagrams illustrating experimental results according to another test.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Hereinafter, the embodiments disclosed in the present specification will be described in detail with reference to the accompanying drawings, but the same or similar components are assigned the same reference numbers regardless of reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "part" for components used in the following description are given or mixed in consideration of only the ease of writing the specification, and do not have distinct meanings or roles by themselves. In addition, in describing the embodiments disclosed in the present specification, if it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed in this specification, the detailed description thereof will be omitted. In addition, the accompanying drawings are only for easy understanding of the embodiments disclosed in the present specification, and the technical idea disclosed herein is not limited by the accompanying drawings, and all changes included in the spirit and scope of the present invention , should be understood to include equivalents or substitutes.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including an ordinal number such as 1st, 2nd, etc. may be used to describe various elements, but the elements are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When an element is referred to as being “connected” or “connected” to another element, it is understood that it may be directly connected or connected to the other element, but other elements may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprises” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

도 1은 음성 검출 장치를 설명하기 위한 블록도이다.1 is a block diagram illustrating a voice detection apparatus.

오디오 신호는, 음성 신호 및 기타 가청 범위 내의 음향 신호를 포함할 수 있다.Audio signals may include voice signals and other acoustic signals within the audible range.

그리고 음성 검출 장치는, 오디오 신호를 이용하여, 오디오 신호 내 음성(voice)을 검출하는 장치를 의미할 수 있다.In addition, the voice detection device may refer to a device that detects a voice in the audio signal by using the audio signal.

본 발명의 실시 예에 따른 음성 검출 장치(100)는, 출력부(120), 수신부(110), 제어부(130) 및 메모리(140)를 포함할 수 있다.The voice detection apparatus 100 according to an embodiment of the present invention may include an output unit 120 , a reception unit 110 , a control unit 130 , and a memory 140 .

수신부(110)는 오디오 신호를 수신할 수 있다. The receiver 110 may receive an audio signal.

구체적으로 수신부(110)는 다 채널 마이크를 포함할 수 있으며, 다 채널 마이크는 외부로부터 오디오 신호를 수신할 수 있다. 여기서 다 채널 마이크는 둘 이상의 마이크를 포함할 수 있다.Specifically, the receiver 110 may include a multi-channel microphone, and the multi-channel microphone may receive an audio signal from the outside. Here, the multi-channel microphone may include two or more microphones.

메모리(140)는 음성 검출 장치(100)의 다양한 기능을 지원하는 데이터를 저장할 수 있다.The memory 140 may store data supporting various functions of the voice detection apparatus 100 .

출력부(120)는 오디오 신호 내 음성의 존재 여부를 출력할 수 있다. 구체적으로 출력부(120)는 디스플레이 및 음성 검출 결과를 전기적 신호로 출력하는 장치 중 적어도 하나를 포함하고, 제어부(130)에 의해 검출된, 오디오 신호 내 음성의 존재 여부를 디스플레이 하거나 전기적 신호로 출력할 수 있다.The output unit 120 may output whether a voice is present in the audio signal. Specifically, the output unit 120 includes at least one of a display and a device for outputting a voice detection result as an electrical signal, and displays the presence or absence of a voice in the audio signal detected by the controller 130 or outputs it as an electrical signal. can do.

한편 제어부(130)는 음성 검출 장치(100)의 전반적인 동작을 제어할 수 있다.Meanwhile, the controller 130 may control the overall operation of the voice detection apparatus 100 .

또한 음원으로부터 생성된 오디오 신호가 다 채널 마이크를 통하여 수신되면, 제어부(130)는 수신된 오디오 신호를 이용하여 오디오 신호 내 음성의 존재 여부를 검출할 수 있다.In addition, when the audio signal generated from the sound source is received through the multi-channel microphone, the controller 130 may detect the presence of a voice in the audio signal using the received audio signal.

도 2는, 음성 검출 장치의 일례인 이동 단말기를 설명하기 위한 도면이다.2 is a diagram for explaining a mobile terminal that is an example of a voice detection device.

음성 검출 장치(100)는 휴대폰이나 스마트폰 등의 이동 단말기일 수 있다. 따라서 본 명세서에서는, 용어 음성 검출 장치(100)는 용어 이동 단말기(100)와 병행하여 사용하도록 한다.The voice detection apparatus 100 may be a mobile terminal such as a mobile phone or a smart phone. Therefore, in this specification, the term voice detection apparatus 100 is used in parallel with the term mobile terminal 100 .

이동 단말기(100)는 다 채널 마이크를 포함할 수 있다. 본 명세서에서는 이동 단말기(100)가 두 개의 마이크, 즉 제1 마이크(210) 및 제2 마이크(220)를 포함하는 것으로 가정하여 설명한다.The mobile terminal 100 may include a multi-channel microphone. In this specification, it is assumed that the mobile terminal 100 includes two microphones, that is, a first microphone 210 and a second microphone 220 .

두 개의 마이크는 이동 단말기(100) 상에서 다른 측면에 배치될 수 있다. 예를 들어 제1 마이크(210)는 이동 단말기(100)의 하부 측면에 배치될 수 있으며, 제2 마이크(220)는 이동 단말기(100)의 상부 측면에 배치될 수 있다.The two microphones may be disposed on different sides of the mobile terminal 100 . For example, the first microphone 210 may be disposed on the lower side of the mobile terminal 100 , and the second microphone 220 may be disposed on the upper side of the mobile terminal 100 .

한편 제1 마이크(210)는 음원의 예상 위치와 더 가깝게 배치될 수 있다.Meanwhile, the first microphone 210 may be disposed closer to the expected position of the sound source.

일 례로 화자가 일반적인 용도로 이동 단말기(100)를 사용하는 경우, 제1 마이크(210)는 화자의 입에 더 가깝게 배치될 수 있다. 구체적으로 화자가 이동 단말기(100)를 이용하여 전화 통화를 하는 경우, 제1 마이크(210)는 제2 마이크(220)에 비하여 통화중인 화자의 입에 더 가깝게 위치할 수 있으며, 이와는 대조적으로 제2 마이크(220)는 제1 마이크(210)에 비하여 화자의 입에 더 멀게 위치할 수 있다.For example, when the speaker uses the mobile terminal 100 for a general purpose, the first microphone 210 may be disposed closer to the speaker's mouth. Specifically, when the speaker makes a phone call using the mobile terminal 100 , the first microphone 210 may be located closer to the mouth of the speaker than the second microphone 220 . The second microphone 220 may be located farther from the speaker's mouth than the first microphone 210 .

또한 이동 단말기는, 무선 통신부 및 출력부를 포함할 수 있다.Also, the mobile terminal may include a wireless communication unit and an output unit.

무선 통신부는, 이동 단말기(100)와 무선 통신 시스템 사이, 이동 단말기(100)와 다른 이동 단말기(100) 사이, 또는 이동 단말기(100)와 외부서버 사이의 무선 통신을 가능하게 하는 하나 이상의 모듈을 포함할 수 있다. 또한, 상기 무선 통신부는, 이동 단말기(100)를 하나 이상의 네트워크에 연결하는 하나 이상의 모듈을 포함할 수 있다.The wireless communication unit includes one or more modules that enable wireless communication between the mobile terminal 100 and the wireless communication system, between the mobile terminal 100 and another mobile terminal 100, or between the mobile terminal 100 and an external server. may include In addition, the wireless communication unit may include one or more modules for connecting the mobile terminal 100 to one or more networks.

무선 통신부는 이동 통신 모듈을 포함할 수 있다. 또한 이동 통신 모듈은 이동통신을 위한 기술표준들 또는 통신방식(예를 들어, GSM(Global System for Mobile communication), CDMA(Code Division Multi Access), CDMA2000(Code Division Multi Access 2000), EV-DO(Enhanced Voice-Data Optimized or Enhanced Voice-Data Only), WCDMA(Wideband CDMA), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), LTE(Long Term Evolution), LTE-A(Long Term Evolution-Advanced) 등)에 따라 구축된 이동 통신망 상에서 기지국, 외부의 단말, 서버 중 적어도 하나와 무선 신호를 송수신한다. The wireless communication unit may include a mobile communication module. In addition, the mobile communication module includes technical standards or communication methods for mobile communication (eg, Global System for Mobile communication (GSM), Code Division Multi Access (CDMA), Code Division Multi Access 2000 (CDMA2000), and EV-DO ( Enhanced Voice-Data Optimized or Enhanced Voice-Data Only), Wideband CDMA (WCDMA), High Speed Downlink Packet Access (HSDPA), High Speed Uplink Packet Access (HSUPA), Long Term Evolution (LTE), Long Term (LTE-A) Evolution-Advanced), etc.), transmits and receives radio signals to and from at least one of a base station, an external terminal, and a server on a mobile communication network.

상기 무선 신호는, 음성 호 신호, 화상 통화 호 신호 또는 문자/멀티미디어 메시지 송수신에 따른 다양한 형태의 데이터를 포함할 수 있다. The wireless signal may include various types of data according to transmission/reception of a voice call signal, a video call signal, or a text/multimedia message.

또한 이동 단말기는 스피커를 포함할 수 있으며, 스피커는 통화 모드에서 통화 음을 출력할 수 있다.Also, the mobile terminal may include a speaker, and the speaker may output a call sound in a call mode.

도 3은 음성 검출 방법을 설명하기 위한 순서도이다.3 is a flowchart illustrating a voice detection method.

음성 검출 장치(100)의 제어부(130)는 제1 마이크 및 제2 마이크를 통하여 오디오 신호를 수신할 수 있다(S310).The control unit 130 of the voice detection apparatus 100 may receive an audio signal through the first microphone and the second microphone (S310).

여기서 오디오 신호에는 잡음이 포함될 수 있다. 또한 오디오 신호에는 시간 구간 별로 사람의 음성이 포함되거나 포함되지 않을 수 있다.Here, the audio signal may include noise. In addition, the audio signal may or may not include a human voice for each time section.

따라서 제어부(130)는 오디오 신호를 시간 구간(예를 들어 20-30ms)에 따라 나누어 복수의 프레임을 생성하고, 프레임 별로 오디오 신호 내 음성의 존재 여부를 검출할 수 있다.Accordingly, the controller 130 may generate a plurality of frames by dividing the audio signal according to a time period (eg, 20-30 ms), and may detect whether a voice is present in the audio signal for each frame.

이 경우 제어부(130)는 복수의 프레임 각각에 대하여 단 구간 푸리에 변환(STFT)을 적용하여 오디오 신호의 복소 스펙트럼(complex spectrum)을 획득하고, 복수의 프레임 각각에 존재하는 복수의 주파수 빈에 대한 단 구간 푸리에 변환 계수(coefficients)를 획득할 수 있다.In this case, the control unit 130 obtains a complex spectrum of an audio signal by applying a short interval Fourier transform (STFT) to each of a plurality of frames, and applies a stage for a plurality of frequency bins present in each of the plurality of frames. Interval Fourier transform coefficients may be obtained.

이하에서는 복수의 프레임 중 l번째 프레임의 예를 들어 설명한다.Hereinafter, an example of the l-th frame among the plurality of frames will be described.

또한 제어부(130)는 제1 마이크를 통하여 수신된 제1 오디오 신호 및 제2 마이크를 통하여 수신된 제2 오디오 신호에 대하여, 복수의 주파수 빈에 대한 단 구간 푸리에 변환 계수(coefficients)를 획득할 수 있다.Also, the controller 130 may obtain short-term Fourier transform coefficients for a plurality of frequency bins with respect to the first audio signal received through the first microphone and the second audio signal received through the second microphone. have.

이 경우 l번째 프레임의 k번째 주파수 빈(frequency bin)에 대하여, 제1 마이크에 의해 수신된 제1 오디오 신호의 단 구간 푸리에 계수는

으로 표현될 수 있다.In this case, for the k-th frequency bin of the l-th frame, the short-term Fourier coefficient of the first audio signal received by the first microphone is

can be expressed as

또한 l번째 프레임의 k번째 주파수 빈(frequency bin)에 대하여, 제2 마이크에 의해 수신된 제2 오디오 신호의 단 구간 푸리에 계수는

으로 표현될 수 있다.Also, for the k-th frequency bin of the l-th frame, the short-term Fourier coefficient of the second audio signal received by the second microphone is

can be expressed as

음성의 구성 요소는 주파수 빈(frequency bin)들에 드문 드문 분포되어 있다. 그리고 모든 주파수 빈(frequency bin)에 대응하는 값들을 사용하는 방식은 신뢰할 수 있는 주파수 빈(frequency bin)들의 값들뿐만 아니라 신뢰 할 수 없는 주파수 빈(frequency bin)들의 값들까지 사용하는 방식이기 때문에, 음성의 존재 여부 검출의 정확도가 낮아지는 문제가 있었다.The components of speech are sparsely distributed in frequency bins. And since the method of using values corresponding to all frequency bins is a method of using not only reliable frequency bin values but also unreliable frequency bin values, the voice There was a problem in that the accuracy of detecting the presence or absence of

따라서 본 명세서에서는, 높은 신뢰성을 가지는 주파수 빈을 선별하고, 선별된 주파수 빈들의 공간 정보를 기반으로 음성의 존재 여부를 검출하는 방법을 제안한다.Therefore, in the present specification, a method of selecting a frequency bin having high reliability and detecting the presence or absence of a voice based on spatial information of the selected frequency bins is proposed.

본 명세서에서는 높은 신뢰성을 가지는 주파수 빈으로 선별된 것을, 신뢰 주파수 빈이라고 명칭 하도록 한다.In the present specification, a frequency bin having high reliability is referred to as a reliable frequency bin.

제어부(130)는, 기 설정된 조건에 기반하여, 오디오 신호에 대응하는 복수의 주파수 빈들 중 신뢰 주파수 빈들을 선별할 수 있다(S330).The controller 130 may select reliable frequency bins from among a plurality of frequency bins corresponding to the audio signal based on a preset condition ( S330 ).

구체적으로 제어부(130)는 기 설정된 조건을 만족하는 주파수 빈을 신뢰 주파수 빈으로 결정할 수 있다.In more detail, the controller 130 may determine a frequency bin that satisfies a preset condition as a reliable frequency bin.

여기서 기 설정된 조건은, 오디오 신호의 크기, 오디오 신호의 채널 간 레벨 차, 오디오 신호의 채널 간 시간 차 중 적어도 하나에 기반할 수 있다.Here, the preset condition may be based on at least one of a size of an audio signal, a level difference between channels of the audio signal, and a time difference between channels of the audio signal.

먼저 제1 세부 조건에 대하여 설명한다. 제1 세부 조건은 제1 마스크로 명칭 될 수도 있다. First, the first detailed condition will be described. The first detailed condition may be referred to as a first mask.

또한 제1 세부 조건(제1 마스크)은, m1(l, k)과 같이 표현될 수 있다. 여기서 l은 l번째 프레임을 의미하며, k는 k번째 주파수 빈을 의미할 수 있다. 또한 m1(l, k)=1은, 제1 세부 조건 하에, l번째 프레임의 k번째 주파수 빈이 신뢰 주파수 빈이라는 것을 의미할 수 있다. 또한 m1(l, k)=0은, 제2 세부 조건 하에, l번째 프레임의 k번째 주파수 빈이 신뢰 주파수 빈이 아니라는 것을 의미할 수 있다.Also, the first detailed condition (first mask) may be expressed as m1(l, k). Here, l may mean an l-th frame, and k may mean a k-th frequency bin. Also, m1(l, k)=1 may mean that the k-th frequency bin of the l-th frame is a confidence frequency bin under the first detailed condition. Also, m1(l, k)=0 may mean that the k-th frequency bin of the l-th frame is not a reliable frequency bin under the second detailed condition.

제1 세부 조건은 오디오 신호의 크기, 그 중에서도 음원의 예상 위치(예를 들어 화자의 입)와 더 가까운 제1 마이크를 통하여 수신되는 제1 오디오 신호의 크기에 기반한다.The first detailed condition is based on the size of the audio signal, among others, the size of the first audio signal received through the first microphone closer to the expected location of the sound source (eg, the speaker's mouth).

구체적으로 제1 세부 조건은, 해당하는 주파수 빈에서, 음원의 예상 위치(예를 들어 화자의 입)와 더 가까운 제1 마이크를 통하여 수신되는 제1 오디오 신호의 크기가 기 설정된 값보다 큰 조건일 수 있다.Specifically, the first detailed condition is a condition in which, in the corresponding frequency bin, the magnitude of the first audio signal received through the first microphone closer to the expected location of the sound source (eg, the speaker's mouth) is greater than a preset value. can

이것은 아래와 같은 수식으로 표현될 수 있다.This can be expressed by the following formula.

여기서 m1(l, k)는 제1 마스크,

는 제1 마이크에 의해 수신된 제1 오디오 신호의 l번째 프레임의 k번째 주파수 빈에 대한 단 구간 푸리에 계수,

는 크기,

는 기 설정된 값을 의미할 수 있다.where m1(l, k) is the first mask,

is the short-term Fourier coefficient for the k-th frequency bin of the l-th frame of the first audio signal received by the first microphone,

is the size,

may mean a preset value.

즉 화자가 발화를 하는 경우, 주변의 잡음의 유무에 관계 없이, 화자의 입과 가까운 제1 마이크에서 수신되는 오디오 신호의 크기는 커질 수 있다. 따라서 제1 세부 조건(제1 마스크)는 화자의 입과 가까운 제1 마이크를 통하여 수신되는 제1 오디오 신호의 파워에 기초하여 결정될 수 있다. That is, when the speaker speaks, the level of the audio signal received from the first microphone close to the speaker's mouth may increase regardless of the presence or absence of surrounding noise. Accordingly, the first detailed condition (first mask) may be determined based on the power of the first audio signal received through the first microphone close to the speaker's mouth.

그리고 화자의 입과 더 가까운 상기 제1 마이크를 통하여 수신되는 제1 오디오 신호의 크기가 기 설정된 값보다 큰 경우, 제어부(130)는 해당하는 주파수 빈을 신뢰 주파수 빈으로 결정할 수 있다. 이에 따라 l번째 프레임의 k번째 주파수 빈은 신뢰 주파수 또는 비 신뢰 주파수로 결정될 수 있다.In addition, when the magnitude of the first audio signal received through the first microphone closer to the speaker's mouth is greater than a preset value, the controller 130 may determine the corresponding frequency bin as the reliable frequency bin. Accordingly, the k-th frequency bin of the l-th frame may be determined as a reliable frequency or an unreliable frequency.

다음은 제2 세부 조건에 대하여 설명한다. 제2 세부 조건은 제2 마스크로 명칭 될 수도 있다. Next, the second detailed condition will be described. The second detailed condition may be referred to as a second mask.

또한 제2 세부 조건(제2 마스크)은, m2(l, k)과 같이 표현될 수 있다. 또한 m2(l, k)=1은, 제2 세부 조건 하에, l번째 프레임의 k번째 주파수 빈이 신뢰 주파수 빈이라는 것을 의미할 수 있다. 또한 m2(l, k)=0은, 제2 세부 조건 하에, l번째 프레임의 k번째 주파수 빈이 신뢰 주파수 빈이 아니라는 것을 의미할 수 있다.Also, the second detailed condition (second mask) may be expressed as m2(l, k). Also, m2(l, k)=1 may mean that the k-th frequency bin of the l-th frame is a confidence frequency bin under the second detailed condition. Also, m2(l, k)=0 may mean that the k-th frequency bin of the l-th frame is not a reliable frequency bin under the second detailed condition.

제2 세부 조건은 오디오 신호의 채널 간 레벨 차에 기반한다.The second detailed condition is based on a level difference between channels of the audio signal.

채널간 레벨차(interchannel level difference, ILD)는, 두 개의 마이크들로 획득한 오디오 신호들 사이의 레벨 차를 의미할 수 있다. 그리고 채널 간 레벨 차(interchannel level difference, ILD)는

로 표현될 수 있다. 여기서

는 l번째 프레임의 k번째 주파수 빈에서, 제1 마이크에 의해 수신된 제1 오디오 신호의 단 구간 푸리에 계수를 의미할 수 있다. 또한 여기서

는 l번째 프레임의 k번째 주파수 빈에서, 제2 마이크에 의해 수신된 제2 오디오 신호의 단 구간 푸리에 계수를 의미할 수 있다.An interchannel level difference (ILD) may mean a level difference between audio signals acquired by two microphones. And the interchannel level difference (ILD) is

can be expressed as here

may mean a short-term Fourier coefficient of the first audio signal received by the first microphone in the k-th frequency bin of the l-th frame. also here

may mean a short-term Fourier coefficient of the second audio signal received by the second microphone in the k-th frequency bin of the l-th frame.

그리고 제2 세부 조건은, 해당하는 주파수에서 제1 마이크에 의해 수신된 제1 오디오 신호와 제2 마이크에 의해 수신된 제2 오디오 신호의 채널간 레벨 차가 기 설정된 값보다 큰 조건일 수 있다.In addition, the second detailed condition may be a condition in which a level difference between channels of the first audio signal received by the first microphone and the second audio signal received by the second microphone at a corresponding frequency is greater than a preset value.

여기서 m2(l, k)는 제2 마스크,

는 l프레임의 k번째 주파수 빈에서의 채널 간 레벨 차,

는 기 설정된 값을 의미할 수 있다.where m2(l, k) is the second mask,

is the level difference between channels in the k-th frequency bin of the l frame,

may mean a preset value.

이동 단말기를 이용하여 화자가 전화 통화를 하고 있으며, 뒤쪽 방향에서 타인이 이야기하는 상황을 가정한다. 그리고 화자의 입과 더 가까운 제1 마이크에는 주로 화자의 음성이 들어가고, 제2 마이크에는 주로 타인의 음성이 들어간다고 가정한다. 이 경우 제1 마이크를 통하여 수신되는 제1 오디오 신호의 크기와 제2 마이크를 통하여 수신되는 제2 오디오 신호의 크기가 서로 큰 차이가 없는 상황이 발생할 수 있다.It is assumed that the speaker is talking on the phone using a mobile terminal and another person is talking from the back. In addition, it is assumed that the first microphone, which is closer to the speaker's mouth, mainly receives the speaker's voice, and the second microphone mainly receives the voice of another person. In this case, there may be a situation in which the magnitude of the first audio signal received through the first microphone and the magnitude of the second audio signal received through the second microphone are not significantly different from each other.

또한 종래 기술에 따르게 되면, 모든 주파수 빈에서의 신호의 크기들을 평균을 내서 음성의 존재 유무를 검출한다. 사람마다 목소리의 높낮이(pitch)가 다르며 주파수 영역에서는 기본 높낮이 주파수의 배음(harmonic)의 형태로 나타나는데, 화자의 음성 영역에 해당하는 주파수 빈들에서의 채널 간 레벨 차(부호: +)와 타인의 음성 영역에 해당하는 주파수 빈들에서의 채널 간 레벨 차(부호: -)를 평균하는 경우, 평균 값이 낮게 나오게 되어 화자의 음성의 존재 유무를 제대로 검출하지 못하는 문제가 발생할 수 있다.In addition, according to the prior art, the presence or absence of a voice is detected by averaging the amplitudes of signals in all frequency bins. The pitch of each person's voice is different, and it appears in the form of harmonics of the basic pitch frequency in the frequency domain. In the case of averaging the level difference (sign: -) between channels in frequency bins corresponding to the region, the average value is low, so that the presence or absence of the speaker's voice cannot be properly detected.

따라서 본 발명에서는, 음원의 예상 위치(예를 들어 화자의 입)과 가까운 제1 마이크에서 수신되는 제1 오디오 신호의 크기가 제2 마이크에서 수신되는 제2 오디오 신호의 크기보다 충분히(기 설정된 값(

))보다 클 때, 제어부(130)는 해당하는 주파수 빈을 신뢰 주파수 빈으로 결정할 수 있다. 이에 따라 l번째 프레임의 k번째 주파수 빈은 신뢰 주파수 또는 비 신뢰 주파수로 결정될 수 있다.Accordingly, in the present invention, the level of the first audio signal received from the first microphone close to the expected position of the sound source (eg, the speaker's mouth) is sufficiently greater than the level of the second audio signal received from the second microphone (preset value). (

)), the controller 130 may determine a corresponding frequency bin as a reliable frequency bin. Accordingly, the k-th frequency bin of the l-th frame may be determined as a reliable frequency or an unreliable frequency.

이에 따라 제2 마스크는 제2 마이크에 가까운 노이즈 소스에 대하여, 음성 검출의 강인성(robustness)을 향상시킬 수 있다.Accordingly, the second mask may improve robustness of voice detection with respect to a noise source close to the second microphone.

다음은 제3 세부 조건에 대하여 설명한다. 제3 세부 조건은 제3 마스크로 명칭 될 수도 있다. Next, the third detailed condition will be described. The third detailed condition may be referred to as a third mask.

또한 제3 세부 조건(제3 마스크)은, m3(l, k)과 같이 표현될 수 있다. 또한 m3(l, k)=1은, 제3 세부 조건 하에, l번째 프레임의 k번째 주파수 빈이 신뢰 주파수 빈이라는 것을 의미할 수 있다. 또한 m3(l, k)=0은, 제3 세부 조건 하에, l번째 프레임의 k번째 주파수 빈이 신뢰 주파수 빈이 아니라는 것을 의미할 수 있다.Also, the third detailed condition (third mask) may be expressed as m3(l, k). Also, m3(l, k)=1 may mean that the k-th frequency bin of the l-th frame is a confidence frequency bin under the third detailed condition. Also, m3(l, k)=0 may mean that the k-th frequency bin of the l-th frame is not a reliable frequency bin under the third detailed condition.

제3 세부 조건은 오디오 신호의 채널 간 시간 차에 기반한다.The third detailed condition is based on a time difference between channels of the audio signal.

여기서 채널간 시간차(interchannel time difference, ITD)는 두 개 이상의 마이크로 획득한 오디오 신호 들 사이의 시간 차를 의미할 수 있다.Here, an interchannel time difference (ITD) may mean a time difference between audio signals acquired by two or more microphones.

즉 채널 간 시간 차는, 동일한 오디오 신호를 수신하는 경우 제1 마이크에서 수신된 제1 오디오 신호와 제2 마이크에서 수신된 제2 오디오 신호 간의 시간 차를 의미할 수 있으며,

로 표현될 수 있다.That is, the time difference between channels may mean a time difference between the first audio signal received from the first microphone and the second audio signal received from the second microphone when the same audio signal is received,

can be expressed as

그리고 제3 세부 조건은, 해당하는 주파수에서 오디오 신호의 채널 간 시간 차가 기 설정된 범위 이내이거나, 해당하는 주파수가 제1 기 설정된 값보다 작거나, 해당하는 주파수가 제2 기 설정된 값보다 큰 조건일 수 있다.And the third detailed condition is a condition that a time difference between channels of an audio signal at a corresponding frequency is within a preset range, a corresponding frequency is less than a first preset value, or a corresponding frequency is greater than a second preset value can

여기서 m3(l, k)는 제3 마스크,

는 l번째 프레임의 k번째 주파수 빈에서의 채널 간 시간 차를 의미할 수 있다.where m3(l, k) is the third mask,

may mean a time difference between channels in the k-th frequency bin of the l-th frame.

또한 제3 세부 조건은, 해당하는 주파수 빈(k)에서 오디오 신호의 채널 간 시간 차가 기 설정된 범위(

내지

) 이내이거나, 해당하는 주파수 빈(k)이 제1 기 설정된 값(k1)보다 작거나, 해당하는 주파수 빈(k)이 제2 기 설정된 값(k2)보다 큰 조건일 수 있다.In addition, the third detailed condition is a preset range (

inside

), the corresponding frequency bin k is smaller than the first preset value k1, or the corresponding frequency bin k is larger than the second preset value k2.

여기서 k1과 k2는 주파수 빈 인덱스의 범위일 수 있다. 예를 들어 k1은 100Hz이고 k2는 1000Hz일 수 있다.Here, k1 and k2 may be ranges of frequency bin indices. For example, k1 may be 100 Hz and k2 may be 1000 Hz.

그리고 k1은, 음원의 방향각(또는 도래각)(direction-of-arrival, DoA)이 위상 측정의 작은 오류에 너무 민감한 주파수 빈들이 채널 간 시간 차를 이용한 선별로부터 제외되도록 설정될 수 있다. 따라서 해당하는 주파수 빈(k)이 제1 기 설정된 값(k1)보다 작은 경우, 해당하는 주파수 빈(k)은 신뢰 주파수 빈으로 결정될 수 있다.And k1 may be set so that frequency bins in which the direction-of-arrival (DoA) of the sound source is too sensitive to a small error in phase measurement are excluded from the selection using the inter-channel time difference. Accordingly, when the corresponding frequency bin k is smaller than the first preset value k1, the corresponding frequency bin k may be determined as a reliable frequency bin.

또한 k2는 특정 마진을 가지고 공간 앨리어싱(spatial aliasing)을 피하도록 설정될 수 있다. 따라서 해당하는 주파수 빈(k)이 제2 기 설정된 값(k2)보다 큰 경우, 해당하는 주파수 빈(k)은 신뢰 주파수 빈으로 결정될 수 있다.Also, k2 can be set to avoid spatial aliasing with a specific margin. Accordingly, when the corresponding frequency bin k is greater than the second preset value k2, the corresponding frequency bin k may be determined as a reliable frequency bin.

한편 해당하는 주파수 빈(k)이 제2 기 설정된 값(k2)보다 작고 제1 기 설정된 값(k1)보다 큰 경우, 해당하는 주파수 빈(k)은 채널 간 시간 차에 기반하여 신뢰 주파수인지 여부가 결정될 수 있다.On the other hand, when the corresponding frequency bin (k) is smaller than the second preset value (k2) and greater than the first preset value (k1), whether the corresponding frequency bin (k) is a reliable frequency based on the time difference between channels can be determined.

수학식 3에서 기 설정된 범위(

내지

)는 시간을 나타내는 것으로(예를 들어

는 0.008s,

는 0.04s), 예상되는 음원의 위치에 기반하여 결정될 수 있다.In Equation 3, the preset range (

inside

) denotes time (e.g.

is 0.008s,

0.04s), may be determined based on the expected location of the sound source.

예를 들어 화자가 이동 단말기(100)를 이용하여 전화 통화를 하는 경우, 화자의 입이 위치할 것이라고 예상되는 범위가 있다. 이 경우 기 설정된 범위(

내지

)는 화자의 입이 위치할 것이라고 예상되는 범위에 대응하도록 설정될 수 있다.For example, when the speaker uses the mobile terminal 100 to make a phone call, there is a range in which the speaker's mouth is expected to be located. In this case, the preset range (

inside

) may be set to correspond to a range in which the speaker's mouth is expected to be located.

즉 실제 음원이 음원의 예상 위치와 같은 방향에 존재하는 경우, 수신되는 오디오 신호에 대한 채널 간 시간 차는 기 설정된 범위 내 (

내지

)에 있을 수 있다. 따라서 채널 간 시간 차가 기 설정된 범위 내 (

내지

)에 있는 경우, 제어부(130)는 해당하는 주파수 빈을 신뢰 주파수 빈으로 결정할 수 있다.That is, when the actual sound source is located in the same direction as the expected position of the sound source, the time difference between channels for the received audio signal is within a preset range (

inside

) may be in Therefore, the time difference between channels is within the preset range (

inside

), the controller 130 may determine a corresponding frequency bin as a reliable frequency bin.

반면에, 실제 음원이 음원의 예상 위치와 다른 방향에 존재하는 경우, 수신되는 오디오 신호에 대한 채널 간 시간 차는 기 설정된 범위 (

내지

) 밖에 있을 수 있다. 따라서 채널 간 시간 차가 기 설정된 범위 (

내지

) 밖에 있는 경우, 제어부(130)는 해당하는 주파수 빈을 비 신뢰 주파수 빈으로 결정할 수 있다.On the other hand, when the actual sound source is located in a different direction from the expected position of the sound source, the time difference between channels for the received audio signal is within a preset range (

inside

) may be outside. Therefore, the time difference between channels is within the preset range (

inside

), the controller 130 may determine the corresponding frequency bin as an unreliable frequency bin.

즉 채널 간 시간 차가 기 설정된 범위 (

내지

) 밖에 있는 경우, 해당하는 주파수 빈에 대응하는 공간 정보는, 음성의 존재 여부를 검출하는데 사용되지 않을 수 있다.That is, the time difference between channels is a preset range (

inside

), spatial information corresponding to the corresponding frequency bin may not be used to detect the presence of a voice.

한편 기 설정된 조건에 기반하여, 제어부(130)는 오디오 신호에 대응하는 복수의 주파수 빈들 중 신뢰 주파수 빈들을 선별할 수 있다.Meanwhile, based on a preset condition, the controller 130 may select reliable frequency bins from among a plurality of frequency bins corresponding to the audio signal.

앞에서는 세부 조건 들의 예시로 제1 세부 조건, 제2 세부 조건 및 제3 세부 조건을 설명하였다.The first detailed condition, the second detailed condition, and the third detailed condition have been described above as examples of the detailed conditions.

한편 세부 조건들 중 어느 하나는 단독으로 기 설정된 조건을 구성할 수 있다. 다만 이에 한정되지 않으며, 둘 이상의 세부 조건이 기 설정된 조건을 구성할 수 있다. Meanwhile, any one of the detailed conditions may independently constitute a preset condition. However, the present invention is not limited thereto, and two or more detailed conditions may constitute a preset condition.

이 경우 둘 이상의 세부 조건은 “and”로 연결될 수 있다. 즉 기 설정된 조건은 둘 이상의 세부 조건을 포함하고, 제어부(130)는 해당하는 주파수 빈이 둘 이상의 세부 조건을 모두 만족하는 경우 해당하는 주파수 빈을 신뢰 주파수 빈으로 결정할 수 있다.In this case, two or more detailed conditions may be connected with “and”. That is, the preset condition includes two or more detailed conditions, and when the corresponding frequency bin satisfies all of the two or more detailed conditions, the corresponding frequency bin may be determined as a reliable frequency bin.

수학식 4에서는 제1 세부 조건(m1(l, k)), 제2 세부 조건(m2(l, k)), 제3 세부 조건(m3(l, k))을 포함하는 기 설정된 조건(m(l, k))을 나타내었다.In Equation 4, a preset condition (m) including a first detailed condition (m1(l, k)), a second detailed condition (m2(l, k)), and a third detailed condition (m3(l, k)) (l, k)) is shown.

이 경우 제1 세부 조건(m1(l, k)), 제2 세부 조건(m2(l, k)) 및 제3 세부 조건(m3(l, k))을 모두 만족하여야, 해당하는 주파수 빈(l번째 프레임의 k번째 주파수 빈)은 신뢰 주파수로 결정될 수 있다.In this case, the first detailed condition (m1(l, k)), the second detailed condition (m2(l, k)), and the third detailed condition (m3(l, k)) must all be satisfied, and the corresponding frequency bin ( The k-th frequency bin of the l-th frame) may be determined as a confidence frequency.

그리고 m(l, k)=1은, 복수의 세부 조건들이 결합된 기 설정된 조건 하에, l번째 프레임의 k번째 주파수 빈이 신뢰 주파수 빈이라는 것을 의미할 수 있다. 또한 m(l, k)=0은, 복수의 세부 조건들이 결합된 기 설정된 조건 하에, l번째 프레임의 k번째 주파수 빈이 신뢰 주파수 빈이 아니라는 것을 의미할 수 있다.And m(l, k)=1 may mean that the k-th frequency bin of the l-th frame is a confidence frequency bin under a preset condition in which a plurality of detailed conditions are combined. Also, m(l, k)=0 may mean that the k-th frequency bin of the l-th frame is not a reliable frequency bin under a preset condition in which a plurality of detailed conditions are combined.

한편 제어부(130)는 선별된 신뢰 주파수 빈들을 이용하여 오디오 신호 내 음성의 존재 여부를 검출할 수 있다(S350).Meanwhile, the controller 130 may detect the presence of a voice in the audio signal using the selected confidence frequency bins ( S350 ).

구체적으로 제어부(130)는 NDPSD, LTIPD, AND 및 SVM 중 적어도 하나에 기반한 음성 활동 감지 알고리즘을 이용하여, 오디오 신호 내 음성의 존재 여부를 검출할 수 있다. 그리고 오디오 신호 내 음성의 존재 여부의 검출에는 선별된 신뢰 주파수 빈들이 사용될 수 있다.In more detail, the controller 130 may detect the presence of a voice in the audio signal by using a voice activity detection algorithm based on at least one of NDPSD, LTIPD, AND, and SVM. In addition, the selected confidence frequency bins may be used to detect the presence of a voice in the audio signal.

먼저 NDPSD에 기반한 음성 검출 방법을 설명한다.First, a voice detection method based on NDSD will be described.

NDPSD에 기반한 음성 활동 감지 알고리즘은 두 개의 마이크 간 오디오 신호의 크기 차(파워 차)에 따라 음성이 존재하는지를 판단하는 알고리즘으로, 선행 문헌 1(Jeub, M.; Herglotz, C.; Nelke, C.; Beaugeant, C.; Vary, P. Noise reduction for dual-microphone mobile phones exploiting power level differences. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, 25?30 March 2012; pp. 1693?1696.)에서 제시하고 있는 알고리즘이다. 그리고 이 알고리즘은 채널 간 레벨 차(ILD)에 기반한 것일 수 있다.The NDPSD-based voice activity detection algorithm is an algorithm that determines whether a voice is present according to the difference in the size (power difference) of the audio signal between two microphones. Prior Document 1 (Jeub, M.; Herglotz, C.; Nelke, C. ; Beaugeant, C.; Vary, P. Noise reduction for dual-microphone mobile phones exploiting power level differences. 1693?1696.) is the algorithm presented. And this algorithm may be based on the level difference (ILD) between channels.

이 경우 기존의 NDPSD에 기반한 음성 활동 감지 방법은 아래와 같은 수학식으로 나타낼 수 있다.In this case, the existing NDPSD-based voice activity detection method can be expressed by the following equation.

여기서

는 크기,

는 l번째 프레임의 k번째 주파수 빈에서, 제1 마이크로부터 수신된 신호의 단구간 푸리에 변환(STFT) 계수(coefficients),

는 l번째 프레임의 k번째 주파수 빈에서, 제2 마이크로부터 수신된 신호의 단구간 푸리에 변환(STFT) 계수(coefficients),

는 l번째 프레임의 k번째 주파수 빈에서 신호간의 정규화된 크기 차,

는 모든 주파수 빈에 대한 신호간 크기 차들의 평균,

는 l번째 프레임에서의 음성의 존재 유무를 의미한다.here

is the size,

In the k-th frequency bin of the l-th frame, the short-term Fourier transform (STFT) coefficients of the signal received from the first microphone,

In the k-th frequency bin of the l-th frame, the short-term Fourier transform (STFT) coefficients of the signal received from the second microphone,

is the normalized magnitude difference between signals in the k-th frequency bin of the l-th frame,

is the average of the magnitude differences between signals for all frequency bins,

denotes the presence or absence of a voice in the l-th frame.

즉 기존의 NDPSD에 기반한 음성 검출 방법에서는, 모든 주파수 빈에 대한 신호 간 크기 차들의 평균(

)를 임계 값

과 비교함으로써 음성의 존재 유무가 검출된다.That is, in the existing NDPSD-based voice detection method, the average (

) to the threshold

By comparing with , the presence or absence of a voice is detected.

한편 본 발명에서 제안하는 개선된 NDPSD에 기반한 음성 검출 방법은 아래와 같은 수학식으로 나타낼 수 있다.Meanwhile, the voice detection method based on the improved NDPSD proposed by the present invention can be expressed by the following equation.

는 개선된 NDPSD 음성 활동 감지 알고리즘에 기반하여 산출된, l번째 프레임에서의 음성의 존재 유무를 의미한다. 또한

는 선별된 신뢰 주파수 빈들(m(l, k)가 1인 주파수 빈들)에 대응하는 값들(신호 간의 정규화된 크기 차)에 대한 평균을 의미할 수 있다.

denotes the presence or absence of a voice in the lth frame calculated based on the improved NDPSD voice activity detection algorithm. Also

may mean an average of values (normalized difference between signals) corresponding to the selected confidence frequency bins (frequency bins in which m(l, k) is 1).

따라서 수학식 7은, 해당 프레임 내에서 선별된 신뢰 주파수 빈들에 대응하는 값들에 대한 평균(

)이 임계 값

보다 큰 경우 해당 프레임에 음성이 존재하는 것으로 결정한 다는 것을 의미할 수 있다.Therefore, Equation 7 is the average (

) this threshold

If it is larger than that, it may mean that it is determined that the voice is present in the corresponding frame.

또한 해당 프레임에 음성이 존재하는 것으로 결정하기 위하여 새로운 조건(

)이 추가될 수 있다. 여기서

은 신뢰 주파수 빈의 개수를 의미하고,

는 기 설정된 개수를 의미할 수 있다.In addition, in order to determine that voice is present in the frame, a new condition (

) can be added. here

is the number of confidence frequency bins,

may mean a preset number.

따라서 제어부(130)는 해당 프레임 내에서 선별된 신뢰 주파수 빈들에 대응하는 값들에 대한 평균(

)이 임계 값

보다 크거나 같고, 해당 프레임 내에 존재하는 신뢰 주파수 빈의 개수(

)가 기 설정된 개수(

)보다 크거나 같은 경우, 헤당 프레임에 음성이 존재하는 것으로 결정할 수 있다.Accordingly, the control unit 130 calculates the average (

) this threshold

greater than or equal to the number of confidence frequency bins present in the frame (

) is the preset number (

) is greater than or equal to, it may be determined that the voice is present in the corresponding frame.

반면에, 해당 프레임 내에서 선별된 신뢰 주파수 빈들에 대응하는 값들에 대한 평균(

)이 임계 값

보다 크지만, 해당 프레임 내에 존재하는 신뢰 주파수 빈의 개수(

)가 기 설정된 개수(

)보다 작은 경우, 제어부(130)는 해당 프레임에 음성이 존재하지 않는 것으로 결정할 수 있다.On the other hand, the average (

) this threshold

greater than, but the number of confidence frequency bins present in the frame (

) is the preset number (

), the controller 130 may determine that there is no voice in the corresponding frame.

즉 신뢰 주파수 빈의 개수가 너무 작은 경우에는, 표본이 너무 작기 때문에 오류가 발생할 수 있다. 따라서 신뢰 주파수 빈의 개수(

)가 기 설정된 개수(

)보다 큰 경우에만, 오디오 신호 내 음성이 존재하는 것으로 판단할 수 있다.That is, if the number of confidence frequency bins is too small, an error may occur because the sample is too small. Therefore, the number of confidence frequency bins (

) is the preset number (

), it can be determined that the voice in the audio signal is present.

다음은 LTIPD에 기반한 음성 검출 방법을 설명한다.The following describes a voice detection method based on LTIPD.

LTIPD에 기반한 음성 활동 감지 알고리즘은, 목표방향에서 지속적으로 신호가 들어오는 경우 음성이 존재하는 것으로 판단하는 알고리즘으로, 선행 문헌 2(Guo, Y.; Li, K.; Fu, Q.; Yan, Y. A two microphone based voice activity detection for distant talking speech in wide range of direction of arrival. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, 25?30 March 2012; pp. 4901?4904)에서 제시하고 있다. 그리고 이 알고리즘은 채널 간 시간 차(ITD)에 기반한 것일 수 있다.The LTIPD-based voice activity detection algorithm is an algorithm that determines that a voice is present when a signal is continuously received from a target direction. Prior Document 2 (Guo, Y.; Li, K.; Fu, Q.; Yan, Y A two microphone based voice activity detection for distant talking speech in wide range of direction of arrival.In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, 25?30 March 2012; ) is presented in And this algorithm may be based on the time difference (ITD) between channels.

즉 LTIPD에 기반한 음성 활동 감지 알고리즘은, 방향각(또는 도래각)(direction-of-arrival, DoA) 추정치가 동일한(같은 섹터의 방향각 추정치를 가지는) 주파수 빈들에 얼마나 많은 에너지가 집중되는지를 측정하는 방식일 수 있다.That is, the LTIPD-based voice activity detection algorithm measures how much energy is concentrated in frequency bins with the same direction-of-arrival (DoA) estimate (with the same sector direction-of-arrival estimate). It could be a way to

이 경우 음원의 타겟 방향각(또는 도래각)(direction-of-arrival, DoA)의 범위는 폭이 동일한

개의 섹터로 분할될 수 있다. In this case, the range of the target direction-of-arrival (DoA) of the sound source has the same width.

It can be divided into sectors.

그리고 LTIPD에 기반한 음성 검출 방법은 아래와 같은 수학식으로 나타낼 수 있다.And the voice detection method based on LTIPD can be expressed by the following equation.

여기서 i는 섹터의 인덱스, Ci (l, k)는 가장 최신의 L개의 프레임 중, k번째 주파수 빈에 대한 DOA 추정치가 i번째 섹터를 나타내는 프레임의 개수, ki는 DOA의 집중(concentration)의 임계 값,

은 채널 간 시간 차가 타겟 범위 내에 있는 주파수 들에 대응하는 파워들의 합산 치,

는 에너지 집중의 임계 값,

는 l번째 프레임에서의 음성의 존재 유무를 의미할 수 있다.where i is the index of the sector, Ci (l, k) is the number of frames in which the DOA estimate for the k-th frequency bin represents the i-th sector among the latest L frames, and ki is the threshold of DOA concentration value,

is the sum of powers corresponding to frequencies within the target range with the time difference between channels,

is the threshold of energy concentration,

may mean the presence or absence of a voice in the l-th frame.

즉 기존의 LTIPD에 기반한 음성 검출 방법은, 타겟 방향각을 가리키는 모든 주파수 빈들의 신호의 크기를 합산한 후, 섹터 중에서 합이 가장 큰 섹터의 합계와 에너지 집중의 임계 값(

)과 비교하여 음성의 존재 유무를 검출하였다.That is, in the existing LTIPD-based voice detection method, after summing the signal sizes of all frequency bins pointing to the target direction angle, the sum of the sectors with the largest sum among the sectors and the threshold value of energy concentration (

) and the presence or absence of a negative was detected.

한편 본 발명에서 제안하는, 개선된 LTIPD에 기반한 음성 검출 방법은 아래와 같은 수학식으로 나타낼 수 있다.Meanwhile, the voice detection method based on the improved LTIPD proposed by the present invention can be expressed by the following equation.

즉 개선된 LTIPD에 기반한 음성 검출 방법에서는, 기 설정된 조건을 만족하는 주파수 빈들(신뢰 주파수 빈들)의 크기를 합산함으로써, 파워들의 합산 치

가 결정될 수 있다. 그리고 파워들의 합산 치

를 에너지 집중의 임계 값과 비교함으로써, l번째 프레임에서의 음성의 존재 여부(

)가 검출될 수 있다.That is, in the improved LTIPD-based voice detection method, by summing the sizes of frequency bins (reliable frequency bins) that satisfy a preset condition, the sum of powers

can be determined. and the sum of the powers

By comparing with the threshold of energy concentration, the presence of speech in the lth frame (

) can be detected.

한편 NDPSD, LTIPD 이외에도, 선별된 신뢰 주파수 빈들에 대응하는 값들을 사용해 얻은 검출 통계치에 기반하여 음성 활동을 감지하는 알고리즘이 사용될 수 있다.Meanwhile, in addition to NDPSD and LTIPD, an algorithm for detecting voice activity based on detection statistics obtained using values corresponding to selected confidence frequency bins may be used.

이하에서는, AND, SVM에 기반한 음성 검출 방법을 설명한다. AND 및 SVM은 채널 간 시간 차 정보와 채널 간 레벨 차 정보를 결합하여 오디오 신호 내 음성의 존재 여부를 검출하는 방식일 수 있다. 그리고 본 발명에서 제안하는 개선된 AND과 개선된 SVM는, 선별된 신뢰 주파수 빈들에 대응하는 채널 간 시간 차와 채널간 레벨 차를 결합하여 오디오 신호 내 음성의 존재 여부를 검출하는 방식일 수 있다.Hereinafter, a voice detection method based on AND and SVM will be described. AND and SVM may be a method of detecting the presence or absence of voice in an audio signal by combining information on a time difference between channels and information on a level difference between channels. In addition, the improved AND and the improved SVM proposed in the present invention may be a method of detecting the presence or absence of a voice in an audio signal by combining a time difference between channels corresponding to selected confidence frequency bins and a level difference between channels.

먼저 AND에 기반한 음성 검출 방법을 설명한다.First, a voice detection method based on AND will be described.

AND에 기반한 음성 활동 감지 알고리즘은, 채널 간 시간 차(ITD)와 채널 간 레벨 차(ILD)가 음원의 위치에 대한 다른 특징을 가지고 있으므로 채널 간 시간 차(ITD)에 기초한 음성 활동 감지와 채널 간 레벨 차(ILD)에 기초한 음성 활동 감지의 결과를 결합하는 방식으로, 선행 문헌 3(Park, J.; Jin, Y.G.; Hwang, S.; Shin, J.W. Dual Microphone Voice Activity Detection Exploiting InterchannelTime and Level Difference. IEEE Signal Process. Lett. 2016, 23, 1335?1339)에서 제시하고 있다. The AND-based voice activity detection algorithm is based on the inter-channel time difference (ITD) and voice activity detection based on the inter-channel time difference (ITD) because the inter-channel time difference (ITD) and the inter-channel level difference (ILD) have different characteristics for the location of the sound source. As a method of combining the results of voice activity detection based on the level difference (ILD), Prior Document 3 (Park, J.; Jin, YG; Hwang, S.; Shin, JW Dual Microphone Voice Activity Detection Exploiting InterchannelTime and Level Difference. IEEE Signal Process. Lett. 2016, 23, 1335-1339).

채널 간 시간 차(ITD)에 기초한 음성 활동 감지와 채널 간 레벨 차(ILD)에 기초한 음성 활동 감지는 and로 결합될 수 있다. 따라서 AND에 기반한 음성 활동 감지 알고리즘은, 아래와 같은 수학식으로 나타낼 수 있다.Voice activity detection based on inter-channel time difference (ITD) and voice activity detection based on inter-channel level difference (ILD) may be combined with and. Therefore, the AND-based voice activity detection algorithm can be expressed by the following equation.

(

: 채널 간 시간 차(ITD)에 기초한 음성 활동 감지의 결과,

: 채널 간 레벨 차(ILD)에 기초한 음성 활동 감지의 결과)(

: the result of voice activity detection based on the time difference (ITD) between channels,

: Result of detection of voice activity based on the level difference (ILD) between channels)

즉 AND에 기반한 음성 활동 감지 알고리즘은, 채널 간 시간 차(ITD)에 기초한 음성 활동 감지의 결과가 ‘음성의 존재’로 나오고, 채널 간 레벨 차(ILD)에 기초한 음성 활동 감지의 결과가 ‘음성의 존재’로 나오는 경우, 최종적으로 음성이 존재한다고 결정한다.That is, in the AND-based voice activity detection algorithm, the result of voice activity detection based on the inter-channel time difference (ITD) is 'existence of voice', and the result of voice activity detection based on the inter-channel level difference (ILD) is 'voice'. In the case of 'existence of ', it is finally determined that the voice exists.

한편 본 발명에서 제안하는, 개선된 AND에 기반한 기반한 음성 검출 방법은, 개선된 NDPSD 음성 활동 감지 알고리즘에 기반하여 산출된 음성의 존재 유무(

)와 개선된 LTIPD 음성 활동 감지 알고리즘에 기반하여 산출된 음성의 존재 유무(

)를 and 연산 한 것일 수 있다.On the other hand, the voice detection method based on the improved AND proposed in the present invention is the presence or absence of a voice calculated based on the improved NDPSD voice activity detection algorithm.

) and the presence or absence of a voice calculated based on the improved LTIPD voice activity detection algorithm (

) may be an and operation.

즉 제어부(130)는, 개선된 NDPSD 음성 활동 감지 알고리즘에 기반하여 음성이 존재하는 것으로 결정되고 개선된 LTIPD 음성 활동 감지 알고리즘에 기반하여 음성이 존재하는 것으로 결정되는 경우, 최종 적으로 음성이 존재하는 것으로 결정할 수 있다.That is, when it is determined that the voice is present based on the improved NDPSD voice activity detection algorithm and it is determined that the voice exists based on the improved LTIPD voice activity detection algorithm, the controller 130 determines that the voice is finally present. it can be decided that

다음은 SVM에 기반한 음성 검출 방법을 설명한다.The following describes a voice detection method based on SVM.

SVM에 기반한 음성 활동 감지 알고리즘은, 채널 간 시간 차(ITD)와 채널 간 레벨 차(ILD)를 특징(feature)으로 이용하여 분류기를 트레이닝 하고, 분류기는 채널 간 시간 차(ITD)와 채널 간 레벨 차(ILD)를 특징(feature)으로 이용하여 음성의 존재 유무를 분류(classification)하는 방식으로, 선행 문헌 3(Park, J.; Jin, Y.G.; Hwang, S.; Shin, J.W. Dual Microphone Voice Activity Detection Exploiting InterchannelTime and Level Difference. IEEE Signal Process. Lett. 2016, 23, 1335?1339)에서 제시하고 있다. The voice activity detection algorithm based on SVM trains the classifier using the inter-channel time difference (ITD) and the inter-channel level difference (ILD) as features, and the classifier is the inter-channel time difference (ITD) and the inter-channel level As a method of classifying the presence or absence of voice using ILD as a feature, prior document 3 (Park, J.; Jin, YG; Hwang, S.; Shin, JW Dual Microphone Voice Activity) Detection Exploiting InterchannelTime and Level Difference. IEEE Signal Process. Lett. 2016, 23, 1335-1339).

기존의 SVM에 기반한 음성 활동 감지 알고리즘은, 모든 주파수 빈에 대하여, 채널 간 시간 차(ITD)와 채널 간 레벨 차(ILD)를 제공하여 분류기를 트레이닝 하였다. 또한 기존의 SVM에 기반한 음성 활동 감지 알고리즘은, 모든 주파수 빈에 대하여 채널 간 시간 차(ITD)와 채널 간 레벨 차(ILD)를 분류기에 제공하여, 분류기가 음성의 존재 유무를 출력하도록 하였다.The existing SVM-based voice activity detection algorithm trains a classifier by providing an inter-channel time difference (ITD) and an inter-channel level difference (ILD) for all frequency bins. In addition, the existing SVM-based voice activity detection algorithm provides the inter-channel time difference (ITD) and the inter-channel level difference (ILD) to the classifier for all frequency bins, so that the classifier outputs the presence or absence of a voice.

한편 본 발명에서 제안하는, 개선된 SVM에 기반한 음성 검출 방법은, 신뢰 주파수 빈에 대응하는 채널 간 시간 차(ITD)와 채널 간 레벨 차(ILD)를 제공하여 분류기를 트레이닝 할 수 있다. Meanwhile, the improved SVM-based voice detection method proposed by the present invention can train a classifier by providing an inter-channel time difference (ITD) and an inter-channel level difference (ILD) corresponding to a confidence frequency bin.

또한 개선된 SVM에 기반한 음성 검출 방법은, 신뢰 주파수 빈에 대응하는 채널 간 시간 차(ITD)와 채널 간 레벨 차(ILD)를 분류기에 제공하여 분류기가 음성의 존재 유무를 출력하도록 할 수 있다. In addition, the improved SVM-based voice detection method provides the inter-channel time difference (ITD) and the inter-channel level difference (ILD) corresponding to the confidence frequency bins to the classifier so that the classifier outputs the presence or absence of a voice.

구체적으로는, 기 설정된 조건을 만족하는 주파수 빈에 대응하는 정보는 분류기에 제공하고, 기 설정된 조건을 만족하지 않는 주파수 빈에 대응하는 정보는 0으로 처리한 후 분류기에 제공하는 방식으로 구현될 수 있다.Specifically, information corresponding to a frequency bin that satisfies a preset condition is provided to the classifier, and information corresponding to a frequency bin that does not satisfy a preset condition is treated as 0 and then provided to the classifier. have.

한편 AND, SVM 이외에도, 선별된 신뢰 주파수 빈들에 대응하는 값들을 사용해 얻은 검출 통계치에 기반하여 음성 활동을 감지하는 알고리즘이 사용될 수 있다.Meanwhile, in addition to AND and SVM, an algorithm for detecting voice activity based on detection statistics obtained using values corresponding to selected confidence frequency bins may be used.

도 4는 테스트에서의 조건을 설명한 도면이다.4 is a diagram for explaining conditions in a test.

방의 크기는 3119 * 3232 * 2080 mm3이고, 잔향 시간(RT60)은 100~200ms이다.The size of the room is 3119 * 3232 * 2080 mm3, and the reverberation time (RT60) is 100-200 ms.

그리고 방의 중앙에는 화자가 휴대폰을 들고 서 있었다. 휴대폰에 포함되는 두 마이크 간의 거리는 약 140mm이다.And in the center of the room, the speaker was standing with a cell phone. The distance between the two microphones included in the phone is about 140mm.

한편 무지향성 잡음은 NOISEX-92 데이터 베이스의 white, babble, car 노이즈에 의해 생성되었으며, 복잡한 반사를 일으키기 위하여 4개의 라우드 스피커(410, 420, 430, 440)가 방의 코너를 향하도록 설치되었다.On the other hand, omni-directional noise was generated by white, babble, and car noise in the NOISEX-92 database, and four loudspeakers (410, 420, 430, 440) were installed to face the corner of the room to generate complex reflections.

또한 지향성 잡음은, TIMIT 데이터베이스로부터 선택된 네명의 남성과 여성 화자에 의해 발화된 발화문이며, 화자로부터 45도, 135도, 225도, 315도 방향으로 1000mm 떨어진 네개의 라우드스피커 중 어느 하나에 의해 화자를 향한 방향으로 재생되었다. In addition, directional noise is a utterance uttered by four male and female speakers selected from the TIMIT database, and the speaker is uttered by any one of four loudspeakers at a distance of 1000 mm in the directions of 45, 135, 225, and 315 degrees from the speaker. played in the direction toward

한편 화자의 음성은 네개의 라우드 스피커 중 하나로부터 온 지향성 잡음 또는 무지향성 잡음 중 하나와 혼합되었다. 이 경우 제1 마이크 상에서 SNR 레벨은 -5dB, 0dB, 5dB, 10dB, 15dB, 20dB이다.Meanwhile, the speaker's voice was mixed with either directional noise or omnidirectional noise from one of the four loudspeakers. In this case, the SNR levels on the first microphone are -5dB, 0dB, 5dB, 10dB, 15dB, and 20dB.

제3 마스크(

)에서의 주파수 인덱스 범위(k1, k2)는 제1 마이크 및 제2 마이크 사이의 거리가 140mm인 것을 고려하여, (125Hz, 968.75Hz)에 대응하는 (4, 31)으로 설정되었다(수학식 3 참고).3rd mask (

), the frequency index range (k1, k2) was set to (4, 31) corresponding to (125 Hz, 968.75 Hz), considering that the distance between the first microphone and the second microphone was 140 mm (Equation 3) Reference).

그리고 0도가 제1 마이크에 대한 엔드 파이어 방향이라고 가정하는 경우, 타겟 DOA의 범위는 10도 내지 70도로 설정되었다.And when it is assumed that 0 degrees is the direction of the end fire with respect to the first microphone, the range of the target DOA is set to 10 degrees to 70 degrees.

도 5는 종합적인 테스트 결과를 도시한 도면이며, 도 6은 SNR의 레벨 별 테스트 결과를 도시한 도면이다.FIG. 5 is a view showing comprehensive test results, and FIG. 6 is a view showing test results for each level of SNR.

여기서 점선은 종래의 알고리즘들(PLDR, MNDPSD, LTIPD, SVM, AND)를 나타내고, 실선은 본 발명에 의해 개선된 알고리즘들(MNDPSD^FS, LTIPD^FS, SVM^FS, AND^FS)을 나타낸다.Here, the dotted line indicates the conventional algorithms (PLDR, MNDPSD, LTIPD, SVM, AND), and the solid line indicates the algorithms improved by the present invention (MNDPSD ^FS , LTIPD ^FS , SVM ^FS , AND ^FS ).

도 5 및 도 6을 참고하면, 본 발명에 의해 개선된 알고리즘들이 더 높은 성능을 나타냄을 알 수 있다.5 and 6, it can be seen that the algorithms improved by the present invention exhibit higher performance.

도 7 및 도 8은 또 다른 테스트에 따른 실험 결과를 도시한 도면이다.7 and 8 are diagrams showing experimental results according to another test.

이 실험은, 더 넓은 크기의 강의실에서 진행되었으며, 잔향 시간(RT60)은 400~500ms이었다.This experiment was conducted in a larger lecture room, and the reverberation time (RT60) was 400-500 ms.

도 8에서 clean waveform은 잡음이 섞이지 않은 오디오 신호의 파형, noisy waveform은 잡음이 섞인 오디오 신호의 파형, Ground truth of voice actively는 실제 음성 활동을 의미한다.In FIG. 8 , a clean waveform denotes a waveform of an audio signal without noise, a noisy waveform denotes a waveform of an audio signal mixed with noise, and a ground truth of voice actively denotes actual voice activity.

그리고 도 7 및 도 8을 참고하면, 개선된 알고리즘들(MNDPSD^FS, LTIPD^FS, SVM^FS, AND^FS)이 종래의 알고리즘들(PLDR, MNDPSD, LTIPD, SVM, AND)에 비하여 높은 성능을 나타내는 것을 알 수 있다.And with reference to FIGS. 7 and 8 , it was found that the improved algorithms (MNDPSD ^FS , LTIPD ^FS , SVM ^FS , AND ^FS ) exhibit higher performance than the conventional algorithms (PLDR, MNDPSD, LTIPD, SVM, AND). Able to know.

이와 같이 본 발명에 따르면, 조건을 부과하여 신뢰성 있는 주파수를 선택하고, 신뢰성 있는 주파수들에 대응하는 값들을 이용하여 음성의 존재 유무를 검출함으로써, 음성 활동 검출의 정확도를 향상시킬 수 있는 장점이 있다.As described above, according to the present invention, there is an advantage in that the accuracy of voice activity detection can be improved by selecting a reliable frequency by applying a condition and detecting the presence or absence of a voice using values corresponding to the reliable frequencies. .

한편, 제어부는 일반적으로 장치의 제어를 담당하는 구성으로, 중앙처리장치, 마이크로 프로세서, 프로세서 등의 용어와 혼용될 수 있다.On the other hand, the control unit is a component in charge of controlling the device in general, and may be used interchangeably with terms such as a central processing unit, a microprocessor, and a processor.

전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 상기 컴퓨터는 단말기의 제어부(180)를 포함할 수도 있다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니 되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The present invention described above can be implemented as computer-readable codes on a medium in which a program is recorded. The computer-readable medium includes all types of recording devices in which data readable by a computer system is stored. Examples of computer-readable media include Hard Disk Drive (HDD), Solid State Disk (SSD), Silicon Disk Drive (SDD), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. There is this. In addition, the computer may include the control unit 180 of the terminal. Accordingly, the above detailed description should not be construed as restrictive in all respects but as exemplary. The scope of the present invention should be determined by a reasonable interpretation of the appended claims, and all modifications within the equivalent scope of the present invention are included in the scope of the present invention.

110: 수신부 120: 출력부
130: 제어부110: receiving unit 120: output unit
130: control unit

Claims

receiving an audio signal through the first microphone and the second microphone;
Based on a preset condition, a frequency bin that satisfies the preset condition among a plurality of frequency bins corresponding to the audio signal is determined as a reliable frequency bin, and a frequency bin that does not satisfy the preset condition is determined as a non-reliable frequency bin. and selecting the determined confidence frequency bins; and
detecting the presence of speech in the audio signal using the selected confidence frequency bins;
The preset condition is
In a corresponding frequency bin, the first detailed condition is a condition in which the magnitude of the first audio signal received through the first microphone closer to the expected location of the sound source is greater than a preset value;
a second detailed condition in which a level difference between channels of a first audio signal received by the first microphone and a second audio signal received by the second microphone in a corresponding frequency bin is greater than a preset value;
In the corresponding frequency bin, the third condition is that a time difference between channels of the audio signal is within a preset range, the corresponding frequency bin is smaller than a first preset value, or the corresponding frequency bin is greater than a second preset value including detailed conditions;
The step of selecting the confidence frequency bins comprises:
determining the corresponding frequency bin as the confidence frequency bin when the corresponding frequency bin satisfies all two or more detailed conditions among the first detailed condition to the third detailed condition
voice detection method.

delete

The method of claim 1,
Detecting the presence of a voice in the audio signal using the selected confidence frequency bins includes:
Whether a voice is present in the audio signal using a voice activity detection algorithm based on at least one of NDPSD, LTIPD, and algorithms for detecting voice activity based on detection statistics obtained using values corresponding to the selected confidence frequency bins to detect,
voice detection method.

The method of claim 1,
Detecting the presence of a voice in the audio signal using the selected confidence frequency bins includes:
detecting the presence of a voice in the audio signal by combining inter-channel time difference information and inter-channel level difference information corresponding to the selected confidence frequency bins;
Detecting the presence of a voice in the audio signal by combining the inter-channel time difference information and the inter-channel level difference information corresponding to the selected confidence frequency bins,
Whether a voice is present in the audio signal by using a voice activity detection algorithm based on at least one of algorithms for detecting voice activity based on AND, SVM, and a classifier inputting values corresponding to the selected confidence frequency bins to detect
voice detection method.

8. The method of claim 7,
Detecting the presence of a voice in the audio signal using the selected confidence frequency bins includes:
When the number of the selected confidence frequency bins is less than a preset number, determining that the voice does not exist in the audio signal
voice detection method.

a receiver including a first microphone and a second microphone; and
receiving an audio signal through the first microphone and the second microphone, and using a frequency bin satisfying the preset condition among a plurality of frequency bins corresponding to the audio signal based on a preset condition as a confidence frequency bin, and a control unit configured to determine a frequency bin that does not satisfy a preset condition as an unreliable frequency bin, select the determined reliable frequency bins, and detect the presence or absence of a voice in the audio signal using the selected reliable frequency bins, ,
The preset condition is
In a corresponding frequency bin, the first detailed condition is a condition in which the magnitude of the first audio signal received through the first microphone closer to the expected location of the sound source is greater than a preset value;
a second detailed condition in which a level difference between channels of a first audio signal received by the first microphone and a second audio signal received by the second microphone in a corresponding frequency bin is greater than a preset value;
In the corresponding frequency bin, the time difference between channels of the audio signal is within a preset range, the corresponding frequency bin is smaller than a first preset value, or the corresponding frequency bin is greater than a second preset value. including detailed conditions;
When the corresponding frequency bin satisfies all two or more detailed conditions among the first detailed condition to the third detailed condition, the control unit determines the corresponding frequency bin as the confidence frequency bin
mobile terminal.

11. The method of claim 10,
The first microphone,
disposed closer to the mouth of the speaker who is talking using the mobile terminal,
The second microphone is disposed farther from the mouth of the talker than the first microphone.
mobile terminal.