KR20230018783A

KR20230018783A - Multi-channel voice activity detection system and operation method thereof

Info

Publication number: KR20230018783A
Application number: KR1020210100706A
Authority: KR
Inventors: 박형민; 박상훈
Original assignee: 서강대학교산학협력단
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2023-02-07

Abstract

A multi-channel speech detection system, according to one embodiment of the present invention, may include: a noise estimation unit; a feature extraction unit; and a speech detector. The noise estimation unit may estimate noise included in a multi-channel input signal based on the multi-channel input signal and provide the estimated noise. The feature extraction unit may provide a speech presence probability corresponding to the probability that a speech signal is included for each frame and frequency based on the estimated noise. The speech detector may determine the presence or absence of a speech signal based on the speech presence probability and provide the determination result thereof. The multi-channel speech detection system, according to the present invention, can simultaneously update weight values used in a first neural network and a second neural network according to a total loss value calculated based on a first loss value determined according to a noise extraction mask and a second loss value determined according to the determination result, thereby improving the performance of speech detection in a multi-channel environment.

Description

Multi-channel voice detection system and its operating method {MULTI-CHANNEL VOICE ACTIVITY DETECTION SYSTEM AND OPERATION METHOD THEREOF}

본 발명은 멀티채널 음성 탐지 시스템 및 이의 동작방법에 관한 것이다. The present invention relates to a multi-channel voice detection system and its operating method.

복수의 채널을 통해서 전달되는 입력신호에는 음성신호 및 노이즈 신호를 포함할 수 있다. 최근, 입력신호로부터 노이즈 신호를 제거하여 음성신호 성분만을 추출하기 위한 연구들이 진행되고 있다. An input signal transmitted through a plurality of channels may include a voice signal and a noise signal. Recently, studies have been conducted to extract only audio signal components by removing noise signals from input signals.

(한국등록특허) 제10-2063492호 (등록일자, 2020.01.02)(Korean Registered Patent) No. 10-2063492 (registration date, 2020.01.02)

본 발명이 이루고자 하는 기술적 과제는 노이즈 추출용 마스크에 따라 결정되는 제1 로스 값 및 판단 결과에 따라 결정되는 제2 로스 값에 기초하여 산출되는 토탈 로스 값에 따라 상기 제1 뉴럴 네트워크 및 제2 뉴럴 네트워크에서 사용되는 가중치 값을 동시에 갱신함으로써 멀티채널 환경에서 음성 탐지 성능을 높일 수 있는 멀티채널 음성 탐지 시스템을 제공하는 것이다.A technical problem to be achieved by the present invention is to obtain the first neural network and the second neural network according to a total loss value calculated based on a first loss value determined according to a noise extraction mask and a second loss value determined according to a determination result. An object of the present invention is to provide a multi-channel voice detection system capable of improving voice detection performance in a multi-channel environment by simultaneously updating weight values used in a network.

이러한 과제를 해결하기 위하여 본 발명의 실시예에 따른 멀티채널 음성 탐지 시스템은 노이즈 추정부, 특징 추출부 및 음성 디텍터를 포함할 수 있다. 노이즈 추정부는 멀티 채널 입력신호에 기초하여 상기 멀티 채널 입력신호에 포함되는 노이즈를 추정하여 추정 노이즈를 제공할 수 있다. 특징 추출부는 상기 추정 노이즈에 기초하여 프레임 및 주파수 별로 음성신호가 포함될 확률에 해당하는 음성 존재 확률(Speech Presence Probability)을 제공할 수 있다. 음성 디텍터는 상기 음성 존재 확률에 기초하여 상기 음성신호의 유무를 판단하여 판단 결과를 제공할 수 있다. To solve this problem, a multi-channel voice detection system according to an embodiment of the present invention may include a noise estimation unit, a feature extraction unit, and a voice detector. The noise estimator may estimate noise included in the multi-channel input signal based on the multi-channel input signal and provide estimated noise. The feature extraction unit may provide a speech presence probability corresponding to a probability that a speech signal is included for each frame and frequency based on the estimated noise. The voice detector may determine the presence or absence of the voice signal based on the voice presence probability and provide a determination result.

일 실시예에 있어서, 상기 노이즈 추정부는 제1 뉴럴 네트워크 및 노이즈 추출기를 포함할 수 있다. 제1 뉴럴 네트워크는 상기 멀티 채널 입력신호에 따라 결정되는 노이즈 추출용 마스크를 제공할 수 있다. 노이즈 추출기는 상기 노이즈 추출용 마스크 및 상기 멀티 채널 입력신호에 기초하여 상기 추정 노이즈를 제공할 수 있다. In an embodiment, the noise estimator may include a first neural network and a noise extractor. The first neural network may provide a noise extraction mask determined according to the multi-channel input signal. A noise extractor may provide the estimated noise based on the noise extraction mask and the multi-channel input signal.

일 실시예에 있어서, 상기 음성 디텍터는 제2 뉴럴 네트워크를 포함할 수 있다. 제2 뉴럴 네트워크는 상기 음성 존재 확률을 입력받아 상기 판단결과를 제공할 수 있다. In an embodiment, the voice detector may include a second neural network. The second neural network may receive the voice existence probability and provide the determination result.

일 실시예에 있어서, 상기 멀티채널 음성 탐지 시스템은 로스 계산부를 더 포함할 수 있다. 로스 계산부는 상기 노이즈 추출용 마스크에 따라 결정되는 제1 로스 값 및 상기 판단 결과에 따라 결정되는 제2 로스 값을 계산할 수 있다. In one embodiment, the multi-channel voice detection system may further include a loss calculator. The loss calculator may calculate a first loss value determined according to the noise extraction mask and a second loss value determined according to the determination result.

일 실시예에 있어서, 상기 제1 로스 값 및 상기 제2 로스 값에 기초하여 결정되는 토탈 로스 값에 따라 상기 제1 뉴럴 네트워크 및 제2 뉴럴 네트워크에서 사용되는 가중치 값이 갱신될 수 있다. In an embodiment, weight values used in the first neural network and the second neural network may be updated according to a total loss value determined based on the first loss value and the second loss value.

일 실시예에 있어서, 상기 토탈 로스 값은 상기 제1 로스 값에 제1 로스 가중치를 곱한 제1 가중 로스 값 및 상기 제2 로스 값에 제2 로스 가중치를 곱한 제2 가중 로스 값을 합한 값일 수 있다. In an embodiment, the total loss value may be a sum of a first weighted loss value obtained by multiplying the first loss value by a first loss weight value and a second weighted loss value obtained by multiplying the second loss value by a second loss weight value. there is.

일 실시예에 있어서, 상기 멀티채널 음성 탐지 시스템은 상기 토탈 로스 값에 기초하여 상기 제1 뉴럴 네트워크 및 상기 제2 뉴럴 네트워크의 노드에 적용되는 상기 가중치를 동시에 갱신할 수 있다. In an embodiment, the multi-channel voice detection system may simultaneously update the weights applied to nodes of the first neural network and the second neural network based on the total loss value.

이러한 과제를 해결하기 위하여 본 발명의 실시예에 따른 멀티채널 음성 탐지 시스템의 동작방법에서는, 노이즈 추정부가 멀티 채널 입력신호에 기초하여 상기 멀티 채널 입력신호에 포함되는 노이즈를 추정하여 추정 노이즈를 제공할 수 있다. 특징 추출부가 상기 추정 노이즈에 기초하여 프레임 및 주파수 별로 음성신호가 포함될 확률에 해당하는 음성 존재 확률(Speech Presence Probability)을 제공할 수 있다. 음성 디텍터가 상기 음성 존재 확률에 기초하여 상기 음성신호의 유무를 판단하여 판단 결과를 제공할 수 있다. In order to solve this problem, in the operating method of the multi-channel voice detection system according to an embodiment of the present invention, the noise estimation unit estimates the noise included in the multi-channel input signal based on the multi-channel input signal and provides the estimated noise. can The feature extraction unit may provide a speech presence probability corresponding to a probability that a speech signal is included for each frame and frequency based on the estimated noise. A voice detector may determine the presence or absence of the voice signal based on the voice presence probability and provide a determination result.

일 실시예에 있어서, 상기 노이즈 추정부는 제1 뉴럴 네트워크 및 노이즈 추출기를 포함할 수 있다. 상기 음성 디텍터는 제2 뉴럴 네트워크를 포함할 수 있다. 제1 뉴럴 네트워크는 상기 멀티 채널 입력신호에 따라 결정되는 노이즈 추출용 마스크를 제공할 수 있다. 노이즈 추출기는 상기 노이즈 추출용 마스크 및 상기 멀티 채널 입력신호에 기초하여 상기 추정 노이즈를 제공할 수 있다. 제2 뉴럴 네트워크는 상기 음성 존재 확률을 입력받아 상기 판단결과를 제공할 수 있다. In an embodiment, the noise estimator may include a first neural network and a noise extractor. The voice detector may include a second neural network. The first neural network may provide a noise extraction mask determined according to the multi-channel input signal. A noise extractor may provide the estimated noise based on the noise extraction mask and the multi-channel input signal. The second neural network may receive the voice existence probability and provide the determination result.

일 실시예에 있어서, 상기 멀티채널 음성 탐지 시스템은 상기 노이즈 추출용 마스크에 따라 결정되는 제1 로스 값 및 상기 판단 결과에 따라 결정되는 제2 로스 값을 계산하고, 상기 제1 로스 값 및 상기 제2 로스 값에 기초하여 결정되는 토탈 로스 값에 따라 상기 제1 뉴럴 네트워크 및 제2 뉴럴 네트워크에서 사용되는 가중치 값을 갱신할 수 있다. In one embodiment, the multi-channel voice detection system calculates a first loss value determined according to the noise extraction mask and a second loss value determined according to the determination result, the first loss value and the first loss value determined according to the determination result. Weight values used in the first neural network and the second neural network may be updated according to the total loss value determined based on the 2 loss value.

이러한 과제를 해결하기 위하여 본 발명의 실시예에 따른 멀티채널 음성 탐지 시스템은 노이즈 추정부, 특징 추출부, 음성 디텍터 및 로스 계산부를 포함할 수 있다. 노이즈 추정부는 멀티 채널 입력신호에 기초하여 상기 멀티 채널 입력신호에 포함되는 노이즈를 추정하여 추정 노이즈를 제공할 수 있다. 특징 추출부는 상기 추정 노이즈에 기초하여 프레임 및 주파수 별로 음성신호가 포함될 확률에 해당하는 음성 존재 확률(Speech Presence Probability)을 제공할 수 있다. 음성 디텍터는 상기 음성 존재 확률에 기초하여 상기 음성신호의 유무를 판단하여 판단 결과를 제공할 수 있다. 로스 계산부는 상기 노이즈 추출용 마스크에 따라 결정되는 제1 로스 값 및 상기 판단 결과에 따라 결정되는 제2 로스 값을 계산할 수 있다.To solve this problem, a multi-channel voice detection system according to an embodiment of the present invention may include a noise estimation unit, a feature extraction unit, a voice detector, and a loss calculation unit. The noise estimator may estimate noise included in the multi-channel input signal based on the multi-channel input signal and provide estimated noise. The feature extraction unit may provide a speech presence probability corresponding to a probability that a speech signal is included for each frame and frequency based on the estimated noise. The voice detector may determine the presence or absence of the voice signal based on the voice presence probability and provide a determination result. The loss calculator may calculate a first loss value determined according to the noise extraction mask and a second loss value determined according to the determination result.

이러한 과제를 해결하기 위하여 본 발명의 실시예에 따른 멀티채널 음성 탐지 시스템의 동작방법에서는, 노이즈 추정부가 멀티 채널 입력신호에 기초하여 상기 멀티 채널 입력신호에 포함되는 노이즈를 추정하여 추정 노이즈를 제공할 수 있다. 특징 추출부가 상기 추정 노이즈에 기초하여 프레임 및 주파수 별로 음성신호가 포함될 확률에 해당하는 음성 존재 확률(Speech Presence Probability)을 제공할 수 있다. 음성 디텍터가 상기 음성 존재 확률에 기초하여 상기 음성신호의 유무를 판단하여 판단 결과를 제공할 수 있다. 로스 계산부가 상기 노이즈 추출용 마스크에 따라 결정되는 제1 로스 값 및 상기 판단 결과에 따라 결정되는 제2 로스 값을 계산할 수 있다. In order to solve this problem, in the operating method of the multi-channel voice detection system according to an embodiment of the present invention, the noise estimation unit estimates the noise included in the multi-channel input signal based on the multi-channel input signal and provides the estimated noise. can The feature extraction unit may provide a speech presence probability corresponding to a probability that a speech signal is included for each frame and frequency based on the estimated noise. A voice detector may determine the presence or absence of the voice signal based on the voice presence probability and provide a determination result. A loss calculator may calculate a first loss value determined according to the noise extraction mask and a second loss value determined according to the determination result.

일 실시예에 있어서, 상기 제1 로스 값 및 상기 제2 로스 값에 기초하여 결정되는 토탈 로스 값에 따라 상기 노이즈 추정부에 포함되는 제1 뉴럴 네트워크 및 상기 음성 디텍터에 포함되는 제2 뉴럴 네트워크에서 사용되는 가중치 값이 갱신될 수 있다. In an embodiment, in a first neural network included in the noise estimation unit and a second neural network included in the voice detector according to a total loss value determined based on the first loss value and the second loss value, The weight values used may be updated.

위에서 언급된 본 발명의 기술적 과제 외에도, 본 발명의 다른 특징 및 이점들이 이하에서 기술되거나, 그러한 기술 및 설명으로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.In addition to the technical problems of the present invention mentioned above, other features and advantages of the present invention will be described below, or will be clearly understood by those skilled in the art from such description and description.

이상과 같은 본 발명에 따르면 다음과 같은 효과가 있다.According to the present invention as described above, there are the following effects.

본 발명에 따른 멀티채널 음성 탐지 시스템은 노이즈 추출용 마스크에 따라 결정되는 제1 로스 값 및 판단 결과에 따라 결정되는 제2 로스 값에 기초하여 산출되는 토탈 로스 값에 따라 제1 뉴럴 네트워크 및 제2 뉴럴 네트워크에서 사용되는 가중치 값을 동시에 갱신함으로써 멀티채널 환경에서 음성 탐지 성능을 높일 수 있다. In a multi-channel voice detection system according to the present invention, a first neural network and a second loss value are calculated based on a first loss value determined according to a noise extraction mask and a second loss value determined according to a determination result. Voice detection performance can be improved in a multi-channel environment by simultaneously updating the weight values used in the neural network.

이 밖에도, 본 발명의 실시 예들을 통해 본 발명의 또 다른 특징 및 이점들이 새롭게 파악될 수도 있을 것이다.In addition, other features and advantages of the present invention may be newly identified through the embodiments of the present invention.

도 1은 본 발명의 실시예들에 따른 멀티채널 음성 탐지 시스템을 나타내는 도면이다.
도 2는 도 1의 멀티채널 음성 탐지 시스템에 포함되는 노이즈 추정부를 나타내는 도면이다.
도 3은 도 1의 멀티채널 음성 탐지 시스템에 포함되는 음성 디텍터를 나타내는 도면이다.
도 4는 도 1의 멀티채널 음성 탐지 시스템에 포함되는 특징 추출부로부터 제공되는 음성 존재 확률을 설명하기 위한 도면이다.
도 5는 도 1의 멀티채널 음성 탐지 시스템에 포함되는 음성 디텍터로부터 제공되는 판단결과를 설명하기 위한 도면이다.
도 6은 도 1의 멀티채널 음성 탐지 시스템에서 사용되는 제1 로스 값 및 제2 로스 값을 설명하기 위한 도면이다.
도 7은 도 1의 멀티채널 음성 탐지 시스템에 포함되는 로스 계산부를 나타내는 도면이다.
도 8은 도 1의 멀티채널 음성 탐지 시스템에서 사용되는 제1 로스 가중치 및 제2 로스 가중치를 설명하기 위한 도면이다.
도 9 및 10은 본 발명의 실시예들에 따른 멀티채널 음성 탐지 시스템의 동작방법을 나타내는 순서도들이다.
도 11은 본 발명의 실시예들에 따른 멀티채널 음성 탐지 시스템을 나타내는 도면이다.1 is a diagram illustrating a multi-channel voice detection system according to embodiments of the present invention.
FIG. 2 is a diagram illustrating a noise estimation unit included in the multi-channel voice detection system of FIG. 1 .
FIG. 3 is a diagram illustrating a voice detector included in the multi-channel voice detection system of FIG. 1 .
FIG. 4 is a diagram for explaining a voice presence probability provided from a feature extractor included in the multi-channel voice detection system of FIG. 1 .
FIG. 5 is a diagram for explaining a determination result provided from a voice detector included in the multi-channel voice detection system of FIG. 1 .
FIG. 6 is a diagram for explaining a first loss value and a second loss value used in the multi-channel voice detection system of FIG. 1 .
FIG. 7 is a diagram illustrating a loss calculation unit included in the multi-channel voice detection system of FIG. 1 .
FIG. 8 is a diagram for explaining a first loss weight and a second loss weight used in the multi-channel voice detection system of FIG. 1 .
9 and 10 are flowcharts illustrating an operating method of a multi-channel voice detection system according to embodiments of the present invention.
11 is a diagram illustrating a multi-channel voice detection system according to embodiments of the present invention.

본 명세서에서 각 도면의 구성 요소들에 참조번호를 부가함에 있어서 동일한 구성 요소들에 한해서는 비록 다른 도면상에 표시되더라도 가능한한 동일한 번호를 가지도록 하고 있음에 유의하여야 한다.In this specification, it should be noted that in adding reference numerals to components of each drawing, the same components have the same numbers as much as possible even if they are displayed on different drawings.

한편, 본 명세서에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.Meanwhile, the meaning of terms described in this specification should be understood as follows.

단수의 표현은 문맥상 명백하게 다르게 정의하지 않는 한, 복수의 표현을 포함하는 것으로 이해되어야 하는 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다.Singular expressions should be understood as including plural expressions, unless the context clearly defines otherwise, and the scope of rights should not be limited by these terms.

"포함하다" 또는 "가지다" 등의 용어는 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.It should be understood that terms such as "comprise" or "having" do not preclude the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

이하, 첨부되는 도면을 참고하여 상기 문제점을 해결하기 위해 고안된 본 발명의 바람직한 실시예들에 대해 상세히 설명한다.Hereinafter, preferred embodiments of the present invention designed to solve the above problems will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예들에 따른 멀티채널 음성 탐지 시스템을 나타내는 도면이고, 도 2는 도 1의 멀티채널 음성 탐지 시스템에 포함되는 노이즈 추정부를 나타내는 도면이고, 도 3은 도 1의 멀티채널 음성 탐지 시스템에 포함되는 음성 디텍터를 나타내는 도면이고, 도 4는 도 1의 멀티채널 음성 탐지 시스템에 포함되는 특징 추출부로부터 제공되는 음성 존재 확률을 설명하기 위한 도면이고, 도 5는 도 1의 멀티채널 음성 탐지 시스템에 포함되는 음성 디텍터로부터 제공되는 판단결과를 설명하기 위한 도면이다.1 is a diagram showing a multi-channel voice detection system according to embodiments of the present invention, FIG. 2 is a diagram showing a noise estimation unit included in the multi-channel voice detection system of FIG. 1, and FIG. 3 is a diagram showing the multi-channel voice detection system of FIG. 4 is a diagram illustrating a voice presence probability provided from a feature extractor included in the multi-channel voice detection system of FIG. 1, and FIG. It is a diagram for explaining the determination result provided from the voice detector included in the channel voice detection system.

도 1 내지 5를 참조하면, 본 발명의 실시예에 따른 멀티채널 음성 탐지 시스템(10)은 노이즈 추정부(100), 특징 추출부(200) 및 음성 디텍터(300)를 포함할 수 있다. 노이즈 추정부(100)는 멀티 채널 입력신호(MCI)에 기초하여 멀티 채널 입력신호(MCI)에 포함되는 노이즈를 추정하여 추정 노이즈(ENO)를 제공할 수 있다. 예를 들어, 멀티 채널 입력신호(MCI)는 복수의 마이크들을 통해서 수신되는 신호일 수 있고, 복수의 마이크들을 통해서 수신되는 신호에는 음성신호 및 노이즈 신호가 섞여 있을 수 있다. 1 to 5 , a multi-channel voice detection system 10 according to an embodiment of the present invention may include a noise estimation unit 100, a feature extraction unit 200, and a voice detector 300. The noise estimator 100 may estimate noise included in the multi-channel input signal MCI based on the multi-channel input signal MCI and provide the estimated noise ENO. For example, the multi-channel input signal MCI may be a signal received through a plurality of microphones, and a voice signal and a noise signal may be mixed in the signal received through the plurality of microphones.

일 실시예에 있어서, 노이즈 추정부(100)는 제1 뉴럴 네트워크(110) 및 노이즈 추출기(120)를 포함할 수 있다. 제1 뉴럴 네트워크(110)는 멀티 채널 입력신호(MCI)에 따라 결정되는 노이즈 추출용 마스크(NEM)를 제공할 수 있다. 예를 들어, 제1 뉴럴 네트워크(110)는 Speech Enhancement Network(SE Network)일 수 있고, 멀티 채널 입력신호(MCI)는 SE Network에 입력될 수 있고, SE Network는 멀티 채널 입력신호(MCI)에 기초하여 노이즈 추출용 마스크(NEM)를 제공할 수 있다. In one embodiment, the noise estimator 100 may include a first neural network 110 and a noise extractor 120 . The first neural network 110 may provide a noise extraction mask (NEM) determined according to the multi-channel input signal (MCI). For example, the first neural network 110 may be a Speech Enhancement Network (SE Network), a multi-channel input signal (MCI) may be input to the SE Network, and the SE Network may be input to the multi-channel input signal (MCI). Based on this, it is possible to provide a mask (NEM) for noise extraction.

또한, 노이즈 추출기(120)는 노이즈 추출용 마스크(NEM) 및 멀티 채널 입력신호(MCI)에 기초하여 추정 노이즈(ENO)를 제공할 수 있다. 예를 들어, 노이즈 추출기(120)는 제1 뉴럴 네트워크(110)로부터 제공되는 노이즈 추출용 마스크(NEM) 및 멀티채널 입력신호를 곱하여 추정 노이즈(ENO)를 제공할 수 있다. Also, the noise extractor 120 may provide the estimated noise ENO based on the noise extraction mask NEM and the multi-channel input signal MCI. For example, the noise extractor 120 may provide the estimated noise ENO by multiplying the noise extraction mask NEM provided from the first neural network 110 and the multi-channel input signal.

특징 추출부(200)는 추정 노이즈(ENO)에 기초하여 프레임 및 주파수 별로 음성신호가 포함될 확률에 해당하는 음성 존재 확률(SPP, Speech Presence Probability)을 제공할 수 있다. 예를 들어, 복수의 프레임들은 제1 프레임(FR1) 내지 제N 프레임(N은 자연수)을 포함할 수 있고, 복수의 주파수들은 제1 주파수(F1) 내지 제M 주파수(M은 자연수)를 포함할 수 있다. 제1 프레임(FR1)에서 제1 주파수(F1)의 음성신호가 포함될 확률은 0.9일 수 있고, 제1 프레임(FR1)에서 제2 주파수(F2)의 음성신호가 포함될 확률은 0.05일 수 있고, 제1 프레임(FR1)에서 제3 주파수(F3)의 음성신호가 포함될 확률은 0.8일 수 있다. 또한, 제2 프레임(FR2)에서 제1 주파수(F1)의 음성신호가 포함될 확률은 0.1일 수 있고, 제2 프레임(FR2)에서 제2 주파수(F2)의 음성신호가 포함될 확률은 0.85일 수 있고, 제2 프레임(FR2)에서 제3 주파수(F3)의 음성신호가 포함될 확률은 0.93일 수 있다. 동일한 방식으로 제3 프레임(FR3) 내지 제N 프레임에서의 주파수 별 음성신호가 포함될 확률을 파악할 수 있다. 이와 같이, 특징 추출부(200)는 추정 노이즈(ENO)에 기초하여 프레임 및 주파수 별로 음성신호가 포함될 확률에 해당하는 음성 존재 확률(SPP)을 제공할 수 있다.The feature extractor 200 may provide a Speech Presence Probability (SPP) corresponding to a probability that a speech signal is included for each frame and frequency based on the estimated noise ENO. For example, the plurality of frames may include a first frame FR1 to an Nth frame (N is a natural number), and the plurality of frequencies include a first frequency F1 to an Mth frequency (M is a natural number). can do. The probability that the audio signal of the first frequency F1 is included in the first frame FR1 may be 0.9, and the probability that the audio signal of the second frequency F2 is included in the first frame FR1 may be 0.05. A probability that the voice signal of the third frequency F3 is included in the first frame FR1 may be 0.8. In addition, the probability that the audio signal of the first frequency F1 is included in the second frame FR2 may be 0.1, and the probability that the audio signal of the second frequency F2 is included in the second frame FR2 may be 0.85. The probability that the voice signal of the third frequency F3 is included in the second frame FR2 may be 0.93. In the same way, it is possible to determine the probability that the voice signal is included for each frequency in the third frame FR3 to the Nth frame. As such, the feature extractor 200 may provide a voice presence probability (SPP) corresponding to a probability that a voice signal is included for each frame and frequency based on the estimated noise ENO.

음성 디텍터(300)는 음성 존재 확률(SPP)에 기초하여 음성신호의 유무를 판단하여 판단 결과(DR)를 제공할 수 있다. 일 실시예에 있어서, 음성 디텍터(300)는 제2 뉴럴 네트워크(310)를 포함할 수 있다. 제2 뉴럴 네트워크(310)는 음성 존재 확률(SPP)을 입력 받아 판단결과를 제공할 수 있다. 판단 결과(DR)는 프레임 별로 해당 프레임에 음성신호가 존재하는 지 여부를 판단하기 위해 사용될 수 있다. 예를 들어, 해당 프레임에 음성신호가 존재하는 지 여부를 판단하기 위해서 사용되는 기준 값은 0.5일 수 있다. 이 경우, 제1 프레임(FR1)에서 판단 결과(DR)의 값은 0.8일 수 있고, 판단 결과(DR)의 값에 해당하는 0.8은 기준 값보다 크므로 해당 프레임에는 음성신호가 존재한다고 판단할 수 있다. 또한, 제2 프레임(FR2)에서 판단 결과(DR)의 값은 0.1일 수 있고, 판단 결과(DR)의 값에 해당하는 0.1은 기준 값보다 작으므로 해당 프레임에는 음성신호가 존재하지 않는다고 판단할 수 있다. 동일한 방식으로, 제3 프레임(FR3)에서 판단 결과(DR)의 값은 0.75일 수 있고, 판단 결과(DR)의 값에 해당하는 0.75는 기준 값보다 크므로 해당 프레임에는 음성신호가 존재한다고 판단할 수 있다.The voice detector 300 may determine the presence or absence of a voice signal based on the voice presence probability (SPP) and provide a determination result (DR). In one embodiment, the voice detector 300 may include a second neural network 310 . The second neural network 310 may receive a voice presence probability (SPP) and provide a decision result. The determination result (DR) may be used to determine whether a voice signal exists in a corresponding frame for each frame. For example, a reference value used to determine whether a voice signal exists in a corresponding frame may be 0.5. In this case, the value of the decision result DR in the first frame FR1 may be 0.8, and since 0.8 corresponding to the value of the decision result DR is greater than the reference value, it can be determined that a voice signal exists in the corresponding frame. can Also, in the second frame FR2, the value of the decision result DR may be 0.1, and since 0.1 corresponding to the value of the decision result DR is smaller than the reference value, it can be determined that no audio signal exists in the corresponding frame. can In the same way, in the third frame FR3, the value of the decision result DR may be 0.75, and since 0.75 corresponding to the value of the decision result DR is greater than the reference value, it is determined that a voice signal exists in the corresponding frame. can do.

도 6은 도 1의 멀티채널 음성 탐지 시스템에서 사용되는 제1 로스 값 및 제2 로스 값을 설명하기 위한 도면이고, 도 7은 도 1의 멀티채널 음성 탐지 시스템에 포함되는 로스 계산부를 나타내는 도면이고, 도 8은 도 1의 멀티채널 음성 탐지 시스템에서 사용되는 제1 로스 가중치 및 제2 로스 가중치를 설명하기 위한 도면이다.6 is a diagram for explaining a first loss value and a second loss value used in the multi-channel voice detection system of FIG. 1, and FIG. 7 is a diagram showing a loss calculation unit included in the multi-channel voice detection system of FIG. , FIG. 8 is a diagram for explaining first loss weights and second loss weights used in the multi-channel voice detection system of FIG. 1 .

도 1 내지 8을 참조하면, 본 발명의 실시예에 따른 멀티채널 음성 탐지 시스템(10)은 노이즈 추정부(100), 특징 추출부(200), 음성 디텍터(300) 및 로스 계산부(400)를 포함할 수 있다. 로스 계산부(400)는 노이즈 추출용 마스크(NEM)에 따라 결정되는 제1 로스 값(L1) 및 판단 결과(DR)에 따라 결정되는 제2 로스 값(L2)을 계산할 수 있다. 로스 계산부(400)는 제1 로스 계산부(410) 및 제2 로스 계산부(420)를 포함할 수 있다. 예를 들어, 제1 뉴럴 네트워크(110)에 입력되는 신호는 미리 라벨링된 멀티채널 입력신호일 수 있고, 라벨링된 멀티채널 입력신호에 상응하는 제1 뉴럴 네트워크(110)의 이상적인 출력은 라벨링 노이즈 마스크(LNM)일 수 있다. 라벨링 노이즈 마스크(LNM)는 사용자가 미리 파악하고 있을 수 있다. 이 경우, 라벨링된 멀티채널 입력신호를 제1 뉴럴 네트워크(110)에 입력신호로 실제로 제공하여 얻어지는 제1 뉴럴 네트워크(110)의 출력은 노이즈 추출용 마스크(NEM)일 수 있다. 여기서, 제1 로스 값(L1)은 라벨링 노이즈 마스크(LNM) 및 노이즈 추출용 마스크(NEM)를 비교하여 얻어질 수 있다. 1 to 8, the multi-channel voice detection system 10 according to an embodiment of the present invention includes a noise estimation unit 100, a feature extraction unit 200, a voice detector 300, and a loss calculator 400. can include The loss calculator 400 may calculate a first loss value L1 determined according to the noise extraction mask NEM and a second loss value L2 determined according to the determination result DR. The loss calculator 400 may include a first loss calculator 410 and a second loss calculator 420 . For example, a signal input to the first neural network 110 may be a pre-labeled multi-channel input signal, and an ideal output of the first neural network 110 corresponding to the labeled multi-channel input signal is a labeling noise mask ( LNM). The labeling noise mask (LNM) may be known to the user in advance. In this case, an output of the first neural network 110 obtained by actually providing the labeled multi-channel input signal as an input signal to the first neural network 110 may be a noise extraction mask (NEM). Here, the first loss value L1 may be obtained by comparing the labeling noise mask LNM and the noise extraction mask NEM.

또한, 예를 들어, 제2 뉴럴 네트워크(310)의 이상적인 출력은 라벨링 판단 결과(LDR)일 수 있다. 라벨링 판단 결과(LDR)는 사용자가 미리 파악하고 있을 수 있다. 이 경우, 제2 뉴럴 네트워크의 출력에 해당하는 판단 결과(DR) 및 라벨링 판단 결과(LDR)를 비교하여 제2 로스값(L2)을 얻을 수 있다. Also, for example, an ideal output of the second neural network 310 may be a labeling decision result (LDR). The labeling decision result (LDR) may be recognized by the user in advance. In this case, the second loss value L2 may be obtained by comparing the decision result DR corresponding to the output of the second neural network and the labeling decision result LDR.

일 실시예에 있어서, 제1 로스 값(L1) 및 제2 로스 값(L2)에 기초하여 결정되는 토탈 로스 값에 따라 제1 뉴럴 네트워크(110) 및 제2 뉴럴 네트워크(310)에서 사용되는 가중치 값이 갱신될 수 있다. 예를 들어, 토탈 로스 값은 제1 로스 값(L1)에 제1 로스 가중치(LW1)를 곱한 제1 가중 로스 값 및 제2 로스 값(L2)에 제2 로스 가중치(LW2)를 곱한 제2 가중 로스 값을 합한 값일 수 있다. In an embodiment, weights used in the first neural network 110 and the second neural network 310 according to the total loss value determined based on the first loss value L1 and the second loss value L2 Values can be updated. For example, the total loss value is a first weighted loss value obtained by multiplying the first loss value L1 by the first loss weight value LW1, and a second weighted loss value obtained by multiplying the second loss value L2 by the second loss weight value LW2. It may be a sum of weighted loss values.

일 실시예에 있어서, 멀티채널 음성 탐지 시스템(10)은 토탈 로스 값에 기초하여 제1 뉴럴 네트워크(110) 및 제2 뉴럴 네트워크(310)의 노드에 적용되는 가중치를 동시에 갱신할 수 있다. 예를 들어, 제1 뉴럴 네트워크(110) 및 제2 뉴럴 네트워크(310)에는 복수의 레이어들이 포함될 수 있고, 복수의 레이어들의 각각에는 복수의 노드들이 포함될 수 있다. 본 발명에 따른 멀티채널 음성 탐지 시스템(10)은 토탈 로스 값에 따라 제1 뉴럴 네트워크(110) 및 제2 뉴럴 네트워크(310)에 포함되는 노드들에 적용되는 가중치의 값을 갱신할 수 있다. In an embodiment, the multi-channel voice detection system 10 may simultaneously update weights applied to nodes of the first neural network 110 and the second neural network 310 based on the total loss value. For example, a plurality of layers may be included in the first neural network 110 and the second neural network 310, and a plurality of nodes may be included in each of the plurality of layers. The multi-channel voice detection system 10 according to the present invention may update values of weights applied to nodes included in the first neural network 110 and the second neural network 310 according to the total loss value.

도 9 및 10은 본 발명의 실시예들에 따른 멀티채널 음성 탐지 시스템의 동작방법을 나타내는 순서도들이다.9 and 10 are flowcharts illustrating an operating method of a multi-channel voice detection system according to embodiments of the present invention.

도 1 내지 10을 참조하면, 본 발명의 실시예에 따른 멀티채널 음성 탐지 시스템(10)의 동작방법에서는, 노이즈 추정부(100)가 멀티 채널 입력신호(MCI)에 기초하여 멀티 채널 입력신호(MCI)에 포함되는 노이즈를 추정하여 추정 노이즈(ENO)를 제공할 수 있다(S100). 특징 추출부(200)가 추정 노이즈(ENO)에 기초하여 프레임 및 주파수 별로 음성신호가 포함될 확률에 해당하는 음성 존재 확률(Speech Presence Probability)을 제공할 수 있다(S200). 음성 디텍터(300)가 음성 존재 확률(SPP)에 기초하여 음성신호의 유무를 판단하여 판단 결과(DR)를 제공할 수 있다(S300). 1 to 10, in the operating method of the multi-channel voice detection system 10 according to an embodiment of the present invention, the noise estimator 100 performs a multi-channel input signal (MCI) based on the multi-channel input signal (MCI). Noise included in MCI) may be estimated to provide estimated noise (ENO) (S100). The feature extraction unit 200 may provide a speech presence probability corresponding to a probability that a speech signal is included for each frame and frequency based on the estimated noise ENO (S200). The voice detector 300 may determine the presence or absence of a voice signal based on the voice presence probability (SPP) and provide a determination result (DR) (S300).

일 실시예에 있어서, 노이즈 추정부(100)는 제1 뉴럴 네트워크(110) 및 노이즈 추출기(120)를 포함할 수 있다. 음성 디텍터(300)는 제2 뉴럴 네트워크(310)를 포함할 수 있다. 제1 뉴럴 네트워크(110)는 멀티 채널 입력신호(MCI)에 따라 결정되는 노이즈 추출용 마스크(NEM)를 제공할 수 있다. 노이즈 추출기(120)는 노이즈 추출용 마스크(NEM) 및 멀티 채널 입력신호(MCI)에 기초하여 추정 노이즈(ENO)를 제공할 수 있다. 제2 뉴럴 네트워크(310)는 음성 존재 확률(SPP)을 입력받아 판단결과를 제공할 수 있다. In one embodiment, the noise estimator 100 may include a first neural network 110 and a noise extractor 120 . The voice detector 300 may include a second neural network 310 . The first neural network 110 may provide a noise extraction mask (NEM) determined according to the multi-channel input signal (MCI). The noise extractor 120 may provide the estimated noise ENO based on the noise extraction mask NEM and the multi-channel input signal MCI. The second neural network 310 may receive a voice presence probability (SPP) and provide a decision result.

일 실시예에 있어서, 멀티채널 음성 탐지 시스템(10)은 노이즈 추출용 마스크(NEM)에 따라 결정되는 제1 로스 값(L1) 및 판단 결과(DR)에 따라 결정되는 제2 로스 값(L2)을 계산하고, 제1 로스 값(L1) 및 제2 로스 값(L2)에 기초하여 결정되는 토탈 로스 값에 따라 제1 뉴럴 네트워크(110) 및 제2 뉴럴 네트워크(310)에서 사용되는 가중치 값을 갱신할 수 있다. In an embodiment, the multi-channel voice detection system 10 includes a first loss value L1 determined according to a noise extraction mask NEM and a second loss value L2 determined according to the determination result DR. is calculated, and weight values used in the first neural network 110 and the second neural network 310 are determined according to the total loss value determined based on the first loss value L1 and the second loss value L2. can be renewed

본 발명의 실시예에 따른 멀티채널 음성 탐지 시스템(10)은 노이즈 추정부(100), 특징 추출부(200), 음성 디텍터(300) 및 로스 계산부(400)를 포함할 수 있다. 노이즈 추정부(100)는 멀티 채널 입력신호(MCI)에 기초하여 멀티 채널 입력신호(MCI)에 포함되는 노이즈를 추정하여 추정 노이즈(ENO)를 제공할 수 있다. 특징 추출부(200)는 추정 노이즈(ENO)에 기초하여 라이클리 후드 레이시오(Likelihood Ratio)방식에 따라 프레임 및 주파수 별로 음성신호가 포함될 확률에 해당하는 음성 존재 확률(Speech Presence Probability)을 제공할 수 있다. 음성 디텍터(300)는 음성 존재 확률(SPP)에 기초하여 음성신호의 유무를 판단하여 판단 결과(DR)를 제공할 수 있다. 로스 계산부(400)는 노이즈 추출용 마스크(NEM)에 따라 결정되는 제1 로스 값(L1) 및 판단 결과(DR)에 따라 결정되는 제2 로스 값(L2)을 계산할 수 있다.The multi-channel voice detection system 10 according to an embodiment of the present invention may include a noise estimator 100, a feature extractor 200, a voice detector 300, and a loss calculator 400. The noise estimator 100 may estimate noise included in the multi-channel input signal MCI based on the multi-channel input signal MCI and provide the estimated noise ENO. The feature extractor 200 provides a speech presence probability corresponding to the probability that a speech signal is included for each frame and frequency according to a Likelihood Ratio method based on the estimated noise ENO. can The voice detector 300 may determine the presence or absence of a voice signal based on the voice presence probability (SPP) and provide a determination result (DR). The loss calculator 400 may calculate a first loss value L1 determined according to the noise extraction mask NEM and a second loss value L2 determined according to the determination result DR.

본 발명의 실시예에 따른 멀티채널 음성 탐지 시스템(10)의 동작방법에서는, 노이즈 추정부(100)가 멀티 채널 입력신호(MCI)에 기초하여 멀티 채널 입력신호(MCI)에 포함되는 노이즈를 추정하여 추정 노이즈(ENO)를 제공할 수 있다(S100). 특징 추출부(200)가 추정 노이즈(ENO)에 기초하여 프레임 및 주파수 별로 음성신호가 포함될 확률에 해당하는 음성 존재 확률(Speech Presence Probability)을 제공할 수 있다(S200). 음성 디텍터(300)가 음성 존재 확률(SPP)에 기초하여 음성신호의 유무를 판단하여 판단 결과(DR)를 제공할 수 있다(S300). 로스 계산부(400)가 노이즈 추출용 마스크(NEM)에 따라 결정되는 제1 로스 값(L1) 및 판단 결과(DR)에 따라 결정되는 제2 로스 값(L2)을 계산할 수 있다(S400). In the operating method of the multi-channel voice detection system 10 according to an embodiment of the present invention, the noise estimation unit 100 estimates noise included in the multi-channel input signal MCI based on the multi-channel input signal MCI. Thus, estimation noise ENO may be provided (S100). The feature extraction unit 200 may provide a speech presence probability corresponding to a probability that a speech signal is included for each frame and frequency based on the estimated noise ENO (S200). The voice detector 300 may determine the presence or absence of a voice signal based on the voice presence probability (SPP) and provide a determination result (DR) (S300). The loss calculator 400 may calculate a first loss value L1 determined according to the noise extraction mask NEM and a second loss value L2 determined according to the determination result DR (S400).

일 실시예에 있어서, 제1 로스 값(L1) 및 제2 로스 값(L2)에 기초하여 결정되는 토탈 로스 값에 따라 노이즈 추정부(100)에 포함되는 제1 뉴럴 네트워크(110) 및 음성 디텍터(300)에 포함되는 제2 뉴럴 네트워크(310)에서 사용되는 가중치 값이 갱신될 수 있다. In an embodiment, the first neural network 110 and the voice detector included in the noise estimator 100 according to the total loss value determined based on the first loss value L1 and the second loss value L2 A weight value used in the second neural network 310 included in 300 may be updated.

일 실시예에 있어서, 토탈 로스 값은 제1 로스 값(L1)에 제1 로스 가중치(LW1)를 곱한 제1 가중 로스 값 및 제2 로스 값(L2)에 제2 로스 가중치(LW2)를 곱한 제2 가중 로스 값을 합한 값일 수 있다.In one embodiment, the total loss value is a first weight loss value obtained by multiplying the first loss value L1 by the first loss weight value LW1, and the second loss value L2 multiplied by the second loss weight value LW2. It may be a sum of the second weight loss values.

본 발명에 따른 멀티채널 음성 탐지 시스템(10)은 노이즈 추출용 마스크(NEM)에 따라 결정되는 제1 로스 값(L1) 및 판단 결과(DR)에 따라 결정되는 제2 로스 값(L2)에 기초하여 산출되는 토탈 로스 값에 따라 제1 뉴럴 네트워크(110) 및 제2 뉴럴 네트워크(310)에서 사용되는 가중치 값을 동시에 갱신함으로써 멀티채널 환경에서 음성 탐지 성능을 높일 수 있다.The multi-channel voice detection system 10 according to the present invention is based on a first loss value L1 determined according to a noise extraction mask NEM and a second loss value L2 determined according to the determination result DR. Voice detection performance in a multi-channel environment can be improved by simultaneously updating the weight values used in the first neural network 110 and the second neural network 310 according to the calculated total loss value.

도 11은 본 발명의 실시예들에 따른 멀티채널 음성 탐지 시스템을 나타내는 도면이다. 11 is a diagram illustrating a multi-channel voice detection system according to embodiments of the present invention.

도 1 내지 11을 참조하면, 본 발명에 따른 멀티채널 음성 탐지 시스템(10)은 노이즈 추정부(100), 특징 추출부(200) 및 음성 디텍터(300)를 포함할 수 있다. 노이즈 추정부(100)는 멀티 채널 입력신호(MCI)에 기초하여 멀티 채널 입력신호(MCI)에 포함되는 노이즈를 추정하여 추정 노이즈(ENO)를 제공할 수 있다. 특징 추출부(200)는 추정 노이즈(ENO)에 기초하여 프레임 및 주파수 별로 음성신호가 포함될 확률에 해당하는 음성 존재 확률(Speech Presence Probability)을 제공할 수 있다. 음성 디텍터(300)는 멀티 채널 입력신호(MCI), 노이즈 추정부(100)로부터 제공되는 노이즈 추출용 마스크(NEM) 및 음성 존재 확률(SPP) 중 적어도 하나에 기초하여 음성신호의 유무를 판단하여 판단 결과(DR)를 제공할 수 있다. 또한, 음성 디텍터(300)의 입력으로 노이즈 추출용 마스크(NEM)가 제공되고 있으나, 노이즈 추출용 마스크(NEM)를 대신해 노이즈 추출용 마스크(NEM)로부터 추정되는 Enhanced Speech의 주파수 특징이 음성 디텍터(300)의 입력으로 제공될 수도 있다. 1 to 11, the multi-channel voice detection system 10 according to the present invention may include a noise estimation unit 100, a feature extraction unit 200, and a voice detector 300. The noise estimator 100 may estimate noise included in the multi-channel input signal MCI based on the multi-channel input signal MCI and provide the estimated noise ENO. The feature extractor 200 may provide a speech presence probability corresponding to a probability that a speech signal is included for each frame and frequency based on the estimated noise ENO. The voice detector 300 determines the presence or absence of a voice signal based on at least one of a multi-channel input signal (MCI), a noise extraction mask (NEM) provided from the noise estimation unit 100, and a voice presence probability (SPP). A decision result (DR) may be provided. In addition, a noise extraction mask (NEM) is provided as an input of the voice detector 300, but instead of the noise extraction mask (NEM), the frequency characteristics of the enhanced speech estimated from the noise extraction mask (NEM) are used for the voice detector ( 300) may be provided as an input.

10: 멀티채널 음성 탐지 시스템 100: 노이즈 추정부
200: 특징 추출부 300: 특징 추출부
110: 제1 뉴럴 네트워크 120: 노이즈 추출기
310: 제2 뉴럴 네트워크 400: 로스 계산부10: multi-channel voice detection system 100: noise estimator
200: feature extraction unit 300: feature extraction unit
110: first neural network 120: noise extractor
310: second neural network 400: loss calculator

Claims

a noise estimation unit estimating noise included in the multi-channel input signal based on the multi-channel input signal and providing estimated noise;
a feature extraction unit providing a speech presence probability corresponding to a probability that a speech signal is included in each frame and frequency based on the estimated noise; and
A multi-channel voice detection system comprising a voice detector for determining the presence or absence of the voice signal based on the voice presence probability and providing a determination result.

According to claim 1,
The noise estimator,
a first neural network providing a noise extraction mask determined according to the multi-channel input signal;
and a noise extractor for providing the estimated noise based on the noise extraction mask and the multi-channel input signal.

According to claim 2,
The voice detector,
The multi-channel voice detection system comprising a second neural network receiving the voice presence probability and providing the determination result.

According to claim 3,
The multi-channel voice detection system,
The multi-channel voice detection system of claim 1 , further comprising a loss calculator configured to calculate a first loss value determined according to the noise extraction mask and a second loss value determined according to the determination result.

According to claim 4,
The multi-channel voice detection system, characterized in that weight values used in the first neural network and the second neural network are updated according to a total loss value determined based on the first loss value and the second loss value.

According to claim 5,
The total loss value is a sum of a first weighted loss value obtained by multiplying the first loss value by a first loss weight value and a second weighted loss value obtained by multiplying the second loss value by a second loss weight value. Multichannel voice, characterized in that detection system.

According to claim 6,
The multi-channel voice detection system,
The multi-channel voice detection system of claim 1 , wherein the weights applied to nodes of the first neural network and the second neural network are simultaneously updated based on the total loss value.

estimating noise included in the multi-channel input signal based on the multi-channel input signal by a noise estimator and providing estimated noise;
providing, by a feature extraction unit, a speech presence probability corresponding to a probability that a speech signal is included for each frame and frequency based on the estimated noise; and
A method of operating a multi-channel voice detection system comprising the step of determining, by a voice detector, the presence or absence of the voice signal based on the voice presence probability and providing a determination result.

According to claim 8,
The noise estimator,
a first neural network providing a noise extraction mask determined according to the multi-channel input signal; and
A noise extractor providing the estimated noise based on the noise extraction mask and the multi-channel input signal;
The voice detector,
A method of operating a multi-channel voice detection system comprising a second neural network receiving the voice presence probability and providing the determination result.

According to claim 9,
The multi-channel voice detection system calculates a first loss value determined according to the noise extraction mask and a second loss value determined according to the determination result,
Operation of the multi-channel voice detection system characterized in that the weight values used in the first neural network and the second neural network are updated according to the total loss value determined based on the first loss value and the second loss value. method.

a noise estimation unit estimating noise included in the multi-channel input signal based on the multi-channel input signal and providing estimated noise;
a feature extraction unit providing a speech presence probability corresponding to a probability that a speech signal is included in each frame and frequency based on the estimated noise;
a voice detector determining whether or not the voice signal is present based on the voice presence probability and providing a result of the determination; and
and a loss calculator configured to calculate a first loss value determined according to the noise extraction mask and a second loss value determined according to the determination result.

estimating noise included in the multi-channel input signal based on the multi-channel input signal by a noise estimator and providing estimated noise;
providing, by a feature extraction unit, a speech presence probability corresponding to a probability that a speech signal is included for each frame and frequency based on the estimated noise;
determining, by a voice detector, the presence or absence of the voice signal based on the voice presence probability, and providing a determination result; and
and calculating, by a loss calculation unit, a first loss value determined according to the noise extraction mask and a second loss value determined according to the determination result.

According to claim 12,
Weight values used in the first neural network included in the noise estimator and the second neural network included in the speech detector are updated according to the total loss value determined based on the first loss value and the second loss value. A method of operating a multi-channel voice detection system, characterized in that.

According to claim 13,
The total loss value is a sum of a first weighted loss value obtained by multiplying the first loss value by a first loss weight value and a second weighted loss value obtained by multiplying the second loss value by a second loss weight value. Multichannel voice, characterized in that How the detection system works.

a noise estimation unit estimating noise included in the multi-channel input signal based on the multi-channel input signal and providing estimated noise;
a feature extraction unit providing a speech presence probability corresponding to a probability that a speech signal is included in each frame and frequency based on the estimated noise; and
Multi-channel voice detection including a voice detector that determines whether or not there is a voice signal based on at least one of the multi-channel input signal, a noise extraction mask provided from the noise estimation unit, and the voice existence probability and provides a result of the decision system.