KR20220081129A

KR20220081129A - Audio signal processing method and appratus

Info

Publication number: KR20220081129A
Application number: KR1020200170663A
Authority: KR
Inventors: 김의성; 전재진
Original assignee: 주식회사 카카오엔터프라이즈
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2022-06-15
Also published as: KR102503895B1

Abstract

음성 신호 및 노이즈를 포함하는 음향 신호를 처리하는 방법 및 장치에 관한 것이다. 실시예에 따른 음향 신호 처리 방법은 음성 신호 및 노이즈를 포함하는 입력 음향 신호를 획득하는 단계, 입력 음향 신호에서 노이즈를 제거하여, 제1 신호를 생성하는 단계, 노이즈 제어에 관한 사용자 인터페이스를 통해 입력되고 노이즈의 강도를 제어하는 파라미터를 획득하는 단계 및 파라미터에 기초하여 입력 음향 신호 및 제1 신호를 가중 합함으로써, 출력 음향 신호를 생성하는 단계를 포함한다.It relates to a method and apparatus for processing an acoustic signal including a voice signal and noise. An acoustic signal processing method according to an embodiment includes: obtaining an input acoustic signal including a voice signal and noise; generating a first signal by removing noise from the input acoustic signal; and obtaining a parameter controlling the intensity of the noise and generating an output acoustic signal by weighted summing the input acoustic signal and the first signal based on the parameter.

Description

Acoustic signal processing method and apparatus {AUDIO SIGNAL PROCESSING METHOD AND APPRATUS}

음향 신호를 처리하는 방법 및 장치에 관한 것이다. 보다 구체적으로, 음성 신호 및 노이즈를 포함하는 음향 신호를 처리하는 방법 및 장치에 관한 것이다.It relates to a method and apparatus for processing an acoustic signal. More particularly, it relates to a method and apparatus for processing a sound signal including a voice signal and noise.

노이즈 제거 기술은 음향 신호에서 다른 신호의 간섭을 비롯한 여러 가지 의도하지 않은 입력 신호의 왜곡을 제거하는 기술이다. 예를 들어, 마이크를 통해 입력된 음성 신호에 주변 환경에서 발생한 소음으로 인해 노이즈가 포함된 경우, 배경 노이즈 제거 기술을 통해 음성 외의 노이즈를 제거할 수 있다. 즉, 노이즈 제거 기술은 마이크에 입력되는 원치 않는 신호들에 의한 영향을 배제시키거나 완화시키는 음성 전처리 기술의 일종이다. 노이즈 제거 기술은 통화 시 의사 소통을 방해하는 배경 노이즈를 제거하여 통화 품질을 높이기 위하여 이용될 수 있으며, 음성 인식 시스템에서 음성 인식의 효율을 높이기 위하여 이용될 수도 있다.Noise cancellation technology is a technology that removes various unintentional distortion of an input signal, including interference of other signals, from an acoustic signal. For example, when a voice signal input through a microphone includes noise due to noise generated in the surrounding environment, noise other than the voice may be removed through background noise removal technology. That is, the noise removal technology is a kind of voice pre-processing technology that excludes or mitigates the effects of unwanted signals input to the microphone. The noise removal technology may be used to increase call quality by removing background noise that interferes with communication during a call, and may be used to increase the efficiency of voice recognition in a voice recognition system.

노이즈 제거를 위해 많은 연구들이 진행되어 왔으며, 전통적인 노이즈 제거 알고리즘은 통계적 기법을 사용하여 개발되었으나, 최근 딥러닝을 이용한 알고리즘이 많이 개발되고 있다. 노이즈 제거된 신호를 출력하도록 학습된 딥러닝 모델을 이용하는 경우, 노이즈 제거의 강도를 조절하기 위해서는 여러 모델들의 가중치를 모두 가지고 있어야 하는데, 이는 여러 모델들의 가중치들을 저장하기 위한 소프트웨어 용량을 과도하게 요구하며, 많은 가중치 값들에 메모리를 할당함에 따라 레이턴시(latency)가 길어지는 문제가 있다.Many studies have been conducted to remove noise, and traditional noise removal algorithms have been developed using statistical techniques, but recently, many algorithms using deep learning have been developed. When using a deep learning model trained to output a denoised signal, it is necessary to have all the weights of several models in order to control the strength of denoising, which requires excessive software capacity to store the weights of several models. , there is a problem in that latency increases as memory is allocated to many weight values.

설정된 배경 노이즈의 강도에 따라 노이즈 제거의 정도를 제어하는 음성 신호 처리 기술을 제공할 수 있으며, 사용자가 음성 신호를 수신하는 환경에서 사용자 인터페이스를 통해 노이즈 제거의 정도를 조절할 수 있는 기술을 제공할 수 있다.It is possible to provide a voice signal processing technology that controls the degree of noise removal according to the set background noise intensity, and it is possible to provide a technology for controlling the degree of noise removal through the user interface in an environment in which the user receives a voice signal. have.

음성이 포함된 구간과 음성이 포함되지 않은 구간에서 노이즈 제거된 음성 신호와 입력된 음성 신호의 합성 비율을 조정함으로써, 음성이 포함된 구간과 음성이 포함되지 않은 구간의 연결이 자연스러운 출력 신호를 합성하는 기술을 제공할 수 있다.By adjusting the synthesis ratio of the noise-removed voice signal and the input voice signal in the section including the voice and the section without the voice, a natural output signal is synthesized by connecting the section with the voice and the section without the voice technology can be provided.

통화 환경에서 노이즈 제거가 필요한 상황을 판단하여, 노이즈 제거가 필요한 상황에서 사용자가 인터페이스를 통하여 배경 노이즈 제거에 대한 적절한 제어를 할 수 있도록 유도하는 기술을 제공할 수 있다.It is possible to provide a technology that determines a situation in which noise removal is required in a call environment, and induces a user to appropriately control background noise removal through an interface in a situation in which noise removal is required.

다만, 기술적 과제는 상술한 기술적 과제들로 한정되는 것은 아니며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical tasks are not limited to the above-described technical tasks, and other technical tasks may exist.

일 측에 따른 프로세서에서 수행되는 음향 신호 처리 방법은 음성 신호 및 노이즈를 포함하는 입력 음향 신호를 획득하는 단계; 상기 입력 음향 신호에서 상기 노이즈를 제거하여, 제1 신호를 생성하는 단계; 노이즈 제어에 관한 사용자 인터페이스를 통해 입력되고 상기 노이즈의 강도를 제어하는 파라미터를 획득하는 단계; 및 상기 파라미터에 기초하여 상기 입력 음향 신호 및 상기 제1 신호를 가중 합함으로써, 출력 음향 신호를 생성하는 단계를 포함한다.An acoustic signal processing method performed by a processor according to one aspect includes: acquiring an input acoustic signal including a voice signal and noise; generating a first signal by removing the noise from the input sound signal; obtaining a parameter input through a user interface related to noise control and controlling the intensity of the noise; and generating an output acoustic signal by weighted summing the input acoustic signal and the first signal based on the parameter.

상기 음향 신호 처리 방법은 상기 입력 음향 신호의 SNR(signal-to-noise ratio) 값을 획득하는 단계; 및 상기 SNR 값에 기초하여, 상기 사용자 인터페이스를 활성화할지 여부를 결정하는 단계를 더 포함할 수 있다.The sound signal processing method may include: obtaining a signal-to-noise ratio (SNR) value of the input sound signal; and determining whether to activate the user interface based on the SNR value.

상기 출력 음향 신호를 생성하는 단계는 상기 제1 신호의 각 구간에 대응하여, 음성 존재 확률(speech presence probability)을 획득하는 단계; 및 상기 음성 존재 확률 및 상기 파라미터에 기초하여, 상기 입력 음향 신호 및 상기 제1 신호를 구간 별로 가중 합하는 단계를 포함할 수 있다.The generating of the output sound signal may include: acquiring a speech presence probability corresponding to each section of the first signal; and weighted summing the input sound signal and the first signal for each section based on the voice presence probability and the parameter.

상기 가중 합하는 단계는 상기 음성 존재 확률이 높은 구간에서는, 상기 파라미터에 기초하여, 상기 제1 신호보다 상기 입력 음향 신호에 높은 가중치를 두어 가중 합하는 단계; 및 상기 음성 존재 확률이 낮은 구간에서는, 상기 파라미터에 기초하여, 상기 입력 음향 신호보다 상기 제1 신호에 높은 가중치를 두어 가중 합하는 단계를 포함할 수 있다.The weighted summing may include: weighted summing the input sound signal with a higher weight than the first signal based on the parameter in the section in which the voice presence probability is high; and weighted summing the first signal with a higher weight than the input sound signal, based on the parameter, in a section in which the voice presence probability is low.

상기 가중 합하는 단계는 상기 파라미터에 기초하여 상기 입력 음향 신호 및 상기 제1 신호를 가중 합함으로써, 상기 입력 음향 신호의 비중이 높은 제2 신호 및 상기 제1 신호의 비중이 높은 제3 신호를 생성하는 단계; 및 상기 음성 존재 확률에 기초하여, 상기 제2 신호 및 상기 제3 신호를 구간 별로 가중 합하는 단계를 포함할 수 있다.In the weighted summing step, the input sound signal and the first signal are weighted and summed based on the parameter, thereby generating a second signal having a high proportion of the input sound signal and a third signal having a high proportion of the first signal. step; and weighted summing the second signal and the third signal for each section based on the voice presence probability.

상기 음성 존재 확률을 획득하는 단계는 상기 제1 신호를 특정 시간 단위로 분할하여 생성된 프레임들 각각에 대응하여, 음성 활성 검출(voice activity detection) 알고리즘에 기초하여, 해당하는 프레임에 상기 음성 신호가 포함될 확률을 획득하는 단계를 포함할 수 있다.The obtaining of the voice presence probability may include, in response to each of the frames generated by dividing the first signal into specific time units, based on a voice activity detection algorithm, that the voice signal is included in the corresponding frame. It may include obtaining a probability of being included.

상기 제1 신호를 생성하는 단계는 상기 입력 음향 신호의 시간 영역에서 노이즈 제거 알고리즘을 수행하여, 상기 제1 신호를 생성하는 단계를 포함할 수 있다.The generating of the first signal may include generating the first signal by performing a noise removal algorithm in a time domain of the input sound signal.

상기 제1 신호를 생성하는 단계는 노이즈가 포함된 음향 신호에서 노이즈가 제거된 음향 신호를 출력하도록 학습된 딥 러닝 모델에 기초하여, 상기 제1 신호를 생성하는 단계를 포함할 수 있다.The generating of the first signal may include generating the first signal based on a deep learning model trained to output an acoustic signal in which noise is removed from an acoustic signal including noise.

일 측에 따른 음향 신호 처리 방법은 음성 신호 및 노이즈를 포함하는 입력 음향 신호에서 상기 노이즈를 제거하여, 제1 신호를 생성하는 단계; 상기 제1 신호에 대응하여, 음성이 존재할 확률(speech presence probability)을 획득하는 단계; 상기 노이즈의 제거 정도에 관한 파라미터를 획득하는 단계; 및 상기 음성이 존재할 확률 및 상기 파라미터 중 적어도 하나에 기초하여, 상기 입력 음향 신호 및 상기 제1 신호를 합성함으로써, 상기 노이즈의 제거 정도를 제어하는 단계를 포함한다.An acoustic signal processing method according to an aspect includes: generating a first signal by removing the noise from an input acoustic signal including a voice signal and noise; obtaining a speech presence probability in response to the first signal; obtaining a parameter related to the degree of removal of the noise; and controlling the degree of removal of the noise by synthesizing the input sound signal and the first signal based on at least one of a probability of the presence of the voice and the parameter.

상기 제어하는 단계는 상기 음성이 존재할 확률이 높은 구간에서는, 상기 파라미터에 기초하여, 상기 제1 신호보다 상기 입력 음향 신호에 높은 가중치를 두어 가중 합함으로써, 상기 노이즈의 제거 정도를 제어하는 단계; 및 상기 음성이 존재할 확률이 낮은 구간에서는, 상기 파라미터에 기초하여, 상기 입력 음향 신호보다 상기 제1 신호에 높은 가중치를 두어 가중 합함으로써, 상기 노이즈의 제거 정도를 제어하는 단계를 포함할 수 있다.The controlling may include: controlling the degree of noise removal by weighted summing the input sound signal with a higher weight than the first signal based on the parameter in a section in which the voice is likely to be present; and controlling the degree of removal of the noise by weighted summing the first signal with a higher weight than the input sound signal based on the parameter in a section where the probability of the presence of the voice is low.

상기 제어하는 단계는 상기 제1 신호에 음성 발화 검출(voice activity detection) 알고리즘을 적용하여, 상기 제1 신호를 음성 구간 및 비음성 구간으로 구분하는 단계; 상기 파라미터에 기초하여, 상기 제1 신호보다 상기 입력 음향 신호에 높은 가중치를 두어 상기 음성 구간에 대응하는 제1 신호 및 상기 음성 구간에 대응하는 입력 음향 신호를 가중 합하는 단계; 및 상기 파라미터에 기초하여, 상기 입력 음향 신호보다 상기 제1 신호에 높은 가중치를 두어 상기 비음성 구간에 대응하는 제1 신호 및 상기 비음성 구간에 대응하는 입력 음향 신호를 가중 합하는 단계를 더 포함할 수 있다.The controlling may include applying a voice activity detection algorithm to the first signal to divide the first signal into a voice section and a non-voice section; weighted summing the first signal corresponding to the voice section and the input sound signal corresponding to the voice section by giving a higher weight to the input sound signal than the first signal, based on the parameter; and weighted summing the first signal corresponding to the non-voice section and the input sound signal corresponding to the non-voice section by giving a higher weight to the first signal than the input sound signal based on the parameter. can

일 측에 따른 음향 신호 처리 장치는 음성 신호 및 노이즈를 포함하는 입력 음향 신호를 획득하고, 상기 입력 음향 신호에서 상기 노이즈를 제거하여, 제1 신호를 생성하고, 노이즈 제어에 관한 사용자 인터페이스를 통해 입력되고 상기 노이즈의 강도를 제어하는 파라미터를 획득하며, 상기 파라미터에 기초하여 상기 입력 음향 신호 및 상기 제1 신호를 가중 합함으로써, 출력 음향 신호를 생성하는,A sound signal processing apparatus according to one side obtains an input sound signal including a voice signal and noise, removes the noise from the input sound signal, generates a first signal, and inputs the signal through a user interface related to noise control and obtaining a parameter controlling the intensity of the noise, and generating an output sound signal by weighted summing the input sound signal and the first signal based on the parameter;

적어도 하나의 프로세서를 포함한다.at least one processor.

상기 프로세서는 상기 입력 음향 신호의 SNR(signal-to-noise ratio) 값을 획득하고, 상기 SNR 값에 기초하여, 상기 사용자 인터페이스를 활성화할지 여부를 결정할 수 있다.The processor may obtain a signal-to-noise ratio (SNR) value of the input sound signal, and determine whether to activate the user interface based on the SNR value.

상기 프로세서는 상기 출력 음향 신호를 생성함에 있어서, 상기 제1 신호의 각 구간에 대응하여, 음성 존재 확률(speech presence probability)을 획득하고, 상기 음성 존재 확률 및 상기 파라미터에 기초하여, 상기 입력 음향 신호 및 상기 제1 신호를 구간 별로 가중 합할 수 있다.In generating the output sound signal, the processor acquires a speech presence probability corresponding to each section of the first signal, and based on the speech presence probability and the parameter, the input sound signal and weighted summing the first signal for each section.

상기 프로세서는 상기 가중 합함에 있어서, 상기 음성 존재 확률이 높은 구간에서는, 상기 파라미터에 기초하여, 상기 제1 신호보다 상기 입력 음향 신호에 높은 가중치를 두어 가중 합하고, 상기 음성 존재 확률이 낮은 구간에서는, 상기 파라미터에 기초하여, 상기 입력 음향 신호보다 상기 제1 신호에 높은 가중치를 두어 가중 합할 수 있다.In the weighted summing, the processor performs weighted summing by giving a higher weight to the input sound signal than the first signal based on the parameter in a section in which the voice presence probability is high, and in a section in which the voice presence probability is low, Based on the parameter, the first signal may be weighted and summed with a higher weight than the input sound signal.

상기 프로세서는 상기 가중 합함에 있어서, 상기 파라미터에 기초하여 상기 입력 음향 신호 및 상기 제1 신호를 가중 합함으로써, 상기 입력 음향 신호의 비중이 높은 제2 신호 및 상기 제1 신호의 비중이 높은 제3 신호를 생성하고, 상기 음성 존재 확률에 기초하여, 상기 제2 신호 및 상기 제3 신호를 구간 별로 가중 합할 수 있다.In the weighted summing, the processor performs weighted summing of the input sound signal and the first signal based on the parameter, whereby a second signal having a high proportion of the input sound signal and a third signal having a high proportion of the first signal A signal may be generated, and the second signal and the third signal may be weighted and summed for each section based on the voice presence probability.

상기 프로세서는 상기 제1 신호를 생성함에 있어서, 상기 입력 음향 신호의 시간 영역에서 노이즈 제거 알고리즘을 수행하여, 상기 제1 신호를 생성할 수 있다.When generating the first signal, the processor may generate the first signal by performing a noise removal algorithm in a time domain of the input sound signal.

상기 프로세서는 상기 제1 신호를 생성함에 있어서, 노이즈가 포함된 음향 신호에서 노이즈가 제거된 음향 신호를 출력하도록 학습된 딥 러닝 모델에 기초하여, 상기 제1 신호를 생성할 수 있다.When generating the first signal, the processor may generate the first signal based on a deep learning model trained to output an acoustic signal in which noise is removed from an acoustic signal including noise.

일 측에 따른 음향 신호 처리 장치는 음성 신호 및 노이즈를 포함하는 입력 음향 신호에서 상기 노이즈를 제거하여, 제1 신호를 생성하고, 상기 제1 신호에 대응하여, 음성이 존재할 확률(speech presence probability)을 획득하고, 상기 노이즈의 제거 정도에 관한 파라미터를 획득하며, 상기 음성이 존재할 확률 및 상기 파라미터 중 적어도 하나에 기초하여, 상기 입력 음향 신호 및 상기 제1 신호를 합성함으로써, 상기 노이즈의 제거 정도를 제어하는, 적어도 하나의 프로세서를 포함한다.A sound signal processing apparatus according to an aspect removes the noise from an input sound signal including a voice signal and noise to generate a first signal, and a speech presence probability in response to the first signal is obtained, a parameter related to the degree of noise removal is obtained, and the degree of noise removal is determined by synthesizing the input sound signal and the first signal based on at least one of a probability of the presence of the voice and the parameter. controlling, including at least one processor.

음성 신호를 수신하는 환경에서 배경 노이즈의 강도를 제어할 수 있는 사용자 인터페이스를 제공함으로써, 사용자가 음성 신호를 수신하면서 실시간으로 음성 신호의 노이즈 제거 정도를 제어할 수 있으며, 노이즈 제거가 필요한 상황에서 사용자가 직관적으로 배경 노이즈 제거에 대한 적절한 제어를 할 수 있는 효과가 도출될 수 있다.By providing a user interface that can control the intensity of background noise in an environment in which a voice signal is received, the user can control the degree of noise reduction of the voice signal in real time while receiving the voice signal. An effect that can intuitively control appropriate background noise removal can be derived.

음성이 포함된 구간과 음성이 포함되지 않은 구간의 연결이 자연스러운 출력 신호를 합성함으로써, 개선된 품질의 노이즈 제거 신호를 출력할 수 있다.By synthesizing an output signal in which the connection between the section including the voice and the section without the voice is natural, a noise-removing signal of improved quality can be output.

도 1은 일실시예에 따른 음성 신호 처리 시스템의 예시도.
도 2a 및 2b는 사용자의 단말에 제공되는 노이즈 제어 인터페이스 구성의 예시도들.
도 3은 그룹 영상 통화 중에 제공되는 노이즈 제어 인터페이스 화면의 예시도.
도 4는 일실시예에 따른 음성 신호 처리 시스템의 예시도.
도 5 및 도 6은 일실시예에 따른 음향 신호 처리 방법의 순서도를 도시한 도면들.1 is an exemplary diagram of a voice signal processing system according to an embodiment;
2A and 2B are exemplary diagrams of a configuration of a noise control interface provided to a user's terminal;
3 is an exemplary diagram of a noise control interface screen provided during a group video call;
4 is an exemplary diagram of a voice signal processing system according to an embodiment;
5 and 6 are diagrams illustrating a flow chart of a method for processing an acoustic signal according to an exemplary embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for purposes of illustration only, and may be changed and implemented in various forms. Accordingly, the actual implementation form is not limited to the specific embodiments disclosed, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical spirit described in the embodiments.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Although terms such as first or second may be used to describe various elements, these terms should be interpreted only for the purpose of distinguishing one element from another. For example, a first component may be termed a second component, and similarly, a second component may also be termed a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.When a component is referred to as being “connected to” another component, it may be directly connected or connected to the other component, but it should be understood that another component may exist in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, and includes one or more other features or numbers, It should be understood that the possibility of the presence or addition of steps, operations, components, parts or combinations thereof is not precluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In the description with reference to the accompanying drawings, the same components are assigned the same reference numerals regardless of the reference numerals, and overlapping descriptions thereof will be omitted.

도 1은 일실시예에 따른 음성 신호 처리 시스템의 예시도이다.1 is an exemplary diagram of a voice signal processing system according to an embodiment.

도 1을 참조하면, 일실시예에 따른 음성 신호 처리 시스템(100)은 음향 신호의 노이즈의 제거 정도를 제어하는 시스템에 해당할 수 있다.Referring to FIG. 1 , a voice signal processing system 100 according to an exemplary embodiment may correspond to a system for controlling the degree of noise removal of an acoustic signal.

일실시예에 따른 시스템(100)은 노이즈가 포함된 음향 신호를 입력 받아 설정된 노이즈의 강도에 따라 노이즈의 제거 정도가 제어된 출력 음향 신호를 생성하는 장치(110)를 포함할 수 있다. 다시 말해, 장치(110)는 노이즈 강도가 높게 설정된 경우, 입력 음향 신호에서 노이즈 제거 정도가 낮은 출력 음향 신호를 생성하고, 노이즈 강도가 낮게 설정된 경우, 입력 음향 신호에서 노이즈 제거 정도가 높은 출력 음향 신호를 생성함으로써, 노이즈의 제거 정도를 제어할 수 있다. 장치(110)는 일실시예에 따른 음향 신호 처리 방법을 수행하는 장치로, 입력된 음향 신호를 일실시예에 따른 음향 신호 처리 방법에 따라 처리하여 출력 음향 신호를 생성하는 장치에 해당할 수 있다. 장치(110)는 적어도 하나의 프로세서, 메모리 및 입출력 장치를 포함할 수 있으며, 예를 들어 서버 또는 사용자의 디바이스(예를 들어, 휴대폰, 컴퓨터 등) 등에 해당할 수 있다.The system 100 according to an embodiment may include an apparatus 110 that receives an acoustic signal including noise and generates an output acoustic signal in which a degree of noise removal is controlled according to a set noise intensity. In other words, when the noise intensity is set to be high, the device 110 generates an output sound signal having a low degree of noise removal from the input sound signal, and when the noise intensity is set to be low, the device 110 generates an output sound signal having a high degree of noise reduction from the input sound signal. By generating , the degree of noise removal can be controlled. The apparatus 110 is an apparatus for performing a method for processing a sound signal according to an embodiment, and may correspond to an apparatus for generating an output sound signal by processing an input sound signal according to the method for processing a sound signal according to an embodiment. . The apparatus 110 may include at least one processor, a memory, and an input/output device, and may correspond to, for example, a server or a user's device (eg, a mobile phone, a computer, etc.).

일실시예에 따른 장치(110)를 구성하는 메모리는 휘발성 메모리 또는 비휘발성 메모리일 수 있으며, 음향 신호 처리 방법과 관련된 정보를 저장하거나 음향 신호 처리 방법이 구현된 프로그램을 저장할 수 있다.The memory constituting the device 110 according to an embodiment may be a volatile memory or a non-volatile memory, and may store information related to a sound signal processing method or a program in which the sound signal processing method is implemented.

일실시예에 따른 장치(110)를 구성하는 적어도 하나의 프로세서는 음향 신호 처리 방법을 수행할 수 있다. 장치(110)를 구성하는 프로세서는 프로그램을 실행하고, 장치(110)를 제어할 수 있다. 프로세서에 의하여 실행되는 프로그램의 코드는 장치(110)를 구성하는 메모리에 저장될 수 있다.At least one processor constituting the apparatus 110 according to an embodiment may perform a method of processing an acoustic signal. A processor constituting the device 110 may execute a program and control the device 110 . Codes of programs executed by the processor may be stored in a memory constituting the device 110 .

일실시예에 따른 장치(110)는 입출력 장치를 통하여 외부 장치(예를 들어, 퍼스널 컴퓨터 또는 네트워크)에 연결되고, 데이터를 교환할 수 있으며, 사용자로부터 입력 데이터를 수신하거나, 사용자에게 출력 데이터를 제공할 수 있다.The device 110 according to an embodiment is connected to an external device (eg, a personal computer or a network) through an input/output device, can exchange data, receive input data from a user, or send output data to a user can provide

도 1을 참조하면, 장치(110)는 노이즈 제거부(111) 및 노이즈 제어부(112)의 구성을 포함할 수 있다. 음성 신호 처리 장치(110)의 각 구성은 모듈에 해당할 수 있으며, 각 구성은 음향 신호 처리를 위한 프로세스를 수행할 수 있다. 도 1에 도시된 장치(110)의 구성은 일실시예에 따른 음향 신호 처리 방법을 기능 단위로 설명하기 위하여 구분한 것으로, 장치의 구성을 한정하는 것은 아니다. 상술한 바와 같이, 일실시예에 따른 음향 신호 처리 방법은 장치(110)를 구성하는 적어도 하나의 프로세서에 의해 수행될 수 있다.Referring to FIG. 1 , the device 110 may include a noise removing unit 111 and a noise controlling unit 112 . Each component of the voice signal processing apparatus 110 may correspond to a module, and each component may perform a process for processing an acoustic signal. The configuration of the device 110 shown in FIG. 1 is divided to describe the sound signal processing method according to an embodiment in functional units, and the configuration of the device is not limited thereto. As described above, the sound signal processing method according to an embodiment may be performed by at least one processor constituting the apparatus 110 .

일실시예에 따른 노이즈 제거부(111)는 음성 신호 및 노이즈가 포함된 음향 신호(101)를 입력 받아 노이즈 제거된 신호를 출력하는 모듈에 해당할 수 있다. 입력 음향 신호(101)에 포함된 노이즈는 음성 신호 외에 주변 환경에서 발생한 소음이 입력 장치를 통해 수신되어 발생한 배경 잡음 또는 배경 노이즈를 포함할 수 있다. 노이즈 제거부(111)는 노이즈 제거 알고리즘을 이용하여 입력 음향 신호(101)에서 노이즈를 제거하는 프로세스를 수행할 수 있다. 노이즈 제거 알고리즘은 통계적 기법을 이용하여 노이즈를 제거하는 알고리즘, 딥러닝 모델을 이용하여 노이즈를 제거하는 알고리즘 등 음향 신호에서 노이즈를 제거하는 다양한 알고리즘을 포함할 수 있다.The noise removing unit 111 according to an embodiment may correspond to a module that receives a voice signal and an acoustic signal 101 including noise and outputs a noise-removed signal. The noise included in the input sound signal 101 may include background noise or background noise generated when noise generated in the surrounding environment is received through the input device in addition to the voice signal. The noise removal unit 111 may perform a process of removing noise from the input sound signal 101 using a noise removal algorithm. The noise removal algorithm may include various algorithms for removing noise from an acoustic signal, such as an algorithm for removing noise using a statistical technique and an algorithm for removing noise using a deep learning model.

일실시예에 따를 때, 노이즈 제거 알고리즘은 음향 신호의 시간 영역(time domain)에서 수행되는 알고리즘을 포함할 수 있다. 다시 말해, 시간 영역에서 표현된 음향 신호에서 노이즈를 제거하는 알고리즘을 포함할 수 있다. 음향 신호의 시간 영역에서 수행되는 노이즈 제거 알고리즘은 음향 신호의 주파수 영역에서 수행되는 노이즈 제거 알고리즘과 구분될 수 있다.According to an embodiment, the noise removal algorithm may include an algorithm performed in a time domain of an acoustic signal. In other words, an algorithm for removing noise from an acoustic signal expressed in the time domain may be included. The noise removal algorithm performed in the time domain of the acoustic signal may be distinguished from the noise removal algorithm performed in the frequency domain of the acoustic signal.

예를 들어, 노이즈 제거 알고리즘은 지도 학습 기법으로 노이즈가 포함된 음향 신호에서 노이즈가 제거된 음향 신호를 출력하도록 학습된 딥러닝 모델을 이용하여, 시간 영역에서 표현된 음향 신호에 1차원 합성곱(convolution)을 적용하여 노이즈 제거된 신호를 추정하는 과정을 포함할 수 있다. 여기서, 딥러닝 모델의 학습 방법은 시간 영역에서 표현된 노이즈를 포함하는 음향 신호에 1차원 합성곱을 적용한 결과를 노이즈가 제거된 음향 신호로 맵핑시키는 과정을 포함할 수 있다. 이하에서, 시스템에 입력되는 노이즈 및 음성 신호를 포함하는 음향 신호(101)에서 노이즈 제거 알고리즘에 의해 노이즈가 제거된 음향 신호는 제1 신호로 지칭될 수 있다.For example, the noise removal algorithm uses a deep learning model trained to output a noise-removed acoustic signal from a noise-containing acoustic signal with a supervised learning technique, and a one-dimensional convolution ( convolution) and estimating the noise-removed signal. Here, the learning method of the deep learning model may include a process of mapping a result of applying one-dimensional convolution to an acoustic signal including noise expressed in the time domain to an acoustic signal from which noise has been removed. Hereinafter, the acoustic signal from which the noise is removed by the noise removal algorithm from the acoustic signal 101 including the noise and the voice signal input to the system may be referred to as a first signal.

일실시예에 따른 노이즈 제어부(112)는 입력 음향 신호(101)에서 노이즈가 제거된 제1 신호 및 입력 음향 신호(101)를 합성하여 노이즈 제거 정도가 제어된 출력 음향 신호(102)를 생성하는 모듈에 해당할 수 있다. 노이즈 제어부(112)는 노이즈 제어 알고리즘을 이용하여 노이즈 제거 정도가 제어된 출력 음향 신호(102)를 생성하는 프로세스를 수행할 수 있다. 노이즈 제거 정도는 노이즈 제어 인터페이스(120)를 통한 입력에 기초하여 제어될 수 있다.The noise control unit 112 according to an embodiment generates an output sound signal 102 whose degree of noise removal is controlled by synthesizing a first signal from which noise has been removed from the input sound signal 101 and the input sound signal 101 . It may correspond to a module. The noise control unit 112 may perform a process of generating the output sound signal 102 whose degree of noise removal is controlled by using a noise control algorithm. The degree of noise removal may be controlled based on an input through the noise control interface 120 .

일실시예에 따를 때, 출력 음향 신호(102)를 생성하는 노이즈 제어 알고리즘은 제1 신호 및 입력 음향 신호(101)를 가중 합(weighted sum)하는 방법을 포함할 수 있다. 가중 합을 위한 제1 신호에 대응하는 가중치(weight) 및 입력 음향 신호(101)에 대응하는 가중치는 노이즈 제어 인터페이스(120)를 통해 획득된 노이즈의 강도를 제어하는 파라미터에 기초하여 결정될 수 있다. 일실시예에 따를 때, 노이즈의 강도를 제어하는 파라미터는 노이즈 제어 인터페이스(120)를 통해 사용자 단말로부터 입력될 수 있다.According to an embodiment, the noise control algorithm for generating the output acoustic signal 102 may include a method of weighted summing the first signal and the input acoustic signal 101 . A weight corresponding to the first signal for weighted summing and a weight corresponding to the input acoustic signal 101 may be determined based on a parameter controlling the intensity of noise obtained through the noise control interface 120 . According to an embodiment, a parameter for controlling the intensity of noise may be input from a user terminal through the noise control interface 120 .

예를 들어, 출력 음향 신호(102)는 아래의 수학식 1과 같이 표현될 수 있다.For example, the output sound signal 102 may be expressed as in Equation 1 below.

여기서, x(t), y(t) 및 z(t)는 시간 영역에서 표현된 음향 신호로, 음향 신호를 시간(t)에 대한 함수로 나타낸 것이며, 각각 입력 음향 신호, 제1 신호 및 출력 신호에 대응된다. α는 노이즈 제어 인터페이스(120)를 통해 사용자 단말로부터 입력된 노이즈의 강도에 관한 파라미터로, 0 이상 1 이하의 실수 값을 가질 수 있다.Here, x(t), y(t), and z(t) are sound signals expressed in the time domain, which represent the sound signals as a function of time (t), respectively, an input sound signal, a first signal, and an output corresponds to the signal. α is a parameter related to the intensity of noise input from the user terminal through the noise control interface 120 and may have a real value of 0 or more and 1 or less.

일실시예에 따를 때, 노이즈의 강도에 관한 파라미터 α는 제1 신호의 가중치로, (1-α)는 입력 음향 신호(101)의 가중치로 결정되며, 결정된 가중치에 따라 제1 신호 및 입력 음향 신호(101)를 가중 합함으로써 출력 음향 신호(102)가 생성될 수 있다.According to an embodiment, the parameter α related to the intensity of the noise is determined as a weight of the first signal, and (1-α) is determined as the weight of the input sound signal 101, and according to the determined weight, the first signal and the input sound An output acoustic signal 102 may be generated by weighted summing the signals 101 .

일실시예에 따를 때, 입력 음향 신호 및 제1 신호의 가중 합은 구간 별 또는 프레임 별로 수행될 수 있다. 프레임(frame)은 시간 영역에서 표현된 신호를 일정 시간 단위로 나눈 구간을 의미한다. 입력 음향 신호 및 제1 신호를 프레임 별로 가중 합하는 것은 입력 음향 신호의 제1 프레임과 제1 신호의 제1 프레임을 가중치에 따라 가중 합하는 것으로, 여기서 음향 신호의 제1 프레임과 제1 신호의 제1 프레임은 동일한 시간 구간에 대응될 수 있다. 예를 들어, 프레임의 크기가 1s인 경우, 0~1s 프레임의 입력 음향 신호와 0~1s 프레임의 제1 신호를 가중 합하고, 1~2s 프레임의 입력 음향 신호와 1~2s 프레임의 제1 신호를 가중 합하는 방식으로 입력 음향 신호 전체와 제1 신호 전체를 가중 합할 수 있다. 가중 합을 위한 입력 음향 신호의 가중치와 제1 신호의 가중치는 프레임 별로 달라질 수 있다. 예를 들어, 프레임의 크기가 1s인 경우, 0~1s 프레임의 입력 음향 신호 x(0)와 0~1s 프레임의 제1 신호 y(0)는 α₀x(0)+(1-α₀)y(0)와 같이 가중 합 될 수 있고, 1~2s 프레임의 입력 음향 신호와 1~2s 프레임의 제1 신호는 α₁x(1)+(1-α₁)y(1)와 같이 가중 합 될 수 있으며, α₀및 α₁는 서로 독립적으로 결정될 수 있다.According to an embodiment, the weighted summing of the input sound signal and the first signal may be performed for each section or each frame. A frame refers to a section in which a signal expressed in the time domain is divided into predetermined time units. Weighted summing of the input acoustic signal and the first signal for each frame is weighted summing the first frame of the input acoustic signal and the first frame of the first signal according to the weight, wherein the first frame of the acoustic signal and the first frame of the first signal A frame may correspond to the same time period. For example, when the frame size is 1s, the weighted sum of the input sound signal of the 0~1s frame and the first signal of the 0~1s frame, the input sound signal of the 1-2s frame and the first signal of the 1-2s frame In a method of weighted summing , the entire input sound signal and the entire first signal may be weighted and summed. The weight of the input sound signal for weighted summing and the weight of the first signal may be different for each frame. For example, when the frame size is 1s, the input acoustic signal x(0) of the 0~1s frame and the first signal y(0) of the 0~1s frame are α ₀ x(0)+(1-α ₀ ) ) can be weighted summed as y(0), and the input acoustic signal of 1~2s frame and the first signal of 1~2s frame are weighted as α ₁ x(1)+(1-α ₁ )y(1) can be weighted summed, and α ₀ and α ₁ can be determined independently of each other.

일실시예에 따를 때, 노이즈 제어부(112)는 음성 존재 확률(speech presence probability)을 고려하여 출력 음향 신호(102)를 생성할 수 있다. 이에 관하여는 이하의 도 4에서 상술한다.According to an embodiment, the noise controller 112 may generate the output sound signal 102 in consideration of a speech presence probability. This will be described in detail with reference to FIG. 4 below.

일실시예에 따른 시스템(100)은 노이즈 제어 인터페이스(120)를 포함할 수 있다. 노이즈 제어 인터페이스(120)는 노이즈 제어를 위하여 사용자 및 시스템(100) 사이의 상호작용을 지원하는 사용자 인터페이스(user interface; UI)를 포함할 수 있다. 노이즈 제어 인터페이스(120)는 음향 신호 처리 방법을 수행하는 장치(110)와 다른 장치에 탑재되어 네트워크를 통해 장치(110)와 연결되어 데이터를 송수신할 수 있다. 예를 들어, 장치(110)는 음향 신호 처리에 관한 서버에 해당하고, 노이즈 제어 인터페이스(120)는 사용자의 단말에 탑재되어 네트워크를 통해 사용자의 단말에서 수신된 데이터를 장치(110)에 전송하고, 네트워크를 통해 장치(110)로부터 데이터를 수신하여 사용자 단말에 표시할 수 있다.The system 100 according to one embodiment may include a noise control interface 120 . The noise control interface 120 may include a user interface (UI) that supports interaction between a user and the system 100 for noise control. The noise control interface 120 may be mounted on a device different from the device 110 performing the sound signal processing method and may be connected to the device 110 through a network to transmit/receive data. For example, the device 110 corresponds to a server related to sound signal processing, and the noise control interface 120 is mounted on the user's terminal and transmits data received from the user's terminal through the network to the apparatus 110 and , may receive data from the device 110 through the network and display it on the user terminal.

도 1에 도시된 것과 달리, 일실시예에 따른 노이즈 제어 인터페이스(120)는 음향 신호 처리 방법을 수행하는 장치(110)에 탑재될 수도 있다. 예를 들어, 장치(110)는 사용자 단말에 해당할 수 있으며 노이즈 제어 인터페이스(120)는 사용자 단말에 제공된 사용자 인터페이스에 해당할 수 있다.Unlike the one shown in FIG. 1 , the noise control interface 120 according to an embodiment may be mounted on the apparatus 110 for performing the method for processing an acoustic signal. For example, the device 110 may correspond to a user terminal and the noise control interface 120 may correspond to a user interface provided to the user terminal.

일실시예에 따른 음향 신호 처리 장치(110)는 노이즈 제어 인터페이스(120)를 통해 입력된 노이즈 제어에 관한 입력을 획득할 수 있다. 노이즈 제어에 관한 입력은 노이즈의 강도를 제어하는 입력을 포함할 수 있다. 예를 들어, 사용자는 인터페이스(120)를 통해 노이즈의 강도를 제어하는 파라미터를 입력할 수 있으며, 음향 신호 처리 장치(110)는 네트워크를 통해 입력된 파라미터를 획득할 수 있다.The acoustic signal processing apparatus 110 according to an embodiment may obtain an input related to noise control input through the noise control interface 120 . The input for controlling the noise may include an input for controlling the intensity of the noise. For example, the user may input a parameter for controlling the intensity of noise through the interface 120 , and the acoustic signal processing apparatus 110 may obtain the input parameter through the network.

노이즈의 강도는 출력 음향 신호(102)에 포함된 노이즈의 크기를 의미하는 것으로, 노이즈의 강도 설정에 따라 입력 음향 신호(101)에서 노이즈가 제거되는 정도가 결정될 수 있다. 예를 들어, 노이즈의 노이즈 강도가 작게 설정된 경우, 출력되는 음향 신호에서 노이즈가 적게 포함되는 것을 의미하므로 노이즈의 제거 정도는 크게 설정될 수 있다. 반면, 노이즈 강도가 크게 설정된 경우, 노이즈의 제거 정도는 작게 설정될 수 있다. The intensity of noise means the level of noise included in the output sound signal 102 , and the degree to which noise is removed from the input sound signal 101 may be determined according to the noise intensity setting. For example, when the noise intensity of noise is set to be small, since it means that the output sound signal contains less noise, the degree of noise removal may be set to be large. On the other hand, when the noise intensity is set to be large, the degree of noise removal may be set to be small.

예를 들어, 사용자의 단말에 제공되는 노이즈 제어 인터페이스의 구성은 도 2a 및 도 2b를 참조할 수 있다. 도 2a를 참조하면, 사용자는 제공된 노이즈 제어 인터페이스를 통해 표시된 바(201)의 위치를 기준선(202) 내에서 변경함으로써, 출력 음향 신호(102)에 포함된 노이즈의 강도를 설정할 수 있다. 바(201)가 기준선(202)의 왼쪽에 위치할수록 작은 노이즈 강도를 지시하고, 오른쪽에 위치할수록 큰 노이즈 강도를 지시할 수 있다. 도 2a 및 도 2b를 참조하면, 도 2b에 도시된 바(211)는 도 2a에 도시된 바(201)보다 기준선의 오른쪽에 위치하고 있으므로, 도 2a보다 도 2b에서 노이즈의 강도가 크게 설정된다. 일실시예에 따른 노이즈 제어 인터페이스는 사용자로부터 바(201)의 위치를 변경하는 입력을 수신하여, 바(201)가 위치한 기준선(202) 내 지점을 파라미터화하여 노이즈 강도에 관한 파라미터를 획득할 수 있다.For example, the configuration of the noise control interface provided to the user's terminal may refer to FIGS. 2A and 2B . Referring to FIG. 2A , the user may set the intensity of noise included in the output sound signal 102 by changing the position of the displayed bar 201 within the reference line 202 through the provided noise control interface. As the bar 201 is positioned to the left of the reference line 202 , it may indicate a small noise intensity, and as the bar 201 is positioned to the right of the reference line 202 , it may indicate a greater noise intensity. Referring to FIGS. 2A and 2B , since the bar 211 shown in FIG. 2B is located to the right of the reference line than the bar 201 shown in FIG. 2A , the intensity of the noise is set higher in FIG. 2B than in FIG. 2A . The noise control interface according to an embodiment may receive an input for changing the position of the bar 201 from a user, parameterize a point within the reference line 202 where the bar 201 is located, and obtain a parameter related to noise intensity. have.

도 2a 및 도 2b에 도시된 인터페이스의 구성은 일실시예에 따른 노이즈 제어 인터페이스(120)의 구성의 일 예로, 이외에도 노이즈 제어 인터페이스(120)는 노이즈의 강도를 수치로 입력하는 인터페이스를 포함할 수 있으며, 터치스크린을 통한 터치 입력, 키보드를 통한 키 입력, 마우스를 통한 클릭 입력 등 다양한 입력을 지원하는 인터페이스를 포함할 수 있다.The configuration of the interface shown in FIGS. 2A and 2B is an example of the configuration of the noise control interface 120 according to an embodiment. In addition, the noise control interface 120 may include an interface for numerically inputting the intensity of noise. and may include an interface supporting various inputs, such as a touch input through a touch screen, a key input through a keyboard, and a click input through a mouse.

사용자는 노이즈 제어 인터페이스(120)를 통해 출력 음향 신호(102)에 포함된 노이즈의 강도를 조정함으로써, 시스템으로 노이즈 제어에 관한 입력을 송신할 수 있으며, 시스템은 인터페이스를 통해 수신된 노이즈 제어에 관한 입력에 기초하여 노이즈의 제거 정도를 조정하여 출력 음향 신호(102)를 생성할 수 있다.A user may send an input related to noise control to the system by adjusting the intensity of noise included in the output acoustic signal 102 via the noise control interface 120 , and the system may send an input related to noise control received through the interface. The output acoustic signal 102 may be generated by adjusting the degree of noise removal based on the input.

일실시예에 따를 때, 노이즈 제어 인터페이스(120)는 음성 신호를 수신하는 환경에서 제공될 수 있다. 예를 들어, 음성 통화 또는 영상 통화 시 통화 상대방의 음성 신호를 수신하는 환경에서 노이즈 제어 인터페이스(120)가 제공될 수 있다.According to an embodiment, the noise control interface 120 may be provided in an environment for receiving a voice signal. For example, the noise control interface 120 may be provided in an environment in which a voice signal of a call counterpart is received during a voice call or a video call.

예를 들어, 도 3은 그룹 영상 통화 중에 제공되는 노이즈 제어 인터페이스를 도시한 도면이다. 도 3을 참조하면, 사용자 단말을 통해 그룹 영상 통화가 진행되는 경우, 통화 상대방의 단말로부터 송신된 영상 및 음성을 수신하는 환경에서 노이즈 제어 인터페이스가 제공될 수 있다. 그룹 영상 통화의 경우 복수의 통화 상대방들 각각의 음성 신호에 대응하여 노이즈 제어 인터페이스가 제공될 수 있으며, 사용자는 통화 상대방의 영상이 표시되는 인터페이스 하단에 표시된 노이즈 제어 인터페이스를 통해 영상에 대응되는 통화 상대방으로부터 수신된 음성 신호의 노이즈 강도를 제어할 수 있다. 사용자는 복수의 통화 상대방들 각각에 대응하는 인터페이스를 통해 통화 상대방으로부터 수신되는 음성 신호의 노이즈 강도를 제어함으로써, 노이즈의 제거 정도가 조정된 상대방의 음성 신호를 수신할 수 있다. 또한, 복수의 통화 상대방들에 대응하는 노이즈 강도는 서로 다르게 설정될 수 있는 바, 설정된 각각의 노이즈 강도에 따라 노이즈 제거 정도가 서로 다르게 조정되어 음향 신호가 출력될 수 있다.For example, FIG. 3 is a diagram illustrating a noise control interface provided during a group video call. Referring to FIG. 3 , when a group video call is conducted through a user terminal, a noise control interface may be provided in an environment in which video and audio transmitted from the terminal of the call counterpart are received. In the case of a group video call, a noise control interface may be provided in response to a voice signal of each of a plurality of call parties, and the user may use the noise control interface displayed at the bottom of the interface where the video of the call party is displayed, the call party corresponding to the video It is possible to control the noise intensity of the voice signal received from the The user may receive the voice signal of the other party whose degree of noise removal is adjusted by controlling the noise intensity of the voice signal received from the call party through the interface corresponding to each of the plurality of call counterparts. In addition, since the noise intensity corresponding to the plurality of call counterparts may be set differently, the noise removal degree may be adjusted differently according to each set noise intensity to output an acoustic signal.

다시 도 1을 참조하면, 일실시예에 따른 노이즈 제어 인터페이스(120)는 입력 음향 신호(101)의 신호 대 잡음 비(signal-to-noise ratio; SNR)에 기초하여 활성화 여부가 결정될 수 있다.Referring back to FIG. 1 , whether to activate the noise control interface 120 according to an embodiment may be determined based on a signal-to-noise ratio (SNR) of the input sound signal 101 .

SNR(signal-to-noise ratio)은 잡음 신호 대비 수신된 신호의 비율을 나타내는 것으로, 수신된 신호의 세기를 V_s, 노이즈의 세기를 V_n이라고 할 때, alog₍V_s/V_n)(dB)으로 정의된다(여기서, a는 양의 상수). SNR이 클수록 신호가 잡음보다 크다는 것으로, 예를 들어, 음성 신호의 SNR이 7dB일 때보다 20dB일 때 음성은 크게, 잡음은 작게 들리며, 0dB은 음성과 잡음의 크기가 동일하다는 것을 의미하고, -10dB와 같이 SNR이 음의 값인 경우 음성보다 잡음이 더 크다는 것을 의미한다.Signal-to-noise ratio (SNR) represents the ratio of the received signal to the noise signal. When the received signal strength is V _s and the noise strength is V _n , alog ₍ V _s /V _n )( dB) (where a is a positive constant). The larger the SNR, the larger the signal is than the noise. For example, when the SNR of the audio signal is 20 dB than when the SNR is 7 dB, the speech sounds louder and the noise sounds smaller, and 0 dB means that the speech and noise levels are the same, - When the SNR is negative, such as 10dB, it means that the noise is greater than the voice.

일실시예에 따른 장치(110)는 입력 음향 신호(101)의 SNR 값을 획득할 수 있으며, SNR 값에 기초하여 사용자 인터페이스(120)를 활성화할지 여부를 결정할 수 있다. 예를 들어, SNR 값이 미리 정해진 기준 값 미만인 경우, 입력 음향 신호(101)에 노이즈가 많이 포함된 것으로 판단되어, 노이즈 제어 인터페이스(120)를 활성화하는 것으로 결정될 수 있다.The device 110 according to an embodiment may obtain an SNR value of the input acoustic signal 101 and determine whether to activate the user interface 120 based on the SNR value. For example, when the SNR value is less than a predetermined reference value, it may be determined that a large amount of noise is included in the input sound signal 101 , and thus it may be determined to activate the noise control interface 120 .

일실시예에 따를 때, 인터페이스(120)를 활성화하는 것으로 결정된 경우, 인터페이스(120)가 활성화되고, 장치(110)에서 노이즈 제거 알고리즘 및 노이즈 제어 알고리즘이 수행될 수 있다.According to an embodiment, when it is determined to activate the interface 120 , the interface 120 is activated, and the device 110 may perform a noise removal algorithm and a noise control algorithm.

다시 도 3을 참조하면, 그룹 영상 통화의 경우 복수의 통화 상대방들 각각의 음성 신호에 대응하여 노이즈 제어 인터페이스가 제공될 수 있으며, 복수의 통화 상대방들로부터 수신된 각각의 입력 음향 신호의 SNR 값에 기초하여, 노이즈 제어 인터페이스의 활성화 여부가 결정될 수 있다. 노이즈 제어 인터페이스의 활성화 여부는 다양한 방법으로 인터페이스 상에 표시될 수 있다. 예를 들어, 인터페이스에 표시된 도형(301, 302)의 색상 차이 및 도형(303, 304)의 색상 차이로 노이즈 제어 인터페이스의 활성화 여부가 표시될 수 있다. 노이즈 제어 인터페이스가 활성화된 경우, 사용자는 입력 장치를 통해 노이즈 강도를 제어하는 파라미터를 입력하기 위하여 노이즈 강도를 나타내는 바(305)의 위치를 이동시킬 수 있는 반면, 노이즈 제어 인터페이스가 활성화되지 않은 경우, 사용자는 입력 장치를 통해 노이즈 강도를 나타내는 바(306)의 위치를 이동시킬 수 없다. Referring back to FIG. 3 , in the case of a group video call, a noise control interface may be provided in response to each voice signal of a plurality of call counterparts, and the SNR value of each input sound signal received from the plurality of call counterparts may be Based on it, it may be determined whether to activate the noise control interface. Whether the noise control interface is activated may be displayed on the interface in various ways. For example, whether the noise control interface is activated may be indicated by the color difference between the figures 301 and 302 displayed on the interface and the color difference between the figures 303 and 304 . When the noise control interface is activated, the user can move the position of the bar 305 indicating the noise intensity in order to input a parameter for controlling the noise intensity through the input device, whereas when the noise control interface is not activated, The user cannot move the position of the bar 306 representing the noise intensity through the input device.

도 4는 일실시예에 따른 음성 신호 처리 시스템의 예시도이다.4 is an exemplary diagram of a voice signal processing system according to an embodiment.

도 4를 참조하면, 일실시예에 따른 음향 신호 처리 방법을 수행하는 장치는 음성 존재 확률을 획득하기 위한 음성 활성 검출부의 구성을 더 포함할 수 있다. 도 4에 도시된 장치의 구성은 도 1과 마찬가지로 일실시예에 따른 음향 신호 처리 방법을 기능 단위로 설명하기 위하여 구분한 것으로, 장치의 구성을 한정하는 것은 아니다.Referring to FIG. 4 , the apparatus for performing the method for processing an acoustic signal according to an embodiment may further include a configuration of a voice activity detector for acquiring a voice presence probability. The configuration of the device shown in FIG. 4 is divided to describe the sound signal processing method according to an embodiment in a functional unit similar to that of FIG. 1 , and the configuration of the device is not limited thereto.

일실시예에 따른 음성 활성 검출부(410)는 입력 음향 신호에서 노이즈가 제거된 제1 신호의 음성 존재 확률을 획득하는 모듈에 해당할 수 있다. 음성 활성 검출부(410)는 음성 활성 검출(voice activity detection; VAD) 알고리즘을 이용하여, 제1 신호의 음성 존재 확률을 획득하는 프로세스를 수행할 수 있다. 음성 활성 검출 알고리즘은 음향 신호에서 음성 구간 및 묵음 구간(또는 비음성 구간)을 판별하는 알고리즘으로, 음성 활성을 검출하는 다양한 알고리즘을 포함할 수 있다. 음성 활성 검출 알고리즘의 결과로 알고리즘의 입력으로 이용된 음향 신호가 음성 신호일 확률인 음성 존재 확률이 출력될 수 있다.The voice activity detection unit 410 according to an embodiment may correspond to a module for acquiring the voice presence probability of the first signal from which noise is removed from the input sound signal. The voice activity detection unit 410 may perform a process of acquiring the voice presence probability of the first signal using a voice activity detection (VAD) algorithm. The voice activity detection algorithm is an algorithm for discriminating a voice section and a silent section (or a non-voice section) in an acoustic signal, and may include various algorithms for detecting voice activity. As a result of the voice activity detection algorithm, a voice presence probability that is a probability that an acoustic signal used as an input of the algorithm is a voice signal may be output.

일실시예에 따른 음성 활성 검출부(410)는 음성 활성 검출 알고리즘에 따라 제1 신호의 프레임 별로 음성 존재 확률을 획득할 수 있다.The voice activity detection unit 410 according to an embodiment may acquire a voice presence probability for each frame of the first signal according to a voice activity detection algorithm.

일실시예에 따른 노이즈 제어부(420)는 음성 존재 확률 및 노이즈의 강도를 제어하는 파라미터에 기초하여, 입력 음향 신호 및 제1 신호를 가중 합함으로써 출력 음향 신호를 생성할 수 있다.The noise control unit 420 according to an exemplary embodiment may generate an output sound signal by weighted summing the input sound signal and the first signal based on the parameters controlling the voice presence probability and the noise intensity.

일실시예에 따른 노이즈 제어부(420)는 노이즈 강도를 제어하는 파라미터에 기초하여 입력 음향 신호 및 제1 신호를 가중 합함으로써, 입력 음향 신호의 비중이 높은 제2 신호 및 제1 신호의 비중이 높은 제3 신호를 생성하고, 제1 신호의 음성 존재 확률에 기초하여, 제2 신호 및 제3 신호를 가중 합함으로써, 출력 음향 신호를 생성할 수 있다.The noise control unit 420 according to an embodiment weights and sums the input sound signal and the first signal based on a parameter for controlling the noise intensity, so that the second signal having a high proportion of the input sound signal and the first signal having a high proportion The output acoustic signal may be generated by generating the third signal and weighted summing the second signal and the third signal based on the voice presence probability of the first signal.

예를 들어, 노이즈 제어 인터페이스를 통해 입력된 노이즈 강도를 제어하는 파라미터가 α₁(α₁은 0 이상 0.5 미만의 임의의 실수)으로 설정된 경우, 아래의 수학식 2와 같이 입력 음향 신호 x(t) 및 제1 신호 y(t)를 가중 합하여, 입력 음향 신호 x(t)의 비중이 높은 제2 신호 z₀(t)가 합성될 수 있다.For example, when the parameter controlling the noise intensity input through the noise control interface is set to α ₁ (α ₁ is an arbitrary real number between 0 and 0.5), as in Equation 2 below, the input sound signal x(t ) and the first signal y(t) by weighted summing, a second signal z ₀ (t) having a high weight of the input acoustic signal x(t) may be synthesized.

여기서, α₁은 0 이상 0.5 미만의 임의의 실수이므로, 입력 음향 신호 x(t)의 가중치 (1-α₁)가 제1 신호 y(t)의 가중치 α₁보다 크다. 따라서, 제1 신호 y(t)보다 입력 음향 신호 x(t)의 비중이 높은 제2 신호 z₀(t)가 생성될 수 있다.Here, since α ₁ is an arbitrary real number greater than or equal to 0 and less than 0.5, the weight (1-α ₁ ) of the input acoustic signal x(t) is greater than the weight α ₁ of the first signal y(t). Accordingly, the second signal z ₀ (t) having a higher specific gravity of the input acoustic signal x(t) than the first signal y(t) may be generated.

또한, 아래의 수학식 3과 같이 입력 음향 신호 x(t) 및 제1 신호 y(t)를 가중 합하여, 제1 신호 y(t)의 비중이 높은 제3 신호 z₁(t)가 합성될 수 있다.In addition, by weighted summing the input sound signal x(t) and the first signal y(t) as shown in Equation 3 below, a third signal z ₁ (t) having a high proportion of the first signal y(t) is synthesized. can

여기서, α₂는 노이즈 제어 인터페이스를 통해 입력된 α₁에 기초하여 결정되는 파라미터로, α₁이 클수록 작게 결정되는 파라미터에 해당할 수 있다. 예를 들어, α₂는 (1-α₁)으로 결정될 수 있다.Here, α ₂ is a parameter determined based on α ₁ input through the noise control interface, and may correspond to a parameter determined to be smaller as α ₁ is larger. For example, α ₂ may be determined as (1-α ₁ ).

α₂는 (1-α₁)으로 결정된다고 할 때, 0 이상 0.5 미만의 임의의 실수인 α₁에 대응하여, α₂는 0.5 초과 1 이하의 실수로 결정되므로, 입력 음향 신호 x(t)의 가중치 (1-α₂)가 제1 신호 y(t)의 가중치 α₂보다 작다. 따라서, 입력 음향 신호 x(t)보다 제1 신호 y(t)의 비중이 높은 출력 제3 신호 z₁(t)가 생성될 수 있다.Assuming that α ₂ is determined as (1-α ₁ ), corresponding to α ₁ , which is an arbitrary real number greater than 0 and less than 0.5, α ₂ is determined as a real number greater than 0.5 and less than or equal to 1, so that the input acoustic signal x(t) The weight (1-α ₂ ) of is smaller than the weight α ₂ of the first signal y(t). Accordingly, an output third signal z ₁ (t) having a higher specific gravity of the first signal y(t) than the input sound signal x(t) may be generated.

출력 음향 신호 z(t)는 아래의 수학식 4와 같이 제1 신호의 음성 존재 확률 p에 기초하여, 제2 신호 z₀(t) 및 제3 신호 z₁(t)를 가중 합함으로써, 생성될 수 있다.The output acoustic signal z(t) is generated by weighted summing the second signal z ₀ (t) and the third signal z ₁ (t) based on the speech presence probability p of the first signal as shown in Equation 4 below. can be

일실시예에 따를 때, 음성 존재 확률이 큰 경우, 입력 음향 신호의 비중이 높은 제2 신호가 출력 음향 신호에 더 많이 포함되고, 음성 존재 확률이 작은 경우, 제1 신호의 비중이 높은 제3 신호가 출력 음향 신호에 더 많이 포함되도록 제1 신호 및 입력 음향 신호를 합성함으로써, 구간 별로 노이즈 제거 정도가 제어된 출력 음향 신호가 생성될 수 있다. 제2 신호 및 제3 신호의 합성을 위한 가중치 α₁및α₂을 조절함으로써, 음성 신호가 포함된 부분과 음성 신호가 포함되지 않은 부분의 연결이 자연스러운 출력 음향 신호가 생성될 수 있다.According to an embodiment, when the voice presence probability is high, the output sound signal includes more of the second signal having a high proportion of the input sound signal, and when the voice presence probability is low, the third signal having a high proportion of the first signal By synthesizing the first signal and the input sound signal so that the signal is more included in the output sound signal, an output sound signal in which the degree of noise removal is controlled for each section may be generated. weight α ₁ for synthesis of the second signal and the third signal, andBy adjusting α ₂ , an output sound signal can be generated in which the connection between the portion including the audio signal and the portion not including the audio signal is natural.

일실시예에 따를 때, 제1 신호의 음성 존재 확률 p가 프레임 별로 획득된 경우, 노이즈 제어부(420)는 음성 존재 확률 및 노이즈 강도를 제어하는 파라미터에 기초하여, 입력 음향 신호 및 제1 신호를 프레임 별로 가중 합하여 출력 음향 신호를 생성할 수 있다. 예를 들어, 제1 신호의 프레임 t₁에서 음성 존재 확률이 p₁인 경우, 프레임 t₁에 대응하는 출력 음향 신호 z(t₁)은 아래의 수학식 5와 같이 나타낼 수 있다.According to an embodiment, when the voice presence probability p of the first signal is obtained for each frame, the noise controller 420 selects the input sound signal and the first signal based on the parameters controlling the voice presence probability and the noise intensity. An output sound signal may be generated by weighted summing for each frame. For example, when the audio presence probability is p ₁ in the frame t ₁ of the first signal, the output acoustic signal z(t ₁ ) corresponding to the frame t ₁ may be expressed as Equation 5 below.

일실시예에 따를 때, 제1 신호의 음성 존재 확률 p가 프레임 별로 획득된 경우, 노이즈 제어부(420)는 제1 신호의 각 프레임이 음성 존재 확률이 높은 프레임인지 음성 존재 확률이 낮은 프레임인지 여부를 판단하여, 출력 음향 신호를 생성할 수 있다.According to an embodiment, when the voice presence probability p of the first signal is obtained for each frame, the noise control unit 420 determines whether each frame of the first signal is a frame having a high voice presence probability or a low voice presence probability. may be determined to generate an output sound signal.

제1 신호의 특정 프레임이 음성 존재 확률이 높은 프레임인지 음성 존재 확률이 낮은 프레임인지 여부는 제1 신호의 특정 프레임에서 획득된 음성 존재 확률을 미리 정해진 임계 값과 비교함으로써 결정될 수 있다. 예를 들어, 0.5를 기준으로 제1 신호의 특정 프레임의 음성 존재 확률이 0.5이상인 경우, 음성 존재 확률이 높은 프레임으로, 0.5 미만인 경우 음성 존재 확률이 낮은 프레임을 결정될 수 있다. 일실시예에 따를 때, 음성 존재 확률이 높은 프레임은 음성 구간으로, 음성 존재 확률이 낮은 프레임은 비음성 구간으로 구분될 수 있다. 다시 말해, 제1 신호에 음성 발화 검출 알고리즘을 적용하여 출력된 제1 신호의 프레임 별 음성 존재 확률에 기초하여, 제1 신호의 프레임들은 음성 구간 및 비음성 구간으로 구분될 수 있다.Whether the specific frame of the first signal is a frame having a high voice presence probability or a low voice presence probability may be determined by comparing the voice presence probability obtained in the specific frame of the first signal with a predetermined threshold value. For example, when the speech presence probability of a specific frame of the first signal is 0.5 or more based on 0.5, a frame having a high speech presence probability may be determined, and when the speech presence probability is less than 0.5, a frame having a low speech presence probability may be determined. According to an embodiment, a frame having a high voice presence probability may be divided into a voice section, and a frame having a low voice presence probability may be divided into a non-voice section. In other words, the frames of the first signal may be divided into a voice section and a non-voice section based on a voice presence probability for each frame of the first signal output by applying a voice utterance detection algorithm to the first signal.

일실시예에 따른 노이즈 제어부(420)는 제1 신호의 프레임들 중 음성 존재 확률이 높은 프레임에서는, 노이즈 강도를 제어하는 파라미터에 기초하여, 제1 신호보다 입력 음향 신호에 높은 가중치를 두어 가중 합할 수 있다. 노이즈 제어부(420)는 제1 신호의 프레임들 중 음성 존재 확률이 낮은 프레임에서는, 노이즈 강도를 제어하는 파라미터에 기초하여, 입력 음향 신호보다 제1 신호에 높은 가중치를 두어 가중 합할 수 있다.The noise control unit 420 according to an exemplary embodiment may weight and sum the input sound signal by weighting it higher than the first signal, based on the parameter controlling the noise intensity, in a frame with a high probability of voice presence among frames of the first signal. can The noise control unit 420 may weight and sum the first signal with a higher weight than the input sound signal, based on a parameter controlling the noise intensity, in a frame having a low voice presence probability among the frames of the first signal.

예를 들어, 노이즈 제어 인터페이스를 통해 입력된 노이즈 강도를 제어하는 파라미터가 α₁(α₁은 0 이상 0.5 이하의 임의의 실수)으로 설정된 경우, 제1 신호의 프레임들 중 음성 존재 확률이 높은 프레임(t₁)에서는, 아래의 수학식 6과 같이 제1 신호 y(t₁) 및 입력 음향 신호 x(t₁)가 합성될 수 있다.For example, when the parameter controlling the noise intensity input through the noise control interface is set to α ₁ (α ₁ is an arbitrary real number between 0 and 0.5), a frame with a high probability of voice presence among frames of the first signal In (t ₁ ), the first signal y(t ₁ ) and the input sound signal x(t ₁ ) may be synthesized as shown in Equation 6 below.

여기서, α₁은 0 이상 0.5 이하의 임의의 실수이므로, 입력 음향 신호 x(t₁)의 가중치 (1-α₁)가 제1 신호 y(t₁)의 가중치 α₁보다 크거나 같다. 따라서, 음성 존재 확률이 높은 프레임에서는 제1 신호 y(t₁)보다 입력 음향 신호 x(t₁)의 비중이 높은 출력 음향 신호 z(t₁)가 생성될 수 있다.Here, since α ₁ is an arbitrary real number of 0 or more and 0.5 or less, the weight (1-α ₁ ) of the input acoustic signal x(t ₁ ) is greater than or equal to the weight α ₁ of the first signal y(t ₁ ). Accordingly, in a frame having a high probability of voice presence, an output acoustic signal z(t ₁ ) having a higher proportion of the input acoustic signal x(t ₁ ) than the first signal y(t ₁ ) may be generated.

반면, 제1 신호의 프레임들 중 음성 존재 확률이 낮은 프레임에서는, 아래의 수학식 7과 같이 제1 신호 y(t₂) 및 입력 음향 신호 x(t₂)가 합성될 수 있다.On the other hand, in a frame having a low voice presence probability among frames of the first signal, the first signal y(t ₂ ) and the input sound signal x(t ₂ ) may be synthesized as shown in Equation 7 below.

α₂는 (1-α₁)으로 결정된다고 할 때, 0 이상 0.5 이하의 임의의 실수인 α₁에 대응하여, α₂는 0.5 이상 1 이하의 실수로 결정되므로, 입력 음향 신호 x(t₂)의 가중치 (1-α₂)가 제1 신호 y(t₂)의 가중치 α₂보다 작거나 같다. 따라서, 음성 존재 확률이 낮은 프레임에서는 입력 음향 신호보다 제1 신호의 비중이 높은 출력 음향 신호 z(t₂)가 생성될 수 있다.Assuming that α ₂ is determined as (1-α ₁ ), corresponding to α ₁ that is an arbitrary real number between 0 and 0.5, α ₂ is determined as a real number between 0.5 and 1, so that the input acoustic signal x(t ₂ ) has a weight (1-α ₂ ) less than or equal to a weight α ₂ of the first signal y(t ₂ ). Accordingly, in a frame having a low voice presence probability, an output acoustic signal z(t ₂ ) having a higher proportion of the first signal than the input acoustic signal may be generated.

일실시예에 따를 때, 제1 신호의 각 프레임이 음성 존재 확률이 높은 프레임인지 음성 존재 확률이 낮은 프레임인지 여부를 판단하여(또는 음성 구간에 해당하는지 비음성 구간에 해당하는지 판단하여), 출력 음향 신호를 생성하는 과정은 상술한 수학식 4에서, 음성 존재 확률이 높은 프레임의 경우 p=1, 음성 존재 확률이 낮은 프레임의 경우 p=0으로 하여 제2 신호 및 제3 신호를 가중 합하여 출력 음향 신호를 생성하는 것으로 이해할 수 있다.According to an embodiment, it is determined whether each frame of the first signal is a frame with a high voice presence probability or a low voice presence probability (or by determining whether it corresponds to a voice section or a non-voice section), and output The process of generating the sound signal is output by weighted summing the second signal and the third signal in Equation 4 above, with p = 1 for a frame with a high voice presence probability and p=0 for a frame with a low voice presence probability. It can be understood as generating an acoustic signal.

도 5 및 도 6은 일실시예에 따른 음향 신호 처리 방법의 순서도를 도시한 도면들이다.5 and 6 are diagrams illustrating a flow chart of a method for processing an acoustic signal according to an exemplary embodiment.

보다 구체적으로, 도 5 및 도 6은 음향 신호 처리 장치는 서버이고, 사용자 단말에 노이즈 제어 인터페이스가 탑재된 실시예에서, 음향 신호 시스템의 동작 과정을 도시한다. More specifically, FIGS. 5 and 6 illustrate an operation process of an acoustic signal system in an embodiment in which the acoustic signal processing apparatus is a server and a noise control interface is mounted in a user terminal.

도 5를 참조하면, 서버는 사용자 단말의 마이크 등 입력 장치를 통해 입력된 입력 음향 신호를 네트워크를 통해 수신(510)할 수 있다. 서버는 수신된 입력 음향 신호에 상술한 노이즈 제거 알고리즘을 수행(530)하여 제1 신호를 획득할 수 있다. 사용자 단말은 노이즈 제어 인터페이스를 통해 사용자로부터 노이즈 제어 입력을 수신하여 서버에 전달할 수 있으며, 서버는 노이즈 제어 인터페이스를 통해 수신된 노이즈 제어 입력을 획득(520)할 수 있다. 일실시예에 따를 때, 노이즈 제어 입력 획득 동작과 노이즈 제거 알고리즘 수행 동작은 병렬적으로 수행될 수 있다.Referring to FIG. 5 , the server may receive 510 an input sound signal input through an input device such as a microphone of the user terminal through a network. The server may obtain the first signal by performing the above-described noise removal algorithm on the received input sound signal ( 530 ). The user terminal may receive the noise control input from the user through the noise control interface and transmit it to the server, and the server may obtain 520 the noise control input received through the noise control interface. According to an embodiment, the operation of acquiring the noise control input and the operation of performing the noise removal algorithm may be performed in parallel.

서버는 획득된 노이즈 제어 입력에 기초하여, 제1 신호 및 입력 음향 신호를 가중 합하는 노이즈 제어 알고리즘을 수행(540)함으로써, 출력 음향 신호를 생성할 수 있다. 서버는 출력 음향 신호를 네트워크를 통해 사용자 단말에 발신(550)할 수 있으며, 출력 음향 신호를 수신한 사용자 단말은 스피커 등 출력 장치를 통해 출력 음향 신호를 출력할 수 있다.The server may generate an output sound signal by performing a noise control algorithm for weighted summing the first signal and the input sound signal based on the obtained noise control input ( 540 ). The server may transmit (550) the output sound signal to the user terminal through the network, and the user terminal receiving the output sound signal may output the output sound signal through an output device such as a speaker.

일실시예에 따를 때, 서버에서 수신된 입력 음향 신호는 사용자 단말에 노이즈 제거 알고리즘이 수행된 음향 신호에 해당할 수 있다. 다시 말해, 사용자 단말은 입력 장치를 통해 수신된 음향 신호에서 노이즈를 제거하는 프로세스를 수행할 수 있다.According to an embodiment, the input sound signal received from the server may correspond to the sound signal on which the noise removal algorithm is performed in the user terminal. In other words, the user terminal may perform a process of removing noise from the acoustic signal received through the input device.

도 6을 참조하면, 사용자 단말로부터 입력 음향 신호를 수신한 서버는 도 5에 도시된 이후의 동작을 수행하기에 앞서, 수신된 입력 음향 신호에서 SNR 값을 획득(610)하여, 노이즈 제어 인터페이스의 활성화 여부를 결정할 수 있다. SNR 값은 서버의 프로세서에 의해 계산될 수도 있고, 외부 장치로부터 계산되어 서버에서 획득될 수도 있다.Referring to FIG. 6 , the server receiving the input sound signal from the user terminal obtains ( 610 ) an SNR value from the received input sound signal before performing the subsequent operations shown in FIG. You can decide whether to activate it or not. The SNR value may be calculated by a processor of the server or may be calculated from an external device and obtained in the server.

서버는 획득된 SNR 값에 기초하여 SNR 값이 미리 정해진 기준에 부합하는 경우, 사용자 단말의 노이즈 제어 인터페이스를 활성화(620)시킬 수 있다. 미리 정해진 기준은 입력 음향 신호에 대하여 노이즈 제거 및 제어 알고리즘의 수행 여부를 결정하기 위한 SNR 값과 관련된 기준에 해당할 수 있다. 예를 들어, SNR 값을 미리 정해진 임계 값 또는 임계 범위와 비교하여, SNR 값이 임계 값보다 크거나 작은 경우 또는 SNR 값이 임계 범위 내에 있는 경우, 노이즈 제어 인터페이스를 활성화하는 것으로 결정될 수 있다.When the SNR value meets a predetermined criterion based on the obtained SNR value, the server may activate 620 the noise control interface of the user terminal. The predetermined criterion may correspond to a criterion related to an SNR value for determining whether to perform a noise removal and control algorithm with respect to the input sound signal. For example, by comparing the SNR value with a predetermined threshold value or threshold range, it may be determined to activate the noise control interface when the SNR value is greater than or less than the threshold value or when the SNR value is within the threshold range.

서버에서 노이즈 제어 인터페이스를 활성화하는 경우, 사용자 단말에 노이즈 제어 인터페이스가 표시되는 등 활성화될 수 있으며, 사용자는 노이즈 제어 인터페이스를 통해 서버와 데이터를 송수신할 수 있다.When the server activates the noise control interface, the noise control interface may be displayed on the user terminal, etc., and may be activated, and the user may transmit/receive data to and from the server through the noise control interface.

서버에서 SNR 값에 기초하여, 노이즈 제어 인터페이스를 활성화하는 것으로 결정된 경우, 노이즈 제거 알고리즘이 수행될 수 있다. 또한, 노이즈 제어 인터페이스를 통해 서버에서 노이즈 제어 입력이 획득될 수 있으므로, 노이즈 제어 알고리즘이 수행될 수 있다. When the server determines to activate the noise control interface based on the SNR value, a noise removal algorithm may be performed. Also, since a noise control input can be obtained from the server through the noise control interface, a noise control algorithm can be performed.

반면, SNR 값이 미리 정해진 기준에 해당하지 않는 경우, 서버는 노이즈 제어 인터페이스를 활성화하지 않을 수 있으며, 노이즈 제거 알고리즘이 수행되지 않을 수 있다. 사용자 단말에서 노이즈 제어 인터페이스가 활성화되지 않는 경우, 사용자는 인터페이스를 통해 노이즈 제어에 관한 파라미터를 입력할 수 없다. 서버에서 노이즈 제어 알고리즘이 수행되지 않으며, 노이즈 제어 인터페이스를 통해 입력되는 노이즈 제어에 관한 파라미터가 수신되지 않는 경우, 노이즈 제어 알고리즘은 수행되지 않는다. 이 경우, 사용자 단말에 입력된 음향 신호는 서버에서 노이즈 제거 및 제어에 관한 별도의 처리를 거치지 않고, 출력 장치를 통해 출력될 수 있다. On the other hand, when the SNR value does not correspond to the predetermined criterion, the server may not activate the noise control interface, and the noise removal algorithm may not be performed. When the noise control interface is not activated in the user terminal, the user cannot input parameters related to noise control through the interface. When the noise control algorithm is not performed in the server and a parameter related to noise control input through the noise control interface is not received, the noise control algorithm is not performed. In this case, the sound signal input to the user terminal may be output through the output device without a separate process for noise removal and control in the server.

도 5 및 도 6은 입력 음향 신호를 서버로 발신하는 사용자 단말과 출력 음향 신호를 서버로부터 수신하는 사용자 단말은 서로 동일한 단말인 경우를 도시하고 있으나, 일실시예에 따를 때, 입력 음향 신호를 서버로 발신하는 사용자 단말과 출력 음향 신호를 서버로부터 수신하는 사용자 단말은 서로 다른 단말에 해당할 수 있다. 예를 들어, 영상 통화 또는 음성 통화 환경에서는, 특정 사용자 단말을 통해 입력된 음향 신호가 통화 상대방의 단말을 통해 출력되므로, 네트워크를 통해 서버로 입력 음향 신호를 발신하는 사용자 단말과 네트워크를 통해 서버로부터 출력 음향 신호를 수신하는 사용자 단말은 서로 다른 단말에 해당할 수 있다.5 and 6 show a case where the user terminal that transmits the input sound signal to the server and the user terminal that receives the output sound signal from the server are the same terminal, but according to an embodiment, when the input sound signal is transmitted to the server A user terminal that transmits to and a user terminal that receives an output sound signal from the server may correspond to different terminals. For example, in a video call or voice call environment, since an audio signal input through a specific user terminal is output through the terminal of the other party to the call, a user terminal that transmits an input sound signal to the server through the network and the server through the network The user terminals receiving the output sound signal may correspond to different terminals.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented by a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the apparatus, methods, and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA) array), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using a general purpose computer or special purpose computer. The processing device may execute an operating system (OS) and a software application running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in a computer-readable recording medium.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination, and the program instructions recorded on the medium are specially designed and configured for the embodiment, or are known and available to those skilled in the art of computer software. may be Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or a plurality of software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, those of ordinary skill in the art may apply various technical modifications and variations based thereon. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

In the sound signal processing method performed by the processor,
obtaining an input sound signal including a voice signal and noise;
generating a first signal by removing the noise from the input sound signal;
obtaining a parameter input through a user interface related to noise control and controlling the intensity of the noise; and
generating an output acoustic signal by weighted summing the input acoustic signal and the first signal based on the parameter;
containing,
Acoustic signal processing method.

According to claim 1,
obtaining a signal-to-noise ratio (SNR) value of the input sound signal; and
determining whether to activate the user interface based on the SNR value
further comprising,
Acoustic signal processing method.

According to claim 1,
The step of generating the output sound signal is
obtaining a speech presence probability corresponding to each section of the first signal; and
weighted summing the input sound signal and the first signal for each section based on the voice presence probability and the parameter
containing,
Acoustic signal processing method.

4. The method of claim 3,
The weighted summing step is
weighted summing the input sound signal with a higher weight than the first signal based on the parameter in the section in which the voice presence probability is high; and
weighted summing the first signal with a higher weight than the input sound signal based on the parameter in the section in which the voice presence probability is low
containing,
Acoustic signal processing method.

4. The method of claim 3,
The weighted summing step is
generating a second signal having a high proportion of the input sound signal and a third signal having a high proportion of the first signal by weighted summing the input sound signal and the first signal based on the parameter; and
weighted summing the second signal and the third signal for each section based on the voice presence probability
containing,
Acoustic signal processing method.

4. The method of claim 3,
The step of obtaining the voice presence probability includes:
Corresponding to each of the frames generated by dividing the first signal into specific time units,
obtaining a probability that the voice signal is included in a corresponding frame based on a voice activity detection algorithm;
containing,
Acoustic signal processing method.

According to claim 1,
generating the first signal
generating the first signal by performing a noise removal algorithm in the time domain of the input sound signal
containing,
Acoustic signal processing method.

According to claim 1,
generating the first signal
Generating the first signal based on a deep learning model trained to output an acoustic signal in which noise is removed from a noise-containing acoustic signal
containing,
Acoustic signal processing method.

In the sound signal processing method performed by the processor,
generating a first signal by removing the noise from an input sound signal including a voice signal and noise;
obtaining a speech presence probability in response to the first signal;
obtaining a parameter related to the degree of removal of the noise; and
controlling the degree of removal of the noise by synthesizing the input sound signal and the first signal based on at least one of a probability of the presence of the voice and at least one of the parameters;
containing,
Acoustic signal processing method.

10. The method of claim 9,
The controlling step is
controlling the degree of removal of the noise by weighting and summing the input sound signal with a higher weight than the first signal based on the parameter in a section in which the voice is highly probable; and
In a section where the probability of the presence of the voice is low, based on the parameter, weighting and summing the first signal with a higher weight than the input sound signal, thereby controlling the degree of removal of the noise;
containing,
Acoustic signal processing method.

10. The method of claim 9,
The controlling step is
dividing the first signal into a voice section and a non-voice section by applying a voice activity detection algorithm to the first signal;
weighted summing the first signal corresponding to the voice section and the input sound signal corresponding to the voice section by giving a higher weight to the input sound signal than the first signal, based on the parameter; and
weighted summing the first signal corresponding to the non-voice section and the input sound signal corresponding to the non-voice section by giving a higher weight to the first signal than the input sound signal based on the parameter
further comprising,
Acoustic signal processing method.

A computer program stored in a medium for executing the method of any one of claims 1 to 11 in combination with hardware.

acquiring an input sound signal including a voice signal and noise;
removing the noise from the input sound signal to generate a first signal,
Obtaining a parameter that is input through a user interface related to noise control and controls the intensity of the noise,
generating an output acoustic signal by weighted summing the input acoustic signal and the first signal based on the parameter;
comprising at least one processor;
Acoustic signal processing device.

14. The method of claim 13,
the processor is
obtaining a signal-to-noise ratio (SNR) value of the input sound signal;
determining whether to activate the user interface based on the SNR value,
Acoustic signal processing device.

14. The method of claim 13,
the processor is
In generating the output sound signal,
In response to each section of the first signal, obtaining a speech presence probability (speech presence probability),
weighted summing the input sound signal and the first signal for each section based on the voice presence probability and the parameter,
Acoustic signal processing device.

16. The method of claim 15,
the processor is
In the weighted summing,
In a section in which the voice presence probability is high, based on the parameter, weighted summing is performed by placing a higher weight on the input sound signal than the first signal,
In the section in which the voice presence probability is low, based on the parameter, weighted summing the first signal with a higher weight than the input sound signal,
Acoustic signal processing device.

16. The method of claim 15,
the processor is
In the weighted summing,
By weighted summing the input sound signal and the first signal based on the parameter, a second signal having a high proportion of the input sound signal and a third signal having a high proportion of the first signal are generated,
weighted summing the second signal and the third signal for each section based on the voice presence probability;
Acoustic signal processing device.

14. The method of claim 13,
the processor is
In generating the first signal,
generating the first signal by performing a noise removal algorithm in the time domain of the input sound signal,
Acoustic signal processing device.

14. The method of claim 13,
the processor is
In generating the first signal,
Based on a deep learning model trained to output a noise-removed acoustic signal from a noise-containing acoustic signal, generating the first signal,
Acoustic signal processing device.

generating a first signal by removing the noise from an input sound signal including a voice signal and noise;
In response to the first signal, obtain a speech presence probability (speech presence probability),
obtaining a parameter related to the degree of removal of the noise,
Controlling the degree of removal of the noise by synthesizing the input sound signal and the first signal based on at least one of a probability of the presence of the voice and the parameter,
comprising at least one processor;
Acoustic signal processing device.